2018-12-06 11:33:34

by Gautham R Shenoy

[permalink] [raw]
Subject: [PATCH] pseries/hotplug: Add more delay in pseries_cpu_die while waiting for rtas-stop

From: "Gautham R. Shenoy" <[email protected]>

Currently running DLPAR offline/online operations in a loop on a
POWER9 system with SMT=off results in the following crash:

[ 223.321032] cpu 112 (hwid 112) Ready to die...
[ 223.355963] Querying DEAD? cpu 113 (113) shows 2
[ 223.356233] cpu 114 (hwid 114) Ready to die...
[ 223.356236] cpu 113 (hwid 113) Ready to die...
[ 223.356241] Bad kernel stack pointer 1faf6ca0 at 1faf6d50
[ 223.356243] Oops: Bad kernel stack pointer, sig: 6 [#1]
[ 223.356244] LE SMP NR_CPUS=2048 NUMA pSeries
[ 223.356247] Modules linked in:
[ 223.356255] CPU: 114 PID: 0 Comm: swapper/114 Not tainted 4.20.0-rc3 #39
[ 223.356258] NIP: 000000001faf6d50 LR: 000000001faf6d50 CTR: 000000001ec6d06c
[ 223.356259] REGS: c00000001e5cbd30 TRAP: 0700 Not tainted (4.20.0-rc3)
[ 223.356260] MSR: 8000000000081000 <SF,ME> CR: 28000004 XER: 00000020
[ 223.356263] CFAR: 000000001ec98590 IRQMASK: 8000000000009033
GPR00: 000000001faf6d50 000000001faf6ca0 000000001ed1c448 00000267e6a273c3
GPR04: 0000000000000000 00000000000000e0 000000000000dfe8 000000001faf6d30
GPR08: 000000001faf6d28 00000267e6a273c3 000000001ec1b108 0000000000000000
GPR12: 0000000001b6d998 c00000001eb55080 c0000003a1b8bf90 000000001eea3f40
GPR16: 0000000000000000 c0000006fda85100 c00000000004c8b0 c0000000013d5300
GPR20: c0000006fda85300 0000000000000008 c0000000019d2cf8 c0000000013d6888
GPR24: 0000000000000072 c0000000013d688c 0000000000000002 c0000000013d688c
GPR28: c0000000019cecf0 0000000000000390 0000000000000000 000000001faf6ca0
[ 223.356281] NIP [000000001faf6d50] 0x1faf6d50
[ 223.356281] LR [000000001faf6d50] 0x1faf6d50
[ 223.356282] Call Trace:
[ 223.356283] Instruction dump:
[ 223.356285] XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
[ 223.356286] XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
[ 223.356290] ---[ end trace f46a4e046b564d1f ]---

This is due to multiple offlined CPUs (CPU 113 and CPU 114 above)
concurrently (within 3us) trying to make the rtas-call with the
"stop-self" token, something that is prohibited by the PAPR.

The concurrent calls can happen as follows.

o In dlpar_offline_cpu() we prod an offline CPU X (offline due to
SMT=off) and loop for 25 tries in pseries_cpu_die() querying if
the target CPU X has been stopped in RTAS. After 25 tries, we
prints the message "Querying DEAD? cpu X (X) shows 2" and return
to dlpar_offline_cpu(). Note that at this point CPU X has not yet
called rtas with the "stop-self" token, but can do so anytime now.

o Back in dlpar_offline_cpu(), we prod the next offline CPU Y. CPU Y
promptly wakes up and calls RTAS with the "stop-self" token.

o Before RTAS can stop CPU Y, CPU X also calls RTAS with the
"stop-self" token.

The problem occurs due to the short number of tries (25) in
pseries_cpu_die() which covers 90% of the cases. For the remaining 10%
of the cases, it was observed that we would need to loop for a few
hundred iterations before the target CPU calls rtas with "stop-self"
token.Experiments show that each try takes roughly ~25us.

In this patch we fix the problem by increasing the number of tries
from 25 to 4000 (which roughly corresponds to 100ms) before bailing
out and declaring that we have failed to observe the target CPU call
rtas-stop-self. This provides sufficient serialization to ensure that
there no concurrent rtas calls with "stop-self" token.

Reported-by: Michael Bringmann <[email protected]>
Signed-off-by: Gautham R. Shenoy <[email protected]>
---
arch/powerpc/platforms/pseries/hotplug-cpu.c | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index 2f8e621..c913c44 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -214,14 +214,25 @@ static void pseries_cpu_die(unsigned int cpu)
msleep(1);
}
} else if (get_preferred_offline_state(cpu) == CPU_STATE_OFFLINE) {
+ int max_tries = 4000; /* Roughly corresponds to 100ms */
+ u64 tb_before = mftb();

- for (tries = 0; tries < 25; tries++) {
+ for (tries = 0; tries < max_tries; tries++) {
cpu_status = smp_query_cpu_stopped(pcpu);
if (cpu_status == QCSS_STOPPED ||
cpu_status == QCSS_HARDWARE_ERROR)
break;
cpu_relax();
}
+
+ if (tries == max_tries) {
+ u64 time_elapsed_us = div_u64(mftb() - tb_before,
+ tb_ticks_per_usec);
+
+ pr_warn("Offlined CPU %d isn't stopped by RTAS after %llu us\n",
+ cpu, time_elapsed_us);
+ WARN_ON(1);
+ }
}

if (cpu_status != 0) {
--
1.8.3.1



2018-12-06 17:29:29

by Thiago Jung Bauermann

[permalink] [raw]
Subject: Re: [PATCH] pseries/hotplug: Add more delay in pseries_cpu_die while waiting for rtas-stop


Hello Gautham,

Gautham R. Shenoy <[email protected]> writes:

> From: "Gautham R. Shenoy" <[email protected]>
>
> Currently running DLPAR offline/online operations in a loop on a
> POWER9 system with SMT=off results in the following crash:
>
> [ 223.321032] cpu 112 (hwid 112) Ready to die...
> [ 223.355963] Querying DEAD? cpu 113 (113) shows 2
> [ 223.356233] cpu 114 (hwid 114) Ready to die...
> [ 223.356236] cpu 113 (hwid 113) Ready to die...
> [ 223.356241] Bad kernel stack pointer 1faf6ca0 at 1faf6d50
> [ 223.356243] Oops: Bad kernel stack pointer, sig: 6 [#1]
> [ 223.356244] LE SMP NR_CPUS=2048 NUMA pSeries
> [ 223.356247] Modules linked in:
> [ 223.356255] CPU: 114 PID: 0 Comm: swapper/114 Not tainted 4.20.0-rc3 #39
> [ 223.356258] NIP: 000000001faf6d50 LR: 000000001faf6d50 CTR: 000000001ec6d06c
> [ 223.356259] REGS: c00000001e5cbd30 TRAP: 0700 Not tainted (4.20.0-rc3)
> [ 223.356260] MSR: 8000000000081000 <SF,ME> CR: 28000004 XER: 00000020
> [ 223.356263] CFAR: 000000001ec98590 IRQMASK: 8000000000009033
> GPR00: 000000001faf6d50 000000001faf6ca0 000000001ed1c448 00000267e6a273c3
> GPR04: 0000000000000000 00000000000000e0 000000000000dfe8 000000001faf6d30
> GPR08: 000000001faf6d28 00000267e6a273c3 000000001ec1b108 0000000000000000
> GPR12: 0000000001b6d998 c00000001eb55080 c0000003a1b8bf90 000000001eea3f40
> GPR16: 0000000000000000 c0000006fda85100 c00000000004c8b0 c0000000013d5300
> GPR20: c0000006fda85300 0000000000000008 c0000000019d2cf8 c0000000013d6888
> GPR24: 0000000000000072 c0000000013d688c 0000000000000002 c0000000013d688c
> GPR28: c0000000019cecf0 0000000000000390 0000000000000000 000000001faf6ca0
> [ 223.356281] NIP [000000001faf6d50] 0x1faf6d50
> [ 223.356281] LR [000000001faf6d50] 0x1faf6d50
> [ 223.356282] Call Trace:
> [ 223.356283] Instruction dump:
> [ 223.356285] XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
> [ 223.356286] XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
> [ 223.356290] ---[ end trace f46a4e046b564d1f ]---
>
> This is due to multiple offlined CPUs (CPU 113 and CPU 114 above)
> concurrently (within 3us) trying to make the rtas-call with the
> "stop-self" token, something that is prohibited by the PAPR.
>
> The concurrent calls can happen as follows.
>
> o In dlpar_offline_cpu() we prod an offline CPU X (offline due to
> SMT=off) and loop for 25 tries in pseries_cpu_die() querying if
> the target CPU X has been stopped in RTAS. After 25 tries, we
> prints the message "Querying DEAD? cpu X (X) shows 2" and return
> to dlpar_offline_cpu(). Note that at this point CPU X has not yet
> called rtas with the "stop-self" token, but can do so anytime now.
>
> o Back in dlpar_offline_cpu(), we prod the next offline CPU Y. CPU Y
> promptly wakes up and calls RTAS with the "stop-self" token.
>
> o Before RTAS can stop CPU Y, CPU X also calls RTAS with the
> "stop-self" token.
>
> The problem occurs due to the short number of tries (25) in
> pseries_cpu_die() which covers 90% of the cases. For the remaining 10%
> of the cases, it was observed that we would need to loop for a few
> hundred iterations before the target CPU calls rtas with "stop-self"
> token.Experiments show that each try takes roughly ~25us.
>
> In this patch we fix the problem by increasing the number of tries
> from 25 to 4000 (which roughly corresponds to 100ms) before bailing
> out and declaring that we have failed to observe the target CPU call
> rtas-stop-self. This provides sufficient serialization to ensure that
> there no concurrent rtas calls with "stop-self" token.
>
>
>
> Reported-by: Michael Bringmann <[email protected]>
> Signed-off-by: Gautham R. Shenoy <[email protected]>
> ---
> arch/powerpc/platforms/pseries/hotplug-cpu.c | 13 ++++++++++++-
> 1 file changed, 12 insertions(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c
> index 2f8e621..c913c44 100644
> --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
> +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
> @@ -214,14 +214,25 @@ static void pseries_cpu_die(unsigned int cpu)
> msleep(1);
> }
> } else if (get_preferred_offline_state(cpu) == CPU_STATE_OFFLINE) {
> + int max_tries = 4000; /* Roughly corresponds to 100ms */
> + u64 tb_before = mftb();
>
> - for (tries = 0; tries < 25; tries++) {
> + for (tries = 0; tries < max_tries; tries++) {
> cpu_status = smp_query_cpu_stopped(pcpu);
> if (cpu_status == QCSS_STOPPED ||
> cpu_status == QCSS_HARDWARE_ERROR)
> break;
> cpu_relax();
> }
> +
> + if (tries == max_tries) {
> + u64 time_elapsed_us = div_u64(mftb() - tb_before,
> + tb_ticks_per_usec);
> +
> + pr_warn("Offlined CPU %d isn't stopped by RTAS after %llu us\n",
> + cpu, time_elapsed_us);
> + WARN_ON(1);
> + }
> }
>
> if (cpu_status != 0) {

I posted a similar patch last year, but I wasn't able to arrive at a
root cause analysis like you did:

https://lists.ozlabs.org/pipermail/linuxppc-dev/2017-February/153734.html

One thing I realized after I posted the patch was that in my case, the
CPU was crashing inside RTAS. From the NIP and LR in the trace above it
looks like it's crashing in RTAS in your case as well.

Michael Ellerman had two comments on my patch:

1. Regardless of the underlying bug, the kernel shouldn't crash so we
need a patch making it more resilient to this failure.

2. The wait loop should use udelay() so that the loop will actually take
a set amount of wall time, rather than just cycles.

Regarding 1. if the problem is that the kernel is causing RTAS to crash
because it calls it in a way that's unsupported, then I don't see how we
can make the kernel more resilient. We have to make the kernel respect
RTAS' restrictions (or alternatively, poke RTAS devs to make RTAS fail
gracefuly in these conditions).

Regarding 2. I implemented a new version of my patch (posted below) but
I was never able to test it because I couldn't access a system where the
problem was reproducible anymore.

There's also a race between the CPU driving the unplug and the CPU being
unplugged which I think is not easy for the CPU being unplugged to win,
which makes the busy loop in pseries_cpu_die() a bit fragile. I describe
the race in the patch description.

My solution to make the race less tight is to make the CPU driving the
unplug to only start the busy loop only after the CPU being unplugged is
in the CPU_STATE_OFFLINE state. At that point, we know that it either is
about to call RTAS or it already has.

Do you think this makes sense? If you do, would you mind testing my
patch? You can change the timeouts and delays if you want. To be honest
they're just guesses on my part...



From 4174cd0939fadeda93e49ed2e94ba84131c922d0 Mon Sep 17 00:00:00 2001
From: Thiago Jung Bauermann <[email protected]>
Date: Thu, 6 Dec 2018 15:16:34 -0200
Subject: [PATCH] powerpc/pseries: Only wait for dying CPU after call to
rtas_stop_self()

When testing DLPAR CPU add/remove on a system under stress,
pseries_cpu_die() doesn't wait long enough for a CPU to die and the kernel
ends up crashing:

[ 446.143648] cpu 152 (hwid 152) Ready to die...
[ 446.464057] cpu 153 (hwid 153) Ready to die...
[ 446.473525] cpu 154 (hwid 154) Ready to die...
[ 446.474077] cpu 155 (hwid 155) Ready to die...
[ 446.483529] cpu 156 (hwid 156) Ready to die...
[ 446.493532] cpu 157 (hwid 157) Ready to die...
[ 446.494078] cpu 158 (hwid 158) Ready to die...
[ 446.503527] cpu 159 (hwid 159) Ready to die...
[ 446.664534] cpu 144 (hwid 144) Ready to die...
[ 446.964113] cpu 145 (hwid 145) Ready to die...
[ 446.973525] cpu 146 (hwid 146) Ready to die...
[ 446.974094] cpu 147 (hwid 147) Ready to die...
[ 446.983944] cpu 148 (hwid 148) Ready to die...
[ 446.984062] cpu 149 (hwid 149) Ready to die...
[ 446.993518] cpu 150 (hwid 150) Ready to die...
[ 446.993543] Querying DEAD? cpu 150 (150) shows 2
[ 446.994098] cpu 151 (hwid 151) Ready to die...
[ 447.133726] cpu 136 (hwid 136) Ready to die...
[ 447.403532] cpu 137 (hwid 137) Ready to die...
[ 447.403772] cpu 138 (hwid 138) Ready to die...
[ 447.403839] cpu 139 (hwid 139) Ready to die...
[ 447.403887] cpu 140 (hwid 140) Ready to die...
[ 447.403937] cpu 141 (hwid 141) Ready to die...
[ 447.403979] cpu 142 (hwid 142) Ready to die...
[ 447.404038] cpu 143 (hwid 143) Ready to die...
[ 447.513546] cpu 128 (hwid 128) Ready to die...
[ 447.693533] cpu 129 (hwid 129) Ready to die...
[ 447.693999] cpu 130 (hwid 130) Ready to die...
[ 447.703530] cpu 131 (hwid 131) Ready to die...
[ 447.704087] Querying DEAD? cpu 132 (132) shows 2
[ 447.704102] cpu 132 (hwid 132) Ready to die...
[ 447.713534] cpu 133 (hwid 133) Ready to die...
[ 447.714064] Querying DEAD? cpu 134 (134) shows 2
cpu 0x86: Vector: 300 (Data Access) at [c000000007b0fd40]
pc: 000000001ec3072c
lr: 000000001ec2fee0
sp: 1faf6bd0
msr: 8000000102801000
dar: 212d6c1a2a20c
dsisr: 42000000
current = 0xc000000474c6d600
paca = 0xc000000007b6b600 softe: 0 irq_happened: 0x01
pid = 0, comm = swapper/134
Linux version 4.8.0-34-generic (buildd@bos01-ppc64el-026) (gcc version 5.4.0 20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.4) ) #36~16.04.1-Ubuntu SMP Wed Dec 21 18:53:20 UTC 2016 (Ubuntu 4.8.0-34.36~16.04.1-generic 4.8.11)
WARNING: exception is not recoverable, can't continue

This was reproduced in v4.12-rc3 as well, but I don't have a crash log
handy for that version right now. Sorry.

This is a race between one CPU stopping and another one calling
pseries_cpu_die() to wait for it to stop. That function does a short busy
loop calling RTAS query-cpu-stopped-state on the stopping CPU to verify
that it is stopped.

As can be seen in the dmesg right before or after the "Querying DEAD?"
messages, if pseries_cpu_die() waited a little longer it would have seen
the CPU in the stopped state.

I see two cases that can be causing this race:

1. It's possible that CPU 134 was inactive at the time it was unplugged. In
that case, dlpar_offline_cpu() calls H_PROD on the CPU and immediately
calls pseries_cpu_die(). Meanwhile, the prodded CPU activates and start
the process of stopping itself. It's possible that the busy loop is not
long enough to allow for the CPU to wake up and complete the stopping
process.

2. If CPU 134 was online at the time it was unplugged, it would have gone
through the new CPU hotplug state machine in kernel/cpu.c that was
introduced in v4.6 to get itself stopped. It's possible that the busy
loop in pseries_cpu_die() was long enough for the older hotplug code but
not for the new hotplug state machine.

Either way, the solution is to only start querying if the CPU is stopped
after it had a chance to call rtas_stop_self(). Since
pseries_mach_cpu_die() sets the CPU current state to offline almost
immediately before the RTAS call, we use that as a signal that it is either
already stopped or very close to that point, and we can start the busy
loop.

Signed-off-by: Thiago Jung Bauermann <[email protected]>
---
arch/powerpc/platforms/pseries/hotplug-cpu.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index 2f8e62163602..fad3ceb74623 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -214,13 +214,21 @@ static void pseries_cpu_die(unsigned int cpu)
msleep(1);
}
} else if (get_preferred_offline_state(cpu) == CPU_STATE_OFFLINE) {
+ /*
+ * If the current state is not offline yet, it means that the
+ * dying CPU (which is in pseries_mach_cpu_die) didn't have a
+ * chance to call rtas_stop_self yet and therefore it's too
+ * early to query if the CPU is stopped.
+ */
+ spin_event_timeout(get_cpu_current_state(cpu) == CPU_STATE_OFFLINE,
+ 100000, 100);

for (tries = 0; tries < 25; tries++) {
cpu_status = smp_query_cpu_stopped(pcpu);
if (cpu_status == QCSS_STOPPED ||
cpu_status == QCSS_HARDWARE_ERROR)
break;
- cpu_relax();
+ udelay(100);
}
}



2018-12-07 10:44:11

by Gautham R Shenoy

[permalink] [raw]
Subject: Re: [PATCH] pseries/hotplug: Add more delay in pseries_cpu_die while waiting for rtas-stop

Hi Thiago,

On Thu, Dec 06, 2018 at 03:28:17PM -0200, Thiago Jung Bauermann wrote:

[..snip..]

>
>
> I posted a similar patch last year, but I wasn't able to arrive at a
> root cause analysis like you did:
>
>
https://lists.ozlabs.org/pipermail/linuxppc-dev/2017-February/153734.html

Ah! Nice. So this is a known problem.

>
> One thing I realized after I posted the patch was that in my case, the
> CPU was crashing inside RTAS. From the NIP and LR in the trace above it
> looks like it's crashing in RTAS in your case as well.
>
> Michael Ellerman had two comments on my patch:
>
> 1. Regardless of the underlying bug, the kernel shouldn't crash so we
> need a patch making it more resilient to this failure.
>
> 2. The wait loop should use udelay() so that the loop will actually take
> a set amount of wall time, rather than just cycles.
>
> Regarding 1. if the problem is that the kernel is causing RTAS to crash
> because it calls it in a way that's unsupported, then I don't see how we
> can make the kernel more resilient. We have to make the kernel respect
> RTAS' restrictions (or alternatively, poke RTAS devs to make RTAS fail
> gracefuly in these conditions).


I agree that the Kernel has to respect RTAS's restriction. The PAPR
v2.8.1, Requirement R1-7.2.3-8 under section 7.2.3 says the following:

"The stop-self service needs to be serialized with calls to the
stop-self, start-cpu, and set-power-level services. The OS must
be able to call RTAS services on other processors while the
processor is stopped or being stopped"

Thus the onus is on the OS to ensure that there are no concurrent rtas
calls with "stop-self" token.

>
> Regarding 2. I implemented a new version of my patch (posted below) but
> I was never able to test it because I couldn't access a system where the
> problem was reproducible anymore.
>
> There's also a race between the CPU driving the unplug and the CPU being
> unplugged which I think is not easy for the CPU being unplugged to win,
> which makes the busy loop in pseries_cpu_die() a bit fragile. I describe
> the race in the patch description.
>
> My solution to make the race less tight is to make the CPU driving the
> unplug to only start the busy loop only after the CPU being unplugged is
> in the CPU_STATE_OFFLINE state. At that point, we know that it either is
> about to call RTAS or it already has.

Ah, yes this is good optimization. Though, I think we ought to
unconditionally wait until the target CPU has woken up from CEDE and
changed its state to CPU_STATE_OFFLINE. After if PROD failed, then we
would have caught it in dlpar_offline_cpu() itself.

>
> Do you think this makes sense? If you do, would you mind testing my
> patch? You can change the timeouts and delays if you want. To be honest
> they're just guesses on my part...

Sure. I will test the patch and report back.

--
Thanks and Regards
gautham.


2018-12-07 12:05:08

by Gautham R Shenoy

[permalink] [raw]
Subject: Re: [PATCH] pseries/hotplug: Add more delay in pseries_cpu_die while waiting for rtas-stop

On Fri, Dec 07, 2018 at 04:13:11PM +0530, Gautham R Shenoy wrote:
> Hi Thiago,
>
>
> Sure. I will test the patch and report back.

I added the following debug patch on top of your patch, and after an
hour's run, the system crashed. Appending the log at the end.

I suppose we still need to increase the number of tries since we wait
only for 2.5ms looping before giving up.

----------------------------x8-----------------------------------------
From 9e9e1e459cdc63540a73575af359a7bc0d43e455 Mon Sep 17 00:00:00 2001
From: "Gautham R. Shenoy" <[email protected]>
Date: Fri, 7 Dec 2018 16:47:25 +0530
Subject: [PATCH] Add debug on top of Thiago's patch

Signed-off-by: Gautham R. Shenoy <[email protected]>
---
arch/powerpc/platforms/pseries/hotplug-cpu.c | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index fad3ceb..2545aab 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -214,6 +214,9 @@ static void pseries_cpu_die(unsigned int cpu)
msleep(1);
}
} else if (get_preferred_offline_state(cpu) == CPU_STATE_OFFLINE) {
+ u64 tb_before = mftb();
+ u64 tb_after;
+
/*
* If the current state is not offline yet, it means that the
* dying CPU (which is in pseries_mach_cpu_die) didn't have a
@@ -230,6 +233,11 @@ static void pseries_cpu_die(unsigned int cpu)
break;
udelay(100);
}
+
+ tb_after = mftb();
+
+ printk("[DEBUG] Waited for CPU %d to enter rtas: tries=%d, time=%llu\n", cpu, tries,
+ div_u64(tb_after - tb_before, tb_ticks_per_usec));
}

if (cpu_status != 0) {
--
1.9.4
----------------------------x8-----------------------------------------

cpu 112 (hwid 112) Ready to die...
[DEBUG] Waited for CPU 112 to enter rtas: tries=0, time=65
cpu 113 (hwid 113) Ready to die...
[DEBUG] Waited for CPU 113 to enter rtas: tries=0, time=1139
cpu 114 (hwid 114) Ready to die...
[DEBUG] Waited for CPU 114 to enter rtas: tries=0, time=1036
cpu 115 (hwid 115) Ready to die...
[DEBUG] Waited for CPU 115 to enter rtas: tries=0, time=133
cpu 116 (hwid 116) Ready to die...
[DEBUG] Waited for CPU 116 to enter rtas: tries=0, time=1231
cpu 117 (hwid 117) Ready to die...
[DEBUG] Waited for CPU 117 to enter rtas: tries=0, time=1231
cpu 118 (hwid 118) Ready to die...
[DEBUG] Waited for CPU 118 to enter rtas: tries=0, time=1231
cpu 119 (hwid 119) Ready to die...
[DEBUG] Waited for CPU 119 to enter rtas: tries=0, time=1131
cpu 104 (hwid 104) Ready to die...
[DEBUG] Waited for CPU 104 to enter rtas: tries=0, time=40

******* RTAS CALL BUFFER CORRUPTION *******
393: rtas32_call_buff_ptr=
0000 0060 0000 0060 0000 0060 0000 0060 [...`...`...`...`]
0000 0060 0000 0060 0000 0060 0000 0060 [...`...`...`...`]
0000 0060 0000 0060 0000 0060 0000 0060 [...`...`...`...`]
0000 0060 0800 E07F ACA7 0000 0000 00C0 [...`............]
2500 0000 0000 0000 0000 0000 0000 0000 [%...............]
0000 0000 0000 0000 0000 0000 306E 7572 [............0nur]
4800 0008 .... .... .... .... .... .... [H...........0nur]
394: rtas64_map_buff_ptr=
0000 0000 5046 5743 0000 0000 4F44 4500 [....PFWC....ODE.]
0000 0000 6000 0000 0000 0000 0000 0069 [....`..........i]
0000 0000 0000 0000 0000 0000 0000 0000 [................]
0000 0000 0000 0005 0000 0000 0000 0001 [................]
0000 0000 1A00 0000 0000 0000 0000 0000 [................]
0000 0000 8018 6398 0000 0000 0300 00C0 [......c.........]
0000 0000 .... .... .... .... .... .... [......c.........]
cpu 105 (hwid 105) Ready to die...
Bad kernel stack pointer 1fafb6c0 at 0
Oops: Bad kernel stack pointer, sig: 6 [#1]
LE SMP NR_CPUS=2048 NUMA pSeries
Modules linked in:
CPU: 105 PID: 0 Comm: swapper/105 Not tainted 4.20.0-rc5-thiago+ #45
NIP: 0000000000000000 LR: 0000000000000000 CTR: 00000000007829c8
REGS: c00000001e63bd30 TRAP: 0700 Not tainted (4.20.0-rc5-thiago+)
MSR: 8000000000081000 <SF,ME> CR: 28000004 XER: 00000010
CFAR: 000000001ec153f0 IRQMASK: 8000000000009033
GPR00: 0000000000000000 000000001fafb6c0 000000001ec236a0 0000000000000040
GPR04: 00000000000000c0 0000000000000080 00046c4fb4842557 00000000000000cd
GPR08: 000000000001f400 000000001ed035dc 0000000000000000 0000000000000000
GPR12: 0000000000000000 c00000001eb5e480 c0000003a1b53f90 000000001eea3e20
GPR16: 0000000000000000 c0000006fd845100 c00000000004c8b0 c0000000013d5300
GPR20: c0000006fd845300 0000000000000008 c0000000019d2cf8 c0000000013d6888
GPR24: 0000000000000069 c0000000013d688c 0000000000000002 c0000000013d688c
GPR28: c0000000019cecf0 0000000000000348 0000000000000000 0000000000000000
NIP [0000000000000000] (null)
LR [0000000000000000] (null)
Call Trace:
Instruction dump:
XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX 60000000 60000000 60000000 60000000
---[ end trace 1aa3b4936949457e ]---
Bad kernel stack pointer 1fafb4b0 at 1ec15004
rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
rcu: 88-...!: (0 ticks this GP) idle=2ce/1/0x4000000000000000 softirq=28076/28076 fqs=78
rcu: (detected by 72, t=10866 jiffies, g=180529, q=2526)
Sending NMI from CPU 72 to CPUs 88:
CPU 88 didn't respond to backtrace IPI, inspecting paca.
irq_soft_mask: 0x01 in_mce: 0 in_nmi: 0 current: 22978 (drmgr)
Back trace of paca->saved_r1 (0xc0000006f94ab750) (possibly stale):
Call Trace:
[c0000006f94ab750] [c0000006f94ab790] 0xc0000006f94ab790 (unreliable)
[c0000006f94ab930] [c0000000000373f8] va_rtas_call_unlocked+0xc8/0xe0
[c0000006f94ab950] [c000000000037a98] rtas_call+0x98/0x200
[c0000006f94ab9a0] [c0000000000d7d28] smp_query_cpu_stopped+0x58/0xe0
[c0000006f94aba20] [c0000000000d9dbc] pseries_cpu_die+0x1ec/0x240
[c0000006f94abad0] [c00000000004f284] __cpu_die+0x44/0x60
[c0000006f94abaf0] [c0000000000d8e10] dlpar_cpu_remove+0x160/0x340
[c0000006f94abbc0] [c0000000000d9184] dlpar_cpu_release+0x74/0x100
[c0000006f94abc10] [c000000000025a74] arch_cpu_release+0x44/0x70
[c0000006f94abc30] [c0000000009bd1bc] cpu_release_store+0x4c/0x80
[c0000006f94abc60] [c0000000009ae000] dev_attr_store+0x40/0x70
[c0000006f94abc80] [c000000000495810] sysfs_kf_write+0x70/0xb0
[c0000006f94abca0] [c00000000049453c] kernfs_fop_write+0x17c/0x250
[c0000006f94abcf0] [c0000000003ccb6c] __vfs_write+0x4c/0x1f0
[c0000006f94abd80] [c0000000003ccf74] vfs_write+0xd4/0x240
[c0000006f94abdd0] [c0000000003cd338] ksys_write+0x68/0x110
[c0000006f94abe20] [c00000000000b288] system_call+0x5c/0x70
rcu: rcu_sched kthread starved for 10709 jiffies! g180529 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=72
rcu: RCU grace-period kthread stack dump:
rcu_sched I 0 11 2 0x00000808
Call Trace:
[c0000000061ab840] [c0000003a4a84800] 0xc0000003a4a84800 (unreliable)
[c0000000061aba20] [c00000000001e24c] __switch_to+0x2dc/0x430
[c0000000061aba80] [c000000000e5fb94] __schedule+0x3d4/0xa20
[c0000000061abb50] [c000000000e6022c] schedule+0x4c/0xc0
[c0000000061abb80] [c000000000e64ffc] schedule_timeout+0x1dc/0x4e0
[c0000000061abc80] [c0000000001af40c] rcu_gp_kthread+0xc3c/0x11f0
[c0000000061abdb0] [c00000000013c7c8] kthread+0x168/0x1b0
[c0000000061abe20] [c00000000000b658] ret_from_kernel_thread+0x5c/0x64






2018-12-08 02:42:01

by Thiago Jung Bauermann

[permalink] [raw]
Subject: Re: [PATCH] pseries/hotplug: Add more delay in pseries_cpu_die while waiting for rtas-stop


Gautham R Shenoy <[email protected]> writes:
> On Fri, Dec 07, 2018 at 04:13:11PM +0530, Gautham R Shenoy wrote:
>> Sure. I will test the patch and report back.
>
> I added the following debug patch on top of your patch, and after an
> hour's run, the system crashed. Appending the log at the end.

Thank you very much for testing! Your debug patch was very helpful as
well.

> I suppose we still need to increase the number of tries since we wait
> only for 2.5ms looping before giving up.

Do you think it would have helped? In the debug output you posted I
would have expected the following message to show up if the loop
finished too early, and it didn't:

"Querying DEAD? cpu %i (%i) shows %i\n"

So I don't think increasing the loop length would have made a
difference. In fact, the call to smp_query_cpu_stopped() always
succeeded on the first iteration.

I think there is something else going on which we don't fully understand
yet. From your other email:

> I agree that the Kernel has to respect RTAS's restriction. The PAPR
> v2.8.1, Requirement R1-7.2.3-8 under section 7.2.3 says the following:
>
> "The stop-self service needs to be serialized with calls to the
> stop-self, start-cpu, and set-power-level services. The OS must
> be able to call RTAS services on other processors while the
> processor is stopped or being stopped"
>
> Thus the onus is on the OS to ensure that there are no concurrent rtas
> calls with "stop-self" token.

As you say perhaps there's another call to stop-self, start-cpu or
set-power-level being made concurrently. I don't currently see how more
than one stop-self or start-cpu call could be in flight at the same time
given that there are a number of locks being grabbed during CPU hotplug
and unplug. OTOH the CPU that actually calls stop-self doesn't seem to
grab any locks itself so it's a possibility.

As for set-power-level, it's only used in the case of PCI hotplug from
what I can see, and that isn't part of the picture in this case, right?

We could address this problem directly by adding another lock separate
from rtas.lock to serialize just these calls. The challenge is with
stop-self, because the CPU calling it will never return to release the
lock. Is it possible to grab a lock (or down a semaphore) in the CPU
calling stop-self and then release the lock (or up the semaphore) in the
CPU running pseries_cpu_die()?

> > There's also a race between the CPU driving the unplug and the CPU
> > being unplugged which I think is not easy for the CPU being
> > unplugged to win, which makes the busy loop in pseries_cpu_die() a
> > bit fragile. I describe the race in the patch description.
> >
> > My solution to make the race less tight is to make the CPU driving
> > the unplug to only start the busy loop only after the CPU being
> > unplugged is in the CPU_STATE_OFFLINE state. At that point, we know
> > that it either is about to call RTAS or it already has.
>
> Ah, yes this is good optimization. Though, I think we ought to
> unconditionally wait until the target CPU has woken up from CEDE and
> changed its state to CPU_STATE_OFFLINE. After if PROD failed, then we
> would have caught it in dlpar_offline_cpu() itself.

I recently saw a QEMU-implemented hcall (H_LOGICAL_CI_LOAD) return
success when it had been given an invalid memory address to load from,
so my confidence in the error reporting of hcalls is a bit shaken. :-)

In that case the CPU would wait forever for the CPU state to change. If
you believe 100 ms is too short a timeout, we could make it 500 ms or
even 1s. What do you think?

> cpu 112 (hwid 112) Ready to die...
> [DEBUG] Waited for CPU 112 to enter rtas: tries=0, time=65
> cpu 113 (hwid 113) Ready to die...
> [DEBUG] Waited for CPU 113 to enter rtas: tries=0, time=1139
> cpu 114 (hwid 114) Ready to die...
> [DEBUG] Waited for CPU 114 to enter rtas: tries=0, time=1036
> cpu 115 (hwid 115) Ready to die...
> [DEBUG] Waited for CPU 115 to enter rtas: tries=0, time=133
> cpu 116 (hwid 116) Ready to die...
> [DEBUG] Waited for CPU 116 to enter rtas: tries=0, time=1231
> cpu 117 (hwid 117) Ready to die...
> [DEBUG] Waited for CPU 117 to enter rtas: tries=0, time=1231
> cpu 118 (hwid 118) Ready to die...
> [DEBUG] Waited for CPU 118 to enter rtas: tries=0, time=1231
> cpu 119 (hwid 119) Ready to die...
> [DEBUG] Waited for CPU 119 to enter rtas: tries=0, time=1131
> cpu 104 (hwid 104) Ready to die...
> [DEBUG] Waited for CPU 104 to enter rtas: tries=0, time=40

Interesting, so 1.2 ms can pass before the dying CPU actually gets close
to making the stop-self call. And even in those cases the retry loop is
succeeding on the first try! So this shows that changing the code to
wait for the CPU_STATE_OFFLINE state is worth it.

> ******* RTAS CALL BUFFER CORRUPTION *******
> 393: rtas32_call_buff_ptr=
> 0000 0060 0000 0060 0000 0060 0000 0060 [...`...`...`...`]
> 0000 0060 0000 0060 0000 0060 0000 0060 [...`...`...`...`]
> 0000 0060 0000 0060 0000 0060 0000 0060 [...`...`...`...`]
> 0000 0060 0800 E07F ACA7 0000 0000 00C0 [...`............]
> 2500 0000 0000 0000 0000 0000 0000 0000 [%...............]
> 0000 0000 0000 0000 0000 0000 306E 7572 [............0nur]
> 4800 0008 .... .... .... .... .... .... [H...........0nur]
> 394: rtas64_map_buff_ptr=
> 0000 0000 5046 5743 0000 0000 4F44 4500 [....PFWC....ODE.]
> 0000 0000 6000 0000 0000 0000 0000 0069 [....`..........i]
> 0000 0000 0000 0000 0000 0000 0000 0000 [................]
> 0000 0000 0000 0005 0000 0000 0000 0001 [................]
> 0000 0000 1A00 0000 0000 0000 0000 0000 [................]
> 0000 0000 8018 6398 0000 0000 0300 00C0 [......c.........]
> 0000 0000 .... .... .... .... .... .... [......c.........]

Ah! I never saw this error message. So indeed the kernel is causing RTAS
to blow up. Perhaps it would be useful to instrument more RTAS calls
(especially start-cpu and set-power-level) to see if it's one of them
that is being called at the time this corruption happens.

> cpu 105 (hwid 105) Ready to die...
> Bad kernel stack pointer 1fafb6c0 at 0
> Oops: Bad kernel stack pointer, sig: 6 [#1]
> LE SMP NR_CPUS=2048 NUMA pSeries
> Modules linked in:
> CPU: 105 PID: 0 Comm: swapper/105 Not tainted 4.20.0-rc5-thiago+ #45
> NIP: 0000000000000000 LR: 0000000000000000 CTR: 00000000007829c8
> REGS: c00000001e63bd30 TRAP: 0700 Not tainted (4.20.0-rc5-thiago+)
> MSR: 8000000000081000 <SF,ME> CR: 28000004 XER: 00000010
> CFAR: 000000001ec153f0 IRQMASK: 8000000000009033
> GPR00: 0000000000000000 000000001fafb6c0 000000001ec236a0 0000000000000040
> GPR04: 00000000000000c0 0000000000000080 00046c4fb4842557 00000000000000cd
> GPR08: 000000000001f400 000000001ed035dc 0000000000000000 0000000000000000
> GPR12: 0000000000000000 c00000001eb5e480 c0000003a1b53f90 000000001eea3e20
> GPR16: 0000000000000000 c0000006fd845100 c00000000004c8b0 c0000000013d5300
> GPR20: c0000006fd845300 0000000000000008 c0000000019d2cf8 c0000000013d6888
> GPR24: 0000000000000069 c0000000013d688c 0000000000000002 c0000000013d688c
> GPR28: c0000000019cecf0 0000000000000348 0000000000000000 0000000000000000
> NIP [0000000000000000] (null)
> LR [0000000000000000] (null)
> Call Trace:
> Instruction dump:
> XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
> XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX 60000000 60000000 60000000 60000000
> ---[ end trace 1aa3b4936949457e ]---

Ok, so at about the time CPU 105 makes the stop-self call there is this
RTAS call buffer corruption and this bad kernel stack pointer in CPU 105.
We need to understand better what is causing this.

> Bad kernel stack pointer 1fafb4b0 at 1ec15004
> rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
> rcu: 88-...!: (0 ticks this GP) idle=2ce/1/0x4000000000000000 softirq=28076/28076 fqs=78
> rcu: (detected by 72, t=10866 jiffies, g=180529, q=2526)
> Sending NMI from CPU 72 to CPUs 88:
> CPU 88 didn't respond to backtrace IPI, inspecting paca.
> irq_soft_mask: 0x01 in_mce: 0 in_nmi: 0 current: 22978 (drmgr)
> Back trace of paca->saved_r1 (0xc0000006f94ab750) (possibly stale):
> Call Trace:
> [c0000006f94ab750] [c0000006f94ab790] 0xc0000006f94ab790 (unreliable)
> [c0000006f94ab930] [c0000000000373f8] va_rtas_call_unlocked+0xc8/0xe0
> [c0000006f94ab950] [c000000000037a98] rtas_call+0x98/0x200
> [c0000006f94ab9a0] [c0000000000d7d28] smp_query_cpu_stopped+0x58/0xe0
> [c0000006f94aba20] [c0000000000d9dbc] pseries_cpu_die+0x1ec/0x240
> [c0000006f94abad0] [c00000000004f284] __cpu_die+0x44/0x60
> [c0000006f94abaf0] [c0000000000d8e10] dlpar_cpu_remove+0x160/0x340
> [c0000006f94abbc0] [c0000000000d9184] dlpar_cpu_release+0x74/0x100
> [c0000006f94abc10] [c000000000025a74] arch_cpu_release+0x44/0x70
> [c0000006f94abc30] [c0000000009bd1bc] cpu_release_store+0x4c/0x80
> [c0000006f94abc60] [c0000000009ae000] dev_attr_store+0x40/0x70
> [c0000006f94abc80] [c000000000495810] sysfs_kf_write+0x70/0xb0
> [c0000006f94abca0] [c00000000049453c] kernfs_fop_write+0x17c/0x250
> [c0000006f94abcf0] [c0000000003ccb6c] __vfs_write+0x4c/0x1f0
> [c0000006f94abd80] [c0000000003ccf74] vfs_write+0xd4/0x240
> [c0000006f94abdd0] [c0000000003cd338] ksys_write+0x68/0x110
> [c0000006f94abe20] [c00000000000b288] system_call+0x5c/0x70

So CPU 88 is the one driving the hot unplug and waiting for CPU 105 to
die. But it is stuck inside RTAS. Perhaps because of the call buffer
corruption?

> rcu: rcu_sched kthread starved for 10709 jiffies! g180529 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=72
> rcu: RCU grace-period kthread stack dump:
> rcu_sched I 0 11 2 0x00000808
> Call Trace:
> [c0000000061ab840] [c0000003a4a84800] 0xc0000003a4a84800 (unreliable)
> [c0000000061aba20] [c00000000001e24c] __switch_to+0x2dc/0x430
> [c0000000061aba80] [c000000000e5fb94] __schedule+0x3d4/0xa20
> [c0000000061abb50] [c000000000e6022c] schedule+0x4c/0xc0
> [c0000000061abb80] [c000000000e64ffc] schedule_timeout+0x1dc/0x4e0
> [c0000000061abc80] [c0000000001af40c] rcu_gp_kthread+0xc3c/0x11f0
> [c0000000061abdb0] [c00000000013c7c8] kthread+0x168/0x1b0
> [c0000000061abe20] [c00000000000b658] ret_from_kernel_thread+0x5c/0x64

I don't know what to make of CPU 72. :-) Perhaps it's the one making
the other "rogue" RTAS call interfering with stop-self in CPU 105?

It must be some RTAS call made with rtas_call_unlocked, because CPU 88
is holding the RTAS lock.

-- Thiago Jung Bauermann
IBM Linux Technology Center


2018-12-10 20:47:18

by Michael Bringmann

[permalink] [raw]
Subject: Re: [PATCH] pseries/hotplug: Add more delay in pseries_cpu_die while waiting for rtas-stop

I have asked Scott Mayes to take a look at one of these crashes from
the phyp side. I will let you know if he finds anything notable.

Michael

On 12/07/2018 08:40 PM, Thiago Jung Bauermann wrote:
>
> Gautham R Shenoy <[email protected]> writes:
>> On Fri, Dec 07, 2018 at 04:13:11PM +0530, Gautham R Shenoy wrote:
>>> Sure. I will test the patch and report back.
>>
>> I added the following debug patch on top of your patch, and after an
>> hour's run, the system crashed. Appending the log at the end.
>
> Thank you very much for testing! Your debug patch was very helpful as
> well.
>
>> I suppose we still need to increase the number of tries since we wait
>> only for 2.5ms looping before giving up.
>
> Do you think it would have helped? In the debug output you posted I
> would have expected the following message to show up if the loop
> finished too early, and it didn't:
>
> "Querying DEAD? cpu %i (%i) shows %i\n"
>
> So I don't think increasing the loop length would have made a
> difference. In fact, the call to smp_query_cpu_stopped() always
> succeeded on the first iteration.
>
> I think there is something else going on which we don't fully understand
> yet. From your other email:
>
>> I agree that the Kernel has to respect RTAS's restriction. The PAPR
>> v2.8.1, Requirement R1-7.2.3-8 under section 7.2.3 says the following:
>>
>> "The stop-self service needs to be serialized with calls to the
>> stop-self, start-cpu, and set-power-level services. The OS must
>> be able to call RTAS services on other processors while the
>> processor is stopped or being stopped"
>>
>> Thus the onus is on the OS to ensure that there are no concurrent rtas
>> calls with "stop-self" token.
>
> As you say perhaps there's another call to stop-self, start-cpu or
> set-power-level being made concurrently. I don't currently see how more
> than one stop-self or start-cpu call could be in flight at the same time
> given that there are a number of locks being grabbed during CPU hotplug
> and unplug. OTOH the CPU that actually calls stop-self doesn't seem to
> grab any locks itself so it's a possibility.
>
> As for set-power-level, it's only used in the case of PCI hotplug from
> what I can see, and that isn't part of the picture in this case, right?
>
> We could address this problem directly by adding another lock separate
> from rtas.lock to serialize just these calls. The challenge is with
> stop-self, because the CPU calling it will never return to release the
> lock. Is it possible to grab a lock (or down a semaphore) in the CPU
> calling stop-self and then release the lock (or up the semaphore) in the
> CPU running pseries_cpu_die()?
>
>>> There's also a race between the CPU driving the unplug and the CPU
>>> being unplugged which I think is not easy for the CPU being
>>> unplugged to win, which makes the busy loop in pseries_cpu_die() a
>>> bit fragile. I describe the race in the patch description.
>>>
>>> My solution to make the race less tight is to make the CPU driving
>>> the unplug to only start the busy loop only after the CPU being
>>> unplugged is in the CPU_STATE_OFFLINE state. At that point, we know
>>> that it either is about to call RTAS or it already has.
>>
>> Ah, yes this is good optimization. Though, I think we ought to
>> unconditionally wait until the target CPU has woken up from CEDE and
>> changed its state to CPU_STATE_OFFLINE. After if PROD failed, then we
>> would have caught it in dlpar_offline_cpu() itself.
>
> I recently saw a QEMU-implemented hcall (H_LOGICAL_CI_LOAD) return
> success when it had been given an invalid memory address to load from,
> so my confidence in the error reporting of hcalls is a bit shaken. :-)
>
> In that case the CPU would wait forever for the CPU state to change. If
> you believe 100 ms is too short a timeout, we could make it 500 ms or
> even 1s. What do you think?
>
>> cpu 112 (hwid 112) Ready to die...
>> [DEBUG] Waited for CPU 112 to enter rtas: tries=0, time=65
>> cpu 113 (hwid 113) Ready to die...
>> [DEBUG] Waited for CPU 113 to enter rtas: tries=0, time=1139
>> cpu 114 (hwid 114) Ready to die...
>> [DEBUG] Waited for CPU 114 to enter rtas: tries=0, time=1036
>> cpu 115 (hwid 115) Ready to die...
>> [DEBUG] Waited for CPU 115 to enter rtas: tries=0, time=133
>> cpu 116 (hwid 116) Ready to die...
>> [DEBUG] Waited for CPU 116 to enter rtas: tries=0, time=1231
>> cpu 117 (hwid 117) Ready to die...
>> [DEBUG] Waited for CPU 117 to enter rtas: tries=0, time=1231
>> cpu 118 (hwid 118) Ready to die...
>> [DEBUG] Waited for CPU 118 to enter rtas: tries=0, time=1231
>> cpu 119 (hwid 119) Ready to die...
>> [DEBUG] Waited for CPU 119 to enter rtas: tries=0, time=1131
>> cpu 104 (hwid 104) Ready to die...
>> [DEBUG] Waited for CPU 104 to enter rtas: tries=0, time=40
>
> Interesting, so 1.2 ms can pass before the dying CPU actually gets close
> to making the stop-self call. And even in those cases the retry loop is
> succeeding on the first try! So this shows that changing the code to
> wait for the CPU_STATE_OFFLINE state is worth it.
>
>> ******* RTAS CALL BUFFER CORRUPTION *******
>> 393: rtas32_call_buff_ptr=
>> 0000 0060 0000 0060 0000 0060 0000 0060 [...`...`...`...`]
>> 0000 0060 0000 0060 0000 0060 0000 0060 [...`...`...`...`]
>> 0000 0060 0000 0060 0000 0060 0000 0060 [...`...`...`...`]
>> 0000 0060 0800 E07F ACA7 0000 0000 00C0 [...`............]
>> 2500 0000 0000 0000 0000 0000 0000 0000 [%...............]
>> 0000 0000 0000 0000 0000 0000 306E 7572 [............0nur]
>> 4800 0008 .... .... .... .... .... .... [H...........0nur]
>> 394: rtas64_map_buff_ptr=
>> 0000 0000 5046 5743 0000 0000 4F44 4500 [....PFWC....ODE.]
>> 0000 0000 6000 0000 0000 0000 0000 0069 [....`..........i]
>> 0000 0000 0000 0000 0000 0000 0000 0000 [................]
>> 0000 0000 0000 0005 0000 0000 0000 0001 [................]
>> 0000 0000 1A00 0000 0000 0000 0000 0000 [................]
>> 0000 0000 8018 6398 0000 0000 0300 00C0 [......c.........]
>> 0000 0000 .... .... .... .... .... .... [......c.........]
>
> Ah! I never saw this error message. So indeed the kernel is causing RTAS
> to blow up. Perhaps it would be useful to instrument more RTAS calls
> (especially start-cpu and set-power-level) to see if it's one of them
> that is being called at the time this corruption happens.
>
>> cpu 105 (hwid 105) Ready to die...
>> Bad kernel stack pointer 1fafb6c0 at 0
>> Oops: Bad kernel stack pointer, sig: 6 [#1]
>> LE SMP NR_CPUS=2048 NUMA pSeries
>> Modules linked in:
>> CPU: 105 PID: 0 Comm: swapper/105 Not tainted 4.20.0-rc5-thiago+ #45
>> NIP: 0000000000000000 LR: 0000000000000000 CTR: 00000000007829c8
>> REGS: c00000001e63bd30 TRAP: 0700 Not tainted (4.20.0-rc5-thiago+)
>> MSR: 8000000000081000 <SF,ME> CR: 28000004 XER: 00000010
>> CFAR: 000000001ec153f0 IRQMASK: 8000000000009033
>> GPR00: 0000000000000000 000000001fafb6c0 000000001ec236a0 0000000000000040
>> GPR04: 00000000000000c0 0000000000000080 00046c4fb4842557 00000000000000cd
>> GPR08: 000000000001f400 000000001ed035dc 0000000000000000 0000000000000000
>> GPR12: 0000000000000000 c00000001eb5e480 c0000003a1b53f90 000000001eea3e20
>> GPR16: 0000000000000000 c0000006fd845100 c00000000004c8b0 c0000000013d5300
>> GPR20: c0000006fd845300 0000000000000008 c0000000019d2cf8 c0000000013d6888
>> GPR24: 0000000000000069 c0000000013d688c 0000000000000002 c0000000013d688c
>> GPR28: c0000000019cecf0 0000000000000348 0000000000000000 0000000000000000
>> NIP [0000000000000000] (null)
>> LR [0000000000000000] (null)
>> Call Trace:
>> Instruction dump:
>> XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
>> XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX 60000000 60000000 60000000 60000000
>> ---[ end trace 1aa3b4936949457e ]---
>
> Ok, so at about the time CPU 105 makes the stop-self call there is this
> RTAS call buffer corruption and this bad kernel stack pointer in CPU 105.
> We need to understand better what is causing this.
>
>> Bad kernel stack pointer 1fafb4b0 at 1ec15004
>> rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
>> rcu: 88-...!: (0 ticks this GP) idle=2ce/1/0x4000000000000000 softirq=28076/28076 fqs=78
>> rcu: (detected by 72, t=10866 jiffies, g=180529, q=2526)
>> Sending NMI from CPU 72 to CPUs 88:
>> CPU 88 didn't respond to backtrace IPI, inspecting paca.
>> irq_soft_mask: 0x01 in_mce: 0 in_nmi: 0 current: 22978 (drmgr)
>> Back trace of paca->saved_r1 (0xc0000006f94ab750) (possibly stale):
>> Call Trace:
>> [c0000006f94ab750] [c0000006f94ab790] 0xc0000006f94ab790 (unreliable)
>> [c0000006f94ab930] [c0000000000373f8] va_rtas_call_unlocked+0xc8/0xe0
>> [c0000006f94ab950] [c000000000037a98] rtas_call+0x98/0x200
>> [c0000006f94ab9a0] [c0000000000d7d28] smp_query_cpu_stopped+0x58/0xe0
>> [c0000006f94aba20] [c0000000000d9dbc] pseries_cpu_die+0x1ec/0x240
>> [c0000006f94abad0] [c00000000004f284] __cpu_die+0x44/0x60
>> [c0000006f94abaf0] [c0000000000d8e10] dlpar_cpu_remove+0x160/0x340
>> [c0000006f94abbc0] [c0000000000d9184] dlpar_cpu_release+0x74/0x100
>> [c0000006f94abc10] [c000000000025a74] arch_cpu_release+0x44/0x70
>> [c0000006f94abc30] [c0000000009bd1bc] cpu_release_store+0x4c/0x80
>> [c0000006f94abc60] [c0000000009ae000] dev_attr_store+0x40/0x70
>> [c0000006f94abc80] [c000000000495810] sysfs_kf_write+0x70/0xb0
>> [c0000006f94abca0] [c00000000049453c] kernfs_fop_write+0x17c/0x250
>> [c0000006f94abcf0] [c0000000003ccb6c] __vfs_write+0x4c/0x1f0
>> [c0000006f94abd80] [c0000000003ccf74] vfs_write+0xd4/0x240
>> [c0000006f94abdd0] [c0000000003cd338] ksys_write+0x68/0x110
>> [c0000006f94abe20] [c00000000000b288] system_call+0x5c/0x70
>
> So CPU 88 is the one driving the hot unplug and waiting for CPU 105 to
> die. But it is stuck inside RTAS. Perhaps because of the call buffer
> corruption?
>
>> rcu: rcu_sched kthread starved for 10709 jiffies! g180529 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=72
>> rcu: RCU grace-period kthread stack dump:
>> rcu_sched I 0 11 2 0x00000808
>> Call Trace:
>> [c0000000061ab840] [c0000003a4a84800] 0xc0000003a4a84800 (unreliable)
>> [c0000000061aba20] [c00000000001e24c] __switch_to+0x2dc/0x430
>> [c0000000061aba80] [c000000000e5fb94] __schedule+0x3d4/0xa20
>> [c0000000061abb50] [c000000000e6022c] schedule+0x4c/0xc0
>> [c0000000061abb80] [c000000000e64ffc] schedule_timeout+0x1dc/0x4e0
>> [c0000000061abc80] [c0000000001af40c] rcu_gp_kthread+0xc3c/0x11f0
>> [c0000000061abdb0] [c00000000013c7c8] kthread+0x168/0x1b0
>> [c0000000061abe20] [c00000000000b658] ret_from_kernel_thread+0x5c/0x64
>
> I don't know what to make of CPU 72. :-) Perhaps it's the one making
> the other "rogue" RTAS call interfering with stop-self in CPU 105?
>
> It must be some RTAS call made with rtas_call_unlocked, because CPU 88
> is holding the RTAS lock.
>
> -- Thiago Jung Bauermann
> IBM Linux Technology Center
>

--
Michael W. Bringmann
Linux Technology Center
IBM Corporation
Tie-Line 363-5196
External: (512) 286-5196
Cell: (512) 466-0650
[email protected]


2018-12-10 23:34:29

by Thiago Jung Bauermann

[permalink] [raw]
Subject: Re: [PATCH] pseries/hotplug: Add more delay in pseries_cpu_die while waiting for rtas-stop


Hello Michael,

Michael Bringmann <[email protected]> writes:

> I have asked Scott Mayes to take a look at one of these crashes from
> the phyp side. I will let you know if he finds anything notable.

Thanks! It might make sense to test whether booting with
cede_offline=off makes the bug go away.

One suspicion I have is regarding the code handling CPU_STATE_INACTIVE.
From what I understand, it is a powerpc-specific CPU state and from the
perspective of the generic CPU hotplug state machine, inactive CPUs are
already fully offline. Which means that the locking performed by the
generic code state machine doesn't apply to transitioning CPUs from
INACTIVE to OFFLINE state. Perhaps the bug is that there is more than
one CPU making that transition at the same time? That would cause two
CPUs to call RTAS stop-self.

I haven't checked whether this is really possible or not, though. It's
just a conjecture.

--
Thiago Jung Bauermann
IBM Linux Technology Center


2018-12-11 22:12:50

by Michael Bringmann

[permalink] [raw]
Subject: Re: [PATCH] pseries/hotplug: Add more delay in pseries_cpu_die while waiting for rtas-stop

Note from Scott Mayes on latest crash:

Michael,

Since the partition crashed, I was able to get the last .2 seconds worth of RTAS call trace leading up to the crash.

Best I could tell from that bit of trace was that the removal of a processor involved the following steps:
-- Call to stop-self for a given thread
-- Repeated calls to query-cpu-stopped-state (which eventually indicated the thread was stopped)
-- Call to get-sensor-state for the thread to check its entity-state (9003) sensor which returned 'dr-entity-present'
-- Call to set-indicator to set the isolation-state (9001) indicator to ISOLATE state
-- Call to set-indicator to set the allocation-state (9003) indicator to UNUSABLE state

I noticed one example of thread x28 getting through all of these steps just fine, but for thread x20, although the
query-cpu-stopped state returned 0 status (STOPPED), a subsequent call to set-indicator to ISOLATE
failed. This failure was near the end of the trace, but was not the very last RTAS call made in the trace.
The set-indicator failure reported to Linux was a -9001 (Valid outstanding translation) which was mapped
from a 0x502 (Invalid thread state) return code from PHYP's H_SET_DR_STATE h-call.

On 12/10/2018 02:31 PM, Thiago Jung Bauermann wrote:
>
> Hello Michael,
>
> Michael Bringmann <[email protected]> writes:
>
>> I have asked Scott Mayes to take a look at one of these crashes from
>> the phyp side. I will let you know if he finds anything notable.
>
> Thanks! It might make sense to test whether booting with
> cede_offline=off makes the bug go away.

Scott is looking at the system. I will try once he is finished.

>
> One suspicion I have is regarding the code handling CPU_STATE_INACTIVE.
>>From what I understand, it is a powerpc-specific CPU state and from the
> perspective of the generic CPU hotplug state machine, inactive CPUs are
> already fully offline. Which means that the locking performed by the
> generic code state machine doesn't apply to transitioning CPUs from
> INACTIVE to OFFLINE state. Perhaps the bug is that there is more than
> one CPU making that transition at the same time? That would cause two
> CPUs to call RTAS stop-self.
>
> I haven't checked whether this is really possible or not, though. It's
> just a conjecture.

Michael

>
> --
> Thiago Jung Bauermann
> IBM Linux Technology Center
>
>

--
Michael W. Bringmann
Linux Technology Center
IBM Corporation
Tie-Line 363-5196
External: (512) 286-5196
Cell: (512) 466-0650
[email protected]


2019-01-09 06:09:53

by Gautham R Shenoy

[permalink] [raw]
Subject: Re: [PATCH] pseries/hotplug: Add more delay in pseries_cpu_die while waiting for rtas-stop

Hello Thiago,

Wish you a happy 2019!

On Sat, Dec 08, 2018 at 12:40:52AM -0200, Thiago Jung Bauermann wrote:
>
> Gautham R Shenoy <[email protected]> writes:
> > On Fri, Dec 07, 2018 at 04:13:11PM +0530, Gautham R Shenoy wrote:
> >> Sure. I will test the patch and report back.
> >
> > I added the following debug patch on top of your patch, and after an
> > hour's run, the system crashed. Appending the log at the end.
>
> Thank you very much for testing! Your debug patch was very helpful as
> well.
>
> > I suppose we still need to increase the number of tries since we wait
> > only for 2.5ms looping before giving up.
>
> Do you think it would have helped? In the debug output you posted I
> would have expected the following message to show up if the loop
> finished too early, and it didn't:
>
> "Querying DEAD? cpu %i (%i) shows %i\n"
>
> So I don't think increasing the loop length would have made a
> difference. In fact, the call to smp_query_cpu_stopped() always
> succeeded on the first iteration.

I did some testing during the holidays. Here are the observations:

1) With just your patch (without any additional debug patch), if I run
DLPAR on /off operations on a system that has SMT=off, I am able to
see a crash involving RTAS stack corruption within an hour's time.

2) With the debug patch (appended below) which has additional debug to
capture the callers of stop-self, start-cpu, set-power-levels, the
system is able to perform DLPAR on/off operations on a system with
SMT=off for three days. And then, it crashed with the dead CPU showing
a "Bad kernel stack pointer". From this log, I can clearly
see that there were no concurrent calls to stop-self, start-cpu,
set-power-levels. The only concurrent RTAS calls were the dying CPU
calling "stop-self", and the CPU running the DLPAR operation calling
"query-cpu-stopped-state". The crash signature is appended below as
well.

3) Modifying your patch to remove the udelay and increase the loop
count from 25 to 1000 doesn't improve the situation. We are still able
to see the crash.

4) With my patch, even without any additional debug, I was able to
observe the system run the tests successfully for over a week (I
started the tests before the Christmas weekend, and forgot to turn it
off!)

It appears that there is a narrow race window involving rtas-stop-self
and rtas-query-cpu-stopped-state calls that can be observed with your
patch. Adding any printk's seems to reduce the probability of hitting
this race window. It might be worth the while to check with RTAS
folks, if they suspect something here.

The Crash log with this debug patch is as follows

[DEBUG] CPU 32 waiting for CPU 130 to enter rtas
cpu 130 (hwid 130) Ready to die...
[DEBUG] CPU 32 noticed CPU 130 enter rtas: tries=0, time=539
[DEBUG] CPU 32 waiting for CPU 131 to enter rtas
cpu 131 (hwid 131) Ready to die...
[DEBUG] CPU 32 noticed CPU 131 enter rtas: tries=0, time=137
[DEBUG] CPU 32 waiting for CPU 132 to enter rtas
cpu 132 (hwid 132) Ready to die...
[DEBUG] CPU 32 noticed CPU 132 enter rtas: tries=0, time=1238
[DEBUG] CPU 32 waiting for CPU 133 to enter rtas
cpu 133 (hwid 133) Ready to die...
[DEBUG] CPU 32 noticed CPU 133 enter rtas: tries=1, time=1259
[DEBUG] CPU 32 waiting for CPU 134 to enter rtas
cpu 134 (hwid 134) Ready to die...
[DEBUG] CPU 32 noticed CPU 134 enter rtas: tries=0, time=1141
[DEBUG] CPU 32 waiting for CPU 135 to enter rtas
cpu 135 (hwid 135) Ready to die...
[DEBUG] CPU 32 noticed CPU 135 enter rtas: tries=0, time=636
cpu 120 (hwid 120) Ready to die...
[DEBUG] CPU 32 waiting for CPU 120 to enter rtas
[DEBUG] CPU 32 noticed CPU 120 enter rtas: tries=0, time=53
[DEBUG] CPU 32 waiting for CPU 121 to enter rtas
cpu 121 (hwid 121) Ready to die...
Bad kernel stack pointer 1fafb400 at 1fafb4b0
Bad kernel stack pointer 1fafb4b0 at 1ec15270
Oops: Bad kernel stack pointer, sig: 6 [#1]
LE SMP NR_CPUS=2048 NUMA pSeries
Modules linked in:
CPU: 121 PID: 0 Comm: swapper/121 Not tainted 4.20.0-thiago-original-with-debug+ #57
NIP: 000000001ec15270 LR: 000000001ec1526c CTR: 000000001ecd26c4
REGS: c00000001e577d30 TRAP: 0300 Not tainted (4.20.0-thiago-original-with-debug+)
MSR: 8000000000001000 <SF,ME> CR: 48000000 XER: 00000020
CFAR: 000000001ecd27a8 DAR: ffffffffffff8108 DSISR: 40000000 IRQMASK: 8000000000009033
GPR00: 000000001ec1526c 000000001fafb4b0 0000000000000000 0000000000000000
GPR04: 0000000001a40968 00000000000000e0 000000000000dfe8 0000000000000018
GPR08: 000000001f82ae00 000000001ed025d4 000000001f827850 0000000000000000
GPR12: 0000000001b6d998 c00000001eb4e080 c0000003a1bdbf90 000000001eea3020
GPR16: 0000000000000000 c0000006fdc45100 c00000000004c8b0 c0000000013d5300
GPR20: c0000006fdc45300 0000000000000008 c0000000019d2cf8 c0000000013d6888
GPR24: 0000000000000079 c0000000013d688c 0000000000000002 c0000000013d688c
GPR28: c0000000019cecf0 00000000000003c8 0000000000000000 000000001fafb4b0
NIP [000000001ec15270] 0x1ec15270
LR [000000001ec1526c] 0x1ec1526c
Call Trace:
Instruction dump:
XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
---[ end trace dd774c4912df6c89 ]---


The debug patch over your patch is as follows:

-----------------------x8-------------------------------------------
From e5cad5ff5b61e003810723e549f44a3cc0cef356 Mon Sep 17 00:00:00 2001
From: "Gautham R. Shenoy" <[email protected]>
Date: Fri, 7 Dec 2018 16:47:25 +0530
Subject: [PATCH] Add debug to Thiago's patch

Also, add debug for callers of rtas-start-cpu and rtas-set-power-level

Signed-off-by: Gautham R. Shenoy <[email protected]>
---
arch/powerpc/kernel/rtas.c | 1 +
arch/powerpc/platforms/pseries/hotplug-cpu.c | 10 ++++++++++
arch/powerpc/platforms/pseries/smp.c | 2 ++
3 files changed, 13 insertions(+)

diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
index de35bd8f..9ecf957 100644
--- a/arch/powerpc/kernel/rtas.c
+++ b/arch/powerpc/kernel/rtas.c
@@ -576,6 +576,7 @@ int rtas_set_power_level(int powerdomain, int level, int *setlevel)
return -ENOENT;

do {
+ printk("CPU %d calling rtas-set-power-level\n", smp_processor_id());
rc = rtas_call(token, 2, 2, setlevel, powerdomain, level);
} while (rtas_busy_delay(rc));

diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index fad3ceb..3dab7c3 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -214,6 +214,10 @@ static void pseries_cpu_die(unsigned int cpu)
msleep(1);
}
} else if (get_preferred_offline_state(cpu) == CPU_STATE_OFFLINE) {
+ u64 tb_before = mftb();
+ u64 tb_after;
+
+ printk("[DEBUG] CPU %d waiting for CPU %d to enter rtas\n", smp_processor_id(), cpu);
/*
* If the current state is not offline yet, it means that the
* dying CPU (which is in pseries_mach_cpu_die) didn't have a
@@ -230,6 +234,12 @@ static void pseries_cpu_die(unsigned int cpu)
break;
udelay(100);
}
+
+ tb_after = mftb();
+
+ printk("[DEBUG] CPU %d noticed CPU %d enter rtas: tries=%d, time=%llu\n",
+ smp_processor_id(), cpu, tries,
+ div_u64(tb_after - tb_before, tb_ticks_per_usec));
}

if (cpu_status != 0) {
diff --git a/arch/powerpc/platforms/pseries/smp.c b/arch/powerpc/platforms/pseries/smp.c
index 3df4612..d4624c4 100644
--- a/arch/powerpc/platforms/pseries/smp.c
+++ b/arch/powerpc/platforms/pseries/smp.c
@@ -123,6 +123,8 @@ static inline int smp_startup_cpu(unsigned int lcpu)
if (start_cpu == RTAS_UNKNOWN_SERVICE)
return 1;

+ printk("CPU %d calling start-cpu for CPU %d\n",
+ smp_processor_id(), lcpu);
status = rtas_call(start_cpu, 3, 1, NULL, pcpu, start_here, pcpu);
if (status != 0) {
printk(KERN_ERR "start-cpu failed: %i\n", status);
--
1.9.4


--
Thanks and Regards
gautham.


2019-01-14 18:14:28

by Michael Bringmann

[permalink] [raw]
Subject: Re: [PATCH] pseries/hotplug: Add more delay in pseries_cpu_die while waiting for rtas-stop

On 1/9/19 12:08 AM, Gautham R Shenoy wrote:

> I did some testing during the holidays. Here are the observations:
>
> 1) With just your patch (without any additional debug patch), if I run
> DLPAR on /off operations on a system that has SMT=off, I am able to
> see a crash involving RTAS stack corruption within an hour's time.
>
> 2) With the debug patch (appended below) which has additional debug to
> capture the callers of stop-self, start-cpu, set-power-levels, the
> system is able to perform DLPAR on/off operations on a system with
> SMT=off for three days. And then, it crashed with the dead CPU showing
> a "Bad kernel stack pointer". From this log, I can clearly
> see that there were no concurrent calls to stop-self, start-cpu,
> set-power-levels. The only concurrent RTAS calls were the dying CPU
> calling "stop-self", and the CPU running the DLPAR operation calling
> "query-cpu-stopped-state". The crash signature is appended below as
> well.
>
> 3) Modifying your patch to remove the udelay and increase the loop
> count from 25 to 1000 doesn't improve the situation. We are still able
> to see the crash.
>
> 4) With my patch, even without any additional debug, I was able to
> observe the system run the tests successfully for over a week (I
> started the tests before the Christmas weekend, and forgot to turn it
> off!)

So does this mean that the problem is fixed with your patch?

>
> It appears that there is a narrow race window involving rtas-stop-self
> and rtas-query-cpu-stopped-state calls that can be observed with your
> patch. Adding any printk's seems to reduce the probability of hitting
> this race window. It might be worth the while to check with RTAS
> folks, if they suspect something here.

What would the RTAS folks be looking at here? The 'narrow race window'
is with respect to a patch that it sound like we should not be using.

Thanks.
Michael

--
Michael W. Bringmann
Linux Technology Center
IBM Corporation
Tie-Line 363-5196
External: (512) 286-5196
Cell: (512) 466-0650
[email protected]


2019-01-17 09:09:41

by Gautham R Shenoy

[permalink] [raw]
Subject: Re: [PATCH] pseries/hotplug: Add more delay in pseries_cpu_die while waiting for rtas-stop

Hello Michael,

On Mon, Jan 14, 2019 at 12:11:44PM -0600, Michael Bringmann wrote:
> On 1/9/19 12:08 AM, Gautham R Shenoy wrote:
>
> > I did some testing during the holidays. Here are the observations:
> >
> > 1) With just your patch (without any additional debug patch), if I run
> > DLPAR on /off operations on a system that has SMT=off, I am able to
> > see a crash involving RTAS stack corruption within an hour's time.
> >
> > 2) With the debug patch (appended below) which has additional debug to
> > capture the callers of stop-self, start-cpu, set-power-levels, the
> > system is able to perform DLPAR on/off operations on a system with
> > SMT=off for three days. And then, it crashed with the dead CPU showing
> > a "Bad kernel stack pointer". From this log, I can clearly
> > see that there were no concurrent calls to stop-self, start-cpu,
> > set-power-levels. The only concurrent RTAS calls were the dying CPU
> > calling "stop-self", and the CPU running the DLPAR operation calling
> > "query-cpu-stopped-state". The crash signature is appended below as
> > well.
> >
> > 3) Modifying your patch to remove the udelay and increase the loop
> > count from 25 to 1000 doesn't improve the situation. We are still able
> > to see the crash.
> >
> > 4) With my patch, even without any additional debug, I was able to
> > observe the system run the tests successfully for over a week (I
> > started the tests before the Christmas weekend, and forgot to turn it
> > off!)
>
> So does this mean that the problem is fixed with your patch?

No. On the contrary I think my patch unable to exploit the possible
race window. From a technical point of view, Thiago's patch does the
right things

- It waits for the target CPU to come out of CEDE and set its state
to CPU_STATE_OFFLINE.

- Only then it then makes the "query-cpu-stopped-state" rtas call in
a loop, while giving sufficient delay between successive
queries. This reduces the unnecessary rtas-calls.

In my patch, I don't do any of this, but simply keep calling
"query-cpu-stopped-state" call in a loop for 4000 iterations.

That said, if I incorporate the wait for the target CPU to set its
state to CPU_STATE_OFFLINE in my patch and then make the
"query-cpu-stopped-state" rtas call, then, even I am able to get the
crash with the "RTAS CALL BUFFER CORRUPTION" message.

I think that incorporating the wait for the target CPU set its state
to CPU_STATE_OFFLINE increases the probability that
"query-cpu-stopped-state" and "stop-self" rtas calls get called more
or less at the same time. However since concurrent invocations of
these rtas-calls is allowed by the PAPR, it should not result in a
"RTAS CALL BUFFER CORRUPTION". Am I missing something here ?

>
> >
> > It appears that there is a narrow race window involving rtas-stop-self
> > and rtas-query-cpu-stopped-state calls that can be observed with your
> > patch. Adding any printk's seems to reduce the probability of hitting
> > this race window. It might be worth the while to check with RTAS
> > folks, if they suspect something here.
>
> What would the RTAS folks be looking at here? The 'narrow race window'
> is with respect to a patch that it sound like we should not be
> using.

IMHO, if the race-window exists, it would be better to confirm and fix
it, given that we have a patch that can exploit it consistently.

>
> Thanks.
> Michael
>
> --
> Michael W. Bringmann
> Linux Technology Center
> IBM Corporation
> Tie-Line 363-5196
> External: (512) 286-5196
> Cell: (512) 466-0650
> [email protected]

--
Thanks and Regards
gautham.


2022-07-07 10:00:37

by Christophe Leroy

[permalink] [raw]
Subject: Re: [PATCH] pseries/hotplug: Add more delay in pseries_cpu_die while waiting for rtas-stop



Le 06/12/2018 à 12:31, Gautham R. Shenoy a écrit :
> From: "Gautham R. Shenoy" <[email protected]>
>
> Currently running DLPAR offline/online operations in a loop on a
> POWER9 system with SMT=off results in the following crash:
>
> [ 223.321032] cpu 112 (hwid 112) Ready to die...
> [ 223.355963] Querying DEAD? cpu 113 (113) shows 2
> [ 223.356233] cpu 114 (hwid 114) Ready to die...
> [ 223.356236] cpu 113 (hwid 113) Ready to die...
> [ 223.356241] Bad kernel stack pointer 1faf6ca0 at 1faf6d50
> [ 223.356243] Oops: Bad kernel stack pointer, sig: 6 [#1]
> [ 223.356244] LE SMP NR_CPUS=2048 NUMA pSeries
> [ 223.356247] Modules linked in:
> [ 223.356255] CPU: 114 PID: 0 Comm: swapper/114 Not tainted 4.20.0-rc3 #39
> [ 223.356258] NIP: 000000001faf6d50 LR: 000000001faf6d50 CTR: 000000001ec6d06c
> [ 223.356259] REGS: c00000001e5cbd30 TRAP: 0700 Not tainted (4.20.0-rc3)
> [ 223.356260] MSR: 8000000000081000 <SF,ME> CR: 28000004 XER: 00000020
> [ 223.356263] CFAR: 000000001ec98590 IRQMASK: 8000000000009033
> GPR00: 000000001faf6d50 000000001faf6ca0 000000001ed1c448 00000267e6a273c3
> GPR04: 0000000000000000 00000000000000e0 000000000000dfe8 000000001faf6d30
> GPR08: 000000001faf6d28 00000267e6a273c3 000000001ec1b108 0000000000000000
> GPR12: 0000000001b6d998 c00000001eb55080 c0000003a1b8bf90 000000001eea3f40
> GPR16: 0000000000000000 c0000006fda85100 c00000000004c8b0 c0000000013d5300
> GPR20: c0000006fda85300 0000000000000008 c0000000019d2cf8 c0000000013d6888
> GPR24: 0000000000000072 c0000000013d688c 0000000000000002 c0000000013d688c
> GPR28: c0000000019cecf0 0000000000000390 0000000000000000 000000001faf6ca0
> [ 223.356281] NIP [000000001faf6d50] 0x1faf6d50
> [ 223.356281] LR [000000001faf6d50] 0x1faf6d50
> [ 223.356282] Call Trace:
> [ 223.356283] Instruction dump:
> [ 223.356285] XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
> [ 223.356286] XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
> [ 223.356290] ---[ end trace f46a4e046b564d1f ]---
>
> This is due to multiple offlined CPUs (CPU 113 and CPU 114 above)
> concurrently (within 3us) trying to make the rtas-call with the
> "stop-self" token, something that is prohibited by the PAPR.
>
> The concurrent calls can happen as follows.
>
> o In dlpar_offline_cpu() we prod an offline CPU X (offline due to
> SMT=off) and loop for 25 tries in pseries_cpu_die() querying if
> the target CPU X has been stopped in RTAS. After 25 tries, we
> prints the message "Querying DEAD? cpu X (X) shows 2" and return
> to dlpar_offline_cpu(). Note that at this point CPU X has not yet
> called rtas with the "stop-self" token, but can do so anytime now.
>
> o Back in dlpar_offline_cpu(), we prod the next offline CPU Y. CPU Y
> promptly wakes up and calls RTAS with the "stop-self" token.
>
> o Before RTAS can stop CPU Y, CPU X also calls RTAS with the
> "stop-self" token.
>
> The problem occurs due to the short number of tries (25) in
> pseries_cpu_die() which covers 90% of the cases. For the remaining 10%
> of the cases, it was observed that we would need to loop for a few
> hundred iterations before the target CPU calls rtas with "stop-self"
> token.Experiments show that each try takes roughly ~25us.
>
> In this patch we fix the problem by increasing the number of tries
> from 25 to 4000 (which roughly corresponds to 100ms) before bailing
> out and declaring that we have failed to observe the target CPU call
> rtas-stop-self. This provides sufficient serialization to ensure that
> there no concurrent rtas calls with "stop-self" token.
>
> Reported-by: Michael Bringmann <[email protected]>
> Signed-off-by: Gautham R. Shenoy <[email protected]>

This patch badly conflicts with 801980f64979
("powerpc/pseries/hotplug-cpu: wait indefinitely for vCPU death")

Is it still applicable ? If so, please rebase and resubmit.

Thanks
Christophe

> ---
> arch/powerpc/platforms/pseries/hotplug-cpu.c | 13 ++++++++++++-
> 1 file changed, 12 insertions(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c
> index 2f8e621..c913c44 100644
> --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
> +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
> @@ -214,14 +214,25 @@ static void pseries_cpu_die(unsigned int cpu)
> msleep(1);
> }
> } else if (get_preferred_offline_state(cpu) == CPU_STATE_OFFLINE) {
> + int max_tries = 4000; /* Roughly corresponds to 100ms */
> + u64 tb_before = mftb();
>
> - for (tries = 0; tries < 25; tries++) {
> + for (tries = 0; tries < max_tries; tries++) {
> cpu_status = smp_query_cpu_stopped(pcpu);
> if (cpu_status == QCSS_STOPPED ||
> cpu_status == QCSS_HARDWARE_ERROR)
> break;
> cpu_relax();
> }
> +
> + if (tries == max_tries) {
> + u64 time_elapsed_us = div_u64(mftb() - tb_before,
> + tb_ticks_per_usec);
> +
> + pr_warn("Offlined CPU %d isn't stopped by RTAS after %llu us\n",
> + cpu, time_elapsed_us);
> + WARN_ON(1);
> + }
> }
>
> if (cpu_status != 0) {