2021-04-09 05:57:46

by Dongli Zhang

[permalink] [raw]
Subject: [PATCH v2 1/1] kernel/cpu: to track which CPUHP callback is failed

During bootup or cpu hotplug, the cpuhp_up_callbacks() or
cpuhp_down_callbacks() call many CPUHP callbacks (e.g., perf, mm,
workqueue, RCU, kvmclock and more) for each cpu to online/offline. It may
roll back to its previous state if any of callbacks is failed. As a result,
the user will not be able to know which callback is failed and usually the
only symptom is cpu online/offline failure.

This patch is to print more debug log to help user narrow down where is the
root cause.

Below is the example that how the patch helps narrow down the root cause
for the issue fixed by commit d7eb79c6290c ("KVM: kvmclock: Fix vCPUs > 64
can't be online/hotpluged").

We will have below dynamic debug log once we add
dyndbg="file kernel/cpu.c +p" to kernel command line and when issue is
reproduced.

"CPUHP up callback failure (-12) for cpu 64 at kvmclock:setup_percpu (66)"

Cc: Joe Jin <[email protected]>
Signed-off-by: Dongli Zhang <[email protected]>
---
Changed since v1 RFC:
- use pr_debug() but not pr_err_once() (suggested by Qais Yousef)
- print log for cpuhp_down_callbacks() as well (suggested by Qais Yousef)

kernel/cpu.c | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/kernel/cpu.c b/kernel/cpu.c
index 1b6302ecbabe..bcd4dd7de9c3 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -621,6 +621,10 @@ static int cpuhp_up_callbacks(unsigned int cpu, struct cpuhp_cpu_state *st,
st->state++;
ret = cpuhp_invoke_callback(cpu, st->state, true, NULL, NULL);
if (ret) {
+ pr_debug("CPUHP up callback failure (%d) for cpu %u at %s (%d)\n",
+ ret, cpu, cpuhp_get_step(st->state)->name,
+ st->state);
+
if (can_rollback_cpu(st)) {
st->target = prev_state;
undo_cpu_up(cpu, st);
@@ -990,6 +994,10 @@ static int cpuhp_down_callbacks(unsigned int cpu, struct cpuhp_cpu_state *st,
for (; st->state > target; st->state--) {
ret = cpuhp_invoke_callback(cpu, st->state, false, NULL, NULL);
if (ret) {
+ pr_debug("CPUHP down callback failure (%d) for cpu %u at %s (%d)\n",
+ ret, cpu, cpuhp_get_step(st->state)->name,
+ st->state);
+
st->target = prev_state;
if (st->state < prev_state)
undo_cpu_down(cpu, st);
--
2.17.1


2021-04-13 17:49:47

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH v2 1/1] kernel/cpu: to track which CPUHP callback is failed

On 04/08/21 22:53, Dongli Zhang wrote:
> During bootup or cpu hotplug, the cpuhp_up_callbacks() or
> cpuhp_down_callbacks() call many CPUHP callbacks (e.g., perf, mm,
> workqueue, RCU, kvmclock and more) for each cpu to online/offline. It may
> roll back to its previous state if any of callbacks is failed. As a result,
> the user will not be able to know which callback is failed and usually the
> only symptom is cpu online/offline failure.
>
> This patch is to print more debug log to help user narrow down where is the
> root cause.
>
> Below is the example that how the patch helps narrow down the root cause
> for the issue fixed by commit d7eb79c6290c ("KVM: kvmclock: Fix vCPUs > 64
> can't be online/hotpluged").
>
> We will have below dynamic debug log once we add
> dyndbg="file kernel/cpu.c +p" to kernel command line and when issue is
> reproduced.

You can also enable it at runtime

echo "file kernel/cpu.c +p" > /sys/kernel/debug/dynamic_debug/control

>
> "CPUHP up callback failure (-12) for cpu 64 at kvmclock:setup_percpu (66)"
>
> Cc: Joe Jin <[email protected]>
> Signed-off-by: Dongli Zhang <[email protected]>
> ---

I don't see the harm in adding the debug if some find it useful.

FWIW

Reviewed-by: Qais Yousef <[email protected]>

Cheers

--
Qais Yousef

> Changed since v1 RFC:
> - use pr_debug() but not pr_err_once() (suggested by Qais Yousef)
> - print log for cpuhp_down_callbacks() as well (suggested by Qais Yousef)
>
> kernel/cpu.c | 8 ++++++++
> 1 file changed, 8 insertions(+)
>
> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index 1b6302ecbabe..bcd4dd7de9c3 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -621,6 +621,10 @@ static int cpuhp_up_callbacks(unsigned int cpu, struct cpuhp_cpu_state *st,
> st->state++;
> ret = cpuhp_invoke_callback(cpu, st->state, true, NULL, NULL);
> if (ret) {
> + pr_debug("CPUHP up callback failure (%d) for cpu %u at %s (%d)\n",
> + ret, cpu, cpuhp_get_step(st->state)->name,
> + st->state);
> +
> if (can_rollback_cpu(st)) {
> st->target = prev_state;
> undo_cpu_up(cpu, st);
> @@ -990,6 +994,10 @@ static int cpuhp_down_callbacks(unsigned int cpu, struct cpuhp_cpu_state *st,
> for (; st->state > target; st->state--) {
> ret = cpuhp_invoke_callback(cpu, st->state, false, NULL, NULL);
> if (ret) {
> + pr_debug("CPUHP down callback failure (%d) for cpu %u at %s (%d)\n",
> + ret, cpu, cpuhp_get_step(st->state)->name,
> + st->state);
> +
> st->target = prev_state;
> if (st->state < prev_state)
> undo_cpu_down(cpu, st);
> --
> 2.17.1
>

2021-08-10 16:47:42

by tip-bot2 for Tony Luck

[permalink] [raw]
Subject: [tip: smp/core] cpu/hotplug: Add debug printks for hotplug callback failures

The following commit has been merged into the smp/core branch of tip:

Commit-ID: ebca71a8c96f0af2ba482489ecc64d88979cd825
Gitweb: https://git.kernel.org/tip/ebca71a8c96f0af2ba482489ecc64d88979cd825
Author: Dongli Zhang <[email protected]>
AuthorDate: Thu, 08 Apr 2021 22:53:16 -07:00
Committer: Thomas Gleixner <[email protected]>
CommitterDate: Tue, 10 Aug 2021 18:31:32 +02:00

cpu/hotplug: Add debug printks for hotplug callback failures

CPU hotplug callbacks can fail and cause a rollback to the previous
state. These failures are silent and therefore hard to debug.

Add pr_debug() to the up and down paths which provide information about the
error code, the CPU and the failed state. The debug printks can be enabled
via kernel command line or sysfs.

[ tglx: Adopt to current mainline, massage printk and changelog ]

Signed-off-by: Dongli Zhang <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Qais Yousef <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

---
kernel/cpu.c | 7 +++++++
1 file changed, 7 insertions(+)

diff --git a/kernel/cpu.c b/kernel/cpu.c
index 7ef28e1..192e43a 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -692,6 +692,10 @@ static int cpuhp_up_callbacks(unsigned int cpu, struct cpuhp_cpu_state *st,

ret = cpuhp_invoke_callback_range(true, cpu, st, target);
if (ret) {
+ pr_debug("CPU UP failed (%d) CPU %u state %s (%d)\n",
+ ret, cpu, cpuhp_get_step(st->state)->name,
+ st->state);
+
cpuhp_reset_state(st, prev_state);
if (can_rollback_cpu(st))
WARN_ON(cpuhp_invoke_callback_range(false, cpu, st,
@@ -1091,6 +1095,9 @@ static int cpuhp_down_callbacks(unsigned int cpu, struct cpuhp_cpu_state *st,

ret = cpuhp_invoke_callback_range(false, cpu, st, target);
if (ret) {
+ pr_debug("CPU DOWN failed (%d) CPU %u state %s (%d)\n",
+ ret, cpu, cpuhp_get_step(st->state)->name,
+ st->state);

cpuhp_reset_state(st, prev_state);