2018-12-05 03:41:16

by Doug Anderson

[permalink] [raw]
Subject: [REPOST PATCH v6 0/4] kgdb: Fix kgdb_roundup_cpus()

This series was originally part of the series ("serial: Finish kgdb on
qcom_geni; fix many lockdep splats w/ kgdb") but it made sense to
split it up.

It's believed that dropping into kgdb should be more robust once these
patches are applied.

Repost of v6 adds CC's and also tags already received.

Changes in v6:
- Moved smp_call_function_single_async() error check to patch 3.

Changes in v5:
- Add a comment about get_irq_regs().
- get_cpu() => raw_smp_processor_id() in kgdb_roundup_cpus().
- for_each_cpu() => for_each_online_cpu()
- Error check smp_call_function_single_async()

Changes in v4:
- Removed smp_mb() calls.
- Also clear out .debuggerinfo.
- Also clear out .debuggerinfo and .task for the master.
- Remove clearing out in kdb_stub for offline CPUs; it's now redundant.

Changes in v3:
- No separate init call.
- Don't round up the CPU that is doing the rounding up.
- Add "#ifdef CONFIG_SMP" to match the rest of the file.
- Updated desc saying we don't solve the "failed to roundup" case.
- Document the ignored parameter.
- Don't round up a CPU that failed rounding up before new for v3.
- Don't back trace on a cpu that didn't round up new for v3.

Changes in v2:
- Removing irq flags separated from fixing lockdep splat.
- Don't use smp_call_function (Daniel).

Douglas Anderson (4):
kgdb: Remove irq flags from roundup
kgdb: Fix kgdb_roundup_cpus() for arches who used smp_call_function()
kgdb: Don't round up a CPU that failed rounding up before
kdb: Don't back trace on a cpu that didn't round up

arch/arc/kernel/kgdb.c | 10 +----
arch/arm/kernel/kgdb.c | 12 ------
arch/arm64/kernel/kgdb.c | 12 ------
arch/hexagon/kernel/kgdb.c | 32 ----------------
arch/mips/kernel/kgdb.c | 9 +----
arch/powerpc/kernel/kgdb.c | 6 +--
arch/sh/kernel/kgdb.c | 12 ------
arch/sparc/kernel/smp_64.c | 2 +-
arch/x86/kernel/kgdb.c | 9 +----
include/linux/kgdb.h | 22 +++++++----
kernel/debug/debug_core.c | 65 ++++++++++++++++++++++++++++++++-
kernel/debug/debug_core.h | 1 +
kernel/debug/kdb/kdb_bt.c | 11 +++++-
kernel/debug/kdb/kdb_debugger.c | 7 ----
14 files changed, 98 insertions(+), 112 deletions(-)

--
2.20.0.rc1.387.gf8505762e3-goog



2018-12-05 03:39:57

by Doug Anderson

[permalink] [raw]
Subject: [REPOST PATCH v6 4/4] kdb: Don't back trace on a cpu that didn't round up

If you have a CPU that fails to round up and then run 'btc' you'll end
up crashing in kdb becaue we dereferenced NULL. Let's add a check.
It's wise to also set the task to NULL when leaving the debugger so
that if we fail to round up on a later entry into the debugger we
won't backtrace a stale task.

Signed-off-by: Douglas Anderson <[email protected]>
Acked-by: Daniel Thompson <[email protected]>
---

Changes in v6: None
Changes in v5: None
Changes in v4:
- Also clear out .debuggerinfo.
- Also clear out .debuggerinfo and .task for the master.
- Remove clearing out in kdb_stub for offline CPUs; it's now redundant.

Changes in v3:
- Don't back trace on a cpu that didn't round up new for v3.

Changes in v2: None

kernel/debug/debug_core.c | 4 ++++
kernel/debug/kdb/kdb_bt.c | 11 ++++++++++-
kernel/debug/kdb/kdb_debugger.c | 7 -------
3 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/kernel/debug/debug_core.c b/kernel/debug/debug_core.c
index 1fb8b239e567..5cc608de6883 100644
--- a/kernel/debug/debug_core.c
+++ b/kernel/debug/debug_core.c
@@ -592,6 +592,8 @@ static int kgdb_cpu_enter(struct kgdb_state *ks, struct pt_regs *regs,
arch_kgdb_ops.correct_hw_break();
if (trace_on)
tracing_on();
+ kgdb_info[cpu].debuggerinfo = NULL;
+ kgdb_info[cpu].task = NULL;
kgdb_info[cpu].exception_state &=
~(DCPU_WANT_MASTER | DCPU_IS_SLAVE);
kgdb_info[cpu].enter_kgdb--;
@@ -724,6 +726,8 @@ static int kgdb_cpu_enter(struct kgdb_state *ks, struct pt_regs *regs,
if (trace_on)
tracing_on();

+ kgdb_info[cpu].debuggerinfo = NULL;
+ kgdb_info[cpu].task = NULL;
kgdb_info[cpu].exception_state &=
~(DCPU_WANT_MASTER | DCPU_IS_SLAVE);
kgdb_info[cpu].enter_kgdb--;
diff --git a/kernel/debug/kdb/kdb_bt.c b/kernel/debug/kdb/kdb_bt.c
index 7921ae4fca8d..7e2379aa0a1e 100644
--- a/kernel/debug/kdb/kdb_bt.c
+++ b/kernel/debug/kdb/kdb_bt.c
@@ -186,7 +186,16 @@ kdb_bt(int argc, const char **argv)
kdb_printf("btc: cpu status: ");
kdb_parse("cpu\n");
for_each_online_cpu(cpu) {
- sprintf(buf, "btt 0x%px\n", KDB_TSK(cpu));
+ void *kdb_tsk = KDB_TSK(cpu);
+
+ /* If a CPU failed to round up we could be here */
+ if (!kdb_tsk) {
+ kdb_printf("WARNING: no task for cpu %ld\n",
+ cpu);
+ continue;
+ }
+
+ sprintf(buf, "btt 0x%px\n", kdb_tsk);
kdb_parse(buf);
touch_nmi_watchdog();
}
diff --git a/kernel/debug/kdb/kdb_debugger.c b/kernel/debug/kdb/kdb_debugger.c
index 15e1a7af5dd0..53a0df6e4d92 100644
--- a/kernel/debug/kdb/kdb_debugger.c
+++ b/kernel/debug/kdb/kdb_debugger.c
@@ -118,13 +118,6 @@ int kdb_stub(struct kgdb_state *ks)
kdb_bp_remove();
KDB_STATE_CLEAR(DOING_SS);
KDB_STATE_SET(PAGER);
- /* zero out any offline cpu data */
- for_each_present_cpu(i) {
- if (!cpu_online(i)) {
- kgdb_info[i].debuggerinfo = NULL;
- kgdb_info[i].task = NULL;
- }
- }
if (ks->err_code == DIE_OOPS || reason == KDB_REASON_OOPS) {
ks->pass_exception = 1;
KDB_FLAG_SET(CATASTROPHIC);
--
2.20.0.rc1.387.gf8505762e3-goog


2018-12-05 03:39:58

by Doug Anderson

[permalink] [raw]
Subject: [REPOST PATCH v6 1/4] kgdb: Remove irq flags from roundup

The function kgdb_roundup_cpus() was passed a parameter that was
documented as:

> the flags that will be used when restoring the interrupts. There is
> local_irq_save() call before kgdb_roundup_cpus().

Nobody used those flags. Anyone who wanted to temporarily turn on
interrupts just did local_irq_enable() and local_irq_disable() without
looking at them. So we can definitely remove the flags.

Signed-off-by: Douglas Anderson <[email protected]>
Cc: Vineet Gupta <[email protected]>
Cc: Russell King <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Richard Kuo <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Paul Burton <[email protected]>
Cc: James Hogan <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Cc: Rich Felker <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Acked-by: Will Deacon <[email protected]>
---

Changes in v6: None
Changes in v5: None
Changes in v4: None
Changes in v3: None
Changes in v2:
- Removing irq flags separated from fixing lockdep splat.

arch/arc/kernel/kgdb.c | 2 +-
arch/arm/kernel/kgdb.c | 2 +-
arch/arm64/kernel/kgdb.c | 2 +-
arch/hexagon/kernel/kgdb.c | 9 ++-------
arch/mips/kernel/kgdb.c | 2 +-
arch/powerpc/kernel/kgdb.c | 2 +-
arch/sh/kernel/kgdb.c | 2 +-
arch/sparc/kernel/smp_64.c | 2 +-
arch/x86/kernel/kgdb.c | 9 ++-------
include/linux/kgdb.h | 9 ++-------
kernel/debug/debug_core.c | 2 +-
11 files changed, 14 insertions(+), 29 deletions(-)

diff --git a/arch/arc/kernel/kgdb.c b/arch/arc/kernel/kgdb.c
index 9a3c34af2ae8..0932851028e0 100644
--- a/arch/arc/kernel/kgdb.c
+++ b/arch/arc/kernel/kgdb.c
@@ -197,7 +197,7 @@ static void kgdb_call_nmi_hook(void *ignored)
kgdb_nmicallback(raw_smp_processor_id(), NULL);
}

-void kgdb_roundup_cpus(unsigned long flags)
+void kgdb_roundup_cpus(void)
{
local_irq_enable();
smp_call_function(kgdb_call_nmi_hook, NULL, 0);
diff --git a/arch/arm/kernel/kgdb.c b/arch/arm/kernel/kgdb.c
index caa0dbe3dc61..f21077b077be 100644
--- a/arch/arm/kernel/kgdb.c
+++ b/arch/arm/kernel/kgdb.c
@@ -175,7 +175,7 @@ static void kgdb_call_nmi_hook(void *ignored)
kgdb_nmicallback(raw_smp_processor_id(), get_irq_regs());
}

-void kgdb_roundup_cpus(unsigned long flags)
+void kgdb_roundup_cpus(void)
{
local_irq_enable();
smp_call_function(kgdb_call_nmi_hook, NULL, 0);
diff --git a/arch/arm64/kernel/kgdb.c b/arch/arm64/kernel/kgdb.c
index a20de58061a8..12c339ff6e75 100644
--- a/arch/arm64/kernel/kgdb.c
+++ b/arch/arm64/kernel/kgdb.c
@@ -289,7 +289,7 @@ static void kgdb_call_nmi_hook(void *ignored)
kgdb_nmicallback(raw_smp_processor_id(), get_irq_regs());
}

-void kgdb_roundup_cpus(unsigned long flags)
+void kgdb_roundup_cpus(void)
{
local_irq_enable();
smp_call_function(kgdb_call_nmi_hook, NULL, 0);
diff --git a/arch/hexagon/kernel/kgdb.c b/arch/hexagon/kernel/kgdb.c
index 16c24b22d0b2..012e0e230ac2 100644
--- a/arch/hexagon/kernel/kgdb.c
+++ b/arch/hexagon/kernel/kgdb.c
@@ -119,17 +119,12 @@ void kgdb_arch_set_pc(struct pt_regs *regs, unsigned long pc)

/**
* kgdb_roundup_cpus - Get other CPUs into a holding pattern
- * @flags: Current IRQ state
*
* On SMP systems, we need to get the attention of the other CPUs
* and get them be in a known state. This should do what is needed
* to get the other CPUs to call kgdb_wait(). Note that on some arches,
* the NMI approach is not used for rounding up all the CPUs. For example,
- * in case of MIPS, smp_call_function() is used to roundup CPUs. In
- * this case, we have to make sure that interrupts are enabled before
- * calling smp_call_function(). The argument to this function is
- * the flags that will be used when restoring the interrupts. There is
- * local_irq_save() call before kgdb_roundup_cpus().
+ * in case of MIPS, smp_call_function() is used to roundup CPUs.
*
* On non-SMP systems, this is not called.
*/
@@ -139,7 +134,7 @@ static void hexagon_kgdb_nmi_hook(void *ignored)
kgdb_nmicallback(raw_smp_processor_id(), get_irq_regs());
}

-void kgdb_roundup_cpus(unsigned long flags)
+void kgdb_roundup_cpus(void)
{
local_irq_enable();
smp_call_function(hexagon_kgdb_nmi_hook, NULL, 0);
diff --git a/arch/mips/kernel/kgdb.c b/arch/mips/kernel/kgdb.c
index eb6c0d582626..2b05effc17b4 100644
--- a/arch/mips/kernel/kgdb.c
+++ b/arch/mips/kernel/kgdb.c
@@ -219,7 +219,7 @@ static void kgdb_call_nmi_hook(void *ignored)
set_fs(old_fs);
}

-void kgdb_roundup_cpus(unsigned long flags)
+void kgdb_roundup_cpus(void)
{
local_irq_enable();
smp_call_function(kgdb_call_nmi_hook, NULL, 0);
diff --git a/arch/powerpc/kernel/kgdb.c b/arch/powerpc/kernel/kgdb.c
index 59c578f865aa..b0e804844be0 100644
--- a/arch/powerpc/kernel/kgdb.c
+++ b/arch/powerpc/kernel/kgdb.c
@@ -124,7 +124,7 @@ static int kgdb_call_nmi_hook(struct pt_regs *regs)
}

#ifdef CONFIG_SMP
-void kgdb_roundup_cpus(unsigned long flags)
+void kgdb_roundup_cpus(void)
{
smp_send_debugger_break();
}
diff --git a/arch/sh/kernel/kgdb.c b/arch/sh/kernel/kgdb.c
index 4f04c6638a4d..cc57630f6bf2 100644
--- a/arch/sh/kernel/kgdb.c
+++ b/arch/sh/kernel/kgdb.c
@@ -319,7 +319,7 @@ static void kgdb_call_nmi_hook(void *ignored)
kgdb_nmicallback(raw_smp_processor_id(), get_irq_regs());
}

-void kgdb_roundup_cpus(unsigned long flags)
+void kgdb_roundup_cpus(void)
{
local_irq_enable();
smp_call_function(kgdb_call_nmi_hook, NULL, 0);
diff --git a/arch/sparc/kernel/smp_64.c b/arch/sparc/kernel/smp_64.c
index 4792e08ad36b..f45d876983f1 100644
--- a/arch/sparc/kernel/smp_64.c
+++ b/arch/sparc/kernel/smp_64.c
@@ -1014,7 +1014,7 @@ void flush_dcache_page_all(struct mm_struct *mm, struct page *page)
}

#ifdef CONFIG_KGDB
-void kgdb_roundup_cpus(unsigned long flags)
+void kgdb_roundup_cpus(void)
{
smp_cross_call(&xcall_kgdb_capture, 0, 0, 0);
}
diff --git a/arch/x86/kernel/kgdb.c b/arch/x86/kernel/kgdb.c
index 8e36f249646e..ac6291a4178d 100644
--- a/arch/x86/kernel/kgdb.c
+++ b/arch/x86/kernel/kgdb.c
@@ -422,21 +422,16 @@ static void kgdb_disable_hw_debug(struct pt_regs *regs)
#ifdef CONFIG_SMP
/**
* kgdb_roundup_cpus - Get other CPUs into a holding pattern
- * @flags: Current IRQ state
*
* On SMP systems, we need to get the attention of the other CPUs
* and get them be in a known state. This should do what is needed
* to get the other CPUs to call kgdb_wait(). Note that on some arches,
* the NMI approach is not used for rounding up all the CPUs. For example,
- * in case of MIPS, smp_call_function() is used to roundup CPUs. In
- * this case, we have to make sure that interrupts are enabled before
- * calling smp_call_function(). The argument to this function is
- * the flags that will be used when restoring the interrupts. There is
- * local_irq_save() call before kgdb_roundup_cpus().
+ * in case of MIPS, smp_call_function() is used to roundup CPUs.
*
* On non-SMP systems, this is not called.
*/
-void kgdb_roundup_cpus(unsigned long flags)
+void kgdb_roundup_cpus(void)
{
apic->send_IPI_allbutself(APIC_DM_NMI);
}
diff --git a/include/linux/kgdb.h b/include/linux/kgdb.h
index e465bb15912d..05e5b2eb0d32 100644
--- a/include/linux/kgdb.h
+++ b/include/linux/kgdb.h
@@ -178,21 +178,16 @@ kgdb_arch_handle_exception(int vector, int signo, int err_code,

/**
* kgdb_roundup_cpus - Get other CPUs into a holding pattern
- * @flags: Current IRQ state
*
* On SMP systems, we need to get the attention of the other CPUs
* and get them into a known state. This should do what is needed
* to get the other CPUs to call kgdb_wait(). Note that on some arches,
* the NMI approach is not used for rounding up all the CPUs. For example,
- * in case of MIPS, smp_call_function() is used to roundup CPUs. In
- * this case, we have to make sure that interrupts are enabled before
- * calling smp_call_function(). The argument to this function is
- * the flags that will be used when restoring the interrupts. There is
- * local_irq_save() call before kgdb_roundup_cpus().
+ * in case of MIPS, smp_call_function() is used to roundup CPUs.
*
* On non-SMP systems, this is not called.
*/
-extern void kgdb_roundup_cpus(unsigned long flags);
+extern void kgdb_roundup_cpus(void);

/**
* kgdb_arch_set_pc - Generic call back to the program counter
diff --git a/kernel/debug/debug_core.c b/kernel/debug/debug_core.c
index 65c0f1363788..f3cadda45f07 100644
--- a/kernel/debug/debug_core.c
+++ b/kernel/debug/debug_core.c
@@ -593,7 +593,7 @@ static int kgdb_cpu_enter(struct kgdb_state *ks, struct pt_regs *regs,

/* Signal the other CPUs to enter kgdb_wait() */
else if ((!kgdb_single_step) && kgdb_do_roundup)
- kgdb_roundup_cpus(flags);
+ kgdb_roundup_cpus();
#endif

/*
--
2.20.0.rc1.387.gf8505762e3-goog


2018-12-05 03:40:25

by Doug Anderson

[permalink] [raw]
Subject: [REPOST PATCH v6 2/4] kgdb: Fix kgdb_roundup_cpus() for arches who used smp_call_function()

When I had lockdep turned on and dropped into kgdb I got a nice splat
on my system. Specifically it hit:
DEBUG_LOCKS_WARN_ON(current->hardirq_context)

Specifically it looked like this:
sysrq: SysRq : DEBUG
------------[ cut here ]------------
DEBUG_LOCKS_WARN_ON(current->hardirq_context)
WARNING: CPU: 0 PID: 0 at .../kernel/locking/lockdep.c:2875 lockdep_hardirqs_on+0xf0/0x160
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.19.0 #27
pstate: 604003c9 (nZCv DAIF +PAN -UAO)
pc : lockdep_hardirqs_on+0xf0/0x160
...
Call trace:
lockdep_hardirqs_on+0xf0/0x160
trace_hardirqs_on+0x188/0x1ac
kgdb_roundup_cpus+0x14/0x3c
kgdb_cpu_enter+0x53c/0x5cc
kgdb_handle_exception+0x180/0x1d4
kgdb_compiled_brk_fn+0x30/0x3c
brk_handler+0x134/0x178
do_debug_exception+0xfc/0x178
el1_dbg+0x18/0x78
kgdb_breakpoint+0x34/0x58
sysrq_handle_dbg+0x54/0x5c
__handle_sysrq+0x114/0x21c
handle_sysrq+0x30/0x3c
qcom_geni_serial_isr+0x2dc/0x30c
...
...
irq event stamp: ...45
hardirqs last enabled at (...44): [...] __do_softirq+0xd8/0x4e4
hardirqs last disabled at (...45): [...] el1_irq+0x74/0x130
softirqs last enabled at (...42): [...] _local_bh_enable+0x2c/0x34
softirqs last disabled at (...43): [...] irq_exit+0xa8/0x100
---[ end trace adf21f830c46e638 ]---

Looking closely at it, it seems like a really bad idea to be calling
local_irq_enable() in kgdb_roundup_cpus(). If nothing else that seems
like it could violate spinlock semantics and cause a deadlock.

Instead, let's use a private csd alongside
smp_call_function_single_async() to round up the other CPUs. Using
smp_call_function_single_async() doesn't require interrupts to be
enabled so we can remove the offending bit of code.

In order to avoid duplicating this across all the architectures that
use the default kgdb_roundup_cpus(), we'll add a "weak" implementation
to debug_core.c.

Looking at all the people who previously had copies of this code,
there were a few variants. I've attempted to keep the variants
working like they used to. Specifically:
* For arch/arc we passed NULL to kgdb_nmicallback() instead of
get_irq_regs().
* For arch/mips there was a bit of extra code around
kgdb_nmicallback()

NOTE: In this patch we will still get into trouble if we try to round
up a CPU that failed to round up before. We'll try to round it up
again and potentially hang when we try to grab the csd lock. That's
not new behavior but we'll still try to do better in a future patch.

Suggested-by: Daniel Thompson <[email protected]>
Signed-off-by: Douglas Anderson <[email protected]>
Cc: Vineet Gupta <[email protected]>
Cc: Russell King <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Richard Kuo <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Paul Burton <[email protected]>
Cc: James Hogan <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Cc: Rich Felker <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Acked-by: Will Deacon <[email protected]>
---

Changes in v6:
- Moved smp_call_function_single_async() error check to patch 3.

Changes in v5:
- Add a comment about get_irq_regs().
- get_cpu() => raw_smp_processor_id() in kgdb_roundup_cpus().
- for_each_cpu() => for_each_online_cpu()
- Error check smp_call_function_single_async()

Changes in v4: None
Changes in v3:
- No separate init call.
- Don't round up the CPU that is doing the rounding up.
- Add "#ifdef CONFIG_SMP" to match the rest of the file.
- Updated desc saying we don't solve the "failed to roundup" case.
- Document the ignored parameter.

Changes in v2:
- Removing irq flags separated from fixing lockdep splat.
- Don't use smp_call_function (Daniel).

arch/arc/kernel/kgdb.c | 10 ++--------
arch/arm/kernel/kgdb.c | 12 -----------
arch/arm64/kernel/kgdb.c | 12 -----------
arch/hexagon/kernel/kgdb.c | 27 -------------------------
arch/mips/kernel/kgdb.c | 9 +--------
arch/powerpc/kernel/kgdb.c | 4 ++--
arch/sh/kernel/kgdb.c | 12 -----------
include/linux/kgdb.h | 15 ++++++++++++--
kernel/debug/debug_core.c | 41 ++++++++++++++++++++++++++++++++++++++
9 files changed, 59 insertions(+), 83 deletions(-)

diff --git a/arch/arc/kernel/kgdb.c b/arch/arc/kernel/kgdb.c
index 0932851028e0..68d9fe4b5aa7 100644
--- a/arch/arc/kernel/kgdb.c
+++ b/arch/arc/kernel/kgdb.c
@@ -192,18 +192,12 @@ void kgdb_arch_set_pc(struct pt_regs *regs, unsigned long ip)
instruction_pointer(regs) = ip;
}

-static void kgdb_call_nmi_hook(void *ignored)
+void kgdb_call_nmi_hook(void *ignored)
{
+ /* Default implementation passes get_irq_regs() but we don't */
kgdb_nmicallback(raw_smp_processor_id(), NULL);
}

-void kgdb_roundup_cpus(void)
-{
- local_irq_enable();
- smp_call_function(kgdb_call_nmi_hook, NULL, 0);
- local_irq_disable();
-}
-
struct kgdb_arch arch_kgdb_ops = {
/* breakpoint instruction: TRAP_S 0x3 */
#ifdef CONFIG_CPU_BIG_ENDIAN
diff --git a/arch/arm/kernel/kgdb.c b/arch/arm/kernel/kgdb.c
index f21077b077be..d9a69e941463 100644
--- a/arch/arm/kernel/kgdb.c
+++ b/arch/arm/kernel/kgdb.c
@@ -170,18 +170,6 @@ static struct undef_hook kgdb_compiled_brkpt_hook = {
.fn = kgdb_compiled_brk_fn
};

-static void kgdb_call_nmi_hook(void *ignored)
-{
- kgdb_nmicallback(raw_smp_processor_id(), get_irq_regs());
-}
-
-void kgdb_roundup_cpus(void)
-{
- local_irq_enable();
- smp_call_function(kgdb_call_nmi_hook, NULL, 0);
- local_irq_disable();
-}
-
static int __kgdb_notify(struct die_args *args, unsigned long cmd)
{
struct pt_regs *regs = args->regs;
diff --git a/arch/arm64/kernel/kgdb.c b/arch/arm64/kernel/kgdb.c
index 12c339ff6e75..da880247c734 100644
--- a/arch/arm64/kernel/kgdb.c
+++ b/arch/arm64/kernel/kgdb.c
@@ -284,18 +284,6 @@ static struct step_hook kgdb_step_hook = {
.fn = kgdb_step_brk_fn
};

-static void kgdb_call_nmi_hook(void *ignored)
-{
- kgdb_nmicallback(raw_smp_processor_id(), get_irq_regs());
-}
-
-void kgdb_roundup_cpus(void)
-{
- local_irq_enable();
- smp_call_function(kgdb_call_nmi_hook, NULL, 0);
- local_irq_disable();
-}
-
static int __kgdb_notify(struct die_args *args, unsigned long cmd)
{
struct pt_regs *regs = args->regs;
diff --git a/arch/hexagon/kernel/kgdb.c b/arch/hexagon/kernel/kgdb.c
index 012e0e230ac2..b95d12038a4e 100644
--- a/arch/hexagon/kernel/kgdb.c
+++ b/arch/hexagon/kernel/kgdb.c
@@ -115,33 +115,6 @@ void kgdb_arch_set_pc(struct pt_regs *regs, unsigned long pc)
instruction_pointer(regs) = pc;
}

-#ifdef CONFIG_SMP
-
-/**
- * kgdb_roundup_cpus - Get other CPUs into a holding pattern
- *
- * On SMP systems, we need to get the attention of the other CPUs
- * and get them be in a known state. This should do what is needed
- * to get the other CPUs to call kgdb_wait(). Note that on some arches,
- * the NMI approach is not used for rounding up all the CPUs. For example,
- * in case of MIPS, smp_call_function() is used to roundup CPUs.
- *
- * On non-SMP systems, this is not called.
- */
-
-static void hexagon_kgdb_nmi_hook(void *ignored)
-{
- kgdb_nmicallback(raw_smp_processor_id(), get_irq_regs());
-}
-
-void kgdb_roundup_cpus(void)
-{
- local_irq_enable();
- smp_call_function(hexagon_kgdb_nmi_hook, NULL, 0);
- local_irq_disable();
-}
-#endif
-

/* Not yet working */
void sleeping_thread_to_gdb_regs(unsigned long *gdb_regs,
diff --git a/arch/mips/kernel/kgdb.c b/arch/mips/kernel/kgdb.c
index 2b05effc17b4..42f057a6c215 100644
--- a/arch/mips/kernel/kgdb.c
+++ b/arch/mips/kernel/kgdb.c
@@ -207,7 +207,7 @@ void arch_kgdb_breakpoint(void)
".set\treorder");
}

-static void kgdb_call_nmi_hook(void *ignored)
+void kgdb_call_nmi_hook(void *ignored)
{
mm_segment_t old_fs;

@@ -219,13 +219,6 @@ static void kgdb_call_nmi_hook(void *ignored)
set_fs(old_fs);
}

-void kgdb_roundup_cpus(void)
-{
- local_irq_enable();
- smp_call_function(kgdb_call_nmi_hook, NULL, 0);
- local_irq_disable();
-}
-
static int compute_signal(int tt)
{
struct hard_trap_info *ht;
diff --git a/arch/powerpc/kernel/kgdb.c b/arch/powerpc/kernel/kgdb.c
index b0e804844be0..b4ce54d73337 100644
--- a/arch/powerpc/kernel/kgdb.c
+++ b/arch/powerpc/kernel/kgdb.c
@@ -117,7 +117,7 @@ int kgdb_skipexception(int exception, struct pt_regs *regs)
return kgdb_isremovedbreak(regs->nip);
}

-static int kgdb_call_nmi_hook(struct pt_regs *regs)
+static int kgdb_debugger_ipi(struct pt_regs *regs)
{
kgdb_nmicallback(raw_smp_processor_id(), regs);
return 0;
@@ -502,7 +502,7 @@ int kgdb_arch_init(void)
old__debugger_break_match = __debugger_break_match;
old__debugger_fault_handler = __debugger_fault_handler;

- __debugger_ipi = kgdb_call_nmi_hook;
+ __debugger_ipi = kgdb_debugger_ipi;
__debugger = kgdb_debugger;
__debugger_bpt = kgdb_handle_breakpoint;
__debugger_sstep = kgdb_singlestep;
diff --git a/arch/sh/kernel/kgdb.c b/arch/sh/kernel/kgdb.c
index cc57630f6bf2..14e012ad7c57 100644
--- a/arch/sh/kernel/kgdb.c
+++ b/arch/sh/kernel/kgdb.c
@@ -314,18 +314,6 @@ BUILD_TRAP_HANDLER(singlestep)
local_irq_restore(flags);
}

-static void kgdb_call_nmi_hook(void *ignored)
-{
- kgdb_nmicallback(raw_smp_processor_id(), get_irq_regs());
-}
-
-void kgdb_roundup_cpus(void)
-{
- local_irq_enable();
- smp_call_function(kgdb_call_nmi_hook, NULL, 0);
- local_irq_disable();
-}
-
static int __kgdb_notify(struct die_args *args, unsigned long cmd)
{
int ret;
diff --git a/include/linux/kgdb.h b/include/linux/kgdb.h
index 05e5b2eb0d32..24422865cd18 100644
--- a/include/linux/kgdb.h
+++ b/include/linux/kgdb.h
@@ -176,14 +176,25 @@ kgdb_arch_handle_exception(int vector, int signo, int err_code,
char *remcom_out_buffer,
struct pt_regs *regs);

+/**
+ * kgdb_call_nmi_hook - Call kgdb_nmicallback() on the current CPU
+ * @ignored: This parameter is only here to match the prototype.
+ *
+ * If you're using the default implementation of kgdb_roundup_cpus()
+ * this function will be called per CPU. If you don't implement
+ * kgdb_call_nmi_hook() a default will be used.
+ */
+
+extern void kgdb_call_nmi_hook(void *ignored);
+
/**
* kgdb_roundup_cpus - Get other CPUs into a holding pattern
*
* On SMP systems, we need to get the attention of the other CPUs
* and get them into a known state. This should do what is needed
* to get the other CPUs to call kgdb_wait(). Note that on some arches,
- * the NMI approach is not used for rounding up all the CPUs. For example,
- * in case of MIPS, smp_call_function() is used to roundup CPUs.
+ * the NMI approach is not used for rounding up all the CPUs. Normally
+ * those architectures can just not implement this and get the default.
*
* On non-SMP systems, this is not called.
*/
diff --git a/kernel/debug/debug_core.c b/kernel/debug/debug_core.c
index f3cadda45f07..10db2833a423 100644
--- a/kernel/debug/debug_core.c
+++ b/kernel/debug/debug_core.c
@@ -55,6 +55,7 @@
#include <linux/mm.h>
#include <linux/vmacache.h>
#include <linux/rcupdate.h>
+#include <linux/irq.h>

#include <asm/cacheflush.h>
#include <asm/byteorder.h>
@@ -220,6 +221,46 @@ int __weak kgdb_skipexception(int exception, struct pt_regs *regs)
return 0;
}

+#ifdef CONFIG_SMP
+
+/*
+ * Default (weak) implementation for kgdb_roundup_cpus
+ */
+
+static DEFINE_PER_CPU(call_single_data_t, kgdb_roundup_csd);
+
+void __weak kgdb_call_nmi_hook(void *ignored)
+{
+ /*
+ * NOTE: get_irq_regs() is supposed to get the registers from
+ * before the IPI interrupt happened and so is supposed to
+ * show where the processor was. In some situations it's
+ * possible we might be called without an IPI, so it might be
+ * safer to figure out how to make kgdb_breakpoint() work
+ * properly here.
+ */
+ kgdb_nmicallback(raw_smp_processor_id(), get_irq_regs());
+}
+
+void __weak kgdb_roundup_cpus(void)
+{
+ call_single_data_t *csd;
+ int this_cpu = raw_smp_processor_id();
+ int cpu;
+
+ for_each_online_cpu(cpu) {
+ /* No need to roundup ourselves */
+ if (cpu == this_cpu)
+ continue;
+
+ csd = &per_cpu(kgdb_roundup_csd, cpu);
+ csd->func = kgdb_call_nmi_hook;
+ smp_call_function_single_async(cpu, csd);
+ }
+}
+
+#endif
+
/*
* Some architectures need cache flushes when we set/clear a
* breakpoint:
--
2.20.0.rc1.387.gf8505762e3-goog


2018-12-05 03:40:57

by Doug Anderson

[permalink] [raw]
Subject: [REPOST PATCH v6 3/4] kgdb: Don't round up a CPU that failed rounding up before

If we're using the default implementation of kgdb_roundup_cpus() that
uses smp_call_function_single_async() we can end up hanging
kgdb_roundup_cpus() if we try to round up a CPU that failed to round
up before.

Specifically smp_call_function_single_async() will try to wait on the
csd lock for the CPU that we're trying to round up. If the previous
round up never finished then that lock could still be held and we'll
just sit there hanging.

There's not a lot of use trying to round up a CPU that failed to round
up before. Let's keep a flag that indicates whether the CPU started
but didn't finish to round up before. If we see that flag set then
we'll skip the next round up.

In general we have a few goals here:
- We never want to end up calling smp_call_function_single_async()
when the csd is still locked. This is accomplished because
flush_smp_call_function_queue() unlocks the csd _before_ invoking
the callback. That means that when kgdb_nmicallback() runs we know
for sure the the csd is no longer locked. Thus when we set
"rounding_up = false" we know for sure that the csd is unlocked.
- If there are no timeouts rounding up we should never skip a round
up.

NOTE #1: In general trying to continue running after failing to round
up CPUs doesn't appear to be supported in the debugger. When I
simulate this I find that kdb reports "Catastrophic error detected"
when I try to continue. I can overrule and continue anyway, but it
should be noted that we may be entering the land of dragons here.
Possibly the "Catastrophic error detected" was added _because_ of the
future failure to round up, but even so this is an area of the code
that hasn't been strongly tested.

NOTE #2: I did a bit of testing before and after this change. I
introduced a 10 second hang in the kernel while holding a spinlock
that I could invoke on a certain CPU with 'taskset -c 3 cat /sys/...".

Before this change if I did:
- Invoke hang
- Enter debugger
- g (which warns about Catastrophic error, g again to go anyway)
- g
- Enter debugger

...I'd hang the rest of the 10 seconds without getting a debugger
prompt. After this change I end up in the debugger the 2nd time after
only 1 second with the standard warning about 'Timed out waiting for
secondary CPUs.'

I'll also note that once the CPU finished waiting I could actually
debug it (aka "btc" worked)

I won't promise that everything works perfectly if the errant CPU
comes back at just the wrong time (like as we're entering or exiting
the debugger) but it certainly seems like an improvement.

NOTE #3: setting 'kgdb_info[cpu].rounding_up = false' is in
kgdb_nmicallback() instead of kgdb_call_nmi_hook() because some
implementations override kgdb_call_nmi_hook(). It shouldn't hurt to
have it in kgdb_nmicallback() in any case.

NOTE #4: this logic is really only needed because there is no API call
like "smp_try_call_function_single_async()" or "smp_csd_is_locked()".
If such an API existed then we'd use it instead, but it seemed a bit
much to add an API like this just for kgdb.

Signed-off-by: Douglas Anderson <[email protected]>
Acked-by: Daniel Thompson <[email protected]>
---

Changes in v6:
- Moved smp_call_function_single_async() error check to patch 3.

Changes in v5: None
Changes in v4:
- Removed smp_mb() calls.

Changes in v3:
- Don't round up a CPU that failed rounding up before new for v3.

Changes in v2: None

kernel/debug/debug_core.c | 20 +++++++++++++++++++-
kernel/debug/debug_core.h | 1 +
2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/kernel/debug/debug_core.c b/kernel/debug/debug_core.c
index 10db2833a423..1fb8b239e567 100644
--- a/kernel/debug/debug_core.c
+++ b/kernel/debug/debug_core.c
@@ -247,6 +247,7 @@ void __weak kgdb_roundup_cpus(void)
call_single_data_t *csd;
int this_cpu = raw_smp_processor_id();
int cpu;
+ int ret;

for_each_online_cpu(cpu) {
/* No need to roundup ourselves */
@@ -254,8 +255,23 @@ void __weak kgdb_roundup_cpus(void)
continue;

csd = &per_cpu(kgdb_roundup_csd, cpu);
+
+ /*
+ * If it didn't round up last time, don't try again
+ * since smp_call_function_single_async() will block.
+ *
+ * If rounding_up is false then we know that the
+ * previous call must have at least started and that
+ * means smp_call_function_single_async() won't block.
+ */
+ if (kgdb_info[cpu].rounding_up)
+ continue;
+ kgdb_info[cpu].rounding_up = true;
+
csd->func = kgdb_call_nmi_hook;
- smp_call_function_single_async(cpu, csd);
+ ret = smp_call_function_single_async(cpu, csd);
+ if (ret)
+ kgdb_info[cpu].rounding_up = false;
}
}

@@ -788,6 +804,8 @@ int kgdb_nmicallback(int cpu, void *regs)
struct kgdb_state kgdb_var;
struct kgdb_state *ks = &kgdb_var;

+ kgdb_info[cpu].rounding_up = false;
+
memset(ks, 0, sizeof(struct kgdb_state));
ks->cpu = cpu;
ks->linux_regs = regs;
diff --git a/kernel/debug/debug_core.h b/kernel/debug/debug_core.h
index 127d9bc49fb4..b4a7c326d546 100644
--- a/kernel/debug/debug_core.h
+++ b/kernel/debug/debug_core.h
@@ -42,6 +42,7 @@ struct debuggerinfo_struct {
int ret_state;
int irq_depth;
int enter_kgdb;
+ bool rounding_up;
};

extern struct debuggerinfo_struct kgdb_info[];
--
2.20.0.rc1.387.gf8505762e3-goog


2018-12-07 17:43:37

by Catalin Marinas

[permalink] [raw]
Subject: Re: [REPOST PATCH v6 0/4] kgdb: Fix kgdb_roundup_cpus()

On Tue, Dec 04, 2018 at 07:38:24PM -0800, Douglas Anderson wrote:
> Douglas Anderson (4):
> kgdb: Remove irq flags from roundup
> kgdb: Fix kgdb_roundup_cpus() for arches who used smp_call_function()
> kgdb: Don't round up a CPU that failed rounding up before
> kdb: Don't back trace on a cpu that didn't round up

FWIW, trying these on arm64 (ThunderX2) with CONFIG_KGDB_TESTS_ON_BOOT=y
on top of 4.20-rc5 doesn't boot. It looks like they leave interrupts
disabled when they shouldn't and it trips over the BUG at
mm/vmalloc.c:1380 (called via do_fork -> copy_process).

Now, I don't think these patches make things worse on arm64 since prior
to them the kgdb boot tests on arm64 were stuck in a loop (RUN
singlestep).

--
Catalin

2018-12-07 18:43:22

by Doug Anderson

[permalink] [raw]
Subject: Re: [REPOST PATCH v6 0/4] kgdb: Fix kgdb_roundup_cpus()

Hi,

On Fri, Dec 7, 2018 at 9:42 AM Catalin Marinas <[email protected]> wrote:
>
> On Tue, Dec 04, 2018 at 07:38:24PM -0800, Douglas Anderson wrote:
> > Douglas Anderson (4):
> > kgdb: Remove irq flags from roundup
> > kgdb: Fix kgdb_roundup_cpus() for arches who used smp_call_function()
> > kgdb: Don't round up a CPU that failed rounding up before
> > kdb: Don't back trace on a cpu that didn't round up
>
> FWIW, trying these on arm64 (ThunderX2) with CONFIG_KGDB_TESTS_ON_BOOT=y
> on top of 4.20-rc5 doesn't boot. It looks like they leave interrupts
> disabled when they shouldn't and it trips over the BUG at
> mm/vmalloc.c:1380 (called via do_fork -> copy_process).
>
> Now, I don't think these patches make things worse on arm64 since prior
> to them the kgdb boot tests on arm64 were stuck in a loop (RUN
> singlestep).

Thanks for the report! ...actually, I'd never tried CONFIG_KGDB_TESTS
before. ...so I tried them now:

A) chromeos-4.19 tree on qcom-sdm845 without this series: booted up OK
B) chromeos-4.19 tree on qcom-sdm845 with this series: booted up OK
C) v4.20-rc5-90-g30002dd008ed on rockchip-rk3399 (kevin) with this
series: booted up OK

Example output from B) above:

localhost ~ # dmesg | grep kgdbts
[ 2.139814] KGDB: Registered I/O driver kgdbts
[ 2.144582] kgdbts:RUN plant and detach test
[ 2.165333] kgdbts:RUN sw breakpoint test
[ 2.172990] kgdbts:RUN bad memory access test
[ 2.178640] kgdbts:RUN singlestep test 1000 iterations
[ 2.187765] kgdbts:RUN singlestep [0/1000]
[ 2.559596] kgdbts:RUN singlestep [100/1000]
[ 2.931419] kgdbts:RUN singlestep [200/1000]
[ 3.303474] kgdbts:RUN singlestep [300/1000]
[ 3.675121] kgdbts:RUN singlestep [400/1000]
[ 4.046867] kgdbts:RUN singlestep [500/1000]
[ 4.418920] kgdbts:RUN singlestep [600/1000]
[ 4.790824] kgdbts:RUN singlestep [700/1000]
[ 5.162479] kgdbts:RUN singlestep [800/1000]
[ 5.534103] kgdbts:RUN singlestep [900/1000]
[ 5.902299] kgdbts:RUN do_fork for 100 breakpoints
[ 8.463900] KGDB: Unregistered I/O driver kgdbts, debugger disabled

...so I guess I'm a little confused. Either I have a different config
than you do or something is special about your machine?


NOTE: In general I've never considered "single step" as reliable in
kgdb. I mostly use kgdb as "after the fact" crash debugging to
analyze local variables / memory / other tasks. If it worked that'd
actually be kinda great, but at least when I started using kgdb years
ago I learned that it didn't work and stopped trying...


-Doug

2018-12-10 16:08:03

by Catalin Marinas

[permalink] [raw]
Subject: Re: [REPOST PATCH v6 0/4] kgdb: Fix kgdb_roundup_cpus()

Hi Doug,

On Fri, Dec 07, 2018 at 10:40:24AM -0800, Doug Anderson wrote:
> On Fri, Dec 7, 2018 at 9:42 AM Catalin Marinas <[email protected]> wrote:
> > On Tue, Dec 04, 2018 at 07:38:24PM -0800, Douglas Anderson wrote:
> > > Douglas Anderson (4):
> > > kgdb: Remove irq flags from roundup
> > > kgdb: Fix kgdb_roundup_cpus() for arches who used smp_call_function()
> > > kgdb: Don't round up a CPU that failed rounding up before
> > > kdb: Don't back trace on a cpu that didn't round up
> >
> > FWIW, trying these on arm64 (ThunderX2) with CONFIG_KGDB_TESTS_ON_BOOT=y
> > on top of 4.20-rc5 doesn't boot. It looks like they leave interrupts
> > disabled when they shouldn't and it trips over the BUG at
> > mm/vmalloc.c:1380 (called via do_fork -> copy_process).
> >
> > Now, I don't think these patches make things worse on arm64 since prior
> > to them the kgdb boot tests on arm64 were stuck in a loop (RUN
> > singlestep).
>
> Thanks for the report! ...actually, I'd never tried CONFIG_KGDB_TESTS
> before. ...so I tried them now:
>
> A) chromeos-4.19 tree on qcom-sdm845 without this series: booted up OK
> B) chromeos-4.19 tree on qcom-sdm845 with this series: booted up OK
> C) v4.20-rc5-90-g30002dd008ed on rockchip-rk3399 (kevin) with this
> series: booted up OK
>
> Example output from B) above:
>
> localhost ~ # dmesg | grep kgdbts
> [ 2.139814] KGDB: Registered I/O driver kgdbts
> [ 2.144582] kgdbts:RUN plant and detach test
> [ 2.165333] kgdbts:RUN sw breakpoint test
> [ 2.172990] kgdbts:RUN bad memory access test
> [ 2.178640] kgdbts:RUN singlestep test 1000 iterations
> [ 2.187765] kgdbts:RUN singlestep [0/1000]
> [ 2.559596] kgdbts:RUN singlestep [100/1000]
> [ 2.931419] kgdbts:RUN singlestep [200/1000]
> [ 3.303474] kgdbts:RUN singlestep [300/1000]
> [ 3.675121] kgdbts:RUN singlestep [400/1000]
> [ 4.046867] kgdbts:RUN singlestep [500/1000]
> [ 4.418920] kgdbts:RUN singlestep [600/1000]
> [ 4.790824] kgdbts:RUN singlestep [700/1000]
> [ 5.162479] kgdbts:RUN singlestep [800/1000]
> [ 5.534103] kgdbts:RUN singlestep [900/1000]
> [ 5.902299] kgdbts:RUN do_fork for 100 breakpoints
> [ 8.463900] KGDB: Unregistered I/O driver kgdbts, debugger disabled
>
> ...so I guess I'm a little confused. Either I have a different config
> than you do or something is special about your machine?

I tried it now on a Juno board both as a host and a guest and boots
fine. It must be something that only triggers ThunderX2. Ignore the
report for now, if I find anything interesting I'll let you know.

--
Catalin

2018-12-19 18:36:57

by Daniel Thompson

[permalink] [raw]
Subject: Re: [REPOST PATCH v6 1/4] kgdb: Remove irq flags from roundup

On Tue, Dec 04, 2018 at 07:38:25PM -0800, Douglas Anderson wrote:
> The function kgdb_roundup_cpus() was passed a parameter that was
> documented as:
>
> > the flags that will be used when restoring the interrupts. There is
> > local_irq_save() call before kgdb_roundup_cpus().
>
> Nobody used those flags. Anyone who wanted to temporarily turn on
> interrupts just did local_irq_enable() and local_irq_disable() without
> looking at them. So we can definitely remove the flags.
>
> Signed-off-by: Douglas Anderson <[email protected]>
> Cc: Vineet Gupta <[email protected]>
> Cc: Russell King <[email protected]>
> Cc: Catalin Marinas <[email protected]>
> Cc: Will Deacon <[email protected]>
> Cc: Richard Kuo <[email protected]>
> Cc: Ralf Baechle <[email protected]>
> Cc: Paul Burton <[email protected]>
> Cc: James Hogan <[email protected]>
> Cc: Benjamin Herrenschmidt <[email protected]>
> Cc: Paul Mackerras <[email protected]>
> Cc: Michael Ellerman <[email protected]>
> Cc: Yoshinori Sato <[email protected]>
> Cc: Rich Felker <[email protected]>
> Cc: "David S. Miller" <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Borislav Petkov <[email protected]>
> Cc: "H. Peter Anvin" <[email protected]>
> Acked-by: Will Deacon <[email protected]>

I've not heard any objections from the arch/ maintainers so...

Applied! Thanks.


> ---
>
> Changes in v6: None
> Changes in v5: None
> Changes in v4: None
> Changes in v3: None
> Changes in v2:
> - Removing irq flags separated from fixing lockdep splat.
>
> arch/arc/kernel/kgdb.c | 2 +-
> arch/arm/kernel/kgdb.c | 2 +-
> arch/arm64/kernel/kgdb.c | 2 +-
> arch/hexagon/kernel/kgdb.c | 9 ++-------
> arch/mips/kernel/kgdb.c | 2 +-
> arch/powerpc/kernel/kgdb.c | 2 +-
> arch/sh/kernel/kgdb.c | 2 +-
> arch/sparc/kernel/smp_64.c | 2 +-
> arch/x86/kernel/kgdb.c | 9 ++-------
> include/linux/kgdb.h | 9 ++-------
> kernel/debug/debug_core.c | 2 +-
> 11 files changed, 14 insertions(+), 29 deletions(-)
>
> diff --git a/arch/arc/kernel/kgdb.c b/arch/arc/kernel/kgdb.c
> index 9a3c34af2ae8..0932851028e0 100644
> --- a/arch/arc/kernel/kgdb.c
> +++ b/arch/arc/kernel/kgdb.c
> @@ -197,7 +197,7 @@ static void kgdb_call_nmi_hook(void *ignored)
> kgdb_nmicallback(raw_smp_processor_id(), NULL);
> }
>
> -void kgdb_roundup_cpus(unsigned long flags)
> +void kgdb_roundup_cpus(void)
> {
> local_irq_enable();
> smp_call_function(kgdb_call_nmi_hook, NULL, 0);
> diff --git a/arch/arm/kernel/kgdb.c b/arch/arm/kernel/kgdb.c
> index caa0dbe3dc61..f21077b077be 100644
> --- a/arch/arm/kernel/kgdb.c
> +++ b/arch/arm/kernel/kgdb.c
> @@ -175,7 +175,7 @@ static void kgdb_call_nmi_hook(void *ignored)
> kgdb_nmicallback(raw_smp_processor_id(), get_irq_regs());
> }
>
> -void kgdb_roundup_cpus(unsigned long flags)
> +void kgdb_roundup_cpus(void)
> {
> local_irq_enable();
> smp_call_function(kgdb_call_nmi_hook, NULL, 0);
> diff --git a/arch/arm64/kernel/kgdb.c b/arch/arm64/kernel/kgdb.c
> index a20de58061a8..12c339ff6e75 100644
> --- a/arch/arm64/kernel/kgdb.c
> +++ b/arch/arm64/kernel/kgdb.c
> @@ -289,7 +289,7 @@ static void kgdb_call_nmi_hook(void *ignored)
> kgdb_nmicallback(raw_smp_processor_id(), get_irq_regs());
> }
>
> -void kgdb_roundup_cpus(unsigned long flags)
> +void kgdb_roundup_cpus(void)
> {
> local_irq_enable();
> smp_call_function(kgdb_call_nmi_hook, NULL, 0);
> diff --git a/arch/hexagon/kernel/kgdb.c b/arch/hexagon/kernel/kgdb.c
> index 16c24b22d0b2..012e0e230ac2 100644
> --- a/arch/hexagon/kernel/kgdb.c
> +++ b/arch/hexagon/kernel/kgdb.c
> @@ -119,17 +119,12 @@ void kgdb_arch_set_pc(struct pt_regs *regs, unsigned long pc)
>
> /**
> * kgdb_roundup_cpus - Get other CPUs into a holding pattern
> - * @flags: Current IRQ state
> *
> * On SMP systems, we need to get the attention of the other CPUs
> * and get them be in a known state. This should do what is needed
> * to get the other CPUs to call kgdb_wait(). Note that on some arches,
> * the NMI approach is not used for rounding up all the CPUs. For example,
> - * in case of MIPS, smp_call_function() is used to roundup CPUs. In
> - * this case, we have to make sure that interrupts are enabled before
> - * calling smp_call_function(). The argument to this function is
> - * the flags that will be used when restoring the interrupts. There is
> - * local_irq_save() call before kgdb_roundup_cpus().
> + * in case of MIPS, smp_call_function() is used to roundup CPUs.
> *
> * On non-SMP systems, this is not called.
> */
> @@ -139,7 +134,7 @@ static void hexagon_kgdb_nmi_hook(void *ignored)
> kgdb_nmicallback(raw_smp_processor_id(), get_irq_regs());
> }
>
> -void kgdb_roundup_cpus(unsigned long flags)
> +void kgdb_roundup_cpus(void)
> {
> local_irq_enable();
> smp_call_function(hexagon_kgdb_nmi_hook, NULL, 0);
> diff --git a/arch/mips/kernel/kgdb.c b/arch/mips/kernel/kgdb.c
> index eb6c0d582626..2b05effc17b4 100644
> --- a/arch/mips/kernel/kgdb.c
> +++ b/arch/mips/kernel/kgdb.c
> @@ -219,7 +219,7 @@ static void kgdb_call_nmi_hook(void *ignored)
> set_fs(old_fs);
> }
>
> -void kgdb_roundup_cpus(unsigned long flags)
> +void kgdb_roundup_cpus(void)
> {
> local_irq_enable();
> smp_call_function(kgdb_call_nmi_hook, NULL, 0);
> diff --git a/arch/powerpc/kernel/kgdb.c b/arch/powerpc/kernel/kgdb.c
> index 59c578f865aa..b0e804844be0 100644
> --- a/arch/powerpc/kernel/kgdb.c
> +++ b/arch/powerpc/kernel/kgdb.c
> @@ -124,7 +124,7 @@ static int kgdb_call_nmi_hook(struct pt_regs *regs)
> }
>
> #ifdef CONFIG_SMP
> -void kgdb_roundup_cpus(unsigned long flags)
> +void kgdb_roundup_cpus(void)
> {
> smp_send_debugger_break();
> }
> diff --git a/arch/sh/kernel/kgdb.c b/arch/sh/kernel/kgdb.c
> index 4f04c6638a4d..cc57630f6bf2 100644
> --- a/arch/sh/kernel/kgdb.c
> +++ b/arch/sh/kernel/kgdb.c
> @@ -319,7 +319,7 @@ static void kgdb_call_nmi_hook(void *ignored)
> kgdb_nmicallback(raw_smp_processor_id(), get_irq_regs());
> }
>
> -void kgdb_roundup_cpus(unsigned long flags)
> +void kgdb_roundup_cpus(void)
> {
> local_irq_enable();
> smp_call_function(kgdb_call_nmi_hook, NULL, 0);
> diff --git a/arch/sparc/kernel/smp_64.c b/arch/sparc/kernel/smp_64.c
> index 4792e08ad36b..f45d876983f1 100644
> --- a/arch/sparc/kernel/smp_64.c
> +++ b/arch/sparc/kernel/smp_64.c
> @@ -1014,7 +1014,7 @@ void flush_dcache_page_all(struct mm_struct *mm, struct page *page)
> }
>
> #ifdef CONFIG_KGDB
> -void kgdb_roundup_cpus(unsigned long flags)
> +void kgdb_roundup_cpus(void)
> {
> smp_cross_call(&xcall_kgdb_capture, 0, 0, 0);
> }
> diff --git a/arch/x86/kernel/kgdb.c b/arch/x86/kernel/kgdb.c
> index 8e36f249646e..ac6291a4178d 100644
> --- a/arch/x86/kernel/kgdb.c
> +++ b/arch/x86/kernel/kgdb.c
> @@ -422,21 +422,16 @@ static void kgdb_disable_hw_debug(struct pt_regs *regs)
> #ifdef CONFIG_SMP
> /**
> * kgdb_roundup_cpus - Get other CPUs into a holding pattern
> - * @flags: Current IRQ state
> *
> * On SMP systems, we need to get the attention of the other CPUs
> * and get them be in a known state. This should do what is needed
> * to get the other CPUs to call kgdb_wait(). Note that on some arches,
> * the NMI approach is not used for rounding up all the CPUs. For example,
> - * in case of MIPS, smp_call_function() is used to roundup CPUs. In
> - * this case, we have to make sure that interrupts are enabled before
> - * calling smp_call_function(). The argument to this function is
> - * the flags that will be used when restoring the interrupts. There is
> - * local_irq_save() call before kgdb_roundup_cpus().
> + * in case of MIPS, smp_call_function() is used to roundup CPUs.
> *
> * On non-SMP systems, this is not called.
> */
> -void kgdb_roundup_cpus(unsigned long flags)
> +void kgdb_roundup_cpus(void)
> {
> apic->send_IPI_allbutself(APIC_DM_NMI);
> }
> diff --git a/include/linux/kgdb.h b/include/linux/kgdb.h
> index e465bb15912d..05e5b2eb0d32 100644
> --- a/include/linux/kgdb.h
> +++ b/include/linux/kgdb.h
> @@ -178,21 +178,16 @@ kgdb_arch_handle_exception(int vector, int signo, int err_code,
>
> /**
> * kgdb_roundup_cpus - Get other CPUs into a holding pattern
> - * @flags: Current IRQ state
> *
> * On SMP systems, we need to get the attention of the other CPUs
> * and get them into a known state. This should do what is needed
> * to get the other CPUs to call kgdb_wait(). Note that on some arches,
> * the NMI approach is not used for rounding up all the CPUs. For example,
> - * in case of MIPS, smp_call_function() is used to roundup CPUs. In
> - * this case, we have to make sure that interrupts are enabled before
> - * calling smp_call_function(). The argument to this function is
> - * the flags that will be used when restoring the interrupts. There is
> - * local_irq_save() call before kgdb_roundup_cpus().
> + * in case of MIPS, smp_call_function() is used to roundup CPUs.
> *
> * On non-SMP systems, this is not called.
> */
> -extern void kgdb_roundup_cpus(unsigned long flags);
> +extern void kgdb_roundup_cpus(void);
>
> /**
> * kgdb_arch_set_pc - Generic call back to the program counter
> diff --git a/kernel/debug/debug_core.c b/kernel/debug/debug_core.c
> index 65c0f1363788..f3cadda45f07 100644
> --- a/kernel/debug/debug_core.c
> +++ b/kernel/debug/debug_core.c
> @@ -593,7 +593,7 @@ static int kgdb_cpu_enter(struct kgdb_state *ks, struct pt_regs *regs,
>
> /* Signal the other CPUs to enter kgdb_wait() */
> else if ((!kgdb_single_step) && kgdb_do_roundup)
> - kgdb_roundup_cpus(flags);
> + kgdb_roundup_cpus();
> #endif
>
> /*
> --
> 2.20.0.rc1.387.gf8505762e3-goog
>

2018-12-19 20:32:50

by Daniel Thompson

[permalink] [raw]
Subject: Re: [REPOST PATCH v6 2/4] kgdb: Fix kgdb_roundup_cpus() for arches who used smp_call_function()

On Tue, Dec 04, 2018 at 07:38:26PM -0800, Douglas Anderson wrote:
> When I had lockdep turned on and dropped into kgdb I got a nice splat
> on my system. Specifically it hit:
> DEBUG_LOCKS_WARN_ON(current->hardirq_context)
>
> Specifically it looked like this:
> sysrq: SysRq : DEBUG
> ------------[ cut here ]------------
> DEBUG_LOCKS_WARN_ON(current->hardirq_context)
> WARNING: CPU: 0 PID: 0 at .../kernel/locking/lockdep.c:2875 lockdep_hardirqs_on+0xf0/0x160
> CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.19.0 #27
> pstate: 604003c9 (nZCv DAIF +PAN -UAO)
> pc : lockdep_hardirqs_on+0xf0/0x160
> ...
> Call trace:
> lockdep_hardirqs_on+0xf0/0x160
> trace_hardirqs_on+0x188/0x1ac
> kgdb_roundup_cpus+0x14/0x3c
> kgdb_cpu_enter+0x53c/0x5cc
> kgdb_handle_exception+0x180/0x1d4
> kgdb_compiled_brk_fn+0x30/0x3c
> brk_handler+0x134/0x178
> do_debug_exception+0xfc/0x178
> el1_dbg+0x18/0x78
> kgdb_breakpoint+0x34/0x58
> sysrq_handle_dbg+0x54/0x5c
> __handle_sysrq+0x114/0x21c
> handle_sysrq+0x30/0x3c
> qcom_geni_serial_isr+0x2dc/0x30c
> ...
> ...
> irq event stamp: ...45
> hardirqs last enabled at (...44): [...] __do_softirq+0xd8/0x4e4
> hardirqs last disabled at (...45): [...] el1_irq+0x74/0x130
> softirqs last enabled at (...42): [...] _local_bh_enable+0x2c/0x34
> softirqs last disabled at (...43): [...] irq_exit+0xa8/0x100
> ---[ end trace adf21f830c46e638 ]---
>
> Looking closely at it, it seems like a really bad idea to be calling
> local_irq_enable() in kgdb_roundup_cpus(). If nothing else that seems
> like it could violate spinlock semantics and cause a deadlock.
>
> Instead, let's use a private csd alongside
> smp_call_function_single_async() to round up the other CPUs. Using
> smp_call_function_single_async() doesn't require interrupts to be
> enabled so we can remove the offending bit of code.
>
> In order to avoid duplicating this across all the architectures that
> use the default kgdb_roundup_cpus(), we'll add a "weak" implementation
> to debug_core.c.
>
> Looking at all the people who previously had copies of this code,
> there were a few variants. I've attempted to keep the variants
> working like they used to. Specifically:
> * For arch/arc we passed NULL to kgdb_nmicallback() instead of
> get_irq_regs().
> * For arch/mips there was a bit of extra code around
> kgdb_nmicallback()
>
> NOTE: In this patch we will still get into trouble if we try to round
> up a CPU that failed to round up before. We'll try to round it up
> again and potentially hang when we try to grab the csd lock. That's
> not new behavior but we'll still try to do better in a future patch.
>
> Suggested-by: Daniel Thompson <[email protected]>
> Signed-off-by: Douglas Anderson <[email protected]>
> Cc: Vineet Gupta <[email protected]>
> Cc: Russell King <[email protected]>
> Cc: Catalin Marinas <[email protected]>
> Cc: Will Deacon <[email protected]>
> Cc: Richard Kuo <[email protected]>
> Cc: Ralf Baechle <[email protected]>
> Cc: Paul Burton <[email protected]>
> Cc: James Hogan <[email protected]>
> Cc: Benjamin Herrenschmidt <[email protected]>
> Cc: Paul Mackerras <[email protected]>
> Cc: Michael Ellerman <[email protected]>
> Cc: Yoshinori Sato <[email protected]>
> Cc: Rich Felker <[email protected]>
> Cc: "David S. Miller" <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Borislav Petkov <[email protected]>
> Cc: "H. Peter Anvin" <[email protected]>
> Acked-by: Will Deacon <[email protected]>

I've not heard any objections from the arch/ maintainers so...

Applied! Thanks.


> -
> ---
>
> Changes in v6:
> - Moved smp_call_function_single_async() error check to patch 3.
>
> Changes in v5:
> - Add a comment about get_irq_regs().
> - get_cpu() => raw_smp_processor_id() in kgdb_roundup_cpus().
> - for_each_cpu() => for_each_online_cpu()
> - Error check smp_call_function_single_async()
>
> Changes in v4: None
> Changes in v3:
> - No separate init call.
> - Don't round up the CPU that is doing the rounding up.
> - Add "#ifdef CONFIG_SMP" to match the rest of the file.
> - Updated desc saying we don't solve the "failed to roundup" case.
> - Document the ignored parameter.
>
> Changes in v2:
> - Removing irq flags separated from fixing lockdep splat.
> - Don't use smp_call_function (Daniel).
>
> arch/arc/kernel/kgdb.c | 10 ++--------
> arch/arm/kernel/kgdb.c | 12 -----------
> arch/arm64/kernel/kgdb.c | 12 -----------
> arch/hexagon/kernel/kgdb.c | 27 -------------------------
> arch/mips/kernel/kgdb.c | 9 +--------
> arch/powerpc/kernel/kgdb.c | 4 ++--
> arch/sh/kernel/kgdb.c | 12 -----------
> include/linux/kgdb.h | 15 ++++++++++++--
> kernel/debug/debug_core.c | 41 ++++++++++++++++++++++++++++++++++++++
> 9 files changed, 59 insertions(+), 83 deletions(-)
>
> diff --git a/arch/arc/kernel/kgdb.c b/arch/arc/kernel/kgdb.c
> index 0932851028e0..68d9fe4b5aa7 100644
> --- a/arch/arc/kernel/kgdb.c
> +++ b/arch/arc/kernel/kgdb.c
> @@ -192,18 +192,12 @@ void kgdb_arch_set_pc(struct pt_regs *regs, unsigned long ip)
> instruction_pointer(regs) = ip;
> }
>
> -static void kgdb_call_nmi_hook(void *ignored)
> +void kgdb_call_nmi_hook(void *ignored)
> {
> + /* Default implementation passes get_irq_regs() but we don't */
> kgdb_nmicallback(raw_smp_processor_id(), NULL);
> }
>
> -void kgdb_roundup_cpus(void)
> -{
> - local_irq_enable();
> - smp_call_function(kgdb_call_nmi_hook, NULL, 0);
> - local_irq_disable();
> -}
> -
> struct kgdb_arch arch_kgdb_ops = {
> /* breakpoint instruction: TRAP_S 0x3 */
> #ifdef CONFIG_CPU_BIG_ENDIAN
> diff --git a/arch/arm/kernel/kgdb.c b/arch/arm/kernel/kgdb.c
> index f21077b077be..d9a69e941463 100644
> --- a/arch/arm/kernel/kgdb.c
> +++ b/arch/arm/kernel/kgdb.c
> @@ -170,18 +170,6 @@ static struct undef_hook kgdb_compiled_brkpt_hook = {
> .fn = kgdb_compiled_brk_fn
> };
>
> -static void kgdb_call_nmi_hook(void *ignored)
> -{
> - kgdb_nmicallback(raw_smp_processor_id(), get_irq_regs());
> -}
> -
> -void kgdb_roundup_cpus(void)
> -{
> - local_irq_enable();
> - smp_call_function(kgdb_call_nmi_hook, NULL, 0);
> - local_irq_disable();
> -}
> -
> static int __kgdb_notify(struct die_args *args, unsigned long cmd)
> {
> struct pt_regs *regs = args->regs;
> diff --git a/arch/arm64/kernel/kgdb.c b/arch/arm64/kernel/kgdb.c
> index 12c339ff6e75..da880247c734 100644
> --- a/arch/arm64/kernel/kgdb.c
> +++ b/arch/arm64/kernel/kgdb.c
> @@ -284,18 +284,6 @@ static struct step_hook kgdb_step_hook = {
> .fn = kgdb_step_brk_fn
> };
>
> -static void kgdb_call_nmi_hook(void *ignored)
> -{
> - kgdb_nmicallback(raw_smp_processor_id(), get_irq_regs());
> -}
> -
> -void kgdb_roundup_cpus(void)
> -{
> - local_irq_enable();
> - smp_call_function(kgdb_call_nmi_hook, NULL, 0);
> - local_irq_disable();
> -}
> -
> static int __kgdb_notify(struct die_args *args, unsigned long cmd)
> {
> struct pt_regs *regs = args->regs;
> diff --git a/arch/hexagon/kernel/kgdb.c b/arch/hexagon/kernel/kgdb.c
> index 012e0e230ac2..b95d12038a4e 100644
> --- a/arch/hexagon/kernel/kgdb.c
> +++ b/arch/hexagon/kernel/kgdb.c
> @@ -115,33 +115,6 @@ void kgdb_arch_set_pc(struct pt_regs *regs, unsigned long pc)
> instruction_pointer(regs) = pc;
> }
>
> -#ifdef CONFIG_SMP
> -
> -/**
> - * kgdb_roundup_cpus - Get other CPUs into a holding pattern
> - *
> - * On SMP systems, we need to get the attention of the other CPUs
> - * and get them be in a known state. This should do what is needed
> - * to get the other CPUs to call kgdb_wait(). Note that on some arches,
> - * the NMI approach is not used for rounding up all the CPUs. For example,
> - * in case of MIPS, smp_call_function() is used to roundup CPUs.
> - *
> - * On non-SMP systems, this is not called.
> - */
> -
> -static void hexagon_kgdb_nmi_hook(void *ignored)
> -{
> - kgdb_nmicallback(raw_smp_processor_id(), get_irq_regs());
> -}
> -
> -void kgdb_roundup_cpus(void)
> -{
> - local_irq_enable();
> - smp_call_function(hexagon_kgdb_nmi_hook, NULL, 0);
> - local_irq_disable();
> -}
> -#endif
> -
>
> /* Not yet working */
> void sleeping_thread_to_gdb_regs(unsigned long *gdb_regs,
> diff --git a/arch/mips/kernel/kgdb.c b/arch/mips/kernel/kgdb.c
> index 2b05effc17b4..42f057a6c215 100644
> --- a/arch/mips/kernel/kgdb.c
> +++ b/arch/mips/kernel/kgdb.c
> @@ -207,7 +207,7 @@ void arch_kgdb_breakpoint(void)
> ".set\treorder");
> }
>
> -static void kgdb_call_nmi_hook(void *ignored)
> +void kgdb_call_nmi_hook(void *ignored)
> {
> mm_segment_t old_fs;
>
> @@ -219,13 +219,6 @@ static void kgdb_call_nmi_hook(void *ignored)
> set_fs(old_fs);
> }
>
> -void kgdb_roundup_cpus(void)
> -{
> - local_irq_enable();
> - smp_call_function(kgdb_call_nmi_hook, NULL, 0);
> - local_irq_disable();
> -}
> -
> static int compute_signal(int tt)
> {
> struct hard_trap_info *ht;
> diff --git a/arch/powerpc/kernel/kgdb.c b/arch/powerpc/kernel/kgdb.c
> index b0e804844be0..b4ce54d73337 100644
> --- a/arch/powerpc/kernel/kgdb.c
> +++ b/arch/powerpc/kernel/kgdb.c
> @@ -117,7 +117,7 @@ int kgdb_skipexception(int exception, struct pt_regs *regs)
> return kgdb_isremovedbreak(regs->nip);
> }
>
> -static int kgdb_call_nmi_hook(struct pt_regs *regs)
> +static int kgdb_debugger_ipi(struct pt_regs *regs)
> {
> kgdb_nmicallback(raw_smp_processor_id(), regs);
> return 0;
> @@ -502,7 +502,7 @@ int kgdb_arch_init(void)
> old__debugger_break_match = __debugger_break_match;
> old__debugger_fault_handler = __debugger_fault_handler;
>
> - __debugger_ipi = kgdb_call_nmi_hook;
> + __debugger_ipi = kgdb_debugger_ipi;
> __debugger = kgdb_debugger;
> __debugger_bpt = kgdb_handle_breakpoint;
> __debugger_sstep = kgdb_singlestep;
> diff --git a/arch/sh/kernel/kgdb.c b/arch/sh/kernel/kgdb.c
> index cc57630f6bf2..14e012ad7c57 100644
> --- a/arch/sh/kernel/kgdb.c
> +++ b/arch/sh/kernel/kgdb.c
> @@ -314,18 +314,6 @@ BUILD_TRAP_HANDLER(singlestep)
> local_irq_restore(flags);
> }
>
> -static void kgdb_call_nmi_hook(void *ignored)
> -{
> - kgdb_nmicallback(raw_smp_processor_id(), get_irq_regs());
> -}
> -
> -void kgdb_roundup_cpus(void)
> -{
> - local_irq_enable();
> - smp_call_function(kgdb_call_nmi_hook, NULL, 0);
> - local_irq_disable();
> -}
> -
> static int __kgdb_notify(struct die_args *args, unsigned long cmd)
> {
> int ret;
> diff --git a/include/linux/kgdb.h b/include/linux/kgdb.h
> index 05e5b2eb0d32..24422865cd18 100644
> --- a/include/linux/kgdb.h
> +++ b/include/linux/kgdb.h
> @@ -176,14 +176,25 @@ kgdb_arch_handle_exception(int vector, int signo, int err_code,
> char *remcom_out_buffer,
> struct pt_regs *regs);
>
> +/**
> + * kgdb_call_nmi_hook - Call kgdb_nmicallback() on the current CPU
> + * @ignored: This parameter is only here to match the prototype.
> + *
> + * If you're using the default implementation of kgdb_roundup_cpus()
> + * this function will be called per CPU. If you don't implement
> + * kgdb_call_nmi_hook() a default will be used.
> + */
> +
> +extern void kgdb_call_nmi_hook(void *ignored);
> +
> /**
> * kgdb_roundup_cpus - Get other CPUs into a holding pattern
> *
> * On SMP systems, we need to get the attention of the other CPUs
> * and get them into a known state. This should do what is needed
> * to get the other CPUs to call kgdb_wait(). Note that on some arches,
> - * the NMI approach is not used for rounding up all the CPUs. For example,
> - * in case of MIPS, smp_call_function() is used to roundup CPUs.
> + * the NMI approach is not used for rounding up all the CPUs. Normally
> + * those architectures can just not implement this and get the default.
> *
> * On non-SMP systems, this is not called.
> */
> diff --git a/kernel/debug/debug_core.c b/kernel/debug/debug_core.c
> index f3cadda45f07..10db2833a423 100644
> --- a/kernel/debug/debug_core.c
> +++ b/kernel/debug/debug_core.c
> @@ -55,6 +55,7 @@
> #include <linux/mm.h>
> #include <linux/vmacache.h>
> #include <linux/rcupdate.h>
> +#include <linux/irq.h>
>
> #include <asm/cacheflush.h>
> #include <asm/byteorder.h>
> @@ -220,6 +221,46 @@ int __weak kgdb_skipexception(int exception, struct pt_regs *regs)
> return 0;
> }
>
> +#ifdef CONFIG_SMP
> +
> +/*
> + * Default (weak) implementation for kgdb_roundup_cpus
> + */
> +
> +static DEFINE_PER_CPU(call_single_data_t, kgdb_roundup_csd);
> +
> +void __weak kgdb_call_nmi_hook(void *ignored)
> +{
> + /*
> + * NOTE: get_irq_regs() is supposed to get the registers from
> + * before the IPI interrupt happened and so is supposed to
> + * show where the processor was. In some situations it's
> + * possible we might be called without an IPI, so it might be
> + * safer to figure out how to make kgdb_breakpoint() work
> + * properly here.
> + */
> + kgdb_nmicallback(raw_smp_processor_id(), get_irq_regs());
> +}
> +
> +void __weak kgdb_roundup_cpus(void)
> +{
> + call_single_data_t *csd;
> + int this_cpu = raw_smp_processor_id();
> + int cpu;
> +
> + for_each_online_cpu(cpu) {
> + /* No need to roundup ourselves */
> + if (cpu == this_cpu)
> + continue;
> +
> + csd = &per_cpu(kgdb_roundup_csd, cpu);
> + csd->func = kgdb_call_nmi_hook;
> + smp_call_function_single_async(cpu, csd);
> + }
> +}
> +
> +#endif
> +
> /*
> * Some architectures need cache flushes when we set/clear a
> * breakpoint:
> --
> 2.20.0.rc1.387.gf8505762e3-goog
>

2018-12-19 21:13:58

by Daniel Thompson

[permalink] [raw]
Subject: Re: [REPOST PATCH v6 3/4] kgdb: Don't round up a CPU that failed rounding up before

On Tue, Dec 04, 2018 at 07:38:27PM -0800, Douglas Anderson wrote:
> If we're using the default implementation of kgdb_roundup_cpus() that
> uses smp_call_function_single_async() we can end up hanging
> kgdb_roundup_cpus() if we try to round up a CPU that failed to round
> up before.
>
> Specifically smp_call_function_single_async() will try to wait on the
> csd lock for the CPU that we're trying to round up. If the previous
> round up never finished then that lock could still be held and we'll
> just sit there hanging.
>
> There's not a lot of use trying to round up a CPU that failed to round
> up before. Let's keep a flag that indicates whether the CPU started
> but didn't finish to round up before. If we see that flag set then
> we'll skip the next round up.
>
> In general we have a few goals here:
> - We never want to end up calling smp_call_function_single_async()
> when the csd is still locked. This is accomplished because
> flush_smp_call_function_queue() unlocks the csd _before_ invoking
> the callback. That means that when kgdb_nmicallback() runs we know
> for sure the the csd is no longer locked. Thus when we set
> "rounding_up = false" we know for sure that the csd is unlocked.
> - If there are no timeouts rounding up we should never skip a round
> up.
>
> NOTE #1: In general trying to continue running after failing to round
> up CPUs doesn't appear to be supported in the debugger. When I
> simulate this I find that kdb reports "Catastrophic error detected"
> when I try to continue. I can overrule and continue anyway, but it
> should be noted that we may be entering the land of dragons here.
> Possibly the "Catastrophic error detected" was added _because_ of the
> future failure to round up, but even so this is an area of the code
> that hasn't been strongly tested.
>
> NOTE #2: I did a bit of testing before and after this change. I
> introduced a 10 second hang in the kernel while holding a spinlock
> that I could invoke on a certain CPU with 'taskset -c 3 cat /sys/...".
>
> Before this change if I did:
> - Invoke hang
> - Enter debugger
> - g (which warns about Catastrophic error, g again to go anyway)
> - g
> - Enter debugger
>
> ...I'd hang the rest of the 10 seconds without getting a debugger
> prompt. After this change I end up in the debugger the 2nd time after
> only 1 second with the standard warning about 'Timed out waiting for
> secondary CPUs.'
>
> I'll also note that once the CPU finished waiting I could actually
> debug it (aka "btc" worked)
>
> I won't promise that everything works perfectly if the errant CPU
> comes back at just the wrong time (like as we're entering or exiting
> the debugger) but it certainly seems like an improvement.
>
> NOTE #3: setting 'kgdb_info[cpu].rounding_up = false' is in
> kgdb_nmicallback() instead of kgdb_call_nmi_hook() because some
> implementations override kgdb_call_nmi_hook(). It shouldn't hurt to
> have it in kgdb_nmicallback() in any case.
>
> NOTE #4: this logic is really only needed because there is no API call
> like "smp_try_call_function_single_async()" or "smp_csd_is_locked()".
> If such an API existed then we'd use it instead, but it seemed a bit
> much to add an API like this just for kgdb.
>
> Signed-off-by: Douglas Anderson <[email protected]>
> Acked-by: Daniel Thompson <[email protected]>

Applied! Thanks.


> ---
>
> Changes in v6:
> - Moved smp_call_function_single_async() error check to patch 3.
>
> Changes in v5: None
> Changes in v4:
> - Removed smp_mb() calls.
>
> Changes in v3:
> - Don't round up a CPU that failed rounding up before new for v3.
>
> Changes in v2: None
>
> kernel/debug/debug_core.c | 20 +++++++++++++++++++-
> kernel/debug/debug_core.h | 1 +
> 2 files changed, 20 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/debug/debug_core.c b/kernel/debug/debug_core.c
> index 10db2833a423..1fb8b239e567 100644
> --- a/kernel/debug/debug_core.c
> +++ b/kernel/debug/debug_core.c
> @@ -247,6 +247,7 @@ void __weak kgdb_roundup_cpus(void)
> call_single_data_t *csd;
> int this_cpu = raw_smp_processor_id();
> int cpu;
> + int ret;
>
> for_each_online_cpu(cpu) {
> /* No need to roundup ourselves */
> @@ -254,8 +255,23 @@ void __weak kgdb_roundup_cpus(void)
> continue;
>
> csd = &per_cpu(kgdb_roundup_csd, cpu);
> +
> + /*
> + * If it didn't round up last time, don't try again
> + * since smp_call_function_single_async() will block.
> + *
> + * If rounding_up is false then we know that the
> + * previous call must have at least started and that
> + * means smp_call_function_single_async() won't block.
> + */
> + if (kgdb_info[cpu].rounding_up)
> + continue;
> + kgdb_info[cpu].rounding_up = true;
> +
> csd->func = kgdb_call_nmi_hook;
> - smp_call_function_single_async(cpu, csd);
> + ret = smp_call_function_single_async(cpu, csd);
> + if (ret)
> + kgdb_info[cpu].rounding_up = false;
> }
> }
>
> @@ -788,6 +804,8 @@ int kgdb_nmicallback(int cpu, void *regs)
> struct kgdb_state kgdb_var;
> struct kgdb_state *ks = &kgdb_var;
>
> + kgdb_info[cpu].rounding_up = false;
> +
> memset(ks, 0, sizeof(struct kgdb_state));
> ks->cpu = cpu;
> ks->linux_regs = regs;
> diff --git a/kernel/debug/debug_core.h b/kernel/debug/debug_core.h
> index 127d9bc49fb4..b4a7c326d546 100644
> --- a/kernel/debug/debug_core.h
> +++ b/kernel/debug/debug_core.h
> @@ -42,6 +42,7 @@ struct debuggerinfo_struct {
> int ret_state;
> int irq_depth;
> int enter_kgdb;
> + bool rounding_up;
> };
>
> extern struct debuggerinfo_struct kgdb_info[];
> --
> 2.20.0.rc1.387.gf8505762e3-goog
>

2018-12-19 21:22:24

by Daniel Thompson

[permalink] [raw]
Subject: Re: [REPOST PATCH v6 4/4] kdb: Don't back trace on a cpu that didn't round up

On Tue, Dec 04, 2018 at 07:38:28PM -0800, Douglas Anderson wrote:
> If you have a CPU that fails to round up and then run 'btc' you'll end
> up crashing in kdb becaue we dereferenced NULL. Let's add a check.
> It's wise to also set the task to NULL when leaving the debugger so
> that if we fail to round up on a later entry into the debugger we
> won't backtrace a stale task.
>
> Signed-off-by: Douglas Anderson <[email protected]>
> Acked-by: Daniel Thompson <[email protected]>

Applied! Thanks.


> ---
>
> Changes in v6: None
> Changes in v5: None
> Changes in v4:
> - Also clear out .debuggerinfo.
> - Also clear out .debuggerinfo and .task for the master.
> - Remove clearing out in kdb_stub for offline CPUs; it's now redundant.
>
> Changes in v3:
> - Don't back trace on a cpu that didn't round up new for v3.
>
> Changes in v2: None
>
> kernel/debug/debug_core.c | 4 ++++
> kernel/debug/kdb/kdb_bt.c | 11 ++++++++++-
> kernel/debug/kdb/kdb_debugger.c | 7 -------
> 3 files changed, 14 insertions(+), 8 deletions(-)
>
> diff --git a/kernel/debug/debug_core.c b/kernel/debug/debug_core.c
> index 1fb8b239e567..5cc608de6883 100644
> --- a/kernel/debug/debug_core.c
> +++ b/kernel/debug/debug_core.c
> @@ -592,6 +592,8 @@ static int kgdb_cpu_enter(struct kgdb_state *ks, struct pt_regs *regs,
> arch_kgdb_ops.correct_hw_break();
> if (trace_on)
> tracing_on();
> + kgdb_info[cpu].debuggerinfo = NULL;
> + kgdb_info[cpu].task = NULL;
> kgdb_info[cpu].exception_state &=
> ~(DCPU_WANT_MASTER | DCPU_IS_SLAVE);
> kgdb_info[cpu].enter_kgdb--;
> @@ -724,6 +726,8 @@ static int kgdb_cpu_enter(struct kgdb_state *ks, struct pt_regs *regs,
> if (trace_on)
> tracing_on();
>
> + kgdb_info[cpu].debuggerinfo = NULL;
> + kgdb_info[cpu].task = NULL;
> kgdb_info[cpu].exception_state &=
> ~(DCPU_WANT_MASTER | DCPU_IS_SLAVE);
> kgdb_info[cpu].enter_kgdb--;
> diff --git a/kernel/debug/kdb/kdb_bt.c b/kernel/debug/kdb/kdb_bt.c
> index 7921ae4fca8d..7e2379aa0a1e 100644
> --- a/kernel/debug/kdb/kdb_bt.c
> +++ b/kernel/debug/kdb/kdb_bt.c
> @@ -186,7 +186,16 @@ kdb_bt(int argc, const char **argv)
> kdb_printf("btc: cpu status: ");
> kdb_parse("cpu\n");
> for_each_online_cpu(cpu) {
> - sprintf(buf, "btt 0x%px\n", KDB_TSK(cpu));
> + void *kdb_tsk = KDB_TSK(cpu);
> +
> + /* If a CPU failed to round up we could be here */
> + if (!kdb_tsk) {
> + kdb_printf("WARNING: no task for cpu %ld\n",
> + cpu);
> + continue;
> + }
> +
> + sprintf(buf, "btt 0x%px\n", kdb_tsk);
> kdb_parse(buf);
> touch_nmi_watchdog();
> }
> diff --git a/kernel/debug/kdb/kdb_debugger.c b/kernel/debug/kdb/kdb_debugger.c
> index 15e1a7af5dd0..53a0df6e4d92 100644
> --- a/kernel/debug/kdb/kdb_debugger.c
> +++ b/kernel/debug/kdb/kdb_debugger.c
> @@ -118,13 +118,6 @@ int kdb_stub(struct kgdb_state *ks)
> kdb_bp_remove();
> KDB_STATE_CLEAR(DOING_SS);
> KDB_STATE_SET(PAGER);
> - /* zero out any offline cpu data */
> - for_each_present_cpu(i) {
> - if (!cpu_online(i)) {
> - kgdb_info[i].debuggerinfo = NULL;
> - kgdb_info[i].task = NULL;
> - }
> - }
> if (ks->err_code == DIE_OOPS || reason == KDB_REASON_OOPS) {
> ks->pass_exception = 1;
> KDB_FLAG_SET(CATASTROPHIC);
> --
> 2.20.0.rc1.387.gf8505762e3-goog
>