2010-01-26 04:27:21

by Jason Wessel

[permalink] [raw]
Subject: [PATCH 0/4] kgdb regression fixes for 2.6.33

I would like to get acks from the respective parties to fix these
reported regressions against kgdb in 2.6.33.

I imagine the constraints for the hw breakpoint API is possibly still
a dicey issue, so it is split into its own patch. Even without the
constraints patch it is possible to use hw breakpoints in the kernel
debugger in the same manner that has existed since 2.6.26 (only kgdb
gets to use hw breakpoints).

The regression are:
* hw breakpoints no longer work on x86 after the perf API merge
* clocksource watchdog can dead lock while in the kernel debugger
* softlockup watchdog can reboot the system while using the kernel debugger

I collected all the patches which could go into the tip branch or the
kgdb for_linus branch at the following location depending on the
status of the discussion that ensues.

git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb.git for_igno

Thanks,
Jason.


---
The following changes since commit 92dcffb916d309aa01778bf8963a6932e4014d07:
Linus Torvalds (1):
Linux 2.6.33-rc5

are available in the git repository at:

git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb.git for_ingo

Jason Wessel (4):
x86,hw_breakpoint,kgdb: kgdb to use hw_breakpoint API
perf,hw_breakpoint: add lockless reservation for hw_breaks
kgdb,clocksource: Prevent kernel hang in kernel debugger
softlockup: add sched_clock_tick() to avoid kernel warning on kgdb resume

arch/x86/kernel/hw_breakpoint.c | 5 +-
arch/x86/kernel/kgdb.c | 216 ++++++++++++++++++++++++++++-----------
include/linux/perf_event.h | 1 +
include/linux/sched.h | 4 +
kernel/hw_breakpoint.c | 16 +++
kernel/kgdb.c | 9 +-
kernel/softlockup.c | 15 +++
kernel/time/clocksource.c | 7 +-
8 files changed, 205 insertions(+), 68 deletions(-)


2010-01-26 04:27:56

by Jason Wessel

[permalink] [raw]
Subject: [PATCH 3/4] kgdb,clocksource: Prevent kernel hang in kernel debugger

This is a regression fix against: 0f8e8ef7c204988246da5a42d576b7fa5277a8e4

Spin locks were added to the clocksource_resume_watchdog() which cause
the kernel debugger to deadlock on an SMP system frequently.

The kernel debugger can try for the lock, but if it fails it should
continue to touch the clocksource watchdog anyway, else it will trip
if the general kernel execution has been paused for too long.

This introduces an possible race condition where the kernel debugger
might not process the list correctly if a clocksource is being added
or removed at the time of this call. This race is sufficiently rare vs
having the kernel debugger hang the kernel

CC: Thomas Gleixner <[email protected]>
CC: Martin Schwidefsky <[email protected]>
CC: John Stultz <[email protected]>
CC: Andrew Morton <[email protected]>
CC: Magnus Damm <[email protected]>
Signed-off-by: Jason Wessel <[email protected]>
---
kernel/time/clocksource.c | 7 ++++++-
1 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index e85c234..74f9ba6 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -463,7 +463,12 @@ void clocksource_resume(void)
*/
void clocksource_touch_watchdog(void)
{
- clocksource_resume_watchdog();
+ unsigned long flags;
+
+ int got_lock = spin_trylock_irqsave(&watchdog_lock, flags);
+ clocksource_reset_watchdog();
+ if (got_lock)
+ spin_unlock_irqrestore(&watchdog_lock, flags);
}

/**
--
1.6.3.3

2010-01-26 04:27:32

by Jason Wessel

[permalink] [raw]
Subject: [PATCH 1/4] x86,hw_breakpoint,kgdb: kgdb to use hw_breakpoint API

In the 2.6.33 kernel, the hw_breakpoint API is now used for the
performance event counters. The hw_breakpoint_handler() now consumes
the hw breakpoints that were previously set by kgdb arch specific
code. In order for kgdb to work in conjunction with this core API
change, kgdb must use some of the low level functions of the
hw_breakpoint API to install, uninstall, and receive call backs for hw
breakpoints.

The kgdb core needs to call kgdb_disable_hw_debug anytime a slave cpu
enters kgdb_wait() in order to keep all the hw breakpoints in sync as
well as to prevent hitting a hw breakpoint while kgdb is active.

During the architecture specific initialization of kgdb, it will
pre-allocate 4 disabled (struct perf event **) structures. Kgdb will
use these to manage the capabilities for the 4 hw breakpoint
registers. Right now the hw_breakpoint API does not have a way to ask
how many breakpoints are available, on each CPU so it is possible that
the install of a breakpoint might fail when kgdb restores the system
to the run state. The intent of this patch is to first get the basic
functionality of hw breakpoints working and leave it to the person
debugging the kernel to understand what hw breakpoints are in use and
what restrictions have been imposed as a result.

While atomic, the x86 specific kgdb code will call
arch_uninstall_hw_breakpoint() and arch_install_hw_breakpoint() to
manage the cpu specific hw breakpoints.

The arch specific hw_breakpoint_handler() was changed to restore the
cpu specific dr7 instead of the dr7 that was locally saved, because
the dr7 can be modified while in a call back to kgdb.

The net result of these changes allow kgdb to use the same pool of
hw_breakpoints that are used by the perf event API, but neither knows
about future reservations for the available hw breakpoint slots.

CC: Frederic Weisbecker <[email protected]>
CC: Ingo Molnar <[email protected]>
CC: K.Prasad <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Alan Stern <[email protected]>
Signed-off-by: Jason Wessel <[email protected]>
---
arch/x86/kernel/hw_breakpoint.c | 5 +-
arch/x86/kernel/kgdb.c | 206 ++++++++++++++++++++++++++++-----------
kernel/kgdb.c | 3 +
3 files changed, 152 insertions(+), 62 deletions(-)

diff --git a/arch/x86/kernel/hw_breakpoint.c b/arch/x86/kernel/hw_breakpoint.c
index 05d5fec..cbf19e0 100644
--- a/arch/x86/kernel/hw_breakpoint.c
+++ b/arch/x86/kernel/hw_breakpoint.c
@@ -466,7 +466,7 @@ static int __kprobes hw_breakpoint_handler(struct die_args *args)
{
int i, cpu, rc = NOTIFY_STOP;
struct perf_event *bp;
- unsigned long dr7, dr6;
+ unsigned long dr6;
unsigned long *dr6_p;

/* The DR6 value is pointed by args->err */
@@ -477,7 +477,6 @@ static int __kprobes hw_breakpoint_handler(struct die_args *args)
if ((dr6 & DR_TRAP_BITS) == 0)
return NOTIFY_DONE;

- get_debugreg(dr7, 7);
/* Disable breakpoints during exception handling */
set_debugreg(0UL, 7);
/*
@@ -525,7 +524,7 @@ static int __kprobes hw_breakpoint_handler(struct die_args *args)
if (dr6 & (~DR_TRAP_BITS))
rc = NOTIFY_DONE;

- set_debugreg(dr7, 7);
+ set_debugreg(__get_cpu_var(cpu_dr7), 7);
put_cpu();

return rc;
diff --git a/arch/x86/kernel/kgdb.c b/arch/x86/kernel/kgdb.c
index dd74fe7..3cb2828 100644
--- a/arch/x86/kernel/kgdb.c
+++ b/arch/x86/kernel/kgdb.c
@@ -42,6 +42,7 @@
#include <linux/init.h>
#include <linux/smp.h>
#include <linux/nmi.h>
+#include <linux/hw_breakpoint.h>

#include <asm/debugreg.h>
#include <asm/apicdef.h>
@@ -204,40 +205,38 @@ void gdb_regs_to_pt_regs(unsigned long *gdb_regs, struct pt_regs *regs)

static struct hw_breakpoint {
unsigned enabled;
- unsigned type;
- unsigned len;
unsigned long addr;
+ int len;
+ int type;
+ struct perf_event **pev;
} breakinfo[4];

static void kgdb_correct_hw_break(void)
{
- unsigned long dr7;
- int correctit = 0;
- int breakbit;
int breakno;

- get_debugreg(dr7, 7);
for (breakno = 0; breakno < 4; breakno++) {
- breakbit = 2 << (breakno << 1);
- if (!(dr7 & breakbit) && breakinfo[breakno].enabled) {
- correctit = 1;
- dr7 |= breakbit;
- dr7 &= ~(0xf0000 << (breakno << 2));
- dr7 |= ((breakinfo[breakno].len << 2) |
- breakinfo[breakno].type) <<
- ((breakno << 2) + 16);
- set_debugreg(breakinfo[breakno].addr, breakno);
-
- } else {
- if ((dr7 & breakbit) && !breakinfo[breakno].enabled) {
- correctit = 1;
- dr7 &= ~breakbit;
- dr7 &= ~(0xf0000 << (breakno << 2));
- }
- }
+ struct perf_event *bp;
+ struct arch_hw_breakpoint *info;
+ int val;
+ int cpu = raw_smp_processor_id();
+ if (!breakinfo[breakno].enabled)
+ continue;
+ bp = *per_cpu_ptr(breakinfo[breakno].pev, cpu);
+ info = counter_arch_bp(bp);
+ if (bp->attr.disabled != 1)
+ continue;
+ bp->attr.bp_addr = breakinfo[breakno].addr;
+ bp->attr.bp_len = breakinfo[breakno].len;
+ bp->attr.bp_type = breakinfo[breakno].type;
+ info->address = breakinfo[breakno].addr;
+ info->len = breakinfo[breakno].len;
+ info->type = breakinfo[breakno].type;
+ val = arch_install_hw_breakpoint(bp);
+ if (!val)
+ bp->attr.disabled = 0;
}
- if (correctit)
- set_debugreg(dr7, 7);
+ hw_breakpoint_restore();
}

static int
@@ -259,15 +258,23 @@ kgdb_remove_hw_break(unsigned long addr, int len, enum kgdb_bptype bptype)
static void kgdb_remove_all_hw_break(void)
{
int i;
+ int cpu = raw_smp_processor_id();
+ struct perf_event *bp;

- for (i = 0; i < 4; i++)
- memset(&breakinfo[i], 0, sizeof(struct hw_breakpoint));
+ for (i = 0; i < 4; i++) {
+ if (!breakinfo[i].enabled)
+ continue;
+ bp = *per_cpu_ptr(breakinfo[i].pev, cpu);
+ if (bp->attr.disabled == 1)
+ continue;
+ arch_uninstall_hw_breakpoint(bp);
+ bp->attr.disabled = 1;
+ }
}

static int
kgdb_set_hw_break(unsigned long addr, int len, enum kgdb_bptype bptype)
{
- unsigned type;
int i;

for (i = 0; i < 4; i++)
@@ -278,27 +285,38 @@ kgdb_set_hw_break(unsigned long addr, int len, enum kgdb_bptype bptype)

switch (bptype) {
case BP_HARDWARE_BREAKPOINT:
- type = 0;
- len = 1;
+ len = 1;
+ breakinfo[i].type = X86_BREAKPOINT_EXECUTE;
break;
case BP_WRITE_WATCHPOINT:
- type = 1;
+ breakinfo[i].type = X86_BREAKPOINT_WRITE;
break;
case BP_ACCESS_WATCHPOINT:
- type = 3;
+ breakinfo[i].type = X86_BREAKPOINT_RW;
break;
default:
return -1;
}
-
- if (len == 1 || len == 2 || len == 4)
- breakinfo[i].len = len - 1;
- else
- return -1;
-
- breakinfo[i].enabled = 1;
+ switch (len) {
+ case 1:
+ breakinfo[i].len = X86_BREAKPOINT_LEN_1;
+ break;
+ case 2:
+ breakinfo[i].len = X86_BREAKPOINT_LEN_2;
+ break;
+ case 4:
+ breakinfo[i].len = X86_BREAKPOINT_LEN_4;
+ break;
+#ifdef CONFIG_X86_64
+ case 8:
+ breakinfo[i].len = X86_BREAKPOINT_LEN_8;
+ break;
+#endif
+ default:
+ return -1;
+ }
breakinfo[i].addr = addr;
- breakinfo[i].type = type;
+ breakinfo[i].enabled = 1;

return 0;
}
@@ -313,8 +331,21 @@ kgdb_set_hw_break(unsigned long addr, int len, enum kgdb_bptype bptype)
*/
void kgdb_disable_hw_debug(struct pt_regs *regs)
{
+ int i;
+ int cpu = raw_smp_processor_id();
+ struct perf_event *bp;
+
/* Disable hardware debugging while we are in kgdb: */
set_debugreg(0UL, 7);
+ for (i = 0; i < 4; i++) {
+ if (!breakinfo[i].enabled)
+ continue;
+ bp = *per_cpu_ptr(breakinfo[i].pev, cpu);
+ if (bp->attr.disabled == 1)
+ continue;
+ arch_uninstall_hw_breakpoint(bp);
+ bp->attr.disabled = 1;
+ }
}

/**
@@ -378,7 +409,6 @@ int kgdb_arch_handle_exception(int e_vector, int signo, int err_code,
struct pt_regs *linux_regs)
{
unsigned long addr;
- unsigned long dr6;
char *ptr;
int newPC;

@@ -404,20 +434,6 @@ int kgdb_arch_handle_exception(int e_vector, int signo, int err_code,
raw_smp_processor_id());
}

- get_debugreg(dr6, 6);
- if (!(dr6 & 0x4000)) {
- int breakno;
-
- for (breakno = 0; breakno < 4; breakno++) {
- if (dr6 & (1 << breakno) &&
- breakinfo[breakno].type == 0) {
- /* Set restore flag: */
- linux_regs->flags |= X86_EFLAGS_RF;
- break;
- }
- }
- }
- set_debugreg(0UL, 6);
kgdb_correct_hw_break();

return 0;
@@ -448,10 +464,12 @@ single_step_cont(struct pt_regs *regs, struct die_args *args)
}

static int was_in_debug_nmi[NR_CPUS];
+static int recieved_hw_brk[NR_CPUS];

static int __kgdb_notify(struct die_args *args, unsigned long cmd)
{
struct pt_regs *regs = args->regs;
+ unsigned long *dr6_p;

switch (cmd) {
case DIE_NMI:
@@ -485,16 +503,24 @@ static int __kgdb_notify(struct die_args *args, unsigned long cmd)
break;

case DIE_DEBUG:
- if (atomic_read(&kgdb_cpu_doing_single_step) ==
- raw_smp_processor_id()) {
+ dr6_p = (unsigned long *)ERR_PTR(args->err);
+ if (atomic_read(&kgdb_cpu_doing_single_step) != -1) {
+ if (dr6_p && (*dr6_p & DR_STEP) == 0)
+ return NOTIFY_DONE;
if (user_mode(regs))
return single_step_cont(regs, args);
break;
- } else if (test_thread_flag(TIF_SINGLESTEP))
+ } else if (test_thread_flag(TIF_SINGLESTEP)) {
/* This means a user thread is single stepping
* a system call which should be ignored
*/
return NOTIFY_DONE;
+ } else if (recieved_hw_brk[raw_smp_processor_id()] == 1) {
+ recieved_hw_brk[raw_smp_processor_id()] = 0;
+ return NOTIFY_STOP;
+ } else if (dr6_p && (*dr6_p & DR_TRAP_BITS) == 0) {
+ return NOTIFY_DONE;
+ }
/* fall through */
default:
if (user_mode(regs))
@@ -531,6 +557,25 @@ static struct notifier_block kgdb_notifier = {
.priority = -INT_MAX,
};

+static void kgdb_hw_bp(struct perf_event *bp, int nmi,
+ struct perf_sample_data *data,
+ struct pt_regs *regs)
+{
+ struct die_args args;
+ int cpu = raw_smp_processor_id();
+
+ args.trapnr = 0;
+ args.signr = 5;
+ args.err = 0;
+ args.regs = regs;
+ args.str = "debug";
+ recieved_hw_brk[cpu] = 0;
+ if (__kgdb_notify(&args, DIE_DEBUG) == NOTIFY_STOP)
+ recieved_hw_brk[cpu] = 1;
+ else
+ recieved_hw_brk[cpu] = 0;
+}
+
/**
* kgdb_arch_init - Perform any architecture specific initalization.
*
@@ -539,7 +584,43 @@ static struct notifier_block kgdb_notifier = {
*/
int kgdb_arch_init(void)
{
- return register_die_notifier(&kgdb_notifier);
+ int i, cpu;
+ int ret;
+ struct perf_event_attr attr;
+ struct perf_event **pevent;
+
+ ret = register_die_notifier(&kgdb_notifier);
+ if (ret != 0)
+ return ret;
+ /*
+ * Pre-allocate the hw breakpoint structions in the non-atomic
+ * portion of kgdb because this operation requires mutexs to
+ * complete.
+ */
+ attr.bp_addr = (unsigned long)kgdb_arch_init;
+ attr.type = PERF_TYPE_BREAKPOINT;
+ attr.bp_len = HW_BREAKPOINT_LEN_1;
+ attr.bp_type = HW_BREAKPOINT_X;
+ attr.disabled = 1;
+ for (i = 0; i < 4; i++) {
+ breakinfo[i].pev = register_wide_hw_breakpoint(&attr,
+ kgdb_hw_bp);
+ if (IS_ERR(breakinfo[i].pev)) {
+ printk(KERN_ERR "kgdb: Could not allocate hw breakpoints\n");
+ breakinfo[i].pev = NULL;
+ kgdb_arch_exit();
+ return -1;
+ }
+ for_each_online_cpu(cpu) {
+ pevent = per_cpu_ptr(breakinfo[i].pev, cpu);
+ pevent[0]->hw.sample_period = 1;
+ if (pevent[0]->destroy != NULL) {
+ pevent[0]->destroy = NULL;
+ release_bp_slot(*pevent);
+ }
+ }
+ }
+ return ret;
}

/**
@@ -550,6 +631,13 @@ int kgdb_arch_init(void)
*/
void kgdb_arch_exit(void)
{
+ int i;
+ for (i = 0; i < 4; i++) {
+ if (breakinfo[i].pev) {
+ unregister_wide_hw_breakpoint(breakinfo[i].pev);
+ breakinfo[i].pev = NULL;
+ }
+ }
unregister_die_notifier(&kgdb_notifier);
}

diff --git a/kernel/kgdb.c b/kernel/kgdb.c
index 2eb517e..c7ade62 100644
--- a/kernel/kgdb.c
+++ b/kernel/kgdb.c
@@ -583,6 +583,9 @@ static void kgdb_wait(struct pt_regs *regs)
smp_wmb();
atomic_set(&cpu_in_kgdb[cpu], 1);

+ /* Disable any cpu specific hw breakpoints */
+ kgdb_disable_hw_debug(regs);
+
/* Wait till primary CPU is done with debugging */
while (atomic_read(&passive_cpu_wait[cpu]))
cpu_relax();
--
1.6.3.3

2010-01-26 04:28:48

by Jason Wessel

[permalink] [raw]
Subject: [PATCH 2/4] perf,hw_breakpoint: add lockless reservation for hw_breaks

The kernel debugger cannot take any locks at the risk of deadlocking
the system. This patch implements a simple reservation system using
an atomic variable initialized to the maximum number of system wide
breakpoints. Any time the variable is negative, there are no
remaining unreserved hw breakpoint slots.

The perf hw breakpoint API needs to keep the account correct for the
number of system wide breakpoints available at any given time. The
kernel debugger will use the same reservation semantics, but use the
low level API calls to install and remove breakpoints while general
kernel execution is paused.

CC: Frederic Weisbecker <[email protected]>
CC: Ingo Molnar <[email protected]>
CC: K.Prasad <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Alan Stern <[email protected]>
Signed-off-by: Jason Wessel <[email protected]>
---
arch/x86/kernel/kgdb.c | 12 +++++++++---
include/linux/perf_event.h | 1 +
kernel/hw_breakpoint.c | 16 ++++++++++++++++
3 files changed, 26 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/kgdb.c b/arch/x86/kernel/kgdb.c
index 3cb2828..2a31f35 100644
--- a/arch/x86/kernel/kgdb.c
+++ b/arch/x86/kernel/kgdb.c
@@ -251,6 +251,7 @@ kgdb_remove_hw_break(unsigned long addr, int len, enum kgdb_bptype bptype)
return -1;

breakinfo[i].enabled = 0;
+ atomic_inc(&dbg_slots_pinned);

return 0;
}
@@ -277,11 +278,13 @@ kgdb_set_hw_break(unsigned long addr, int len, enum kgdb_bptype bptype)
{
int i;

+ if (atomic_add_negative(-1, &dbg_slots_pinned))
+ goto err_out;
for (i = 0; i < 4; i++)
if (!breakinfo[i].enabled)
break;
if (i == 4)
- return -1;
+ goto err_out;

switch (bptype) {
case BP_HARDWARE_BREAKPOINT:
@@ -295,7 +298,7 @@ kgdb_set_hw_break(unsigned long addr, int len, enum kgdb_bptype bptype)
breakinfo[i].type = X86_BREAKPOINT_RW;
break;
default:
- return -1;
+ goto err_out;
}
switch (len) {
case 1:
@@ -313,12 +316,15 @@ kgdb_set_hw_break(unsigned long addr, int len, enum kgdb_bptype bptype)
break;
#endif
default:
- return -1;
+ goto err_out;
}
breakinfo[i].addr = addr;
breakinfo[i].enabled = 1;

return 0;
+err_out:
+ atomic_inc(&dbg_slots_pinned);
+ return -1;
}

/**
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 8fa7187..71f3f05 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -825,6 +825,7 @@ static inline int is_software_event(struct perf_event *event)
}

extern atomic_t perf_swevent_enabled[PERF_COUNT_SW_MAX];
+extern atomic_t dbg_slots_pinned;

extern void __perf_sw_event(u32, u64, int, struct pt_regs *, u64);

diff --git a/kernel/hw_breakpoint.c b/kernel/hw_breakpoint.c
index 50dbd59..ddf7951 100644
--- a/kernel/hw_breakpoint.c
+++ b/kernel/hw_breakpoint.c
@@ -55,6 +55,9 @@ static DEFINE_PER_CPU(unsigned int, nr_cpu_bp_pinned);
/* Number of pinned task breakpoints in a cpu */
static DEFINE_PER_CPU(unsigned int, nr_task_bp_pinned[HBP_NUM]);

+/* Slots pinned atomically by the debugger */
+atomic_t dbg_slots_pinned = ATOMIC_INIT(HBP_NUM);
+
/* Number of non-pinned cpu/task breakpoints in a cpu */
static DEFINE_PER_CPU(unsigned int, nr_bp_flexible);

@@ -249,12 +252,24 @@ int reserve_bp_slot(struct perf_event *bp)
int ret = 0;

mutex_lock(&nr_bp_mutex);
+ /*
+ * Grab a dbg_slots_pinned allocation. This atomic variable
+ * allows lockless sharing between the kernel debugger and the
+ * perf hw breakpoints. It represents the total number of
+ * available system wide breakpoints.
+ */
+ if (atomic_add_negative(-1, &dbg_slots_pinned)) {
+ atomic_inc(&dbg_slots_pinned);
+ ret = -ENOSPC;
+ goto end;
+ }

fetch_bp_busy_slots(&slots, bp);

/* Flexible counters need to keep at least one slot */
if (slots.pinned + (!!slots.flexible) == HBP_NUM) {
ret = -ENOSPC;
+ atomic_inc(&dbg_slots_pinned);
goto end;
}

@@ -271,6 +286,7 @@ void release_bp_slot(struct perf_event *bp)
mutex_lock(&nr_bp_mutex);

toggle_bp_slot(bp, false);
+ atomic_inc(&dbg_slots_pinned);

mutex_unlock(&nr_bp_mutex);
}
--
1.6.3.3

2010-01-26 04:28:28

by Jason Wessel

[permalink] [raw]
Subject: [PATCH 4/4] softlockup: add sched_clock_tick() to avoid kernel warning on kgdb resume

When CONFIG_HAVE_UNSTABLE_SCHED_CLOCK is set, sched_clock() gets the
time from hardware such as the TSC on x86. In this configuration kgdb
will report a softlock warning message on resuming or detaching from a
debug session.

Sequence of events in the problem case:

1) "cpu sched clock" and "hardware time" are at 100 sec prior
to a call to kgdb_handle_exception()

2) Debugger waits in kgdb_handle_exception() for 80 sec and on exit
the following is called ... touch_softlockup_watchdog() -->
__raw_get_cpu_var(touch_timestamp) = 0;

3) "cpu sched clock" = 100s (it was not updated, because the interrupt
was disabled in kgdb) but the "hardware time" = 180 sec

4) The first timer interrupt after resuming from kgdb_handle_exception
updates the watchdog from the "cpu sched clock"

update_process_times() { ... run_local_timers() --> softlockup_tick()
--> check (touch_timestamp == 0) (it is "YES" here, we have set
"touch_timestamp = 0" at kgdb) --> __touch_softlockup_watchdog()
***(A)--> reset "touch_timestamp" to "get_timestamp()" (Here, the
"touch_timestamp" will still be set to 100s.) ...

scheduler_tick() ***(B)--> sched_clock_tick() (update "cpu sched
clock" to "hardware time" = 180s) ... }

5) The Second timer interrupt handler appears to have a large jump and
trips the softlockup warning.

update_process_times() { ... run_local_timers() --> softlockup_tick()
--> "cpu sched clock" - "touch_timestamp" = 180s-100s > 60s --> printk
"soft lockup error messages" ... }

note: ***(A) reset "touch_timestamp" to "get_timestamp(this_cpu)"

Why is "touch_timestamp" 100 sec, instead of 180 sec?

When CONFIG_HAVE_UNSTABLE_SCHED_CLOCK is set, the call trace of
get_timestamp() is:

get_timestamp(this_cpu) -->cpu_clock(this_cpu)
-->sched_clock_cpu(this_cpu) -->__update_sched_clock(sched_clock_data,
now)

The __update_sched_clock() function uses the GTOD tick value to create
a window to normalize the "now" values. So if "now" value is too big
for sched_clock_data, it will be ignored.

The fix is to invoke sched_clock_tick() to update "cpu sched clock" in
order to recover from this state. This is done by introducing the
function touch_softlockup_watchdog_sync(). This allows kgdb to request
that the sched clock is updated when the watchdog thread runs the
first time after a resume from kgdb.

[[email protected]: Use per cpu instead of an array]
Signed-off-by: Jason Wessel <[email protected]>
Signed-off-by: Dongdong Deng <[email protected]>
CC: Ingo Molnar <[email protected]>
CC: [email protected]
---
include/linux/sched.h | 4 ++++
kernel/kgdb.c | 6 +++---
kernel/softlockup.c | 15 +++++++++++++++
3 files changed, 22 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6f7bba9..8923215 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -310,6 +310,7 @@ extern void sched_show_task(struct task_struct *p);
#ifdef CONFIG_DETECT_SOFTLOCKUP
extern void softlockup_tick(void);
extern void touch_softlockup_watchdog(void);
+extern void touch_softlockup_watchdog_sync(void);
extern void touch_all_softlockup_watchdogs(void);
extern int proc_dosoftlockup_thresh(struct ctl_table *table, int write,
void __user *buffer,
@@ -323,6 +324,9 @@ static inline void softlockup_tick(void)
static inline void touch_softlockup_watchdog(void)
{
}
+static inline void touch_softlockup_watchdog_sync(void)
+{
+}
static inline void touch_all_softlockup_watchdogs(void)
{
}
diff --git a/kernel/kgdb.c b/kernel/kgdb.c
index c7ade62..761fdd2 100644
--- a/kernel/kgdb.c
+++ b/kernel/kgdb.c
@@ -599,7 +599,7 @@ static void kgdb_wait(struct pt_regs *regs)

/* Signal the primary CPU that we are done: */
atomic_set(&cpu_in_kgdb[cpu], 0);
- touch_softlockup_watchdog();
+ touch_softlockup_watchdog_sync();
clocksource_touch_watchdog();
local_irq_restore(flags);
}
@@ -1453,7 +1453,7 @@ acquirelock:
(kgdb_info[cpu].task &&
kgdb_info[cpu].task->pid != kgdb_sstep_pid) && --sstep_tries) {
atomic_set(&kgdb_active, -1);
- touch_softlockup_watchdog();
+ touch_softlockup_watchdog_sync();
clocksource_touch_watchdog();
local_irq_restore(flags);

@@ -1553,7 +1553,7 @@ kgdb_restore:
}
/* Free kgdb_active */
atomic_set(&kgdb_active, -1);
- touch_softlockup_watchdog();
+ touch_softlockup_watchdog_sync();
clocksource_touch_watchdog();
local_irq_restore(flags);

diff --git a/kernel/softlockup.c b/kernel/softlockup.c
index d225790..0d4c789 100644
--- a/kernel/softlockup.c
+++ b/kernel/softlockup.c
@@ -25,6 +25,7 @@ static DEFINE_SPINLOCK(print_lock);
static DEFINE_PER_CPU(unsigned long, softlockup_touch_ts); /* touch timestamp */
static DEFINE_PER_CPU(unsigned long, softlockup_print_ts); /* print timestamp */
static DEFINE_PER_CPU(struct task_struct *, softlockup_watchdog);
+static DEFINE_PER_CPU(bool, softlock_touch_sync);

static int __read_mostly did_panic;
int __read_mostly softlockup_thresh = 60;
@@ -79,6 +80,12 @@ void touch_softlockup_watchdog(void)
}
EXPORT_SYMBOL(touch_softlockup_watchdog);

+void touch_softlockup_watchdog_sync(void)
+{
+ __raw_get_cpu_var(softlock_touch_sync) = true;
+ __raw_get_cpu_var(softlockup_touch_ts) = 0;
+}
+
void touch_all_softlockup_watchdogs(void)
{
int cpu;
@@ -118,6 +125,14 @@ void softlockup_tick(void)
}

if (touch_ts == 0) {
+ if (unlikely(per_cpu(softlock_touch_sync, this_cpu))) {
+ /*
+ * If the time stamp was touched atomically
+ * make sure the scheduler tick is up to date.
+ */
+ per_cpu(softlock_touch_sync, this_cpu) = false;
+ sched_clock_tick();
+ }
__touch_softlockup_watchdog();
return;
}
--
1.6.3.3

2010-01-26 04:37:50

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 3/4] kgdb,clocksource: Prevent kernel hang in kernel debugger

On Mon, 25 Jan 2010 22:26:39 -0600 Jason Wessel <[email protected]> wrote:

> This is a regression fix against: 0f8e8ef7c204988246da5a42d576b7fa5277a8e4

It's conventional to quote the patch title as well as the hash. ie:

0f8e8ef7c204988246da5a42d576b7fa5277a8e4 ("clocksource: Simplify
clocksource watchdog resume logic")

> Spin locks were added to the clocksource_resume_watchdog() which cause
> the kernel debugger to deadlock on an SMP system frequently.

Please fully describe the deadlock. Without that analysis, the only
way we can work it out is by guessing. This makes it hard for others to
suggest alternative fixes.

> The kernel debugger can try for the lock, but if it fails it should
> continue to touch the clocksource watchdog anyway, else it will trip
> if the general kernel execution has been paused for too long.
>
> This introduces an possible race condition where the kernel debugger
> might not process the list correctly if a clocksource is being added
> or removed at the time of this call. This race is sufficiently rare vs
> having the kernel debugger hang the kernel

A trylock is a pretty ugly "solution" to a locking bug.

> CC: Thomas Gleixner <[email protected]>
> CC: Martin Schwidefsky <[email protected]>
> CC: John Stultz <[email protected]>
> CC: Andrew Morton <[email protected]>
> CC: Magnus Damm <[email protected]>
> Signed-off-by: Jason Wessel <[email protected]>
> ---
> kernel/time/clocksource.c | 7 ++++++-
> 1 files changed, 6 insertions(+), 1 deletions(-)
>
> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> index e85c234..74f9ba6 100644
> --- a/kernel/time/clocksource.c
> +++ b/kernel/time/clocksource.c
> @@ -463,7 +463,12 @@ void clocksource_resume(void)
> */
> void clocksource_touch_watchdog(void)
> {
> - clocksource_resume_watchdog();
> + unsigned long flags;
> +
> + int got_lock = spin_trylock_irqsave(&watchdog_lock, flags);
> + clocksource_reset_watchdog();
> + if (got_lock)
> + spin_unlock_irqrestore(&watchdog_lock, flags);
> }

If we're going to do this then clocksource_reset_watchdog() should be
uninlined. It shouldn't have been inlined in the first place.

This trylock should be accompanied with an explanation which fully
describes the reasons for its presence. Without that, how can the
code reader work this out?

2010-01-26 08:22:41

by Martin Schwidefsky

[permalink] [raw]
Subject: Re: [PATCH 3/4] kgdb,clocksource: Prevent kernel hang in kernel debugger

On Mon, 25 Jan 2010 22:26:39 -0600
Jason Wessel <[email protected]> wrote:

> This is a regression fix against: 0f8e8ef7c204988246da5a42d576b7fa5277a8e4
>
> Spin locks were added to the clocksource_resume_watchdog() which cause
> the kernel debugger to deadlock on an SMP system frequently.
>
> The kernel debugger can try for the lock, but if it fails it should
> continue to touch the clocksource watchdog anyway, else it will trip
> if the general kernel execution has been paused for too long.
>
> This introduces an possible race condition where the kernel debugger
> might not process the list correctly if a clocksource is being added
> or removed at the time of this call. This race is sufficiently rare vs
> having the kernel debugger hang the kernel
>
> CC: Thomas Gleixner <[email protected]>
> CC: Martin Schwidefsky <[email protected]>
> CC: John Stultz <[email protected]>
> CC: Andrew Morton <[email protected]>
> CC: Magnus Damm <[email protected]>
> Signed-off-by: Jason Wessel <[email protected]>

The first question I would ask is why does the kernel deadlock? Can we
have a backchain of a deadlock please?

Hmm, there are all kinds of races if the watchdog code gets interrupted
by the kernel debugger. Wouldn't it be better to just disable the
watchdog while the kernel debugger is active?

--
blue skies,
Martin.

"Reality continues to ruin my life." - Calvin.

2010-01-26 08:46:04

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 3/4] kgdb,clocksource: Prevent kernel hang in kernel debugger

On Mon, 25 Jan 2010, Jason Wessel wrote:
> This is a regression fix against: 0f8e8ef7c204988246da5a42d576b7fa5277a8e4
>
> Spin locks were added to the clocksource_resume_watchdog() which cause
> the kernel debugger to deadlock on an SMP system frequently.
>
> The kernel debugger can try for the lock, but if it fails it should
> continue to touch the clocksource watchdog anyway, else it will trip
> if the general kernel execution has been paused for too long.
>
> This introduces an possible race condition where the kernel debugger
> might not process the list correctly if a clocksource is being added
> or removed at the time of this call. This race is sufficiently rare vs
> having the kernel debugger hang the kernel

I'm not really excited happy about adding a race condition :)

If you stop the kernel in the middle of the watchdog code
(i.e. watchdog_lock is held) then clocksource_reset_watchdog() is not
really a guarantee to keep the TSC alive.

> void clocksource_touch_watchdog(void)
> {
> - clocksource_resume_watchdog();
> + unsigned long flags;
> +
> + int got_lock = spin_trylock_irqsave(&watchdog_lock, flags);

So I prefer

if (!spin_trylock_irqsave(&watchdog_lock, flags))
return;

If that results in TSC being marked unstable then that is way better
than having a race which might even crash or lock the machine when the
stop happened in the middle of a list_add().

Thanks,

tglx

2010-01-26 08:51:42

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 3/4] kgdb,clocksource: Prevent kernel hang in kernel debugger

On Tue, 26 Jan 2010, Martin Schwidefsky wrote:
> On Mon, 25 Jan 2010 22:26:39 -0600
> Jason Wessel <[email protected]> wrote:
>
> > This is a regression fix against: 0f8e8ef7c204988246da5a42d576b7fa5277a8e4
> >
> > Spin locks were added to the clocksource_resume_watchdog() which cause
> > the kernel debugger to deadlock on an SMP system frequently.
> >
> > The kernel debugger can try for the lock, but if it fails it should
> > continue to touch the clocksource watchdog anyway, else it will trip
> > if the general kernel execution has been paused for too long.
> >
> > This introduces an possible race condition where the kernel debugger
> > might not process the list correctly if a clocksource is being added
> > or removed at the time of this call. This race is sufficiently rare vs
> > having the kernel debugger hang the kernel
> >
> > CC: Thomas Gleixner <[email protected]>
> > CC: Martin Schwidefsky <[email protected]>
> > CC: John Stultz <[email protected]>
> > CC: Andrew Morton <[email protected]>
> > CC: Magnus Damm <[email protected]>
> > Signed-off-by: Jason Wessel <[email protected]>
>
> The first question I would ask is why does the kernel deadlock? Can we
> have a backchain of a deadlock please?

The problem arises when the kernel is stopped inside the watchdog code
with watchdog_lock held. When kgdb restarts execution then it touches
the watchdog to avoid that TSC gets marked unstable.

> Hmm, there are all kinds of races if the watchdog code gets interrupted
> by the kernel debugger. Wouldn't it be better to just disable the
> watchdog while the kernel debugger is active?

No, we can keep it and in most cases it clocksource_touch_watchdog()
helps to keep TSC alive. A simple "if (!trylock) return;" should solve
the deadlock problem for kgdb without opening a can of worms.

Thanks,

tglx

2010-01-26 10:09:38

by Dongdong Deng

[permalink] [raw]
Subject: Re: [PATCH 3/4] kgdb,clocksource: Prevent kernel hang in kernel debugger

On Tue, Jan 26, 2010 at 4:50 PM, Thomas Gleixner <[email protected]> wrote:
> On Tue, 26 Jan 2010, Martin Schwidefsky wrote:
>> On Mon, 25 Jan 2010 22:26:39 -0600
>> Jason Wessel <[email protected]> wrote:
>>
>> > This is a regression fix against: 0f8e8ef7c204988246da5a42d576b7fa5277a8e4
>> >
>> > Spin locks were added to the clocksource_resume_watchdog() which cause
>> > the kernel debugger to deadlock on an SMP system frequently.
>> >
>> > The kernel debugger can try for the lock, but if it fails it should
>> > continue to touch the clocksource watchdog anyway, else it will trip
>> > if the general kernel execution has been paused for too long.
>> >
>> > This introduces an possible race condition where the kernel debugger
>> > might not process the list correctly if a clocksource is being added
>> > or removed at the time of this call.  This race is sufficiently rare vs
>> > having the kernel debugger hang the kernel
>> >
>> > CC: Thomas Gleixner <[email protected]>
>> > CC: Martin Schwidefsky <[email protected]>
>> > CC: John Stultz <[email protected]>
>> > CC: Andrew Morton <[email protected]>
>> > CC: Magnus Damm <[email protected]>
>> > Signed-off-by: Jason Wessel <[email protected]>
>>
>> The first question I would ask is why does the kernel deadlock? Can we
>> have a backchain of a deadlock please?
>
> The problem arises when the kernel is stopped inside the watchdog code
> with watchdog_lock held. When kgdb restarts execution then it touches
> the watchdog to avoid that TSC gets marked unstable.
>
>> Hmm, there are all kinds of races if the watchdog code gets interrupted
>> by the kernel debugger. Wouldn't it be better to just disable the
>> watchdog while the kernel debugger is active?
>
> No, we can keep it and in most cases it clocksource_touch_watchdog()
> helps to keep TSC alive. A simple "if (!trylock) return;" should solve
> the deadlock problem for kgdb without opening a can of worms.

Is it possible that we reset the clocksource watchdog during in
clocksource_watchdog() ?

>From the code view, The action of reset clocksource watchdog is just
set the CLOCK_SOURCE_WATCHDOG flag.
thus if we reset it before using, I think the logic will be right.


---

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index e85c234..0590983 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -187,6 +187,8 @@ static DEFINE_SPINLOCK(watchdog_lock);
static cycle_t watchdog_last;
static int watchdog_running;

+static int volatile clocksource_resume_watchdog;
+
static int clocksource_watchdog_kthread(void *data);
static void __clocksource_change_rating(struct clocksource *cs, int rating);

@@ -253,6 +255,11 @@ static void clocksource_watchdog(unsigned long data)
if (!watchdog_running)
goto out;

+ if (unlikely(clocksource_resume_watchdog)) {
+ clocksource_reset_watchdog();
+ clocksource_resume_watchdog = 0;
+ }
+
wdnow = watchdog->read(watchdog);
wd_nsec = clocksource_cyc2ns((wdnow - watchdog_last) & watchdog->mask,
watchdog->mult, watchdog->shift);
@@ -341,11 +348,7 @@ static inline void clocksource_reset_watchdog(void)

static void clocksource_resume_watchdog(void)
{
- unsigned long flags;
-
- spin_lock_irqsave(&watchdog_lock, flags);
- clocksource_reset_watchdog();
- spin_unlock_irqrestore(&watchdog_lock, flags);
+ clocksource_resume_watchdog = 1;
}

static void clocksource_enqueue_watchdog(struct clocksource *cs)



>
> Thanks,
>
>        tglx
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>


Attachments:
clocksource_resume_watchdog.patch (1.21 kB)

2010-01-26 10:19:38

by Xiaotian Feng

[permalink] [raw]
Subject: Re: [PATCH 3/4] kgdb,clocksource: Prevent kernel hang in kernel debugger

On Tue, Jan 26, 2010 at 6:01 PM, Dongdong Deng <[email protected]> wrote:
> On Tue, Jan 26, 2010 at 4:50 PM, Thomas Gleixner <[email protected]> wrote:
>> On Tue, 26 Jan 2010, Martin Schwidefsky wrote:
>>> On Mon, 25 Jan 2010 22:26:39 -0600
>>> Jason Wessel <[email protected]> wrote:
>>>
>>> > This is a regression fix against: 0f8e8ef7c204988246da5a42d576b7fa5277a8e4
>>> >
>>> > Spin locks were added to the clocksource_resume_watchdog() which cause
>>> > the kernel debugger to deadlock on an SMP system frequently.
>>> >
>>> > The kernel debugger can try for the lock, but if it fails it should
>>> > continue to touch the clocksource watchdog anyway, else it will trip
>>> > if the general kernel execution has been paused for too long.
>>> >
>>> > This introduces an possible race condition where the kernel debugger
>>> > might not process the list correctly if a clocksource is being added
>>> > or removed at the time of this call.  This race is sufficiently rare vs
>>> > having the kernel debugger hang the kernel
>>> >
>>> > CC: Thomas Gleixner <[email protected]>
>>> > CC: Martin Schwidefsky <[email protected]>
>>> > CC: John Stultz <[email protected]>
>>> > CC: Andrew Morton <[email protected]>
>>> > CC: Magnus Damm <[email protected]>
>>> > Signed-off-by: Jason Wessel <[email protected]>
>>>
>>> The first question I would ask is why does the kernel deadlock? Can we
>>> have a backchain of a deadlock please?
>>
>> The problem arises when the kernel is stopped inside the watchdog code
>> with watchdog_lock held. When kgdb restarts execution then it touches
>> the watchdog to avoid that TSC gets marked unstable.
>>
>>> Hmm, there are all kinds of races if the watchdog code gets interrupted
>>> by the kernel debugger. Wouldn't it be better to just disable the
>>> watchdog while the kernel debugger is active?
>>
>> No, we can keep it and in most cases it clocksource_touch_watchdog()
>> helps to keep TSC alive. A simple "if (!trylock) return;" should solve
>> the deadlock problem for kgdb without opening a can of worms.
>
> Is it possible that we reset the clocksource watchdog during in
> clocksource_watchdog() ?
>
> From the code view, The action of reset clocksource watchdog is just
> set the CLOCK_SOURCE_WATCHDOG flag.
> thus if we reset it before using, I think the logic will be right.

Is this logic right for resume case?

>
>
> ---
>
> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> index e85c234..0590983 100644
> --- a/kernel/time/clocksource.c
> +++ b/kernel/time/clocksource.c
> @@ -187,6 +187,8 @@ static DEFINE_SPINLOCK(watchdog_lock);
>  static cycle_t watchdog_last;
>  static int watchdog_running;
>
> +static int volatile clocksource_resume_watchdog;
> +
>  static int clocksource_watchdog_kthread(void *data);
>  static void __clocksource_change_rating(struct clocksource *cs, int rating);
>
> @@ -253,6 +255,11 @@ static void clocksource_watchdog(unsigned long data)
>        if (!watchdog_running)
>                goto out;
>
> +       if (unlikely(clocksource_resume_watchdog)) {
> +               clocksource_reset_watchdog();
> +               clocksource_resume_watchdog = 0;
> +       }
> +
>        wdnow = watchdog->read(watchdog);
>        wd_nsec = clocksource_cyc2ns((wdnow - watchdog_last) & watchdog->mask,
>                                     watchdog->mult, watchdog->shift);
> @@ -341,11 +348,7 @@ static inline void clocksource_reset_watchdog(void)
>
>  static void clocksource_resume_watchdog(void)
>  {
> -       unsigned long flags;
> -
> -       spin_lock_irqsave(&watchdog_lock, flags);
> -       clocksource_reset_watchdog();
> -       spin_unlock_irqrestore(&watchdog_lock, flags);
> +       clocksource_resume_watchdog = 1;
>  }
>
>  static void clocksource_enqueue_watchdog(struct clocksource *cs)
>
>
>
>>
>> Thanks,
>>
>>        tglx
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to [email protected]
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>>
>

2010-01-26 10:37:59

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 3/4] kgdb,clocksource: Prevent kernel hang in kernel debugger

On Tue, 26 Jan 2010, Dongdong Deng wrote:
> On Tue, Jan 26, 2010 at 4:50 PM, Thomas Gleixner <[email protected]> wrote:
> > On Tue, 26 Jan 2010, Martin Schwidefsky wrote:
> >> On Mon, 25 Jan 2010 22:26:39 -0600
> >> Jason Wessel <[email protected]> wrote:
> >>
> >> > This is a regression fix against: 0f8e8ef7c204988246da5a42d576b7fa5277a8e4
> >> >
> >> > Spin locks were added to the clocksource_resume_watchdog() which cause
> >> > the kernel debugger to deadlock on an SMP system frequently.
> >> >
> >> > The kernel debugger can try for the lock, but if it fails it should
> >> > continue to touch the clocksource watchdog anyway, else it will trip
> >> > if the general kernel execution has been paused for too long.
> >> >
> >> > This introduces an possible race condition where the kernel debugger
> >> > might not process the list correctly if a clocksource is being added
> >> > or removed at the time of this call.  This race is sufficiently rare vs
> >> > having the kernel debugger hang the kernel
> >> >
> >> > CC: Thomas Gleixner <[email protected]>
> >> > CC: Martin Schwidefsky <[email protected]>
> >> > CC: John Stultz <[email protected]>
> >> > CC: Andrew Morton <[email protected]>
> >> > CC: Magnus Damm <[email protected]>
> >> > Signed-off-by: Jason Wessel <[email protected]>
> >>
> >> The first question I would ask is why does the kernel deadlock? Can we
> >> have a backchain of a deadlock please?
> >
> > The problem arises when the kernel is stopped inside the watchdog code
> > with watchdog_lock held. When kgdb restarts execution then it touches
> > the watchdog to avoid that TSC gets marked unstable.
> >
> >> Hmm, there are all kinds of races if the watchdog code gets interrupted
> >> by the kernel debugger. Wouldn't it be better to just disable the
> >> watchdog while the kernel debugger is active?
> >
> > No, we can keep it and in most cases it clocksource_touch_watchdog()
> > helps to keep TSC alive. A simple "if (!trylock) return;" should solve
> > the deadlock problem for kgdb without opening a can of worms.
>
> Is it possible that we reset the clocksource watchdog during in
> clocksource_watchdog() ?
>
> >From the code view, The action of reset clocksource watchdog is just
> set the CLOCK_SOURCE_WATCHDOG flag.
> thus if we reset it before using, I think the logic will be right.

No, it's not. It just brings back the old flag based logic which we
removed.

The correct way to solve this is a documented

if (!trylock())
return;

in clocksource_touch_watchdog(). And that's what I'm going to push
linuswards.

There is no sane way to reliably prevent TSC from becoming unstable
when kgdb stops the kernel inside the watchdog code. And I do not care
about that at all.

I'm not going to clutter code with crazy workarounds just because some
people believe that using a kernel debugger is a good idea. If people
insist on using kgdb then the possible "TSC becomes unstable" side
effect is the least of their problems.

Thanks,

tglx

2010-01-26 10:43:58

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 3/4] kgdb,clocksource: Prevent kernel hang in kernel debugger

On Mon, 25 Jan 2010, Jason Wessel wrote:
> This is a regression fix against: 0f8e8ef7c204988246da5a42d576b7fa5277a8e4
>
> Spin locks were added to the clocksource_resume_watchdog() which cause
> the kernel debugger to deadlock on an SMP system frequently.
>
> The kernel debugger can try for the lock, but if it fails it should
> continue to touch the clocksource watchdog anyway, else it will trip
> if the general kernel execution has been paused for too long.
>
> This introduces an possible race condition where the kernel debugger
> might not process the list correctly if a clocksource is being added
> or removed at the time of this call. This race is sufficiently rare vs
> having the kernel debugger hang the kernel
>
> CC: Thomas Gleixner <[email protected]>
> CC: Martin Schwidefsky <[email protected]>
> CC: John Stultz <[email protected]>
> CC: Andrew Morton <[email protected]>
> CC: Magnus Damm <[email protected]>
> Signed-off-by: Jason Wessel <[email protected]>
> ---
> kernel/time/clocksource.c | 7 ++++++-
> 1 files changed, 6 insertions(+), 1 deletions(-)
>
> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> index e85c234..74f9ba6 100644
> --- a/kernel/time/clocksource.c
> +++ b/kernel/time/clocksource.c
> @@ -463,7 +463,12 @@ void clocksource_resume(void)
> */
> void clocksource_touch_watchdog(void)
> {
> - clocksource_resume_watchdog();
> + unsigned long flags;
> +
> + int got_lock = spin_trylock_irqsave(&watchdog_lock, flags);
> + clocksource_reset_watchdog();
> + if (got_lock)
> + spin_unlock_irqrestore(&watchdog_lock, flags);

Just for the record. This patch would not compile on any platform
which has CONFIG_CLOCKSOURCE_WATCHDOG=n

Thanks,

tglx

2010-01-26 11:17:12

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 3/4] kgdb,clocksource: Prevent kernel hang in kernel debugger

On Tue, 26 Jan 2010, Thomas Gleixner wrote:
> There is no sane way to reliably prevent TSC from becoming unstable
> when kgdb stops the kernel inside the watchdog code. And I do not care
> about that at all.
>
> I'm not going to clutter code with crazy workarounds just because some
> people believe that using a kernel debugger is a good idea. If people
> insist on using kgdb then the possible "TSC becomes unstable" side
> effect is the least of their problems.

Btw, if the kernel uses tick based timekeeping or a clock source which
wraps in rather short intervals (e.g. pm-timer wraps after ~4.6
seconds), stopping the kernel with kgdb will inevitably screw up time
keeping anyway.

So there is really no reason to worry about TSC becoming unstable.

There is only one real sensible solution for this:

Do _not_ use kgdb - which is the modus operandi of every sane kernel
developer on the planet.

Thanks,

tglx

2010-01-26 14:10:39

by Thomas Gleixner

[permalink] [raw]
Subject: [tip:timers/urgent] clocksource: Prevent potential kgdb dead lock

Commit-ID: 7b7422a566aa0dc1e582ce263d4c7ff4a772700a
Gitweb: http://git.kernel.org/tip/7b7422a566aa0dc1e582ce263d4c7ff4a772700a
Author: Thomas Gleixner <[email protected]>
AuthorDate: Tue, 26 Jan 2010 12:51:10 +0100
Committer: Thomas Gleixner <[email protected]>
CommitDate: Tue, 26 Jan 2010 14:53:16 +0100

clocksource: Prevent potential kgdb dead lock

commit 0f8e8ef7 (clocksource: Simplify clocksource watchdog resume
logic) introduced a potential kgdb dead lock. When the kernel is
stopped by kgdb inside code which holds watchdog_lock then kgdb dead
locks in clocksource_resume_watchdog().

clocksource_resume_watchdog() is called from kbdg via
clocksource_touch_watchdog() to avoid that the clock source watchdog
marks TSC unstable after the kernel has been stopped.

Solve this by replacing spin_lock with a spin_trylock and just return
in case the lock is held. Not resetting the watchdog might result in
TSC becoming marked unstable, but that's an acceptable penalty for
using kgdb.

The timekeeping is anyway easily screwed up by kgdb when the system
uses either jiffies or a clock source which wraps in short intervals
(e.g. pm_timer wraps about every 4.6s), so we really do not have to
worry about that occasional TSC marked unstable side effect.

The second caller of clocksource_resume_watchdog() is
clocksource_resume(). The trylock is safe here as well because the
system is UP at this point, interrupts are disabled and nothing else
can hold watchdog_lock().

Reported-by: Jason Wessel <[email protected]>
LKML-Reference: <[email protected]>
Cc: [email protected]
Cc: Martin Schwidefsky <[email protected]>
Cc: John Stultz <[email protected]>
Cc: Andrew Morton <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
---
kernel/time/clocksource.c | 18 +++++++++++++++---
1 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index e85c234..1370083 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -343,7 +343,19 @@ static void clocksource_resume_watchdog(void)
{
unsigned long flags;

- spin_lock_irqsave(&watchdog_lock, flags);
+ /*
+ * We use trylock here to avoid a potential dead lock when
+ * kgdb calls this code after the kernel has been stopped with
+ * watchdog_lock held. When watchdog_lock is held we just
+ * return and accept, that the watchdog might trigger and mark
+ * the monitored clock source (usually TSC) unstable.
+ *
+ * This does not affect the other caller clocksource_resume()
+ * because at this point the kernel is UP, interrupts are
+ * disabled and nothing can hold watchdog_lock.
+ */
+ if (!spin_trylock_irqsave(&watchdog_lock, flags))
+ return;
clocksource_reset_watchdog();
spin_unlock_irqrestore(&watchdog_lock, flags);
}
@@ -458,8 +470,8 @@ void clocksource_resume(void)
* clocksource_touch_watchdog - Update watchdog
*
* Update the watchdog after exception contexts such as kgdb so as not
- * to incorrectly trip the watchdog.
- *
+ * to incorrectly trip the watchdog. This might fail when the kernel
+ * was stopped in code which holds watchdog_lock.
*/
void clocksource_touch_watchdog(void)
{

2010-01-26 19:25:50

by Jason Wessel

[permalink] [raw]
Subject: Re: [PATCH 2/4] perf,hw_breakpoint: add lockless reservation for hw_breaks

Jason Wessel wrote:
> The kernel debugger cannot take any locks at the risk of deadlocking

[clip]

The previous version of this patch was wrong for 2 reasons:

1) The reserve_slot_bp() can operate on a single cpu or all the cpus.
The previous version of this patch only worked properly if
reserve_slot_bp() was operating on all the cpus.

2) If you do not have kgdb in the kernel there is no reason to incur
any penalty. If you have CONFIG_KGDB turned off now there is zero
impact or risk to the current hw_breakpoint code.

Below is the new patch, which now passes all the various regression
combinations.

Thanks,
Jason.

---

From: Jason Wessel <[email protected]>
Subject: [PATCH] perf,hw_breakpoint: add lockless reservation for hw_breaks

The kernel debugger cannot take any locks at the risk of deadlocking
the system. This patch implements a simple reservation system using
a per cpu atomic variable initialized to the maximum number of system
wide breakpoints. Any time the variable is negative, there are no
remaining unreserved hw breakpoint slots for that cpu.

The availability of a system wide breakpoint that kgdb can make use of
will be determined by calling _dbg_hw_breakpoint_all_alloc(). All the
slot reservation code lives in kernel/hw_breakpoint.c.

The kernel debugger uses common reservation logic, but for activating
and deactivating breakpoints, it will directly but use the low level
API calls provided in the hw_breakpoint API. This is required in
order perform breakpoint activities without scheduling prior to
resuming general kernel execution.

CC: Frederic Weisbecker <[email protected]>
CC: Ingo Molnar <[email protected]>
CC: K.Prasad <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Alan Stern <[email protected]>
Signed-off-by: Jason Wessel <[email protected]>

---
arch/x86/kernel/kgdb.c | 3 +
include/linux/hw_breakpoint.h | 2 +
kernel/hw_breakpoint.c | 82 ++++++++++++++++++++++++++++++++++++++++++
3 files changed, 87 insertions(+)

--- a/arch/x86/kernel/kgdb.c
+++ b/arch/x86/kernel/kgdb.c
@@ -251,6 +251,7 @@ kgdb_remove_hw_break(unsigned long addr,
return -1;

breakinfo[i].enabled = 0;
+ _dbg_hw_breakpoint_all_free();

return 0;
}
@@ -277,6 +278,8 @@ kgdb_set_hw_break(unsigned long addr, in
{
int i;

+ if (_dbg_hw_breakpoint_all_alloc())
+ return -1;
for (i = 0; i < 4; i++)
if (!breakinfo[i].enabled)
break;
--- a/kernel/hw_breakpoint.c
+++ b/kernel/hw_breakpoint.c
@@ -202,6 +202,82 @@ static void toggle_bp_slot(struct perf_e
per_cpu(nr_cpu_bp_pinned, bp->cpu)--;
}

+#ifdef CONFIG_KGDB
+/* Slots pinned atomically by the debugger */
+static DEFINE_PER_CPU(atomic_t, dbg_slots_pinned) = ATOMIC_INIT(HBP_NUM);
+
+void _dbg_hw_breakpoint_all_free(void)
+{
+ int cpu;
+
+ for_each_online_cpu(cpu)
+ atomic_inc(&per_cpu(dbg_slots_pinned, cpu));
+}
+
+int _dbg_hw_breakpoint_all_alloc(void)
+{
+ int cnt = 0;
+ int cpu;
+
+ for_each_online_cpu(cpu) {
+ cnt++;
+ if (atomic_add_negative(-1, &per_cpu(dbg_slots_pinned, cpu))) {
+ for_each_online_cpu(cpu) {
+ cnt--;
+ atomic_inc(&per_cpu(dbg_slots_pinned, cpu));
+ if (!cnt)
+ break;
+ }
+ return -ENOSPC;
+ }
+ }
+
+ return 0;
+}
+
+static int dbg_hw_breakpoint_alloc(int cpu)
+{
+ int ret = 0;
+ /*
+ * Grab a dbg_slots_pinned allocation. This atomic variable
+ * allows lockless sharing between the kernel debugger and the
+ * perf hw breakpoints. It represents the total number of
+ * available system wide breakpoints.
+ */
+ if (cpu >= 0) {
+ if (atomic_add_negative(-1, &per_cpu(dbg_slots_pinned, cpu))) {
+ atomic_inc(&per_cpu(dbg_slots_pinned, cpu));
+ ret = -ENOSPC;
+ }
+ } else {
+ get_online_cpus();
+ ret = _dbg_hw_breakpoint_all_alloc();
+ put_online_cpus();
+ }
+
+ return ret;
+}
+
+static void dbg_hw_breakpoint_free(int cpu)
+{
+ if (cpu >= 0) {
+ atomic_inc(&per_cpu(dbg_slots_pinned, cpu));
+ } else {
+ get_online_cpus();
+ _dbg_hw_breakpoint_all_free();
+ put_online_cpus();
+ }
+}
+#else /* !CONFIG_KGDB */
+static int inline dbg_hw_breakpoint_alloc(int cpu)
+{
+ return 0;
+}
+static void inline dbg_hw_breakpoint_free(int cpu)
+{
+}
+#endif /* !CONFIG_KGDB */
+
/*
* Contraints to check before allowing this new breakpoint counter:
*
@@ -250,11 +326,16 @@ int reserve_bp_slot(struct perf_event *b

mutex_lock(&nr_bp_mutex);

+ ret = dbg_hw_breakpoint_alloc(bp->cpu);
+ if (ret)
+ goto end;
+
fetch_bp_busy_slots(&slots, bp);

/* Flexible counters need to keep at least one slot */
if (slots.pinned + (!!slots.flexible) == HBP_NUM) {
ret = -ENOSPC;
+ dbg_hw_breakpoint_free(bp->cpu);
goto end;
}

@@ -271,6 +352,7 @@ void release_bp_slot(struct perf_event *
mutex_lock(&nr_bp_mutex);

toggle_bp_slot(bp, false);
+ dbg_hw_breakpoint_free(bp->cpu);

mutex_unlock(&nr_bp_mutex);
}
--- a/include/linux/hw_breakpoint.h
+++ b/include/linux/hw_breakpoint.h
@@ -77,6 +77,8 @@ extern void unregister_wide_hw_breakpoin

extern int reserve_bp_slot(struct perf_event *bp);
extern void release_bp_slot(struct perf_event *bp);
+extern int _dbg_hw_breakpoint_all_alloc(void);
+extern void _dbg_hw_breakpoint_all_free(void);

extern void flush_ptrace_hw_breakpoint(struct task_struct *tsk);

2010-01-26 20:15:50

by Andrew Morton

[permalink] [raw]
Subject: Re: [tip:timers/urgent] clocksource: Prevent potential kgdb dead lock

On Tue, 26 Jan 2010 14:09:45 GMT tip-bot for Thomas Gleixner <[email protected]> wrote:

> --- a/kernel/time/clocksource.c
> +++ b/kernel/time/clocksource.c
> @@ -343,7 +343,19 @@ static void clocksource_resume_watchdog(void)
> {
> unsigned long flags;
>
> - spin_lock_irqsave(&watchdog_lock, flags);
> + /*
> + * We use trylock here to avoid a potential dead lock when
> + * kgdb calls this code after the kernel has been stopped with
> + * watchdog_lock held. When watchdog_lock is held we just
> + * return and accept, that the watchdog might trigger and mark
> + * the monitored clock source (usually TSC) unstable.
> + *
> + * This does not affect the other caller clocksource_resume()
> + * because at this point the kernel is UP, interrupts are
> + * disabled and nothing can hold watchdog_lock.
> + */
> + if (!spin_trylock_irqsave(&watchdog_lock, flags))
> + return;
> clocksource_reset_watchdog();
> spin_unlock_irqrestore(&watchdog_lock, flags);
> }
>
> @@ -458,8 +470,8 @@ void clocksource_resume(void)
> * clocksource_touch_watchdog - Update watchdog
> *
> * Update the watchdog after exception contexts such as kgdb so as not
> - * to incorrectly trip the watchdog.
> - *
> + * to incorrectly trip the watchdog. This might fail when the kernel
> + * was stopped in code which holds watchdog_lock.
> */
> void clocksource_touch_watchdog(void)
> {

It would be neater and a shade more reliable to add a separate
clocksource_try_touch_watchdog() for kgdb's use. Which could be inside
#ifdef CONFIG_KGDB.

2010-01-26 20:47:13

by Jason Wessel

[permalink] [raw]
Subject: Re: [tip:timers/urgent] clocksource: Prevent potential kgdb dead lock

Andrew Morton wrote:
> On Tue, 26 Jan 2010 14:09:45 GMT tip-bot for Thomas Gleixner <[email protected]> wrote:
>
>
>> --- a/kernel/time/clocksource.c
>> +++ b/kernel/time/clocksource.c
>> @@ -343,7 +343,19 @@ static void clocksource_resume_watchdog(void)
>> {
>> unsigned long flags;
>>
>> - spin_lock_irqsave(&watchdog_lock, flags);
>> + /*
>> + * We use trylock here to avoid a potential dead lock when
>> + * kgdb calls this code after the kernel has been stopped with
>> + * watchdog_lock held. When watchdog_lock is held we just
>> + * return and accept, that the watchdog might trigger and mark
>> + * the monitored clock source (usually TSC) unstable.
>> + *
>> + * This does not affect the other caller clocksource_resume()
>> + * because at this point the kernel is UP, interrupts are
>> + * disabled and nothing can hold watchdog_lock.
>> + */
>> + if (!spin_trylock_irqsave(&watchdog_lock, flags))
>> + return;
>> clocksource_reset_watchdog();
>> spin_unlock_irqrestore(&watchdog_lock, flags);
>> }
>>
>> @@ -458,8 +470,8 @@ void clocksource_resume(void)
>> * clocksource_touch_watchdog - Update watchdog
>> *
>> * Update the watchdog after exception contexts such as kgdb so as not
>> - * to incorrectly trip the watchdog.
>> - *
>> + * to incorrectly trip the watchdog. This might fail when the kernel
>> + * was stopped in code which holds watchdog_lock.
>> */
>> void clocksource_touch_watchdog(void)
>> {
>>
>
> It would be neater and a shade more reliable to add a separate
> clocksource_try_touch_watchdog() for kgdb's use. Which could be inside
> #ifdef CONFIG_KGDB.
>
>

The kernel debugger is the only user of this API function so
realistically you do not need a new function, and could simply rename
this one, if the name does not fit.

-- more notes on the problem --

I further diagnosed the problem, and it is reasonably rare that it
happens in the first place. I had an instrumented version of the kernel
debugger which showed the problem happening quite frequently, while
trying to fix hw breakpoints.

The problem is not actually an interprocessor deadlock, but that of
re-entering the spin_lock() when it is already held by the interrupted
task on the same cpu. There is no obvious way that I could see to check
for this specific condition and to abort.

For now the patch Thomas created is sufficient, and given some time
we'll decide how to best live with the side effects or to consider any
further changes.

Jason.

Tested-by: Jason Wessel <[email protected]>

2010-01-27 17:57:00

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 2/4] perf,hw_breakpoint: add lockless reservation for hw_breaks

On Tue, Jan 26, 2010 at 01:25:19PM -0600, Jason Wessel wrote:
> @@ -250,11 +326,16 @@ int reserve_bp_slot(struct perf_event *b
>
> mutex_lock(&nr_bp_mutex);
>
> + ret = dbg_hw_breakpoint_alloc(bp->cpu);
> + if (ret)
> + goto end;
> +



This is totally breaking all the constraints that try to
make the reservation cpu-wide/task-wide aware.

Basically, you just reduced the reservation in 4 breakpoints
per cpu.

The current constraints are able to host thousands of
task wide breakpoints, given none of these tasks has
more than 4 breakpoints. What you've just added here breaks
all this flexibility and reduces every breakpoints to
per cpu breakpoints (or system wide), ignoring the per
task contexts, or non-pinned events.

Now I still don't understand why you refuse to use
a best effort approach wrt locking.

A simple mutex_is_locked() would tell you if someone
is trying to reserve a breakpoint. And this is
safe since all the system is stopped at this time,
right? So once you ensure nobody is fighting against
you for the reservation, you can be sure you are alone
until the end of your reservation.

Or if it is not guaranteed the system is stopped when
you reserve a breakpoint for kgdb, you can use
mutex_trylock(). Basically this is the same approach.

If you are fighting against another breakpoint reservation,
it means you are really unlucky, it only happens when
you create such event through a perf syscall, ptrace or ftrace.

Yes a user can create a perf/ftrace/ptrace breakpoint while
another user creates one kgdb, then if the reservation happen
on the same time, either both can make or kgdb will fail.
This *might* happen once in the universe lifetime, should
we really care about that?

I can write a patch for that if you want.

2010-01-27 22:30:17

by Jason Wessel

[permalink] [raw]
Subject: Re: [PATCH 2/4] perf,hw_breakpoint: add lockless reservation forhw_breaks

Frederic Weisbecker wrote:
> On Tue, Jan 26, 2010 at 01:25:19PM -0600, Jason Wessel wrote:
>
>> @@ -250,11 +326,16 @@ int reserve_bp_slot(struct perf_event *b
>>
>> mutex_lock(&nr_bp_mutex);
>>
>> + ret = dbg_hw_breakpoint_alloc(bp->cpu);
>> + if (ret)
>> + goto end;
>> +
>>
>
>
>
> This is totally breaking all the constraints that try to
> make the reservation cpu-wide/task-wide aware.
>

Agreed. I was unaware of all the various pieces that the toggle_bp
handled.

A revised series is on the way based on the off-list discussion.

Thanks,
Jason.

2010-01-28 17:11:04

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 1/4] x86,hw_breakpoint,kgdb: kgdb to use hw_breakpoint API

On Mon, Jan 25, 2010 at 10:26:37PM -0600, Jason Wessel wrote:
> @@ -466,7 +466,7 @@ static int __kprobes hw_breakpoint_handler(struct die_args *args)
> {
> int i, cpu, rc = NOTIFY_STOP;
> struct perf_event *bp;
> - unsigned long dr7, dr6;
> + unsigned long dr6;
> unsigned long *dr6_p;
>
> /* The DR6 value is pointed by args->err */
> @@ -477,7 +477,6 @@ static int __kprobes hw_breakpoint_handler(struct die_args *args)
> if ((dr6 & DR_TRAP_BITS) == 0)
> return NOTIFY_DONE;
>
> - get_debugreg(dr7, 7);
> /* Disable breakpoints during exception handling */
> set_debugreg(0UL, 7);
> /*
> @@ -525,7 +524,7 @@ static int __kprobes hw_breakpoint_handler(struct die_args *args)
> if (dr6 & (~DR_TRAP_BITS))
> rc = NOTIFY_DONE;
>
> - set_debugreg(dr7, 7);
> + set_debugreg(__get_cpu_var(cpu_dr7), 7);
> put_cpu();



Good simplification, but that doesn't appear related to kgdb,
this should be done in a separate patch, for the perf/core tree.



> static void kgdb_correct_hw_break(void)
> {
> - unsigned long dr7;
> - int correctit = 0;
> - int breakbit;
> int breakno;
>
> - get_debugreg(dr7, 7);
> for (breakno = 0; breakno < 4; breakno++) {
> - breakbit = 2 << (breakno << 1);
> - if (!(dr7 & breakbit) && breakinfo[breakno].enabled) {
> - correctit = 1;
> - dr7 |= breakbit;
> - dr7 &= ~(0xf0000 << (breakno << 2));
> - dr7 |= ((breakinfo[breakno].len << 2) |
> - breakinfo[breakno].type) <<
> - ((breakno << 2) + 16);
> - set_debugreg(breakinfo[breakno].addr, breakno);
> -
> - } else {
> - if ((dr7 & breakbit) && !breakinfo[breakno].enabled) {
> - correctit = 1;
> - dr7 &= ~breakbit;
> - dr7 &= ~(0xf0000 << (breakno << 2));
> - }
> - }
> + struct perf_event *bp;
> + struct arch_hw_breakpoint *info;
> + int val;
> + int cpu = raw_smp_processor_id();
> + if (!breakinfo[breakno].enabled)
> + continue;
> + bp = *per_cpu_ptr(breakinfo[breakno].pev, cpu);
> + info = counter_arch_bp(bp);
> + if (bp->attr.disabled != 1)
> + continue;
> + bp->attr.bp_addr = breakinfo[breakno].addr;
> + bp->attr.bp_len = breakinfo[breakno].len;
> + bp->attr.bp_type = breakinfo[breakno].type;
> + info->address = breakinfo[breakno].addr;
> + info->len = breakinfo[breakno].len;
> + info->type = breakinfo[breakno].type;
> + val = arch_install_hw_breakpoint(bp);
> + if (!val)
> + bp->attr.disabled = 0;
> }
> - if (correctit)
> - set_debugreg(dr7, 7);
> + hw_breakpoint_restore();



hw_breakpoint_restore() is used by KVM only, for now.
The cpu var cpu_debugreg[] contains values that
are only saved when KVM switches to a guest, then
this function is called when KVM switches back to the
host. I bet this is not the function you need.
In fact, I don't know what you intended to do there.



> @@ -278,27 +285,38 @@ kgdb_set_hw_break(unsigned long addr, int len, enum kgdb_bptype bptype)
>
> switch (bptype) {
> case BP_HARDWARE_BREAKPOINT:
> - type = 0;
> - len = 1;
> + len = 1;
> + breakinfo[i].type = X86_BREAKPOINT_EXECUTE;
> break;
> case BP_WRITE_WATCHPOINT:
> - type = 1;
> + breakinfo[i].type = X86_BREAKPOINT_WRITE;
> break;
> case BP_ACCESS_WATCHPOINT:
> - type = 3;
> + breakinfo[i].type = X86_BREAKPOINT_RW;
> break;



Would be nice to have bptype set to the generic flags
we have already in linux/hw_breakpoint.h:

enum {
HW_BREAKPOINT_R = 1,
HW_BREAKPOINT_W = 2,
HW_BREAKPOINT_X = 4,
};


We have functions in x86 to do the conversion to
x86 values in arch/x86/kernel/hw_breakpoint.c

Nothing urgent though, as this patch is a regression fix,
this can be done later.



> + switch (len) {
> + case 1:
> + breakinfo[i].len = X86_BREAKPOINT_LEN_1;
> + break;
> + case 2:
> + breakinfo[i].len = X86_BREAKPOINT_LEN_2;
> + break;
> + case 4:
> + breakinfo[i].len = X86_BREAKPOINT_LEN_4;
> + break;
> +#ifdef CONFIG_X86_64
> + case 8:
> + breakinfo[i].len = X86_BREAKPOINT_LEN_8;
> + break;
> +#endif



Same here, see arch_build_bp_info().
Actually, arch_validate_hwbkpt_settings() would do all
that for you. May require few changes though to adapt.

Actually, I don't understand why you encumber with this
breakinfo thing. Why not just keeping a per cpu array
of perf events? You have everything you need inside:
the generic breakpoint attributes in the attrs and
the arch info in the hw_perf_event struct inside.

Hence you would be able to use the x86 breakpoint API
we have already, arch_validate_hwbkpt_settings() does
everything for you. This is going to shrink your code
and then make it a stronger argument to pull request
as a not-that-one-liner regression fix late in the
process (which I must confess is my bad, firstly: I
did the regression and secondly: I should have
reviewed these fixes sooner).



> @@ -313,8 +331,21 @@ kgdb_set_hw_break(unsigned long addr, int len, enum kgdb_bptype bptype)
> */
> void kgdb_disable_hw_debug(struct pt_regs *regs)
> {
> + int i;
> + int cpu = raw_smp_processor_id();
> + struct perf_event *bp;
> +
> /* Disable hardware debugging while we are in kgdb: */
> set_debugreg(0UL, 7);
> + for (i = 0; i < 4; i++) {
> + if (!breakinfo[i].enabled)



See? Here you could use simply bp->attr.disabled instead
of playing with this breakinfo.


> @@ -378,7 +409,6 @@ int kgdb_arch_handle_exception(int e_vector, int signo, int err_code,
> struct pt_regs *linux_regs)
> {
> unsigned long addr;
> - unsigned long dr6;
> char *ptr;
> int newPC;
>
> @@ -448,10 +464,12 @@ single_step_cont(struct pt_regs *regs, struct die_args *args)
> }
>
> static int was_in_debug_nmi[NR_CPUS];
> +static int recieved_hw_brk[NR_CPUS];
>
> static int __kgdb_notify(struct die_args *args, unsigned long cmd)
> {
> struct pt_regs *regs = args->regs;
> + unsigned long *dr6_p;
>
> switch (cmd) {
> case DIE_NMI:
> @@ -485,16 +503,24 @@ static int __kgdb_notify(struct die_args *args, unsigned long cmd)
> break;
>
> case DIE_DEBUG:
> - if (atomic_read(&kgdb_cpu_doing_single_step) ==
> - raw_smp_processor_id()) {
> + dr6_p = (unsigned long *)ERR_PTR(args->err);
> + if (atomic_read(&kgdb_cpu_doing_single_step) != -1) {
> + if (dr6_p && (*dr6_p & DR_STEP) == 0)
> + return NOTIFY_DONE;
> if (user_mode(regs))
> return single_step_cont(regs, args);
> break;
> - } else if (test_thread_flag(TIF_SINGLESTEP))
> + } else if (test_thread_flag(TIF_SINGLESTEP)) {
> /* This means a user thread is single stepping
> * a system call which should be ignored
> */
> return NOTIFY_DONE;
> + } else if (recieved_hw_brk[raw_smp_processor_id()] == 1) {
> + recieved_hw_brk[raw_smp_processor_id()] = 0;
> + return NOTIFY_STOP;
> + } else if (dr6_p && (*dr6_p & DR_TRAP_BITS) == 0) {
> + return NOTIFY_DONE;
> + }



So this is the debug handler, right?


>
> +static void kgdb_hw_bp(struct perf_event *bp, int nmi,
> + struct perf_sample_data *data,
> + struct pt_regs *regs)
> +{
> + struct die_args args;
> + int cpu = raw_smp_processor_id();
> +
> + args.trapnr = 0;
> + args.signr = 5;
> + args.err = 0;
> + args.regs = regs;
> + args.str = "debug";
> + recieved_hw_brk[cpu] = 0;
> + if (__kgdb_notify(&args, DIE_DEBUG) == NOTIFY_STOP)
> + recieved_hw_brk[cpu] = 1;
> + else
> + recieved_hw_brk[cpu] = 0;
> +}



And this looks like the perf event handler.

I'm confused by the logic here. We have the x86 breakpoint
handler which calls perf_bp_event which in turn will call
the above. The above calls __kgdb_notify(), but it will
also be called later as it is a debug notifier.



> +
> /**
> * kgdb_arch_init - Perform any architecture specific initalization.
> *
> @@ -539,7 +584,43 @@ static struct notifier_block kgdb_notifier = {
> */
> int kgdb_arch_init(void)
> {
> - return register_die_notifier(&kgdb_notifier);
> + int i, cpu;
> + int ret;
> + struct perf_event_attr attr;
> + struct perf_event **pevent;
> +
> + ret = register_die_notifier(&kgdb_notifier);
> + if (ret != 0)
> + return ret;
> + /*
> + * Pre-allocate the hw breakpoint structions in the non-atomic
> + * portion of kgdb because this operation requires mutexs to
> + * complete.
> + */
> + attr.bp_addr = (unsigned long)kgdb_arch_init;
> + attr.type = PERF_TYPE_BREAKPOINT;
> + attr.bp_len = HW_BREAKPOINT_LEN_1;
> + attr.bp_type = HW_BREAKPOINT_X;
> + attr.disabled = 1;
> + for (i = 0; i < 4; i++) {
> + breakinfo[i].pev = register_wide_hw_breakpoint(&attr,
> + kgdb_hw_bp);



By calling this, you are reserving all the breakpoint slots.



> + if (IS_ERR(breakinfo[i].pev)) {
> + printk(KERN_ERR "kgdb: Could not allocate hw breakpoints\n");
> + breakinfo[i].pev = NULL;
> + kgdb_arch_exit();
> + return -1;
> + }
> + for_each_online_cpu(cpu) {
> + pevent = per_cpu_ptr(breakinfo[i].pev, cpu);
> + pevent[0]->hw.sample_period = 1;
> + if (pevent[0]->destroy != NULL) {
> + pevent[0]->destroy = NULL;
> + release_bp_slot(*pevent);



And then you release these, ok. We should find a proper
way for that later, but for now it should work.

2010-01-28 17:45:44

by Jason Wessel

[permalink] [raw]
Subject: Re: [PATCH 1/4] x86,hw_breakpoint,kgdb: kgdb to use hw_breakpointAPI

Frederic Weisbecker wrote:
> On Mon, Jan 25, 2010 at 10:26:37PM -0600, Jason Wessel wrote:
>
>> @@ -466,7 +466,7 @@ static int __kprobes hw_breakpoint_handler(struct die_args *args)
>> {
>> int i, cpu, rc = NOTIFY_STOP;
>> struct perf_event *bp;
>> - unsigned long dr7, dr6;
>> + unsigned long dr6;
>> unsigned long *dr6_p;
>>
>> /* The DR6 value is pointed by args->err */
>> @@ -477,7 +477,6 @@ static int __kprobes hw_breakpoint_handler(struct die_args *args)
>> if ((dr6 & DR_TRAP_BITS) == 0)
>> return NOTIFY_DONE;
>>
>> - get_debugreg(dr7, 7);
>> /* Disable breakpoints during exception handling */
>> set_debugreg(0UL, 7);
>> /*
>> @@ -525,7 +524,7 @@ static int __kprobes hw_breakpoint_handler(struct die_args *args)
>> if (dr6 & (~DR_TRAP_BITS))
>> rc = NOTIFY_DONE;
>>
>> - set_debugreg(dr7, 7);
>> + set_debugreg(__get_cpu_var(cpu_dr7), 7);
>> put_cpu();
>>
>
>
>
> Good simplification, but that doesn't appear related to kgdb,
> this should be done in a separate patch, for the perf/core tree.
>
>
>
>

Specifically this is required so that kgdb can modify the state of dr7
by installing and removing breakpoints. Without this change, on return
from the callback the dr7 was not correct.

As far as I know, only kgdb was altering the dr registers during a call
back..

>
>> }
>> - if (correctit)
>> - set_debugreg(dr7, 7);
>> + hw_breakpoint_restore();
>>
>
>
>
> hw_breakpoint_restore() is used by KVM only, for now.
> The cpu var cpu_debugreg[] contains values that
> are only saved when KVM switches to a guest, then
> this function is called when KVM switches back to the
> host. I bet this is not the function you need.
> In fact, I don't know what you intended to do there.
>
>

I was looking to restore the proper contents of the debug registers when
resuming the general kernel execution.

As far as I could tell it looked like the right function because
arch_install_breakpoint() uses the per_cpu vars. If there is a save
function that I need to call first that is a different issue.

IE:
__get_cpu_var(cpu_debugreg[i]) = info->address;


I admit I did not test running a kvm instance, so I don't know what kind
of conflict there would be here. I went and further looked at the kvm
code, and they call the function for the same reason kgdb does. They
want the original system values back on resuming normal kernel
execution. KVM can modify dr7 or other regs directly on entry for its
guest execution. Kgdb does the same sort of thing so as to prevent the
debugger from interrupting itself.


>
>
>> @@ -278,27 +285,38 @@ kgdb_set_hw_break(unsigned long addr, int len, enum kgdb_bptype bptype)
>>
>> switch (bptype) {
>> case BP_HARDWARE_BREAKPOINT:
>> - type = 0;
>> - len = 1;
>> + len = 1;
>> + breakinfo[i].type = X86_BREAKPOINT_EXECUTE;
>> break;
>> case BP_WRITE_WATCHPOINT:
>> - type = 1;
>> + breakinfo[i].type = X86_BREAKPOINT_WRITE;
>> break;
>> case BP_ACCESS_WATCHPOINT:
>> - type = 3;
>> + breakinfo[i].type = X86_BREAKPOINT_RW;
>> break;
>>
>
>
>
> Would be nice to have bptype set to the generic flags
> we have already in linux/hw_breakpoint.h:
>
> enum {
> HW_BREAKPOINT_R = 1,
> HW_BREAKPOINT_W = 2,
> HW_BREAKPOINT_X = 4,
> };
>
>
>

These numbers have to get translated somewhere from the GDB version
which it handed off via the gdb serial protocol. They could be
translated in the gdb stub, but for now they are in the arch specific
stub. Or you can choose to use the same numbering scheme as gdb for the
breakpoint types and the values could be used directly.

> We have functions in x86 to do the conversion to
> x86 values in arch/x86/kernel/hw_breakpoint.c
>
> Nothing urgent though, as this patch is a regression fix,
> this can be done later.
>
>
>
>
>> + switch (len) {
>> + case 1:
>> + breakinfo[i].len = X86_BREAKPOINT_LEN_1;
>> + break;
>> + case 2:
>> + breakinfo[i].len = X86_BREAKPOINT_LEN_2;
>> + break;
>> + case 4:
>> + breakinfo[i].len = X86_BREAKPOINT_LEN_4;
>> + break;
>> +#ifdef CONFIG_X86_64
>> + case 8:
>> + breakinfo[i].len = X86_BREAKPOINT_LEN_8;
>> + break;
>> +#endif
>>
>
>
>
> Same here, see arch_build_bp_info().
> Actually, arch_validate_hwbkpt_settings() would do all
> that for you. May require few changes though to adapt.
>
> Actually, I don't understand why you encumber with this
> breakinfo thing. Why not just keeping a per cpu array
> of perf events? You have everything you need inside:
> the generic breakpoint attributes in the attrs and
> the arch info in the hw_perf_event struct inside.
>

I think the break info thing will go away via a refactor. For now I was
really looking to make it work. There was no way to tell at the time
what values were safe to use in attr struct provided by perf. I would
have further preferred to be able to use the simple -1 cpu in the bp
type and let perf do all the work, but there is no way to allocate a
perf hw wide break like this at the moment.

Realize that what is here is well tested, and aimed to first correct the
regression.

> Hence you would be able to use the x86 breakpoint API
> we have already, arch_validate_hwbkpt_settings() does
> everything for you. This is going to shrink your code
> and then make it a stronger argument to pull request
> as a not-that-one-liner regression fix late in the
> process (which I must confess is my bad, firstly: I
> did the regression and secondly: I should have
> reviewed these fixes sooner).
>
>
>
>
>> @@ -313,8 +331,21 @@ kgdb_set_hw_break(unsigned long addr, int len, enum kgdb_bptype bptype)
>> */
>> void kgdb_disable_hw_debug(struct pt_regs *regs)
>> {
>> + int i;
>> + int cpu = raw_smp_processor_id();
>> + struct perf_event *bp;
>> +
>> /* Disable hardware debugging while we are in kgdb: */
>> set_debugreg(0UL, 7);
>> + for (i = 0; i < 4; i++) {
>> + if (!breakinfo[i].enabled)
>>
>
>
>
> See? Here you could use simply bp->attr.disabled instead
> of playing with this breakinfo.
>
>
>

Right now, no because the disabled attribute was separately tracking if
it was installed or not. The breakinfo thing crudely tracked the kgdb
breakpoint list while in kgdb, and then it is synced with the real
install state.

Yes, I'd like to get rid of it, but it is more changes than I want to
make at the moment.

>> @@ -378,7 +409,6 @@ int kgdb_arch_handle_exception(int e_vector, int signo, int err_code,
>> struct pt_regs *linux_regs)
>> {
>> unsigned long addr;
>> - unsigned long dr6;
>> char *ptr;
>> int newPC;
>>
>> @@ -448,10 +464,12 @@ single_step_cont(struct pt_regs *regs, struct die_args *args)
>> }
>>
>> static int was_in_debug_nmi[NR_CPUS];
>> +static int recieved_hw_brk[NR_CPUS];
>>
>> static int __kgdb_notify(struct die_args *args, unsigned long cmd)
>> {
>> struct pt_regs *regs = args->regs;
>> + unsigned long *dr6_p;
>>
>> switch (cmd) {
>> case DIE_NMI:
>> @@ -485,16 +503,24 @@ static int __kgdb_notify(struct die_args *args, unsigned long cmd)
>> break;
>>
>> case DIE_DEBUG:
>> - if (atomic_read(&kgdb_cpu_doing_single_step) ==
>> - raw_smp_processor_id()) {
>> + dr6_p = (unsigned long *)ERR_PTR(args->err);
>> + if (atomic_read(&kgdb_cpu_doing_single_step) != -1) {
>> + if (dr6_p && (*dr6_p & DR_STEP) == 0)
>> + return NOTIFY_DONE;
>> if (user_mode(regs))
>> return single_step_cont(regs, args);
>> break;
>> - } else if (test_thread_flag(TIF_SINGLESTEP))
>> + } else if (test_thread_flag(TIF_SINGLESTEP)) {
>> /* This means a user thread is single stepping
>> * a system call which should be ignored
>> */
>> return NOTIFY_DONE;
>> + } else if (recieved_hw_brk[raw_smp_processor_id()] == 1) {
>> + recieved_hw_brk[raw_smp_processor_id()] = 0;
>> + return NOTIFY_STOP;
>> + } else if (dr6_p && (*dr6_p & DR_TRAP_BITS) == 0) {
>> + return NOTIFY_DONE;
>> + }
>>
>
>
>
> So this is the debug handler, right?
>
>
>

This ugly bit is all because the patch I had for returning something
from the event call back was tossed.

Because perf does not honor the return code from a call back there is no
way to dismiss an event in the die notifier. The forces the debug
handler to do very nasty tricks so as not to handle the same event twice.

>>
>> +static void kgdb_hw_bp(struct perf_event *bp, int nmi,
>> + struct perf_sample_data *data,
>> + struct pt_regs *regs)
>> +{
>> + struct die_args args;
>> + int cpu = raw_smp_processor_id();
>> +
>> + args.trapnr = 0;
>> + args.signr = 5;
>> + args.err = 0;
>> + args.regs = regs;
>> + args.str = "debug";
>> + recieved_hw_brk[cpu] = 0;
>> + if (__kgdb_notify(&args, DIE_DEBUG) == NOTIFY_STOP)
>> + recieved_hw_brk[cpu] = 1;
>> + else
>> + recieved_hw_brk[cpu] = 0;
>> +}
>>
>
>
>
> And this looks like the perf event handler.
>
> I'm confused by the logic here. We have the x86 breakpoint
> handler which calls perf_bp_event which in turn will call
> the above. The above calls __kgdb_notify(), but it will
> also be called later as it is a debug notifier.
>
>
>

Perf was eating my events if I had no call back. If you know a way
around this let me know.


>
>> +
>> /**
>> * kgdb_arch_init - Perform any architecture specific initalization.
>> *
>> @@ -539,7 +584,43 @@ static struct notifier_block kgdb_notifier = {
>> */
>> int kgdb_arch_init(void)
>> {
>> - return register_die_notifier(&kgdb_notifier);
>> + int i, cpu;
>> + int ret;
>> + struct perf_event_attr attr;
>> + struct perf_event **pevent;
>> +
>> + ret = register_die_notifier(&kgdb_notifier);
>> + if (ret != 0)
>> + return ret;
>> + /*
>> + * Pre-allocate the hw breakpoint structions in the non-atomic
>> + * portion of kgdb because this operation requires mutexs to
>> + * complete.
>> + */
>> + attr.bp_addr = (unsigned long)kgdb_arch_init;
>> + attr.type = PERF_TYPE_BREAKPOINT;
>> + attr.bp_len = HW_BREAKPOINT_LEN_1;
>> + attr.bp_type = HW_BREAKPOINT_X;
>> + attr.disabled = 1;
>> + for (i = 0; i < 4; i++) {
>> + breakinfo[i].pev = register_wide_hw_breakpoint(&attr,
>> + kgdb_hw_bp);
>>
>
>
>
> By calling this, you are reserving all the breakpoint slots.
>
>

Looking down further you will see only 1 slot is used because it is
immediately released.

>
>
>> + if (IS_ERR(breakinfo[i].pev)) {
>> + printk(KERN_ERR "kgdb: Could not allocate hw breakpoints\n");
>> + breakinfo[i].pev = NULL;
>> + kgdb_arch_exit();
>> + return -1;
>> + }
>> + for_each_online_cpu(cpu) {
>> + pevent = per_cpu_ptr(breakinfo[i].pev, cpu);
>> + pevent[0]->hw.sample_period = 1;
>> + if (pevent[0]->destroy != NULL) {
>> + pevent[0]->destroy = NULL;
>> + release_bp_slot(*pevent);
>>
>
>
>
> And then you release these, ok. We should find a proper
> way for that later, but for now it should work.
>
>

Previously, I was unable to convince anyone that the kernel debugger
needed to be able to do this. I had an API change to perf for it, but
it was dismissed along with the notify return value on the call back.
This is merely a work around to correct the regression.

Jason.

2010-01-28 20:00:59

by Jason Wessel

[permalink] [raw]
Subject: Re: [PATCH 1/4] x86,hw_breakpoint,kgdb: kgdb to use hw_breakpointAPI

New version which loses the call back registration, because if we
register as a different breakpoint type, perf won't reject it. I didn't
see that loop hole in the API the first time around.

Also the dr7 fixup in hw_breakpoint.c is not needed.

The trickery for entering twice into the __kgdb_notify is gone too.

All the regression tests are still passing.

---

arch/x86/kernel/kgdb.c | 173
++++++++++++++++++++++++++++++++-----------------
kernel/kgdb.c | 3
2 files changed, 118 insertions(+), 58 deletions(-)

--- a/arch/x86/kernel/kgdb.c
+++ b/arch/x86/kernel/kgdb.c
@@ -42,6 +42,7 @@
#include <linux/init.h>
#include <linux/smp.h>
#include <linux/nmi.h>
+#include <linux/hw_breakpoint.h>

#include <asm/debugreg.h>
#include <asm/apicdef.h>
@@ -204,40 +205,38 @@ void gdb_regs_to_pt_regs(unsigned long *

static struct hw_breakpoint {
unsigned enabled;
- unsigned type;
- unsigned len;
unsigned long addr;
+ int len;
+ int type;
+ struct perf_event **pev;
} breakinfo[4];

static void kgdb_correct_hw_break(void)
{
- unsigned long dr7;
- int correctit = 0;
- int breakbit;
int breakno;

- get_debugreg(dr7, 7);
for (breakno = 0; breakno < 4; breakno++) {
- breakbit = 2 << (breakno << 1);
- if (!(dr7 & breakbit) && breakinfo[breakno].enabled) {
- correctit = 1;
- dr7 |= breakbit;
- dr7 &= ~(0xf0000 << (breakno << 2));
- dr7 |= ((breakinfo[breakno].len << 2) |
- breakinfo[breakno].type) <<
- ((breakno << 2) + 16);
- set_debugreg(breakinfo[breakno].addr, breakno);
-
- } else {
- if ((dr7 & breakbit) && !breakinfo[breakno].enabled) {
- correctit = 1;
- dr7 &= ~breakbit;
- dr7 &= ~(0xf0000 << (breakno << 2));
- }
- }
+ struct perf_event *bp;
+ struct arch_hw_breakpoint *info;
+ int val;
+ int cpu = raw_smp_processor_id();
+ if (!breakinfo[breakno].enabled)
+ continue;
+ bp = *per_cpu_ptr(breakinfo[breakno].pev, cpu);
+ info = counter_arch_bp(bp);
+ if (bp->attr.disabled != 1)
+ continue;
+ bp->attr.bp_addr = breakinfo[breakno].addr;
+ bp->attr.bp_len = breakinfo[breakno].len;
+ bp->attr.bp_type = breakinfo[breakno].type;
+ info->address = breakinfo[breakno].addr;
+ info->len = breakinfo[breakno].len;
+ info->type = breakinfo[breakno].type;
+ val = arch_install_hw_breakpoint(bp);
+ if (!val)
+ bp->attr.disabled = 0;
}
- if (correctit)
- set_debugreg(dr7, 7);
+ hw_breakpoint_restore();
}

static int
@@ -259,15 +258,23 @@ kgdb_remove_hw_break(unsigned long addr,
static void kgdb_remove_all_hw_break(void)
{
int i;
+ int cpu = raw_smp_processor_id();
+ struct perf_event *bp;

- for (i = 0; i < 4; i++)
- memset(&breakinfo[i], 0, sizeof(struct hw_breakpoint));
+ for (i = 0; i < 4; i++) {
+ if (!breakinfo[i].enabled)
+ continue;
+ bp = *per_cpu_ptr(breakinfo[i].pev, cpu);
+ if (bp->attr.disabled == 1)
+ continue;
+ arch_uninstall_hw_breakpoint(bp);
+ bp->attr.disabled = 1;
+ }
}

static int
kgdb_set_hw_break(unsigned long addr, int len, enum kgdb_bptype bptype)
{
- unsigned type;
int i;

for (i = 0; i < 4; i++)
@@ -278,27 +285,38 @@ kgdb_set_hw_break(unsigned long addr, in

switch (bptype) {
case BP_HARDWARE_BREAKPOINT:
- type = 0;
- len = 1;
+ len = 1;
+ breakinfo[i].type = X86_BREAKPOINT_EXECUTE;
break;
case BP_WRITE_WATCHPOINT:
- type = 1;
+ breakinfo[i].type = X86_BREAKPOINT_WRITE;
break;
case BP_ACCESS_WATCHPOINT:
- type = 3;
+ breakinfo[i].type = X86_BREAKPOINT_RW;
break;
default:
return -1;
}
-
- if (len == 1 || len == 2 || len == 4)
- breakinfo[i].len = len - 1;
- else
- return -1;
-
- breakinfo[i].enabled = 1;
+ switch (len) {
+ case 1:
+ breakinfo[i].len = X86_BREAKPOINT_LEN_1;
+ break;
+ case 2:
+ breakinfo[i].len = X86_BREAKPOINT_LEN_2;
+ break;
+ case 4:
+ breakinfo[i].len = X86_BREAKPOINT_LEN_4;
+ break;
+#ifdef CONFIG_X86_64
+ case 8:
+ breakinfo[i].len = X86_BREAKPOINT_LEN_8;
+ break;
+#endif
+ default:
+ return -1;
+ }
breakinfo[i].addr = addr;
- breakinfo[i].type = type;
+ breakinfo[i].enabled = 1;

return 0;
}
@@ -313,8 +331,21 @@ kgdb_set_hw_break(unsigned long addr, in
*/
void kgdb_disable_hw_debug(struct pt_regs *regs)
{
+ int i;
+ int cpu = raw_smp_processor_id();
+ struct perf_event *bp;
+
/* Disable hardware debugging while we are in kgdb: */
set_debugreg(0UL, 7);
+ for (i = 0; i < 4; i++) {
+ if (!breakinfo[i].enabled)
+ continue;
+ bp = *per_cpu_ptr(breakinfo[i].pev, cpu);
+ if (bp->attr.disabled == 1)
+ continue;
+ arch_uninstall_hw_breakpoint(bp);
+ bp->attr.disabled = 1;
+ }
}

/**
@@ -378,7 +409,6 @@ int kgdb_arch_handle_exception(int e_vec
struct pt_regs *linux_regs)
{
unsigned long addr;
- unsigned long dr6;
char *ptr;
int newPC;

@@ -404,20 +434,6 @@ int kgdb_arch_handle_exception(int e_vec
raw_smp_processor_id());
}

- get_debugreg(dr6, 6);
- if (!(dr6 & 0x4000)) {
- int breakno;
-
- for (breakno = 0; breakno < 4; breakno++) {
- if (dr6 & (1 << breakno) &&
- breakinfo[breakno].type == 0) {
- /* Set restore flag: */
- linux_regs->flags |= X86_EFLAGS_RF;
- break;
- }
- }
- }
- set_debugreg(0UL, 6);
kgdb_correct_hw_break();

return 0;
@@ -485,8 +501,7 @@ static int __kgdb_notify(struct die_args
break;

case DIE_DEBUG:
- if (atomic_read(&kgdb_cpu_doing_single_step) ==
- raw_smp_processor_id()) {
+ if (atomic_read(&kgdb_cpu_doing_single_step) != -1) {
if (user_mode(regs))
return single_step_cont(regs, args);
break;
@@ -539,7 +554,42 @@ static struct notifier_block kgdb_notifi
*/
int kgdb_arch_init(void)
{
- return register_die_notifier(&kgdb_notifier);
+ int i, cpu;
+ int ret;
+ struct perf_event_attr attr;
+ struct perf_event **pevent;
+
+ ret = register_die_notifier(&kgdb_notifier);
+ if (ret != 0)
+ return ret;
+ /*
+ * Pre-allocate the hw breakpoint structions in the non-atomic
+ * portion of kgdb because this operation requires mutexs to
+ * complete.
+ */
+ attr.bp_addr = (unsigned long)kgdb_arch_init;
+ attr.type = PERF_TYPE_BREAKPOINT;
+ attr.bp_len = HW_BREAKPOINT_LEN_1;
+ attr.bp_type = HW_BREAKPOINT_W;
+ attr.disabled = 1;
+ for (i = 0; i < 4; i++) {
+ breakinfo[i].pev = register_wide_hw_breakpoint(&attr, NULL);
+ if (IS_ERR(breakinfo[i].pev)) {
+ printk(KERN_ERR "kgdb: Could not allocate hw breakpoints\n");
+ breakinfo[i].pev = NULL;
+ kgdb_arch_exit();
+ return -1;
+ }
+ for_each_online_cpu(cpu) {
+ pevent = per_cpu_ptr(breakinfo[i].pev, cpu);
+ pevent[0]->hw.sample_period = 1;
+ if (pevent[0]->destroy != NULL) {
+ pevent[0]->destroy = NULL;
+ release_bp_slot(*pevent);
+ }
+ }
+ }
+ return ret;
}

/**
@@ -550,6 +600,13 @@ int kgdb_arch_init(void)
*/
void kgdb_arch_exit(void)
{
+ int i;
+ for (i = 0; i < 4; i++) {
+ if (breakinfo[i].pev) {
+ unregister_wide_hw_breakpoint(breakinfo[i].pev);
+ breakinfo[i].pev = NULL;
+ }
+ }
unregister_die_notifier(&kgdb_notifier);
}

--- a/kernel/kgdb.c
+++ b/kernel/kgdb.c
@@ -578,6 +578,9 @@ static void kgdb_wait(struct pt_regs *re
smp_wmb();
atomic_set(&cpu_in_kgdb[cpu], 1);

+ /* Disable any cpu specific hw breakpoints */
+ kgdb_disable_hw_debug(regs);
+
/* Wait till primary CPU is done with debugging */
while (atomic_read(&passive_cpu_wait[cpu]))
cpu_relax();

2010-01-28 20:04:28

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 1/4] x86,hw_breakpoint,kgdb: kgdb to use hw_breakpointAPI

On Thu, Jan 28, 2010 at 11:44:52AM -0600, Jason Wessel wrote:
> Frederic Weisbecker wrote:
> > Good simplification, but that doesn't appear related to kgdb,
> > this should be done in a separate patch, for the perf/core tree.
> >
> >
>
> Specifically this is required so that kgdb can modify the state of dr7
> by installing and removing breakpoints. Without this change, on return
> from the callback the dr7 was not correct.
>
> As far as I know, only kgdb was altering the dr registers during a call
> back..



Ok. Well not sure how/where it needs to modify dr7 directly.



> > hw_breakpoint_restore() is used by KVM only, for now.
> > The cpu var cpu_debugreg[] contains values that
> > are only saved when KVM switches to a guest, then
> > this function is called when KVM switches back to the
> > host. I bet this is not the function you need.
> > In fact, I don't know what you intended to do there.
> >
> >
>
> I was looking to restore the proper contents of the debug registers when
> resuming the general kernel execution.
>
> As far as I could tell it looked like the right function because
> arch_install_breakpoint() uses the per_cpu vars. If there is a save
> function that I need to call first that is a different issue.
>
> IE:
> __get_cpu_var(cpu_debugreg[i]) = info->address;
>
>
> I admit I did not test running a kvm instance, so I don't know what kind
> of conflict there would be here. I went and further looked at the kvm
> code, and they call the function for the same reason kgdb does. They
> want the original system values back on resuming normal kernel
> execution. KVM can modify dr7 or other regs directly on entry for its
> guest execution. Kgdb does the same sort of thing so as to prevent the
> debugger from interrupting itself.



You mean kgdb needs to disable dr7 while handling a breakpoint to
avoid recursion? In this case this is something already done
from the x86 breakpoint handler.



> > Would be nice to have bptype set to the generic flags
> > we have already in linux/hw_breakpoint.h:
> >
> > enum {
> > HW_BREAKPOINT_R = 1,
> > HW_BREAKPOINT_W = 2,
> > HW_BREAKPOINT_X = 4,
> > };
> >
> >
> >
>
> These numbers have to get translated somewhere from the GDB version
> which it handed off via the gdb serial protocol. They could be
> translated in the gdb stub, but for now they are in the arch specific
> stub. Or you can choose to use the same numbering scheme as gdb for the
> breakpoint types and the values could be used directly.



Ah ok. Well, translating gdb <-> generic values will make you
move this code from x86 to core at least.


> > Same here, see arch_build_bp_info().
> > Actually, arch_validate_hwbkpt_settings() would do all
> > that for you. May require few changes though to adapt.
> >
> > Actually, I don't understand why you encumber with this
> > breakinfo thing. Why not just keeping a per cpu array
> > of perf events? You have everything you need inside:
> > the generic breakpoint attributes in the attrs and
> > the arch info in the hw_perf_event struct inside.
> >
>
> I think the break info thing will go away via a refactor. For now I was
> really looking to make it work. There was no way to tell at the time
> what values were safe to use in attr struct provided by perf. I would
> have further preferred to be able to use the simple -1 cpu in the bp
> type and let perf do all the work, but there is no way to allocate a
> perf hw wide break like this at the moment.



Ok.



> Realize that what is here is well tested, and aimed to first correct the
> regression.


Sure.


> >> + } else if (recieved_hw_brk[raw_smp_processor_id()] == 1) {
> >> + recieved_hw_brk[raw_smp_processor_id()] = 0;
> >> + return NOTIFY_STOP;
> >> + } else if (dr6_p && (*dr6_p & DR_TRAP_BITS) == 0) {
> >> + return NOTIFY_DONE;
> >> + }
> >>
> >
> >
> >
> > So this is the debug handler, right?
> >
> >
> >
>
> This ugly bit is all because the patch I had for returning something
> from the event call back was tossed.
>
> Because perf does not honor the return code from a call back there is no
> way to dismiss an event in the die notifier. The forces the debug
> handler to do very nasty tricks so as not to handle the same event twice.



Right, we'll need to fix that later from perf, I think.



> >>
> >> +static void kgdb_hw_bp(struct perf_event *bp, int nmi,
> >> + struct perf_sample_data *data,
> >> + struct pt_regs *regs)
> >> +{
> >> + struct die_args args;
> >> + int cpu = raw_smp_processor_id();
> >> +
> >> + args.trapnr = 0;
> >> + args.signr = 5;
> >> + args.err = 0;
> >> + args.regs = regs;
> >> + args.str = "debug";
> >> + recieved_hw_brk[cpu] = 0;
> >> + if (__kgdb_notify(&args, DIE_DEBUG) == NOTIFY_STOP)
> >> + recieved_hw_brk[cpu] = 1;
> >> + else
> >> + recieved_hw_brk[cpu] = 0;
> >> +}
> >>
> >
> >
> >
> > And this looks like the perf event handler.
> >
> > I'm confused by the logic here. We have the x86 breakpoint
> > handler which calls perf_bp_event which in turn will call
> > the above. The above calls __kgdb_notify(), but it will
> > also be called later as it is a debug notifier.
> >
> >
> >
>
> Perf was eating my events if I had no call back. If you know a way
> around this let me know.


The problem is not the perf callback. What I did not understand
was the use of __kgdb_notify() from the callback, while it is
still called after as a debug notifier.



> > And then you release these, ok. We should find a proper
> > way for that later, but for now it should work.
> >
> >
>
> Previously, I was unable to convince anyone that the kernel debugger
> needed to be able to do this. I had an API change to perf for it, but
> it was dismissed along with the notify return value on the call back.
> This is merely a work around to correct the regression.


We'll need to sanitize that after the regression fix.

2010-01-28 20:17:51

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 1/4] x86,hw_breakpoint,kgdb: kgdb to use hw_breakpointAPI

On Thu, Jan 28, 2010 at 01:58:26PM -0600, Jason Wessel wrote:
> New version which loses the call back registration, because if we
> register as a different breakpoint type, perf won't reject it.



I don't understand what you mean here.



> I didn't
> see that loop hole in the API the first time around.
>
> Also the dr7 fixup in hw_breakpoint.c is not needed.
>
> The trickery for entering twice into the __kgdb_notify is gone too.
>
> All the regression tests are still passing.



Ok. Well, let's go for it, as it's a regression that
needs to be fixed anyway.

But we'll need to sanitize various things after that :)

2010-01-28 20:30:37

by Jason Wessel

[permalink] [raw]
Subject: Re: [PATCH 1/4] x86,hw_breakpoint,kgdb: kgdb to usehw_breakpointAPI

Frederic Weisbecker wrote:
> On Thu, Jan 28, 2010 at 11:44:52AM -0600, Jason Wessel wrote:
>
>> Frederic Weisbecker wrote:
>>
>>> Good simplification, but that doesn't appear related to kgdb,
>>> this should be done in a separate patch, for the perf/core tree.
>>>
>>>
>>>
>> Specifically this is required so that kgdb can modify the state of dr7
>> by installing and removing breakpoints. Without this change, on return
>> from the callback the dr7 was not correct.
>>
>> As far as I know, only kgdb was altering the dr registers during a call
>> back..
>>
>
>
>
> Ok. Well not sure how/where it needs to modify dr7 directly.
>
>

This is not needed any more with the patch I already followed up with.


To provide an explanation though, dr7 got updated as a result of
installing some breakpoints while in kgdb while in the call back. And
that value got squashed by what ever was saved on the stack as a local
in the perf breakpoint handler.

>
>
>> I admit I did not test running a kvm instance, so I don't know what kind
>> of conflict there would be here. I went and further looked at the kvm
>> code, and they call the function for the same reason kgdb does. They
>> want the original system values back on resuming normal kernel
>> execution. KVM can modify dr7 or other regs directly on entry for its
>> guest execution. Kgdb does the same sort of thing so as to prevent the
>> debugger from interrupting itself.
>>
>
>
>
> You mean kgdb needs to disable dr7 while handling a breakpoint to
> avoid recursion? In this case this is something already done
> from the x86 breakpoint handler.
>
>
>

True, that is done for the master core, but not the slave cores, which
we stop dead in their tracks with a nmi, so they have not had dr7 zeroed
out.

Obviously we restore it on resume.

>
>
>>> Would be nice to have bptype set to the generic flags
>>> we have already in linux/hw_breakpoint.h:
>>>
>>> enum {
>>> HW_BREAKPOINT_R = 1,
>>> HW_BREAKPOINT_W = 2,
>>> HW_BREAKPOINT_X = 4,
>>> };
>>>
>>>
>>>
>>>
>> These numbers have to get translated somewhere from the GDB version
>> which it handed off via the gdb serial protocol. They could be
>> translated in the gdb stub, but for now they are in the arch specific
>> stub. Or you can choose to use the same numbering scheme as gdb for the
>> breakpoint types and the values could be used directly.
>>
>
>
>
> Ah ok. Well, translating gdb <-> generic values will make you
> move this code from x86 to core at least.
>
>


I'll move this to the gdbstub in the next merge window, because it looks
like there might also be other archs that gain the
arch/*/kernel/hw_breakpoint.c.

I can also move the logic for the install / remove to be owned by either
hw_breakpoint.c or the debug core because that will become arch
independent as well as more archs pickup the hw_breakpoint.c methodology.


Thanks,
Jason.

2010-01-28 21:25:25

by Jason Wessel

[permalink] [raw]
Subject: Re: [PATCH 1/4] x86,hw_breakpoint,kgdb: kgdb to usehw_breakpointAPI

Frederic Weisbecker wrote:
> On Thu, Jan 28, 2010 at 01:58:26PM -0600, Jason Wessel wrote:
>
>> New version which loses the call back registration, because if we
>> register as a different breakpoint type, perf won't reject it.
>>
>
>
>
> I don't understand what you mean here.
>
>
>
>

This was the change:

- attr.bp_type = HW_BREAKPOINT_X;
+ attr.bp_type = HW_BREAKPOINT_W;

The perf API had extra constraint checks when you did not have a call
back function and the type was HW_BREAKPOINT_X. Using the
HW_BREAKPOINT_W type avoids the extra checks that would not allow me to
register without providing a call back.

This issue will go away entirely if we allow the allocation of an event
that is disabled for the purpose of the kernel debugger. No matter how
you look at it, it is still a work around at the current time.
>> I didn't
>> see that loop hole in the API the first time around.
>>
>> Also the dr7 fixup in hw_breakpoint.c is not needed.
>>
>> The trickery for entering twice into the __kgdb_notify is gone too.
>>
>> All the regression tests are still passing.
>>
>
>
>
> Ok. Well, let's go for it, as it's a regression that
> needs to be fixed anyway.
>
> But we'll need to sanitize various things after that :)
>


Agreed.

Can I take that to mean you have acked this version of the patch?

Thanks,
Jason.

2010-01-28 21:50:47

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 1/4] x86,hw_breakpoint,kgdb: kgdb to usehw_breakpointAPI

On Thu, Jan 28, 2010 at 02:27:54PM -0600, Jason Wessel wrote:
> I'll move this to the gdbstub in the next merge window, because it looks
> like there might also be other archs that gain the
> arch/*/kernel/hw_breakpoint.c.


Yeah, PowerPc, S390 and ARM are in progress.


> I can also move the logic for the install / remove to be owned by either
> hw_breakpoint.c or the debug core because that will become arch
> independent as well as more archs pickup the hw_breakpoint.c methodology.



Indeed.

2010-01-28 21:54:36

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 1/4] x86,hw_breakpoint,kgdb: kgdb to usehw_breakpointAPI

On Thu, Jan 28, 2010 at 02:23:13PM -0600, Jason Wessel wrote:
> Frederic Weisbecker wrote:
> > On Thu, Jan 28, 2010 at 01:58:26PM -0600, Jason Wessel wrote:
> >
> >> New version which loses the call back registration, because if we
> >> register as a different breakpoint type, perf won't reject it.
> >>
> >
> >
> >
> > I don't understand what you mean here.
> >
> >
> >
> >
>
> This was the change:
>
> - attr.bp_type = HW_BREAKPOINT_X;
> + attr.bp_type = HW_BREAKPOINT_W;
>
> The perf API had extra constraint checks when you did not have a call
> back function and the type was HW_BREAKPOINT_X. Using the
> HW_BREAKPOINT_W type avoids the extra checks that would not allow me to
> register without providing a call back.


Oh, where have you seen this?



> > Ok. Well, let's go for it, as it's a regression that
> > needs to be fixed anyway.
> >
> > But we'll need to sanitize various things after that :)
> >
>
>
> Agreed.
>
> Can I take that to mean you have acked this version of the patch?
>
> Thanks,
> Jason.


Yep, please add my ack.

Thanks!