2009-07-20 17:08:16

by Frederic Weisbecker

[permalink] [raw]
Subject: [RFC][PATCH 0/5] hw-breakpoints: Make the API generic + support for perfcounters

Hi,

This patchset aims to make the hardware breakpoint API generic and
also to provide the support for hardware breakpoints from perfcounters.

The hardware breakpoint API changes are a first shot of idea.

The support for perf counter has strange effects. It looks like only
a part of the breakpoint events are actually recorded. The triggered callback
receives well the events but if there are only few of them, none is passed
in userspace (or userspace ignores them, dunno). Once I have some time, I'll
take a deeper look inside the perf mmap syscall or whatever could be
the origin.

Thanks.

Example which traces the BKL accesses by stressing reiserfs:

bkl=0x$(cat ../../System.map | grep kernel_flag | cut -d" " -f 1)
./perf record -f -g -e 5:$bkl -- dbench -t 20 20
./perf report -g -s s
# Samples: 36
#
# Overhead Symbol
# ........ ......
#
83.33% [k] lock_kernel
|
|--51.72%-- __pwrite64
|
|--27.59%-- 0x7f9d83a17b15
|
|--13.79%-- __open
|
--10.34%-- unlink

16.67% [k] unlock_kernel
|
|--80.00%-- 0x7f9d83a17b15
|
|--20.00%-- unlink
|
--20.00%-- __pwrite64

Frederic Weisbecker (5):
hw-breakpoints: Make kernel breakpoints API truly generic
hw-breakpoints: Pull up the target symbol in a generic field
hw-breakpoints: Make user breakpoints API truly generic
perfcounter: Grow the event number to 64 bits
perfcounter: Add support for kernel hardware breakpoints

arch/x86/include/asm/hw_breakpoint.h | 9 ++-
arch/x86/kernel/hw_breakpoint.c | 91 +++++++++++-----
arch/x86/kernel/ptrace.c | 21 ++--
include/asm-generic/hw_breakpoint.h | 116 ++++++++-------------
include/linux/perf_counter.h | 1 +
kernel/hw_breakpoint.c | 173 ++++++++++++++++++++++++++++---
kernel/perf_counter.c | 41 +++++++-
kernel/trace/trace.h | 5 +-
kernel/trace/trace_ksym.c | 67 ++++++------
kernel/trace/trace_selftest.c | 2 +-
samples/hw_breakpoint/data_breakpoint.c | 10 +--
11 files changed, 368 insertions(+), 168 deletions(-)


2009-07-20 17:08:25

by Frederic Weisbecker

[permalink] [raw]
Subject: [RFC][PATCH 1/5] hw-breakpoints: Make kernel breakpoints API truly generic

To define a kernel hardware breakpoint, one need to define the
address, type and length of the breakpoint using arch specific
operations and then register it using a core helper.

The first stage is truly not scalable with respect to the number of
archictures, because for each of them that support hardware
breakpoints, we would need a seperate specific field definition for
the breakpoint.

However, the supported breakpoint functionalities may be very different
between architectures.
Then this new API tries to compose with the following constraints:

- a given architecture may perhaps not support the triggering on one
of the usual memory access (read-write/read/write/execute)

- a given architecture may perhaps not support the ability to trigger
a breakpoint only on specific memory access size lower than the word
size for this arch.

- a given architecture may perhaps not support breakpoints on addresses
range.

The new API changes the following prototype for a kernel breakpoint
registration:

int register_kernel_hw_breakpoint(struct hw_breakpoint *bp)

into:

int register_kernel_hw_breakpoint(struct hw_breakpoint *bp,
unsigned long addr,
int len, enum breakpoint_type type)

The choice of passing the breakpoint settings as parameters of the
registration helper and not by adding generic fields into the breakpoint
structure is motivated by the need of a very specific per arch
representation of the breakpoint:

- the arch may only need an address, but could also need a couple for
breakpoints in ranges.
- the type is subject to arch interpretation (values of debug registers)
- the length too.

Then, to get back these values from a generic breakpoint structure that have
specific encodings into the arch fields, this API comes along with abstract
accessors which implementation is arch specific:

- type hw_breakpoint_type(bp)
- addr hw_breakpoint_addr(bp)

However, open debates come along this RFC patch:

- the address could be a generic field in struct hw_breakpoint. If we
are dealing with a range breakpoint, then we would just need to
compute addr + length to get the end of the range.

- the length and type could also be generic fields of
struct hw_breakpoint. It would then be up to the arch to get a
translation between such generic values and per arch needs.

Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: K.Prasad <[email protected]>
Cc: Alan Stern <[email protected]>
Cc: Steven Rostedt <[email protected]>
---
arch/x86/include/asm/hw_breakpoint.h | 8 +++
arch/x86/kernel/hw_breakpoint.c | 62 +++++++++++++++++++
include/asm-generic/hw_breakpoint.h | 99 ++++++++++---------------------
kernel/hw_breakpoint.c | 12 +++-
kernel/trace/trace.h | 5 +-
kernel/trace/trace_ksym.c | 53 ++++++++--------
kernel/trace/trace_selftest.c | 2 +-
samples/hw_breakpoint/data_breakpoint.c | 6 +--
8 files changed, 142 insertions(+), 105 deletions(-)

diff --git a/arch/x86/include/asm/hw_breakpoint.h b/arch/x86/include/asm/hw_breakpoint.h
index 1acb4d4..5f5f67d 100644
--- a/arch/x86/include/asm/hw_breakpoint.h
+++ b/arch/x86/include/asm/hw_breakpoint.h
@@ -43,6 +43,8 @@ extern unsigned int hbp_user_refcount[HBP_NUM];
extern void arch_install_thread_hw_breakpoint(struct task_struct *tsk);
extern void arch_uninstall_thread_hw_breakpoint(void);
extern int arch_check_va_in_userspace(unsigned long va, u8 hbp_len);
+extern int arch_fill_hw_breakpoint(struct hw_breakpoint *bp, unsigned long addr,
+ int len, enum breakpoint_type type);
extern int arch_validate_hwbkpt_settings(struct hw_breakpoint *bp,
struct task_struct *tsk);
extern void arch_update_user_hw_breakpoint(int pos, struct task_struct *tsk);
@@ -50,6 +52,12 @@ extern void arch_flush_thread_hw_breakpoint(struct task_struct *tsk);
extern void arch_update_kernel_hw_breakpoint(void *);
extern int hw_breakpoint_exceptions_notify(struct notifier_block *unused,
unsigned long val, void *data);
+/*
+ * Typical accessors
+ */
+extern enum breakpoint_type hw_breakpoint_type(struct hw_breakpoint *bp);
+extern unsigned long hw_breakpoint_addr(struct hw_breakpoint *bp);
+
#endif /* __KERNEL__ */
#endif /* _I386_HW_BREAKPOINT_H */

diff --git a/arch/x86/kernel/hw_breakpoint.c b/arch/x86/kernel/hw_breakpoint.c
index 9316a9d..04c0661 100644
--- a/arch/x86/kernel/hw_breakpoint.c
+++ b/arch/x86/kernel/hw_breakpoint.c
@@ -68,6 +68,25 @@ static unsigned long encode_dr7(int drnum, unsigned int len, unsigned int type)
return bp_info;
}

+enum breakpoint_type hw_breakpoint_type(struct hw_breakpoint *bp)
+{
+ switch (bp->info.type) {
+ case HW_BREAKPOINT_EXECUTE:
+ return BREAK_X;
+ case HW_BREAKPOINT_WRITE:
+ return BREAK_W;
+ case HW_BREAKPOINT_RW:
+ return BREAK_RW;
+ default:
+ return -EINVAL;
+ }
+}
+
+unsigned long hw_breakpoint_addr(struct hw_breakpoint *bp)
+{
+ return bp->info.address;
+}
+
void arch_update_kernel_hw_breakpoint(void *unused)
{
struct hw_breakpoint *bp;
@@ -198,6 +217,49 @@ static int arch_store_info(struct hw_breakpoint *bp, struct task_struct *tsk)
return -EINVAL;
}

+int arch_fill_hw_breakpoint(struct hw_breakpoint *bp, unsigned long addr,
+ int len, enum breakpoint_type type)
+{
+ bp->info.address = addr;
+ switch (len) {
+ case 1:
+ bp->info.len = HW_BREAKPOINT_LEN_1;
+ break;
+ case 2:
+ bp->info.len = HW_BREAKPOINT_LEN_2;
+ break;
+ case 4:
+ bp->info.len = HW_BREAKPOINT_LEN_4;
+ break;
+#ifdef CONFIG_X86_64
+ case 8:
+ bp->info.len = HW_BREAKPOINT_LEN_8;
+ break;
+#else
+ return -ENOSYS;
+#endif
+ default:
+ return -EINVAL;
+ }
+
+ switch (type) {
+ case BREAK_W:
+ bp->info.type = HW_BREAKPOINT_WRITE;
+ break;
+ case BREAK_X:
+ bp->info.type = HW_BREAKPOINT_EXECUTE;
+ break;
+ case BREAK_RW:
+ bp->info.type = HW_BREAKPOINT_RW;
+ break;
+ case BREAK_R:
+ return -ENOSYS;
+ default:
+ return -EINVAL;
+ }
+
+ return 0;
+}
/*
* Validate the arch-specific HW Breakpoint register settings
*/
diff --git a/include/asm-generic/hw_breakpoint.h b/include/asm-generic/hw_breakpoint.h
index 9bf2d12..ad860d7 100644
--- a/include/asm-generic/hw_breakpoint.h
+++ b/include/asm-generic/hw_breakpoint.h
@@ -10,6 +10,13 @@
#include <linux/types.h>
#include <linux/kallsyms.h>

+enum breakpoint_type {
+ BREAK_X,
+ BREAK_W,
+ BREAK_R,
+ BREAK_RW,
+};
+
/**
* struct hw_breakpoint - unified kernel/user-space hardware breakpoint
* @triggered: callback invoked after target address access
@@ -21,13 +28,11 @@
* target address can be located in either kernel space or user space.
*
* The breakpoint's address, length, and type are highly
- * architecture-specific. The values are encoded in the @info field; you
- * specify them when registering the breakpoint. To examine the encoded
- * values use hw_breakpoint_get_{kaddress,uaddress,len,type}(), declared
- * below.
+ * architecture-specific. The arch specific values are encoded in the @info
+ * field but are abstracted by the register helpers. To examine the encoded
+ * values, use hw_breakpoint_{type,address}(), declared in
+ * arch/ * /include/asm/hw_breakpoint.h
*
- * The address is specified as a regular kernel pointer (for kernel-space
- * breakponts) or as an %__user pointer (for user-space breakpoints).
* With register_user_hw_breakpoint(), the address must refer to a
* location in user space. The breakpoint will be active only while the
* requested task is running. Conversely with
@@ -35,23 +40,23 @@
* in kernel space, and the breakpoint will be active on all CPUs
* regardless of the current task.
*
+ * While calling a breakpoint register helper, you have to provide the type
+ * and the length of your breakpoint as parameters.
+ *
* The length is the breakpoint's extent in bytes, which is subject to
- * certain limitations. include/asm/hw_breakpoint.h contains macros
- * defining the available lengths for a specific architecture. Note that
- * the address's alignment must match the length. The breakpoint will
- * catch accesses to any byte in the range from address to address +
- * (length - 1).
+ * certain limitations. Note that the address's alignment must match the length.
+ * The breakpoint will catch accesses to any byte in the range from address to
+ * address + (length - 1).
*
* The breakpoint's type indicates the sort of access that will cause it
* to trigger. Possible values may include:
*
- * %HW_BREAKPOINT_RW (triggered on read or write access),
- * %HW_BREAKPOINT_WRITE (triggered on write access), and
- * %HW_BREAKPOINT_READ (triggered on read access).
+ * %BREAK_RW (triggered on read or write access).
+ * %BREAK_W (triggered on write access).
+ * %BREAK_R (triggered on read access).
+ * %BREAK_X (triggered on execute access).
*
- * Appropriate macros are defined in include/asm/hw_breakpoint.h; not all
- * possibilities are available on all architectures. Execute breakpoints
- * must have length equal to the special value %HW_BREAKPOINT_LEN_EXECUTE.
+ * Not all possibilities are available on all architectures.
*
* When a breakpoint gets hit, the @triggered callback is
* invoked in_interrupt with a pointer to the %hw_breakpoint structure and the
@@ -60,45 +65,6 @@
* Breakpoints are disabled during execution @triggered, to avoid
* recursive traps and allow unhindered access to breakpointed memory.
*
- * This sample code sets a breakpoint on pid_max and registers a callback
- * function for writes to that variable. Note that it is not portable
- * as written, because not all architectures support HW_BREAKPOINT_LEN_4.
- *
- * ----------------------------------------------------------------------
- *
- * #include <asm/hw_breakpoint.h>
- *
- * struct hw_breakpoint my_bp;
- *
- * static void my_triggered(struct hw_breakpoint *bp, struct pt_regs *regs)
- * {
- * printk(KERN_DEBUG "Inside triggered routine of breakpoint exception\n");
- * dump_stack();
- * .......<more debugging output>........
- * }
- *
- * static struct hw_breakpoint my_bp;
- *
- * static int init_module(void)
- * {
- * ..........<do anything>............
- * my_bp.info.type = HW_BREAKPOINT_WRITE;
- * my_bp.info.len = HW_BREAKPOINT_LEN_4;
- *
- * my_bp.installed = (void *)my_bp_installed;
- *
- * rc = register_kernel_hw_breakpoint(&my_bp);
- * ..........<do anything>............
- * }
- *
- * static void cleanup_module(void)
- * {
- * ..........<do anything>............
- * unregister_kernel_hw_breakpoint(&my_bp);
- * ..........<do anything>............
- * }
- *
- * ----------------------------------------------------------------------
*/
struct hw_breakpoint {
void (*triggered)(struct hw_breakpoint *, struct pt_regs *);
@@ -106,19 +72,14 @@ struct hw_breakpoint {
};

/*
- * len and type values are defined in include/asm/hw_breakpoint.h.
- * Available values vary according to the architecture. On i386 the
- * possibilities are:
+ * len and type supported values vary according to the architecture.
+ * On x86 the possibilities are:
*
- * HW_BREAKPOINT_LEN_1
- * HW_BREAKPOINT_LEN_2
- * HW_BREAKPOINT_LEN_4
- * HW_BREAKPOINT_RW
- * HW_BREAKPOINT_READ
+ * Len: 1, 2, 4
+ * Type: BREAK_RW, BREAK_W, BREAK_X
*
- * On other architectures HW_BREAKPOINT_LEN_8 may be available, and the
- * 1-, 2-, and 4-byte lengths may be unavailable. There also may be
- * HW_BREAKPOINT_WRITE. You can use #ifdef to check at compile time.
+ * On other architectures, an 8 length may be available, and the
+ * 1-, 2-, and 4-byte lengths may be unavailable.
*/

extern int register_user_hw_breakpoint(struct task_struct *tsk,
@@ -130,7 +91,9 @@ extern void unregister_user_hw_breakpoint(struct task_struct *tsk,
/*
* Kernel breakpoints are not associated with any particular thread.
*/
-extern int register_kernel_hw_breakpoint(struct hw_breakpoint *bp);
+extern int
+register_kernel_hw_breakpoint(struct hw_breakpoint *bp, unsigned long addr,
+ int len, enum breakpoint_type type);
extern void unregister_kernel_hw_breakpoint(struct hw_breakpoint *bp);

extern unsigned int hbp_kernel_pos;
diff --git a/kernel/hw_breakpoint.c b/kernel/hw_breakpoint.c
index c1f64e6..015fec6 100644
--- a/kernel/hw_breakpoint.c
+++ b/kernel/hw_breakpoint.c
@@ -297,15 +297,21 @@ EXPORT_SYMBOL_GPL(unregister_user_hw_breakpoint);
/**
* register_kernel_hw_breakpoint - register a hardware breakpoint for kernel space
* @bp: the breakpoint structure to register
- *
- * @bp.info->name or @bp.info->address, @bp.info->len, @bp.info->type and
+ * @addr: the address where we want to set the breakpoint
+ * @len: length of the value in memory to break in
+ * @type: the type of the breakpoint (read/write/execute)
* @bp->triggered must be set properly before invocation
*
*/
-int register_kernel_hw_breakpoint(struct hw_breakpoint *bp)
+int register_kernel_hw_breakpoint(struct hw_breakpoint *bp, unsigned long addr,
+ int len, enum breakpoint_type type)
{
int rc;

+ rc = arch_fill_hw_breakpoint(bp, addr, len, type);
+ if (rc)
+ return rc;
+
rc = arch_validate_hwbkpt_settings(bp, NULL);
if (rc)
return rc;
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 06886a0..0253691 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -213,12 +213,13 @@ struct syscall_trace_exit {
};

#define KSYM_SELFTEST_ENTRY "ksym_selftest_dummy"
-extern int process_new_ksym_entry(char *ksymname, int op, unsigned long addr);
+extern int process_new_ksym_entry(char *ksymname, enum breakpoint_type type,
+ unsigned long addr);

struct ksym_trace_entry {
struct trace_entry ent;
unsigned long ip;
- unsigned char type;
+ enum breakpoint_type type;
char ksym_name[KSYM_NAME_LEN];
char cmd[TASK_COMM_LEN];
};
diff --git a/kernel/trace/trace_ksym.c b/kernel/trace/trace_ksym.c
index 1256a6e..7829a9b 100644
--- a/kernel/trace/trace_ksym.c
+++ b/kernel/trace/trace_ksym.c
@@ -95,12 +95,12 @@ void ksym_hbp_handler(struct hw_breakpoint *hbp, struct pt_regs *regs)

entry = ring_buffer_event_data(event);
entry->ip = instruction_pointer(regs);
- entry->type = hbp->info.type;
+ entry->type = hw_breakpoint_type(hbp);
strlcpy(entry->ksym_name, hbp->info.name, KSYM_SYMBOL_LEN);
strlcpy(entry->cmd, current->comm, TASK_COMM_LEN);

#ifdef CONFIG_PROFILE_KSYM_TRACER
- ksym_collect_stats(hbp->info.address);
+ ksym_collect_stats(hw_breakpoint_addr(hbp));
#endif /* CONFIG_PROFILE_KSYM_TRACER */

trace_buffer_unlock_commit(tr, event, 0, pc);
@@ -117,6 +117,7 @@ void ksym_hbp_handler(struct hw_breakpoint *hbp, struct pt_regs *regs)
static int ksym_trace_get_access_type(char *str)
{
int access = 0;
+ enum breakpoint_type type;

if (str[0] == 'r')
access += 4;
@@ -133,14 +134,16 @@ static int ksym_trace_get_access_type(char *str)

switch (access) {
case 6:
- access = HW_BREAKPOINT_RW;
+ type = BREAK_RW;
break;
case 2:
- access = HW_BREAKPOINT_WRITE;
+ type = BREAK_W;
break;
+ default:
+ type = -EINVAL;
}

- return access;
+ return type;
}

/*
@@ -176,7 +179,8 @@ static int parse_ksym_trace_str(char *input_string, char **ksymname,
return ret;
}

-int process_new_ksym_entry(char *ksymname, int op, unsigned long addr)
+int process_new_ksym_entry(char *ksymname, enum breakpoint_type type,
+ unsigned long addr)
{
struct trace_ksym *entry;
int ret = -ENOMEM;
@@ -200,14 +204,10 @@ int process_new_ksym_entry(char *ksymname, int op, unsigned long addr)
if (!entry->ksym_hbp->info.name)
goto err;

- entry->ksym_hbp->info.type = op;
- entry->ksym_addr = entry->ksym_hbp->info.address = addr;
-#ifdef CONFIG_X86
- entry->ksym_hbp->info.len = HW_BREAKPOINT_LEN_4;
-#endif
+ entry->ksym_addr = addr;
entry->ksym_hbp->triggered = (void *)ksym_hbp_handler;

- ret = register_kernel_hw_breakpoint(entry->ksym_hbp);
+ ret = register_kernel_hw_breakpoint(entry->ksym_hbp, addr, 4, type);
if (ret < 0) {
printk(KERN_INFO "ksym_tracer request failed. Try again"
" later!!\n");
@@ -243,9 +243,9 @@ static ssize_t ksym_trace_filter_read(struct file *filp, char __user *ubuf,

hlist_for_each_entry(entry, node, &ksym_filter_head, ksym_hlist) {
ret = trace_seq_printf(s, "%s:", entry->ksym_hbp->info.name);
- if (entry->ksym_hbp->info.type == HW_BREAKPOINT_WRITE)
+ if (hw_breakpoint_type(entry->ksym_hbp) == BREAK_W)
ret = trace_seq_puts(s, "-w-\n");
- else if (entry->ksym_hbp->info.type == HW_BREAKPOINT_RW)
+ else if (hw_breakpoint_type(entry->ksym_hbp) == BREAK_RW)
ret = trace_seq_puts(s, "rw-\n");
WARN_ON_ONCE(!ret);
}
@@ -267,7 +267,8 @@ static ssize_t ksym_trace_filter_write(struct file *file,
struct hlist_node *node;
char *input_string, *ksymname = NULL;
unsigned long ksym_addr = 0;
- int ret, op, changed = 0;
+ int ret, changed = 0;
+ enum breakpoint_type type;

input_string = kzalloc(count + 1, GFP_KERNEL);
if (!input_string)
@@ -279,7 +280,7 @@ static ssize_t ksym_trace_filter_write(struct file *file,
}
input_string[count] = '\0';

- ret = op = parse_ksym_trace_str(input_string, &ksymname, &ksym_addr);
+ ret = type = parse_ksym_trace_str(input_string, &ksymname, &ksym_addr);
if (ret < 0) {
kfree(input_string);
return ret;
@@ -291,7 +292,7 @@ static ssize_t ksym_trace_filter_write(struct file *file,
hlist_for_each_entry(entry, node, &ksym_filter_head, ksym_hlist) {
if (entry->ksym_addr == ksym_addr) {
/* Check for malformed request: (6) */
- if (entry->ksym_hbp->info.type != op)
+ if (hw_breakpoint_type(entry->ksym_hbp) != type)
changed = 1;
else
goto out;
@@ -300,9 +301,9 @@ static ssize_t ksym_trace_filter_write(struct file *file,
}
if (changed) {
unregister_kernel_hw_breakpoint(entry->ksym_hbp);
- entry->ksym_hbp->info.type = op;
- if (op > 0) {
- ret = register_kernel_hw_breakpoint(entry->ksym_hbp);
+ if (type >= 0) {
+ ret = register_kernel_hw_breakpoint(entry->ksym_hbp,
+ ksym_addr, 4, type);
if (ret == 0)
goto out;
}
@@ -315,9 +316,9 @@ static ssize_t ksym_trace_filter_write(struct file *file,
goto out;
} else {
/* Check for malformed request: (4) */
- if (op == 0)
+ if (type < 0)
goto out;
- ret = process_new_ksym_entry(ksymname, op, ksym_addr);
+ ret = process_new_ksym_entry(ksymname, type, ksym_addr);
}
out:
mutex_unlock(&ksym_tracer_mutex);
@@ -398,10 +399,10 @@ static enum print_line_t ksym_trace_output(struct trace_iterator *iter)
return TRACE_TYPE_PARTIAL_LINE;

switch (field->type) {
- case HW_BREAKPOINT_WRITE:
+ case BREAK_W:
ret = trace_seq_printf(s, " W ");
break;
- case HW_BREAKPOINT_RW:
+ case BREAK_RW:
ret = trace_seq_printf(s, " RW ");
break;
default:
@@ -464,13 +465,13 @@ static int ksym_tracer_stat_show(struct seq_file *m, void *v)
{
struct hlist_node *stat = v;
struct trace_ksym *entry;
- int access_type = 0;
+ enum breakpoint_type access_type = -1;
char fn_name[KSYM_NAME_LEN];

entry = hlist_entry(stat, struct trace_ksym, ksym_hlist);

if (entry->ksym_hbp)
- access_type = entry->ksym_hbp->info.type;
+ access_type = hw_breakpoint_type(entry->ksym_hbp);

switch (access_type) {
case HW_BREAKPOINT_WRITE:
diff --git a/kernel/trace/trace_selftest.c b/kernel/trace/trace_selftest.c
index 71f2edb..01cfdd9 100644
--- a/kernel/trace/trace_selftest.c
+++ b/kernel/trace/trace_selftest.c
@@ -827,7 +827,7 @@ trace_selftest_startup_ksym(struct tracer *trace, struct trace_array *tr)

ksym_selftest_dummy = 0;
/* Register the read-write tracing request */
- ret = process_new_ksym_entry(KSYM_SELFTEST_ENTRY, HW_BREAKPOINT_RW,
+ ret = process_new_ksym_entry(KSYM_SELFTEST_ENTRY, BREAK_RW,
(unsigned long)(&ksym_selftest_dummy));

if (ret < 0) {
diff --git a/samples/hw_breakpoint/data_breakpoint.c b/samples/hw_breakpoint/data_breakpoint.c
index 9cbdbb8..9f0e210 100644
--- a/samples/hw_breakpoint/data_breakpoint.c
+++ b/samples/hw_breakpoint/data_breakpoint.c
@@ -51,14 +51,10 @@ static int __init hw_break_module_init(void)

#ifdef CONFIG_X86
sample_hbp.info.name = ksym_name;
- sample_hbp.info.type = HW_BREAKPOINT_WRITE;
- sample_hbp.info.len = HW_BREAKPOINT_LEN_4;
#endif /* CONFIG_X86 */
-
sample_hbp.triggered = (void *)sample_hbp_handler;

- ret = register_kernel_hw_breakpoint(&sample_hbp);
-
+ ret = register_kernel_hw_breakpoint(&sample_hbp, 0, 4, BREAK_W);
if (ret < 0) {
printk(KERN_INFO "Breakpoint registration failed\n");
return ret;
--
1.6.2.3

2009-07-20 17:08:29

by Frederic Weisbecker

[permalink] [raw]
Subject: [RFC][PATCH 3/5] hw-breakpoints: Make user breakpoints API truly generic

Following the API changes on the kernel breakpoints API,
the user breakpoints registration now follows the same pattern.

The target, access length and type and now given as parameters
in the registration helpers to avoid per arch code in generic code.

However, to keep registering or modifying easily a user breakpoint
from arch code with a prefilled arch breakpoint structure, we provide
two new helpers:

- register_user_hw_breakpoint_filled(tsk, bp)
- modify_user_hw_breakpoint_filled(tsk, bp)

Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: K.Prasad <[email protected]>
Cc: Alan Stern <[email protected]>
---
arch/x86/kernel/ptrace.c | 21 ++++++-----
include/asm-generic/hw_breakpoint.h | 12 +++++-
kernel/hw_breakpoint.c | 65 +++++++++++++++++++++++++++++-----
3 files changed, 76 insertions(+), 22 deletions(-)

diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index cabdabc..ef0eb10 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -525,17 +525,18 @@ restore:
if (!bp) {
rc = -ENOMEM;
bp = kzalloc(sizeof(struct hw_breakpoint), GFP_KERNEL);
- if (bp) {
- bp->info.address = thread->debugreg[i];
- bp->triggered = ptrace_triggered;
- bp->info.len = len;
- bp->info.type = type;
- rc = register_user_hw_breakpoint(tsk, bp);
- if (rc)
- kfree(bp);
- }
+ if (!bp)
+ break;
+
+ bp->info.address = thread->debugreg[i];
+ bp->triggered = ptrace_triggered;
+ bp->info.len = len;
+ bp->info.type = type;
+ rc = register_user_hw_breakpoint_filled(tsk, bp);
+ if (rc)
+ kfree(bp);
} else
- rc = modify_user_hw_breakpoint(tsk, bp);
+ rc = modify_user_hw_breakpoint_filled(tsk, bp);
if (rc)
break;
}
diff --git a/include/asm-generic/hw_breakpoint.h b/include/asm-generic/hw_breakpoint.h
index 300fe4c..598e3c4 100644
--- a/include/asm-generic/hw_breakpoint.h
+++ b/include/asm-generic/hw_breakpoint.h
@@ -83,10 +83,18 @@ struct hw_breakpoint {
* 1-, 2-, and 4-byte lengths may be unavailable.
*/

+extern int register_user_hw_breakpoint_filled(struct task_struct *tsk,
+ struct hw_breakpoint *bp);
extern int register_user_hw_breakpoint(struct task_struct *tsk,
- struct hw_breakpoint *bp);
+ struct hw_breakpoint *bp,
+ unsigned long addr, int len,
+ enum breakpoint_type type);
+extern int modify_user_hw_breakpoint_filled(struct task_struct *tsk,
+ struct hw_breakpoint *bp);
extern int modify_user_hw_breakpoint(struct task_struct *tsk,
- struct hw_breakpoint *bp);
+ struct hw_breakpoint *bp,
+ unsigned long addr,
+ int len, enum breakpoint_type type);
extern void unregister_user_hw_breakpoint(struct task_struct *tsk,
struct hw_breakpoint *bp);
/*
diff --git a/kernel/hw_breakpoint.c b/kernel/hw_breakpoint.c
index 0301245..f9e62e7 100644
--- a/kernel/hw_breakpoint.c
+++ b/kernel/hw_breakpoint.c
@@ -211,16 +211,14 @@ static void __unregister_user_hw_breakpoint(int pos, struct task_struct *tsk)
}

/**
- * register_user_hw_breakpoint - register a hardware breakpoint for user space
+ * register_user_hw_breakpoint_filled - register a filled hardware breakpoint
+ * for user space.
* @tsk: pointer to 'task_struct' of the process to which the address belongs
* @bp: the breakpoint structure to register
- *
- * @bp.info->name or @bp.info->address, @bp.info->len, @bp.info->type and
* @bp->triggered must be set properly before invocation
- *
*/
-int register_user_hw_breakpoint(struct task_struct *tsk,
- struct hw_breakpoint *bp)
+int register_user_hw_breakpoint_filled(struct task_struct *tsk,
+ struct hw_breakpoint *bp)
{
struct thread_struct *thread = &(tsk->thread);
int i, rc = -ENOSPC;
@@ -246,15 +244,40 @@ int register_user_hw_breakpoint(struct task_struct *tsk,
spin_unlock_bh(&hw_breakpoint_lock);
return rc;
}
+EXPORT_SYMBOL_GPL(register_user_hw_breakpoint_filled);
+
+/**
+ * register_user_hw_breakpoint - register a filled hardware breakpoint
+ * for user space.
+ * @tsk: pointer to 'task_struct' of the process to which the address belongs
+ * @bp: the breakpoint structure to register
+ * @addr: target of the breakpoint
+ * @len: length of the memory target access
+ * @type: type of the breakpoint (read-write/read/write/execute)
+ * @bp->triggered must be set properly before invocation
+ */
+int register_user_hw_breakpoint(struct task_struct *tsk,
+ struct hw_breakpoint *bp,
+ unsigned long addr,
+ int len, enum breakpoint_type type)
+{
+ int ret;
+
+ ret = arch_fill_hw_breakpoint(bp, addr, len, type);
+ if (ret)
+ return ret;
+
+ return register_user_hw_breakpoint_filled(tsk, bp);
+}
EXPORT_SYMBOL_GPL(register_user_hw_breakpoint);

/**
- * modify_user_hw_breakpoint - modify a user-space hardware breakpoint
+ * modify_user_hw_breakpoint_filled - modify a filled user-space hardware breakpoint
* @tsk: pointer to 'task_struct' of the process to which the address belongs
- * @bp: the breakpoint structure to unregister
- *
+ * @bp: the breakpoint structure to update
*/
-int modify_user_hw_breakpoint(struct task_struct *tsk, struct hw_breakpoint *bp)
+int modify_user_hw_breakpoint_filled(struct task_struct *tsk,
+ struct hw_breakpoint *bp)
{
struct thread_struct *thread = &(tsk->thread);
int i, ret = -ENOENT;
@@ -269,6 +292,28 @@ int modify_user_hw_breakpoint(struct task_struct *tsk, struct hw_breakpoint *bp)
spin_unlock_bh(&hw_breakpoint_lock);
return ret;
}
+EXPORT_SYMBOL_GPL(modify_user_hw_breakpoint_filled);
+
+/**
+ * modify_user_hw_breakpoint - modify a user-space hardware breakpoint
+ * @tsk: pointer to 'task_struct' of the process to which the address belongs
+ * @bp: the breakpoint structure to update
+ * @addr: target of the breakpoint
+ * @len: length of the memory target access
+ * @type: type of the breakpoint (read-write/read/write/execute)
+ */
+int modify_user_hw_breakpoint(struct task_struct *tsk, struct hw_breakpoint *bp,
+ unsigned long addr, int len,
+ enum breakpoint_type type)
+{
+ int ret;
+
+ ret = arch_fill_hw_breakpoint(bp, addr, len, type);
+ if (ret)
+ return ret;
+
+ return register_user_hw_breakpoint_filled(tsk, bp);
+}
EXPORT_SYMBOL_GPL(modify_user_hw_breakpoint);

/**
--
1.6.2.3

2009-07-20 17:08:40

by Frederic Weisbecker

[permalink] [raw]
Subject: [RFC][PATCH 4/5] perfcounter: Grow the event number to 64 bits

The event number are defined by the config attr of a counter. When it
is passed while adding a hit, it is compared against the initial
config attr, a 64 bits number, to retrieve the targetted counter.

But when we want to add a hit, this event is passed through 32 bits
parameters, which makes loosing the real event number if its storage
needs more (case of an address).

This patch changes the event parameter in some of the helpers which
add the event hits.

Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Anton Blanchard <[email protected]>
---
kernel/perf_counter.c | 6 +++---
1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/perf_counter.c b/kernel/perf_counter.c
index 5498890..33ffb5a 100644
--- a/kernel/perf_counter.c
+++ b/kernel/perf_counter.c
@@ -3435,7 +3435,7 @@ static int perf_swcounter_is_counting(struct perf_counter *counter)

static int perf_swcounter_match(struct perf_counter *counter,
enum perf_type_id type,
- u32 event, struct pt_regs *regs)
+ u64 event, struct pt_regs *regs)
{
if (!perf_swcounter_is_counting(counter))
return 0;
@@ -3467,7 +3467,7 @@ static void perf_swcounter_add(struct perf_counter *counter, u64 nr,

static void perf_swcounter_ctx_event(struct perf_counter_context *ctx,
enum perf_type_id type,
- u32 event, u64 nr, int nmi,
+ u64 event, u64 nr, int nmi,
struct perf_sample_data *data)
{
struct perf_counter *counter;
@@ -3497,7 +3497,7 @@ static int *perf_swcounter_recursion_context(struct perf_cpu_context *cpuctx)
return &cpuctx->recursion[0];
}

-static void do_perf_swcounter_event(enum perf_type_id type, u32 event,
+static void do_perf_swcounter_event(enum perf_type_id type, u64 event,
u64 nr, int nmi,
struct perf_sample_data *data)
{
--
1.6.2.3

2009-07-20 17:14:01

by Frederic Weisbecker

[permalink] [raw]
Subject: [RFC][PATCH 2/5] hw-breakpoints: Pull up the target symbol in a generic field

The target symbol of a breakpoint doesn't need to be arch specific.
It can be resolved back from the core, then this patch pulls
up this job to the core, avoiding the need to implement that from
arch.

Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: K.Prasad <[email protected]>
Cc: Alan Stern <[email protected]>
Cc: Steven Rostedt <[email protected]>
---
arch/x86/include/asm/hw_breakpoint.h | 1 -
arch/x86/kernel/hw_breakpoint.c | 35 ++++--------------------------
include/asm-generic/hw_breakpoint.h | 1 +
kernel/hw_breakpoint.c | 17 +++++++++++++++
kernel/trace/trace_ksym.c | 14 ++++++------
samples/hw_breakpoint/data_breakpoint.c | 4 +--
6 files changed, 31 insertions(+), 41 deletions(-)

diff --git a/arch/x86/include/asm/hw_breakpoint.h b/arch/x86/include/asm/hw_breakpoint.h
index 5f5f67d..a51c5ff 100644
--- a/arch/x86/include/asm/hw_breakpoint.h
+++ b/arch/x86/include/asm/hw_breakpoint.h
@@ -5,7 +5,6 @@
#define __ARCH_HW_BREAKPOINT_H

struct arch_hw_breakpoint {
- char *name; /* Contains name of the symbol to set bkpt */
unsigned long address;
u8 len;
u8 type;
diff --git a/arch/x86/kernel/hw_breakpoint.c b/arch/x86/kernel/hw_breakpoint.c
index 04c0661..65754c9 100644
--- a/arch/x86/kernel/hw_breakpoint.c
+++ b/arch/x86/kernel/hw_breakpoint.c
@@ -195,28 +195,9 @@ static int arch_check_va_in_kernelspace(unsigned long va, u8 hbp_len)
}

/*
- * Store a breakpoint's encoded address, length, and type.
+ * Pick breakpoint's generic address, length, and type and convert them
+ * into x86 specific debug register values.
*/
-static int arch_store_info(struct hw_breakpoint *bp, struct task_struct *tsk)
-{
- /*
- * User-space requests will always have the address field populated
- * Symbol names from user-space are rejected
- */
- if (tsk && bp->info.name)
- return -EINVAL;
- /*
- * For kernel-addresses, either the address or symbol name can be
- * specified.
- */
- if (bp->info.name)
- bp->info.address = (unsigned long)
- kallsyms_lookup_name(bp->info.name);
- if (bp->info.address)
- return 0;
- return -EINVAL;
-}
-
int arch_fill_hw_breakpoint(struct hw_breakpoint *bp, unsigned long addr,
int len, enum breakpoint_type type)
{
@@ -267,7 +248,6 @@ int arch_validate_hwbkpt_settings(struct hw_breakpoint *bp,
struct task_struct *tsk)
{
unsigned int align;
- int ret = -EINVAL;

switch (bp->info.type) {
/*
@@ -279,14 +259,14 @@ int arch_validate_hwbkpt_settings(struct hw_breakpoint *bp,
if ((!arch_check_va_in_userspace(bp->info.address,
bp->info.len)) &&
bp->info.len != HW_BREAKPOINT_LEN_EXECUTE)
- return ret;
+ return -EINVAL;
break;
case HW_BREAKPOINT_WRITE:
break;
case HW_BREAKPOINT_RW:
break;
default:
- return ret;
+ return -EINVAL;
}

switch (bp->info.len) {
@@ -305,14 +285,9 @@ int arch_validate_hwbkpt_settings(struct hw_breakpoint *bp,
break;
#endif
default:
- return ret;
+ return -EINVAL;
}

- if (bp->triggered)
- ret = arch_store_info(bp, tsk);
-
- if (ret < 0)
- return ret;
/*
* Check that the low-order bits of the address are appropriate
* for the alignment implied by len.
diff --git a/include/asm-generic/hw_breakpoint.h b/include/asm-generic/hw_breakpoint.h
index ad860d7..300fe4c 100644
--- a/include/asm-generic/hw_breakpoint.h
+++ b/include/asm-generic/hw_breakpoint.h
@@ -67,6 +67,7 @@ enum breakpoint_type {
*
*/
struct hw_breakpoint {
+ char *name; /* Contains name of the symbol to set breakpoint */
void (*triggered)(struct hw_breakpoint *, struct pt_regs *);
struct arch_hw_breakpoint info;
};
diff --git a/kernel/hw_breakpoint.c b/kernel/hw_breakpoint.c
index 015fec6..0301245 100644
--- a/kernel/hw_breakpoint.c
+++ b/kernel/hw_breakpoint.c
@@ -225,6 +225,13 @@ int register_user_hw_breakpoint(struct task_struct *tsk,
struct thread_struct *thread = &(tsk->thread);
int i, rc = -ENOSPC;

+ /*
+ * User-space requests will always have the address field populated
+ * Symbol names from user-space are rejected
+ */
+ if (bp->name || !bp->triggered)
+ return -EINVAL;
+
spin_lock_bh(&hw_breakpoint_lock);

for (i = 0; i < hbp_kernel_pos; i++) {
@@ -308,6 +315,16 @@ int register_kernel_hw_breakpoint(struct hw_breakpoint *bp, unsigned long addr,
{
int rc;

+ /*
+ * For kernel-addresses, either the address or symbol name can be
+ * specified.
+ */
+ if (bp->name)
+ addr = (unsigned long) kallsyms_lookup_name(bp->name);
+
+ if (!addr || !bp->triggered)
+ return -EINVAL;
+
rc = arch_fill_hw_breakpoint(bp, addr, len, type);
if (rc)
return rc;
diff --git a/kernel/trace/trace_ksym.c b/kernel/trace/trace_ksym.c
index 7829a9b..9bd2845 100644
--- a/kernel/trace/trace_ksym.c
+++ b/kernel/trace/trace_ksym.c
@@ -96,7 +96,7 @@ void ksym_hbp_handler(struct hw_breakpoint *hbp, struct pt_regs *regs)
entry = ring_buffer_event_data(event);
entry->ip = instruction_pointer(regs);
entry->type = hw_breakpoint_type(hbp);
- strlcpy(entry->ksym_name, hbp->info.name, KSYM_SYMBOL_LEN);
+ strlcpy(entry->ksym_name, hbp->name, KSYM_SYMBOL_LEN);
strlcpy(entry->cmd, current->comm, TASK_COMM_LEN);

#ifdef CONFIG_PROFILE_KSYM_TRACER
@@ -200,8 +200,8 @@ int process_new_ksym_entry(char *ksymname, enum breakpoint_type type,
if (!entry->ksym_hbp)
goto err;

- entry->ksym_hbp->info.name = kstrdup(ksymname, GFP_KERNEL);
- if (!entry->ksym_hbp->info.name)
+ entry->ksym_hbp->name = kstrdup(ksymname, GFP_KERNEL);
+ if (!entry->ksym_hbp->name)
goto err;

entry->ksym_addr = addr;
@@ -219,7 +219,7 @@ int process_new_ksym_entry(char *ksymname, enum breakpoint_type type,
return 0;
err:
if (entry->ksym_hbp)
- kfree(entry->ksym_hbp->info.name);
+ kfree(entry->ksym_hbp->name);
kfree(entry->ksym_hbp);
kfree(entry);
return ret;
@@ -242,7 +242,7 @@ static ssize_t ksym_trace_filter_read(struct file *filp, char __user *ubuf,
mutex_lock(&ksym_tracer_mutex);

hlist_for_each_entry(entry, node, &ksym_filter_head, ksym_hlist) {
- ret = trace_seq_printf(s, "%s:", entry->ksym_hbp->info.name);
+ ret = trace_seq_printf(s, "%s:", entry->ksym_hbp->name);
if (hw_breakpoint_type(entry->ksym_hbp) == BREAK_W)
ret = trace_seq_puts(s, "-w-\n");
else if (hw_breakpoint_type(entry->ksym_hbp) == BREAK_RW)
@@ -310,7 +310,7 @@ static ssize_t ksym_trace_filter_write(struct file *file,
ksym_filter_entry_count--;
hlist_del_rcu(&(entry->ksym_hlist));
synchronize_rcu();
- kfree(entry->ksym_hbp->info.name);
+ kfree(entry->ksym_hbp->name);
kfree(entry->ksym_hbp);
kfree(entry);
goto out;
@@ -350,7 +350,7 @@ static void ksym_trace_reset(struct trace_array *tr)
ksym_filter_entry_count--;
hlist_del_rcu(&(entry->ksym_hlist));
synchronize_rcu();
- kfree(entry->ksym_hbp->info.name);
+ kfree(entry->ksym_hbp->name);
kfree(entry->ksym_hbp);
kfree(entry);
}
diff --git a/samples/hw_breakpoint/data_breakpoint.c b/samples/hw_breakpoint/data_breakpoint.c
index 9f0e210..3901e98 100644
--- a/samples/hw_breakpoint/data_breakpoint.c
+++ b/samples/hw_breakpoint/data_breakpoint.c
@@ -49,9 +49,7 @@ static int __init hw_break_module_init(void)
{
int ret;

-#ifdef CONFIG_X86
- sample_hbp.info.name = ksym_name;
-#endif /* CONFIG_X86 */
+ sample_hbp.name = ksym_name;
sample_hbp.triggered = (void *)sample_hbp_handler;

ret = register_kernel_hw_breakpoint(&sample_hbp, 0, 4, BREAK_W);
--
1.6.2.3

2009-07-20 17:14:46

by Frederic Weisbecker

[permalink] [raw]
Subject: [RFC][PATCH 5/5] perfcounter: Add support for kernel hardware breakpoints

This adds the support for kernel hardware breakpoints in perfcounter.
It is added as a new type of software counter and can be defined by
using the counter number 5 and by passsing the address of the
breakpoint to set through the config attribute.

Example which traces the BKL accesses by stressing reiserfs:

bkl=0x$(cat ../../System.map | grep kernel_flag | cut -d" " -f 1)
./perf record -f -g -e 5:$bkl -- dbench -t 20 20
./perf report -g -s s
# Samples: 36
#
# Overhead Symbol
# ........ ......
#
83.33% [k] lock_kernel
|
|--51.72%-- __pwrite64
|
|--27.59%-- 0x7f9d83a17b15
|
|--13.79%-- __open
|
--10.34%-- unlink

16.67% [k] unlock_kernel
|
|--80.00%-- 0x7f9d83a17b15
|
|--20.00%-- unlink
|
--20.00%-- __pwrite64

For now you can only pass raw kernel addresses.
What is planned next:

- profile by symbol names
- profile with user breakpoints

Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Anton Blanchard <[email protected]>
Cc: K.Prasad <[email protected]>
Cc: Alan Stern <[email protected]>
Cc: Steven Rostedt <[email protected]>
---
include/asm-generic/hw_breakpoint.h | 4 ++
include/linux/perf_counter.h | 1 +
kernel/hw_breakpoint.c | 79 +++++++++++++++++++++++++++++++++++
kernel/perf_counter.c | 35 +++++++++++++++
4 files changed, 119 insertions(+), 0 deletions(-)

diff --git a/include/asm-generic/hw_breakpoint.h b/include/asm-generic/hw_breakpoint.h
index 598e3c4..82657bb 100644
--- a/include/asm-generic/hw_breakpoint.h
+++ b/include/asm-generic/hw_breakpoint.h
@@ -105,6 +105,10 @@ register_kernel_hw_breakpoint(struct hw_breakpoint *bp, unsigned long addr,
int len, enum breakpoint_type type);
extern void unregister_kernel_hw_breakpoint(struct hw_breakpoint *bp);

+#ifdef CONFIG_PERF_COUNTERS
+extern void bp_perf_triggered(struct hw_breakpoint *bp, struct pt_regs *regs);
+#endif
+
extern unsigned int hbp_kernel_pos;

#endif /* __KERNEL__ */
diff --git a/include/linux/perf_counter.h b/include/linux/perf_counter.h
index 5e970c7..c97df54 100644
--- a/include/linux/perf_counter.h
+++ b/include/linux/perf_counter.h
@@ -31,6 +31,7 @@ enum perf_type_id {
PERF_TYPE_TRACEPOINT = 2,
PERF_TYPE_HW_CACHE = 3,
PERF_TYPE_RAW = 4,
+ PERF_TYPE_BREAKPOINT = 5,

PERF_TYPE_MAX, /* non-ABI */
};
diff --git a/kernel/hw_breakpoint.c b/kernel/hw_breakpoint.c
index f9e62e7..c22c29d 100644
--- a/kernel/hw_breakpoint.c
+++ b/kernel/hw_breakpoint.c
@@ -72,6 +72,85 @@ unsigned int hbp_kernel_pos = HBP_NUM;
*/
unsigned int hbp_user_refcount[HBP_NUM];

+struct bp_perf_event {
+ struct hw_breakpoint *bp;
+ int count;
+ struct list_head list;
+};
+
+#ifdef CONFIG_PERF_COUNTERS
+
+static LIST_HEAD(breakpoint_perf_events);
+static DEFINE_MUTEX(breakpoint_perf_lock);
+
+
+int hw_breakpoint_perf_init(unsigned long addr)
+{
+ struct bp_perf_event *event;
+ int ret;
+
+ mutex_lock(&breakpoint_perf_lock);
+
+ list_for_each_entry(event, &breakpoint_perf_events, list) {
+ if (hw_breakpoint_addr(event->bp) == addr) {
+ event->count++;
+ goto found;
+ }
+ }
+
+ mutex_unlock(&breakpoint_perf_lock);
+
+ event = kzalloc(sizeof(*event), GFP_KERNEL);
+ if (!event)
+ return -ENOMEM;
+
+ event->bp = kzalloc(sizeof(*event->bp), GFP_KERNEL);
+ if (!event->bp) {
+ ret = -ENOMEM;
+ goto fail;
+ }
+
+ event->bp->triggered = bp_perf_triggered;
+ ret = register_kernel_hw_breakpoint(event->bp, addr, 1,
+ BREAK_RW);
+ if (ret)
+ goto fail;
+
+ mutex_lock(&breakpoint_perf_lock);
+ list_add_tail(&event->list, &breakpoint_perf_events);
+
+found:
+ mutex_unlock(&breakpoint_perf_lock);
+ return 0;
+
+fail:
+ kfree(event->bp);
+ kfree(event);
+ return ret;
+}
+
+void hw_breakpoint_perf_exit(unsigned long addr)
+{
+ struct bp_perf_event *event;
+
+ mutex_lock(&breakpoint_perf_lock);
+
+ list_for_each_entry(event, &breakpoint_perf_events, list) {
+ if (hw_breakpoint_addr(event->bp) == addr) {
+ if (--event->count)
+ break;
+
+ list_del(&event->list);
+ unregister_kernel_hw_breakpoint(event->bp);
+ kfree(event->bp);
+ kfree(event);
+ break;
+ }
+ }
+ mutex_unlock(&breakpoint_perf_lock);
+}
+#endif
+
/*
* Load the debug registers during startup of a CPU.
*/
diff --git a/kernel/perf_counter.c b/kernel/perf_counter.c
index 33ffb5a..8c70723 100644
--- a/kernel/perf_counter.c
+++ b/kernel/perf_counter.c
@@ -29,6 +29,7 @@
#include <linux/perf_counter.h>

#include <asm/irq_regs.h>
+#include <asm/hw_breakpoint.h>

/*
* Each CPU has a list of per CPU counters:
@@ -3718,6 +3719,36 @@ static const struct pmu *tp_perf_counter_init(struct perf_counter *counter)
}
#endif

+extern int hw_breakpoint_perf_init(unsigned long addr);
+extern void hw_breakpoint_perf_exit(unsigned long addr);
+
+void bp_perf_triggered(struct hw_breakpoint *bp, struct pt_regs *regs)
+{
+ unsigned long target = hw_breakpoint_addr(bp);
+
+ struct perf_sample_data data = {
+ .regs = regs,
+ .addr = instruction_pointer(regs),
+ };
+
+ do_perf_swcounter_event(PERF_TYPE_BREAKPOINT, target, 1, 0, &data);
+}
+
+static void bp_perf_counter_destroy(struct perf_counter *counter)
+{
+ hw_breakpoint_perf_exit(counter->attr.config);
+}
+
+static const struct pmu *bp_perf_counter_init(struct perf_counter *counter)
+{
+ if (hw_breakpoint_perf_init((unsigned long)counter->attr.config))
+ return NULL;
+
+ counter->destroy = bp_perf_counter_destroy;
+
+ return &perf_ops_generic;
+}
+
atomic_t perf_swcounter_enabled[PERF_COUNT_SW_MAX];

static void sw_perf_counter_destroy(struct perf_counter *counter)
@@ -3857,6 +3888,10 @@ perf_counter_alloc(struct perf_counter_attr *attr,
pmu = tp_perf_counter_init(counter);
break;

+ case PERF_TYPE_BREAKPOINT:
+ pmu = bp_perf_counter_init(counter);
+ break;
+
default:
break;
}
--
1.6.2.3

2009-07-20 17:28:04

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/5] hw-breakpoints: Make kernel breakpoints API truly generic

* Frederic Weisbecker ([email protected]) wrote:
[...]
> diff --git a/kernel/hw_breakpoint.c b/kernel/hw_breakpoint.c
> index c1f64e6..015fec6 100644
> --- a/kernel/hw_breakpoint.c
> +++ b/kernel/hw_breakpoint.c
> @@ -297,15 +297,21 @@ EXPORT_SYMBOL_GPL(unregister_user_hw_breakpoint);
> /**
> * register_kernel_hw_breakpoint - register a hardware breakpoint for kernel space
> * @bp: the breakpoint structure to register
> - *
> - * @bp.info->name or @bp.info->address, @bp.info->len, @bp.info->type and
> + * @addr: the address where we want to set the breakpoint
> + * @len: length of the value in memory to break in
> + * @type: the type of the breakpoint (read/write/execute)
> * @bp->triggered must be set properly before invocation

Hi Frederic,

I think one of the great addition in this patchset is to allow using
breakpoints from arch-agnostic code.

It becomes important to document the error values which can be returned
by register_kernel_hw_breakpoint, so this will serve as guidelines for
architecture-specific arch_fill_hw_breakpoint() implementation. This
will become increasingly important, as this abstraction layer will
basically be responsible for either:

- Finding the best support the architecture can provide for a given hw
breakpoint.
- Failing with an explicit error value telling the in-kernel user why it
failed (e.g. if it must use a fallback, or return the error to the
user).

Maybe we should think of a more flexible breakpoint type mapping too,
e.g.:

monitor _strictly_ execute operation on address 0x...
-> would fail if the architecture does not support execution access
monitoring
monitor (at least) execute operations on address 0x...
-> would be allowed to use a more general monitor (e.g. RWX) if the
architecture does not support "execute only" monitor.

(same for read and write)

Mathieu

> *
> */
> -int register_kernel_hw_breakpoint(struct hw_breakpoint *bp)
> +int register_kernel_hw_breakpoint(struct hw_breakpoint *bp, unsigned long addr,
> + int len, enum breakpoint_type type)
> {
> int rc;
>
> + rc = arch_fill_hw_breakpoint(bp, addr, len, type);
> + if (rc)
> + return rc;
> +
> rc = arch_validate_hwbkpt_settings(bp, NULL);
> if (rc)
> return rc;
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2009-07-20 17:38:19

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 5/5] perfcounter: Add support for kernel hardware breakpoints

On Mon, 2009-07-20 at 13:08 -0400, Frederic Weisbecker wrote:
> This adds the support for kernel hardware breakpoints in perfcounter.
> It is added as a new type of software counter and can be defined by
> using the counter number 5 and by passsing the address of the
> breakpoint to set through the config attribute.


> +void bp_perf_triggered(struct hw_breakpoint *bp, struct pt_regs *regs)
> +{
> + unsigned long target = hw_breakpoint_addr(bp);
> +
> + struct perf_sample_data data = {
> + .regs = regs,
> + .addr = instruction_pointer(regs),
> + };
> +
> + do_perf_swcounter_event(PERF_TYPE_BREAKPOINT, target, 1, 0, &data);
> +}

.addr would be an associated data address, like for pagefaults and
cache-misses have, its not the RIP of the faulting ins.

2009-07-20 17:38:29

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 5/5] perfcounter: Add support for kernel hardware breakpoints

On Mon, 2009-07-20 at 13:08 -0400, Frederic Weisbecker wrote:
> This adds the support for kernel hardware breakpoints in perfcounter.
> It is added as a new type of software counter and can be defined by
> using the counter number 5 and by passsing the address of the
> breakpoint to set through the config attribute.

This seems to be IP based breakpoints. Are there plans for data based
breakpoints as well? In that case we might want to think about the
namespace issue, we cannot both call them breakpoint/bp etc.. ;-)

2009-07-20 17:38:31

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 5/5] perfcounter: Add support for kernel hardware breakpoints

On Mon, 2009-07-20 at 13:08 -0400, Frederic Weisbecker wrote:
> + do_perf_swcounter_event(PERF_TYPE_BREAKPOINT, target, 1, 0, &data);

These breakpoints run from IRQ context?

2009-07-20 21:22:04

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC][PATCH 5/5] perfcounter: Add support for kernel hardware breakpoints

2009/7/20, Peter Zijlstra <[email protected]>:
> On Mon, 2009-07-20 at 13:08 -0400, Frederic Weisbecker wrote:
>> This adds the support for kernel hardware breakpoints in perfcounter.
>> It is added as a new type of software counter and can be defined by
>> using the counter number 5 and by passsing the address of the
>> breakpoint to set through the config attribute.
>
> This seems to be IP based breakpoints. Are there plans for data based
> breakpoints as well? In that case we might want to think about the
> namespace issue, we cannot both call them breakpoint/bp etc.. ;-)


Nop, by default these breakpoints trigger on READ/WRITE accesses, it's
meant for data.
The example in the changelog profiles the bkl accesses, not by tracing
lock_kernel() or so...but by tracing the kernel_flag spinlock itself.

So it's the opposite, we may start thinking about naming issues
against possible future plans
for IP breakpoint :-)

But actually for the latter case, I would suggest Kprobe...

Damn, I forgot Masami in the Cc list...

2009-07-21 07:11:58

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC][PATCH 5/5] perfcounter: Add support for kernel hardware breakpoints

2009/7/20, Peter Zijlstra <[email protected]>:
> On Mon, 2009-07-20 at 13:08 -0400, Frederic Weisbecker wrote:
>> This adds the support for kernel hardware breakpoints in perfcounter.
>> It is added as a new type of software counter and can be defined by
>> using the counter number 5 and by passsing the address of the
>> breakpoint to set through the config attribute.
>
>
>> +void bp_perf_triggered(struct hw_breakpoint *bp, struct pt_regs *regs)
>> +{
>> + unsigned long target = hw_breakpoint_addr(bp);
>> +
>> + struct perf_sample_data data = {
>> + .regs = regs,
>> + .addr = instruction_pointer(regs),
>> + };
>> +
>> + do_perf_swcounter_event(PERF_TYPE_BREAKPOINT, target, 1, 0, &data);
>> +}
>
> .addr would be an associated data address, like for pagefaults and
> cache-misses have, its not the RIP of the faulting ins.

Ah ok, then I'll set the actual data memory address that is breakpointed.

Thanks.

2009-07-21 07:19:54

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC][PATCH 5/5] perfcounter: Add support for kernel hardware breakpoints

2009/7/20, Peter Zijlstra <[email protected]>:
> On Mon, 2009-07-20 at 13:08 -0400, Frederic Weisbecker wrote:
>> + do_perf_swcounter_event(PERF_TYPE_BREAKPOINT, target, 1, 0,
>> &data);
>
> These breakpoints run from IRQ context?

Hmm, I guess but I'm not sure exactly :-/

2009-07-21 11:15:26

by K.Prasad

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/5] hw-breakpoints: Make kernel breakpoints API truly generic

On Mon, Jul 20, 2009 at 01:08:03PM -0400, Frederic Weisbecker wrote:
> To define a kernel hardware breakpoint, one need to define the
> address, type and length of the breakpoint using arch specific
> operations and then register it using a core helper.
>
> The first stage is truly not scalable with respect to the number of
> archictures, because for each of them that support hardware
> breakpoints, we would need a seperate specific field definition for
> the breakpoint.
>
> However, the supported breakpoint functionalities may be very different
> between architectures.
> Then this new API tries to compose with the following constraints:
>
> - a given architecture may perhaps not support the triggering on one
> of the usual memory access (read-write/read/write/execute)
>
> - a given architecture may perhaps not support the ability to trigger
> a breakpoint only on specific memory access size lower than the word
> size for this arch.
>
> - a given architecture may perhaps not support breakpoints on addresses
> range.
>
> The new API changes the following prototype for a kernel breakpoint
> registration:
>
> int register_kernel_hw_breakpoint(struct hw_breakpoint *bp)
>
> into:
>
> int register_kernel_hw_breakpoint(struct hw_breakpoint *bp,
> unsigned long addr,
> int len, enum breakpoint_type type)

It is not clear how adding these new parameters to the interface would
help it become generic, as opposed to moving them to 'struct
hw_breakpoint'.

It would make the usage cumbersome of some architectures - say for
instance the PPC64 which always has a breakpoint length of 8 bytes. So
the user needs to specify either '8' always or '0' to indicate variable
length not supported (but it is counter-intuitive..may be interpreted as
zero-length).

>
> The choice of passing the breakpoint settings as parameters of the
> registration helper and not by adding generic fields into the breakpoint
> structure is motivated by the need of a very specific per arch
> representation of the breakpoint:
>
> - the arch may only need an address, but could also need a couple for
> breakpoints in ranges.
> - the type is subject to arch interpretation (values of debug registers)
> - the length too.
>
> Then, to get back these values from a generic breakpoint structure that have
> specific encodings into the arch fields, this API comes along with abstract
> accessors which implementation is arch specific:
>
> - type hw_breakpoint_type(bp)
> - addr hw_breakpoint_addr(bp)
>
> However, open debates come along this RFC patch:
>
> - the address could be a generic field in struct hw_breakpoint. If we
> are dealing with a range breakpoint, then we would just need to
> compute addr + length to get the end of the range.
>
> - the length and type could also be generic fields of
> struct hw_breakpoint. It would then be up to the arch to get a
> translation between such generic values and per arch needs.
>

While the issues have been enumerated above, the patchset only pushes
the issue into a different domain i.e. make the user determine if a
breakpoint type or len is supported in a given architecture vs the existing
implementation in which the user determines if a constant pertaining to
a given len/type is defined. But the accessor-routines
hw_breakpoint_type() and hw_breakpoint_addr() make it much easier to use
and is a good addition.

To make the usage much easier, I would see a combination of the
following:

- Define constants/enums for length and type that are common to all
architectures.
- Define accessor routines that help determine if a given type/len is
supported on the host processor.
- Move fields such as address, len and type to generic breakpoint
structure (if it still matters despite the two changes above).

Let me know what you think.

Thanks,
K.Prasad

2009-07-23 13:06:50

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 5/5] perfcounter: Add support for kernel hardware breakpoints

On Mon, 2009-07-20 at 13:08 -0400, Frederic Weisbecker wrote:
> This adds the support for kernel hardware breakpoints in perfcounter.
> It is added as a new type of software counter and can be defined by
> using the counter number 5 and by passsing the address of the
> breakpoint to set through the config attribute.

Is there a limit to these hardware breakpoints? If so, the software
counter model is not sufficient, since we assume we can always schedule
all software counters. However if you were to add more counters than you
have hardware breakpoints you're hosed.

2009-07-23 17:45:56

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 5/5] perfcounter: Add support for kernel hardware breakpoints

On Thu, 2009-07-23 at 15:08 +0200, Peter Zijlstra wrote:
> On Mon, 2009-07-20 at 13:08 -0400, Frederic Weisbecker wrote:
> > This adds the support for kernel hardware breakpoints in perfcounter.
> > It is added as a new type of software counter and can be defined by
> > using the counter number 5 and by passsing the address of the
> > breakpoint to set through the config attribute.
>
> Is there a limit to these hardware breakpoints? If so, the software
> counter model is not sufficient, since we assume we can always schedule
> all software counters. However if you were to add more counters than you
> have hardware breakpoints you're hosed.

Do the breakpoints work on virtual or physical addresses?

Also, do we have an 64bit architecture with split address-spaces?


2009-07-23 19:56:11

by Alan Stern

[permalink] [raw]
Subject: Re: [RFC][PATCH 5/5] perfcounter: Add support for kernel hardware breakpoints

On Thu, 23 Jul 2009, Peter Zijlstra wrote:

> > Is there a limit to these hardware breakpoints? If so, the software

Certainly there's a limit. They require hardware resources, after all.

> > counter model is not sufficient, since we assume we can always schedule
> > all software counters. However if you were to add more counters than you
> > have hardware breakpoints you're hosed.
>
> Do the breakpoints work on virtual or physical addresses?

On x86, hardware breakpoints take virtual addresses. I don't know
about other architectures.

Alan Stern

2009-07-24 14:03:00

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC][PATCH 5/5] perfcounter: Add support for kernel hardware breakpoints

2009/7/23 Peter Zijlstra <[email protected]>:
> On Mon, 2009-07-20 at 13:08 -0400, Frederic Weisbecker wrote:
>> This adds the support for kernel hardware breakpoints in perfcounter.
>> It is added as a new type of software counter and can be defined by
>> using the counter number 5 and by passsing the address of the
>> breakpoint to set through the config attribute.
>
> Is there a limit to these hardware breakpoints? If so, the software
> counter model is not sufficient, since we assume we can always schedule
> all software counters. However if you were to add more counters than you
> have hardware breakpoints you're hosed.
>
>

Hmm, indeed. But this patch handles this case:

+static const struct pmu *bp_perf_counter_init(struct perf_counter *counter)
+{
+ if (hw_breakpoint_perf_init((unsigned long)counter->attr.config))
+ return NULL;
+

IIRC, hw_breakpoint_perf_init() calls register_kernel_breakpoint() which in turn
returns -ENOSPC if we haven't any breakpoint room left.

It seems we can only set 4 breakpoints simultaneously in x86, or
something close to that.

2009-07-24 14:24:33

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 5/5] perfcounter: Add support for kernel hardware breakpoints

On Fri, 2009-07-24 at 16:02 +0200, Frédéric Weisbecker wrote:
> 2009/7/23 Peter Zijlstra <[email protected]>:
> > On Mon, 2009-07-20 at 13:08 -0400, Frederic Weisbecker wrote:
> >> This adds the support for kernel hardware breakpoints in perfcounter.
> >> It is added as a new type of software counter and can be defined by
> >> using the counter number 5 and by passsing the address of the
> >> breakpoint to set through the config attribute.
> >
> > Is there a limit to these hardware breakpoints? If so, the software
> > counter model is not sufficient, since we assume we can always schedule
> > all software counters. However if you were to add more counters than you
> > have hardware breakpoints you're hosed.
> >
> >
>
> Hmm, indeed. But this patch handles this case:
>
> +static const struct pmu *bp_perf_counter_init(struct perf_counter *counter)
> +{
> + if (hw_breakpoint_perf_init((unsigned long)counter->attr.config))
> + return NULL;
> +
>
> IIRC, hw_breakpoint_perf_init() calls register_kernel_breakpoint() which in turn
> returns -ENOSPC if we haven't any breakpoint room left.
>
> It seems we can only set 4 breakpoints simultaneously in x86, or
> something close to that.

Ah, that's not the correct way of doing that. Suppose that you would
register 4 breakpoint counter to one task, that would leave you unable
to register a breakpoint counter on another task. Even though these
breakpoints would never be scheduled simultaneously.

Also, regular perf counters would multiplex counters when over-committed
on a hardware resource, allowing you to create more such breakpoints
than you have actual hardware slots.

The way to do this is to create a breakpoint pmu which would simply fail
the pmu->enable() method if there are insufficient hardware resources
available.

Also, your init routine, the above hw_breakpoint_perf_init(), will have
to verify that when the counter is part of a group, this and all other
hw breakpoint counters in that group can, now, but also in the future,
be scheduled simultaneously.

This means that there should be some arbitration towards other in-kernel
hw breakpoint users, because if you allow all 4 hw breakpoints in a
group and then let another hw breakpoint users have one, you can never
schedule that group again.

[ which raises a fun point, Paulus do we handle groups having multiple
'hardware' pmu's in? ]

Now, for the actual counter implementation you can probably re-use the
swcounter code, but you also need a pmu implementation.

2009-07-24 17:47:40

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC][PATCH 5/5] perfcounter: Add support for kernel hardware breakpoints

On Fri, Jul 24, 2009 at 04:26:09PM +0200, Peter Zijlstra wrote:
> On Fri, 2009-07-24 at 16:02 +0200, Fr?d?ric Weisbecker wrote:
> > 2009/7/23 Peter Zijlstra <[email protected]>:
> > > On Mon, 2009-07-20 at 13:08 -0400, Frederic Weisbecker wrote:
> > >> This adds the support for kernel hardware breakpoints in perfcounter.
> > >> It is added as a new type of software counter and can be defined by
> > >> using the counter number 5 and by passsing the address of the
> > >> breakpoint to set through the config attribute.
> > >
> > > Is there a limit to these hardware breakpoints? If so, the software
> > > counter model is not sufficient, since we assume we can always schedule
> > > all software counters. However if you were to add more counters than you
> > > have hardware breakpoints you're hosed.
> > >
> > >
> >
> > Hmm, indeed. But this patch handles this case:
> >
> > +static const struct pmu *bp_perf_counter_init(struct perf_counter *counter)
> > +{
> > + if (hw_breakpoint_perf_init((unsigned long)counter->attr.config))
> > + return NULL;
> > +
> >
> > IIRC, hw_breakpoint_perf_init() calls register_kernel_breakpoint() which in turn
> > returns -ENOSPC if we haven't any breakpoint room left.
> >
> > It seems we can only set 4 breakpoints simultaneously in x86, or
> > something close to that.
>
> Ah, that's not the correct way of doing that. Suppose that you would
> register 4 breakpoint counter to one task, that would leave you unable
> to register a breakpoint counter on another task. Even though these
> breakpoints would never be scheduled simultaneously.



Ah, but the breakpoint API deals with that.
We have two types of breakpoints: the kernel bp and the user bp.
The kernel breakpoints are global points that don't deal with task
scheduling, virtual registers, etc...

But the user breakpoints (eg: used with ptrace) are dealt with virtual
debug registers in a way similar to perfcounter: each time we switch the
current task on a cpu, the hardware register states are stored in the
thread, and we load the virtual values into the hardware for the next
thread.

However, this patchset only deals with kernel breakpoint for now (wide
tracing).


> Also, regular perf counters would multiplex counters when over-committed
> on a hardware resource, allowing you to create more such breakpoints
> than you have actual hardware slots.



In the task level I talked above?



> The way to do this is to create a breakpoint pmu which would simply fail
> the pmu->enable() method if there are insufficient hardware resources
> available.



Now I wonder if the code that manages hardware debug breakpoint task switching
and the code from perfcounter could be factorized in one common thing.


> Also, your init routine, the above hw_breakpoint_perf_init(), will have
> to verify that when the counter is part of a group, this and all other
> hw breakpoint counters in that group can, now, but also in the future,
> be scheduled simultaneously.


This is already dealt from the hardware breakpoint API.
We use only one breakpoint register for the user breakpoints, and the rest
for kernel breakpoints. Also if no user breakpoint is registered, every
registers can be used for kernek breakpoints.


> This means that there should be some arbitration towards other in-kernel
> hw breakpoint users, because if you allow all 4 hw breakpoints in a
> group and then let another hw breakpoint users have one, you can never
> schedule that group again.


That's also why I think it's better to keep this virtual register management
from inside the breakpoint API, so that it can deal with perfcounter, ptrace,
etc... at the same.

What do you think?


>
> [ which raises a fun point, Paulus do we handle groups having multiple
> 'hardware' pmu's in? ]
>
> Now, for the actual counter implementation you can probably re-use the
> swcounter code, but you also need a pmu implementation.
>

2009-07-24 20:19:09

by Masami Hiramatsu

[permalink] [raw]
Subject: Re: [RFC][PATCH 5/5] perfcounter: Add support for kernel hardware breakpoints

Fr?d?ric Weisbecker wrote:
> 2009/7/20, Peter Zijlstra <[email protected]>:
>> On Mon, 2009-07-20 at 13:08 -0400, Frederic Weisbecker wrote:
>>> This adds the support for kernel hardware breakpoints in perfcounter.
>>> It is added as a new type of software counter and can be defined by
>>> using the counter number 5 and by passsing the address of the
>>> breakpoint to set through the config attribute.
>> This seems to be IP based breakpoints. Are there plans for data based
>> breakpoints as well? In that case we might want to think about the
>> namespace issue, we cannot both call them breakpoint/bp etc.. ;-)
>
>
> Nop, by default these breakpoints trigger on READ/WRITE accesses, it's
> meant for data.
> The example in the changelog profiles the bkl accesses, not by tracing
> lock_kernel() or so...but by tracing the kernel_flag spinlock itself.
>
> So it's the opposite, we may start thinking about naming issues
> against possible future plans
> for IP breakpoint :-)
>
> But actually for the latter case, I would suggest Kprobe...

Yeah, after kprobes-based event tracer is supported on ftrace,
it is easy to set up IP breakpoints in the kernel.
(maybe, I need to enhance it for perfcounter.)

Thank you,

--
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America), Inc.
Software Solutions Division

e-mail: [email protected]

2009-07-25 02:37:31

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/5] hw-breakpoints: Make kernel breakpoints API truly generic

On Mon, Jul 20, 2009 at 01:27:48PM -0400, Mathieu Desnoyers wrote:
> * Frederic Weisbecker ([email protected]) wrote:
> [...]
> > diff --git a/kernel/hw_breakpoint.c b/kernel/hw_breakpoint.c
> > index c1f64e6..015fec6 100644
> > --- a/kernel/hw_breakpoint.c
> > +++ b/kernel/hw_breakpoint.c
> > @@ -297,15 +297,21 @@ EXPORT_SYMBOL_GPL(unregister_user_hw_breakpoint);
> > /**
> > * register_kernel_hw_breakpoint - register a hardware breakpoint for kernel space
> > * @bp: the breakpoint structure to register
> > - *
> > - * @bp.info->name or @bp.info->address, @bp.info->len, @bp.info->type and
> > + * @addr: the address where we want to set the breakpoint
> > + * @len: length of the value in memory to break in
> > + * @type: the type of the breakpoint (read/write/execute)
> > * @bp->triggered must be set properly before invocation
>
> Hi Frederic,
>
> I think one of the great addition in this patchset is to allow using
> breakpoints from arch-agnostic code.
>
> It becomes important to document the error values which can be returned
> by register_kernel_hw_breakpoint, so this will serve as guidelines for
> architecture-specific arch_fill_hw_breakpoint() implementation. This
> will become increasingly important, as this abstraction layer will
> basically be responsible for either:
>
> - Finding the best support the architecture can provide for a given hw
> breakpoint.


Indeed, and that's the biggest problem it has to face because supported
hardware breakpoint features are very differents from one architecture
to another.


> - Failing with an explicit error value telling the in-kernel user why it
> failed (e.g. if it must use a fallback, or return the error to the
> user).


Yeah, nice point, I'll send another iteration which better documents the error
return values, at least once I get a mostly agreed core implementation :-)


> Maybe we should think of a more flexible breakpoint type mapping too,
> e.g.:
>
> monitor _strictly_ execute operation on address 0x...
> -> would fail if the architecture does not support execution access
> monitoring
> monitor (at least) execute operations on address 0x...
> -> would be allowed to use a more general monitor (e.g. RWX) if the
> architecture does not support "execute only" monitor.
>
> (same for read and write)
>
> Mathieu


Well, I'm not sure the problem mostly resides in the hardware implementation
of strict exec breakpoint types. But I guess your point is not limiting to
that. Yeah for example, x86 doesn't support read-only breakpoints.
But I guess that can be simulated using software artifacts, for example using
READ-WRITE breakpoints + the x86 decoder API, recently submitted by Masami,
to find the nature of the current instruction.

Anyway, your point is indeed important: return common error values for unsupported
breakpoint operations.

Thanks.

>
> > *
> > */
> > -int register_kernel_hw_breakpoint(struct hw_breakpoint *bp)
> > +int register_kernel_hw_breakpoint(struct hw_breakpoint *bp, unsigned long addr,
> > + int len, enum breakpoint_type type)
> > {
> > int rc;
> >
> > + rc = arch_fill_hw_breakpoint(bp, addr, len, type);
> > + if (rc)
> > + return rc;
> > +
> > rc = arch_validate_hwbkpt_settings(bp, NULL);
> > if (rc)
> > return rc;
> --
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2009-07-25 02:56:51

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/5] hw-breakpoints: Make kernel breakpoints API truly generic

On Tue, Jul 21, 2009 at 04:45:17PM +0530, K.Prasad wrote:
> On Mon, Jul 20, 2009 at 01:08:03PM -0400, Frederic Weisbecker wrote:
> > To define a kernel hardware breakpoint, one need to define the
> > address, type and length of the breakpoint using arch specific
> > operations and then register it using a core helper.
> >
> > The first stage is truly not scalable with respect to the number of
> > archictures, because for each of them that support hardware
> > breakpoints, we would need a seperate specific field definition for
> > the breakpoint.
> >
> > However, the supported breakpoint functionalities may be very different
> > between architectures.
> > Then this new API tries to compose with the following constraints:
> >
> > - a given architecture may perhaps not support the triggering on one
> > of the usual memory access (read-write/read/write/execute)
> >
> > - a given architecture may perhaps not support the ability to trigger
> > a breakpoint only on specific memory access size lower than the word
> > size for this arch.
> >
> > - a given architecture may perhaps not support breakpoints on addresses
> > range.
> >
> > The new API changes the following prototype for a kernel breakpoint
> > registration:
> >
> > int register_kernel_hw_breakpoint(struct hw_breakpoint *bp)
> >
> > into:
> >
> > int register_kernel_hw_breakpoint(struct hw_breakpoint *bp,
> > unsigned long addr,
> > int len, enum breakpoint_type type)
>
> It is not clear how adding these new parameters to the interface would
> help it become generic, as opposed to moving them to 'struct
> hw_breakpoint'.
>
> It would make the usage cumbersome of some architectures - say for
> instance the PPC64 which always has a breakpoint length of 8 bytes. So
> the user needs to specify either '8' always or '0' to indicate variable
> length not supported (but it is counter-intuitive..may be interpreted as
> zero-length).


Hm, so may be we should only push the target and the type and set the most
intuitive access length automatically? ie: 1 for x86 so that it triggers
for every accesses, and 8 for ppc?

Also how is actually interpreted an 8 bytes access on ppc? Ie: is a single
byte access actually interpreted through an 8 byte access because of memory
access constraints?



> >
> > The choice of passing the breakpoint settings as parameters of the
> > registration helper and not by adding generic fields into the breakpoint
> > structure is motivated by the need of a very specific per arch
> > representation of the breakpoint:
> >
> > - the arch may only need an address, but could also need a couple for
> > breakpoints in ranges.
> > - the type is subject to arch interpretation (values of debug registers)
> > - the length too.
> >
> > Then, to get back these values from a generic breakpoint structure that have
> > specific encodings into the arch fields, this API comes along with abstract
> > accessors which implementation is arch specific:
> >
> > - type hw_breakpoint_type(bp)
> > - addr hw_breakpoint_addr(bp)
> >
> > However, open debates come along this RFC patch:
> >
> > - the address could be a generic field in struct hw_breakpoint. If we
> > are dealing with a range breakpoint, then we would just need to
> > compute addr + length to get the end of the range.
> >
> > - the length and type could also be generic fields of
> > struct hw_breakpoint. It would then be up to the arch to get a
> > translation between such generic values and per arch needs.
> >
>
> While the issues have been enumerated above, the patchset only pushes
> the issue into a different domain i.e. make the user determine if a
> breakpoint type or len is supported in a given architecture vs the existing
> implementation in which the user determines if a constant pertaining to
> a given len/type is defined. But the accessor-routines
> hw_breakpoint_type() and hw_breakpoint_addr() make it much easier to use
> and is a good addition.
>
> To make the usage much easier, I would see a combination of the
> following:
>
> - Define constants/enums for length and type that are common to all
> architectures.


Indeed, in the hope that most basic types of breakpoints are supported
by every architectures (x and rw).


> - Define accessor routines that help determine if a given type/len is
> supported on the host processor.


Or may be we should forget about the length as stated above?

But indeed we can think about an arch routine that reports the supported
types, that would provide a better granularity of error message than getting
a "non implemented" error while registering a breakpoint.


> - Move fields such as address, len and type to generic breakpoint
> structure (if it still matters despite the two changes above).


Ok, I guess that would indeed be better to do that than using abstract
accessors. Let the arch interpret these generic informations by itself.

Thanks!


> Let me know what you think.
>
> Thanks,
> K.Prasad
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2009-07-25 10:58:02

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 5/5] perfcounter: Add support for kernel hardware breakpoints

On Fri, 2009-07-24 at 19:47 +0200, Frederic Weisbecker wrote:
> On Fri, Jul 24, 2009 at 04:26:09PM +0200, Peter Zijlstra wrote:
> > On Fri, 2009-07-24 at 16:02 +0200, Frédéric Weisbecker wrote:
> > > 2009/7/23 Peter Zijlstra <[email protected]>:
> > > > On Mon, 2009-07-20 at 13:08 -0400, Frederic Weisbecker wrote:
> > > >> This adds the support for kernel hardware breakpoints in perfcounter.
> > > >> It is added as a new type of software counter and can be defined by
> > > >> using the counter number 5 and by passsing the address of the
> > > >> breakpoint to set through the config attribute.
> > > >
> > > > Is there a limit to these hardware breakpoints? If so, the software
> > > > counter model is not sufficient, since we assume we can always schedule
> > > > all software counters. However if you were to add more counters than you
> > > > have hardware breakpoints you're hosed.
> > > >
> > > >
> > >
> > > Hmm, indeed. But this patch handles this case:
> > >
> > > +static const struct pmu *bp_perf_counter_init(struct perf_counter *counter)
> > > +{
> > > + if (hw_breakpoint_perf_init((unsigned long)counter->attr.config))
> > > + return NULL;
> > > +
> > >
> > > IIRC, hw_breakpoint_perf_init() calls register_kernel_breakpoint() which in turn
> > > returns -ENOSPC if we haven't any breakpoint room left.
> > >
> > > It seems we can only set 4 breakpoints simultaneously in x86, or
> > > something close to that.
> >
> > Ah, that's not the correct way of doing that. Suppose that you would
> > register 4 breakpoint counter to one task, that would leave you unable
> > to register a breakpoint counter on another task. Even though these
> > breakpoints would never be scheduled simultaneously.
>
>
>
> Ah, but the breakpoint API deals with that.
> We have two types of breakpoints: the kernel bp and the user bp.
> The kernel breakpoints are global points that don't deal with task
> scheduling, virtual registers, etc...
>
> But the user breakpoints (eg: used with ptrace) are dealt with virtual
> debug registers in a way similar to perfcounter: each time we switch the
> current task on a cpu, the hardware register states are stored in the
> thread, and we load the virtual values into the hardware for the next
> thread.

Ah, but that is sub-optimal, perf counters doesn't actually change the
state if both tasks have the same counter configuration. Yielding a
great performance benefit on scheduling intensive workloads. Poking at
these MSRs, esp. writing to them is very expensive.

So I would suggest not using that feature of the breakpoint API for the
perf counter integration.

> However, this patchset only deals with kernel breakpoint for now (wide
> tracing).

Right, and that's all you would need for perf counter support, please
don't use whatever task state handling you have in place.

> > Also, regular perf counters would multiplex counters when over-committed
> > on a hardware resource, allowing you to create more such breakpoints
> > than you have actual hardware slots.

> In the task level I talked above?

For either cpu or task level.

> > The way to do this is to create a breakpoint pmu which would simply fail
> > the pmu->enable() method if there are insufficient hardware resources
> > available.

> Now I wonder if the code that manages hardware debug breakpoint task switching
> and the code from perfcounter could be factorized in one common thing.

Dunno, its really not that hard to RR a list of counters/breakpoint.

> > Also, your init routine, the above hw_breakpoint_perf_init(), will have
> > to verify that when the counter is part of a group, this and all other
> > hw breakpoint counters in that group can, now, but also in the future,
> > be scheduled simultaneously.

> This is already dealt from the hardware breakpoint API.
> We use only one breakpoint register for the user breakpoints, and the rest
> for kernel breakpoints. Also if no user breakpoint is registered, every
> registers can be used for kernek breakpoints.

This means that you can only ever allow 3 breakpoints into any one group
and have to ensure that no other user can come in when they're not in
active use -- the group is scheduled out.

That is, you have to reserve the max number of breakpoint in a group for
exclusive use by perf counters.

Also, this 1 for userspace seems restrictive. I'd want to have all 4
from GDB if I'd knew my hardware was capable and I'd needed that many.

> > This means that there should be some arbitration towards other in-kernel
> > hw breakpoint users, because if you allow all 4 hw breakpoints in a
> > group and then let another hw breakpoint users have one, you can never
> > schedule that group again.

> That's also why I think it's better to keep this virtual register management
> from inside the breakpoint API, so that it can deal with perfcounter, ptrace,
> etc... at the same.
>
> What do you think?

I think not. I think the breakpoint API should not do task state, or at
least have an interface without this.

Having two multiplexing layers on top of one another is inefficient and
error prone.

2009-07-25 14:19:28

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC][PATCH 5/5] perfcounter: Add support for kernel hardware breakpoints

On Sat, Jul 25, 2009 at 12:56:56PM +0200, Peter Zijlstra wrote:
> On Fri, 2009-07-24 at 19:47 +0200, Frederic Weisbecker wrote:
> > On Fri, Jul 24, 2009 at 04:26:09PM +0200, Peter Zijlstra wrote:
> > > On Fri, 2009-07-24 at 16:02 +0200, Fr?d?ric Weisbecker wrote:
> > > > 2009/7/23 Peter Zijlstra <[email protected]>:
> > > > > On Mon, 2009-07-20 at 13:08 -0400, Frederic Weisbecker wrote:
> > > > >> This adds the support for kernel hardware breakpoints in perfcounter.
> > > > >> It is added as a new type of software counter and can be defined by
> > > > >> using the counter number 5 and by passsing the address of the
> > > > >> breakpoint to set through the config attribute.
> > > > >
> > > > > Is there a limit to these hardware breakpoints? If so, the software
> > > > > counter model is not sufficient, since we assume we can always schedule
> > > > > all software counters. However if you were to add more counters than you
> > > > > have hardware breakpoints you're hosed.
> > > > >
> > > > >
> > > >
> > > > Hmm, indeed. But this patch handles this case:
> > > >
> > > > +static const struct pmu *bp_perf_counter_init(struct perf_counter *counter)
> > > > +{
> > > > + if (hw_breakpoint_perf_init((unsigned long)counter->attr.config))
> > > > + return NULL;
> > > > +
> > > >
> > > > IIRC, hw_breakpoint_perf_init() calls register_kernel_breakpoint() which in turn
> > > > returns -ENOSPC if we haven't any breakpoint room left.
> > > >
> > > > It seems we can only set 4 breakpoints simultaneously in x86, or
> > > > something close to that.
> > >
> > > Ah, that's not the correct way of doing that. Suppose that you would
> > > register 4 breakpoint counter to one task, that would leave you unable
> > > to register a breakpoint counter on another task. Even though these
> > > breakpoints would never be scheduled simultaneously.
> >
> >
> >
> > Ah, but the breakpoint API deals with that.
> > We have two types of breakpoints: the kernel bp and the user bp.
> > The kernel breakpoints are global points that don't deal with task
> > scheduling, virtual registers, etc...
> >
> > But the user breakpoints (eg: used with ptrace) are dealt with virtual
> > debug registers in a way similar to perfcounter: each time we switch the
> > current task on a cpu, the hardware register states are stored in the
> > thread, and we load the virtual values into the hardware for the next
> > thread.
>
> Ah, but that is sub-optimal, perf counters doesn't actually change the
> state if both tasks have the same counter configuration. Yielding a
> great performance benefit on scheduling intensive workloads. Poking at
> these MSRs, esp. writing to them is very expensive.


Ah ok.


> So I would suggest not using that feature of the breakpoint API for the
> perf counter integration.


That would forbid some kinds of profiling (explanations below).


> > However, this patchset only deals with kernel breakpoint for now (wide
> > tracing).
>
> Right, and that's all you would need for perf counter support, please
> don't use whatever task state handling you have in place.


I would actually propose to have a separate layer that manages
the hardware registers <-> per thread virtual registers handling
for things like breakpoint api and perfcounter.

I know a simple RR of registers is not that hard to write, but at
least that can allow simultaneous use of perfcounter and other users
of breakpoint API without having two different versions of register
management.


> > > Also, regular perf counters would multiplex counters when over-committed
> > > on a hardware resource, allowing you to create more such breakpoints
> > > than you have actual hardware slots.
>
> > In the task level I talked above?
>
> For either cpu or task level.
>
> > > The way to do this is to create a breakpoint pmu which would simply fail
> > > the pmu->enable() method if there are insufficient hardware resources
> > > available.
>
> > Now I wonder if the code that manages hardware debug breakpoint task switching
> > and the code from perfcounter could be factorized in one common thing.
>
> Dunno, its really not that hard to RR a list of counters/breakpoint.
>
> > > Also, your init routine, the above hw_breakpoint_perf_init(), will have
> > > to verify that when the counter is part of a group, this and all other
> > > hw breakpoint counters in that group can, now, but also in the future,
> > > be scheduled simultaneously.
>
> > This is already dealt from the hardware breakpoint API.
> > We use only one breakpoint register for the user breakpoints, and the rest
> > for kernel breakpoints. Also if no user breakpoint is registered, every
> > registers can be used for kernek breakpoints.
>
> This means that you can only ever allow 3 breakpoints into any one group
> and have to ensure that no other user can come in when they're not in
> active use -- the group is scheduled out.
> That is, you have to reserve the max number of breakpoint in a group for
> exclusive use by perf counters.


Hmm, if we reserve all breakpoints registers for perfcounter exclusive use
when it runs, that excludes any profiling of ptrace while doing a POKE_USR
or gdb while using breakpoints.

That's why I think it would be better to make use of the hardware breakpoints
from perfcounter using the bp API. Doing so allows concurrent users of bp while
perf is using them. Then we have no restriction concerning the profiling of
code that uses breakpoints.

Using a seperate hardware register <-> virtual register management layer
would then solve the problem of two different ad hoc implementations to
maintain and which impacts profiling performances.


> Also, this 1 for userspace seems restrictive. I'd want to have all 4
> from GDB if I'd knew my hardware was capable and I'd needed that many.


Actually I've made a mistake, you can use several user breakpoints, as
many as the number of hardware breakpoints, minus the number of kernel
bp currently set.


>
> > > This means that there should be some arbitration towards other in-kernel
> > > hw breakpoint users, because if you allow all 4 hw breakpoints in a
> > > group and then let another hw breakpoint users have one, you can never
> > > schedule that group again.
>
> > That's also why I think it's better to keep this virtual register management
> > from inside the breakpoint API, so that it can deal with perfcounter, ptrace,
> > etc... at the same.
> >
> > What do you think?
>
> I think not. I think the breakpoint API should not do task state, or at
> least have an interface without this.
>
> Having two multiplexing layers on top of one another is inefficient and
> error prone.


And what do you think of the above idea?

Thanks.

2009-07-25 15:38:58

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/5] hw-breakpoints: Make kernel breakpoints API truly generic

* Frederic Weisbecker ([email protected]) wrote:
> On Mon, Jul 20, 2009 at 01:27:48PM -0400, Mathieu Desnoyers wrote:
[...]
>
> > Maybe we should think of a more flexible breakpoint type mapping too,
> > e.g.:
> >
> > monitor _strictly_ execute operation on address 0x...
> > -> would fail if the architecture does not support execution access
> > monitoring
> > monitor (at least) execute operations on address 0x...
> > -> would be allowed to use a more general monitor (e.g. RWX) if the
> > architecture does not support "execute only" monitor.
> >
> > (same for read and write)
> >
> > Mathieu
>
>
> Well, I'm not sure the problem mostly resides in the hardware implementation
> of strict exec breakpoint types. But I guess your point is not limiting to
> that. Yeah for example, x86 doesn't support read-only breakpoints.

Exactly. I used "execute" only as an example, but in the end, we could
end up with a list looking like:

HW_WATCH_R (1 << 0)
HW_WATCH_NOT_R (1 << 0)
HW_WATCH_W (1 << 1)
HW_WATCH_NOT_W (1 << 1)
HW_WATCH_X (1 << 2)
HW_WATCH_NOT_X (1 << 2)

So for instance, flags :

HW_WATCH_R|HW_WATCH_NOT_W|HW_WATCH_NOT_X

would specify that the architecture has to support a "read" watchpoint
which does not trigger on write nor execute.

HW_WATCH_R

would specify that the architecture _must_ support "read" watchpoint,
and we don't care about W or X.

HW_WATCH_R|HW_WATCH_W|HW_WATCH_X

Would ask for watching rwx on an address. The architecture would have to
support all those three.

Some combinations would be invalid (e.g. HW_WATCH_R|HW_WATCH_NOT_R).

There might be better ways to express this, but at this it should show
my point a bit more clearly.

Mathieu


> But I guess that can be simulated using software artifacts, for example using
> READ-WRITE breakpoints + the x86 decoder API, recently submitted by Masami,
> to find the nature of the current instruction.
>
> Anyway, your point is indeed important: return common error values for unsupported
> breakpoint operations.
>
> Thanks.
>

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2009-07-25 15:51:37

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC][PATCH 5/5] perfcounter: Add support for kernel hardware breakpoints

* Frederic Weisbecker ([email protected]) wrote:
> On Sat, Jul 25, 2009 at 12:56:56PM +0200, Peter Zijlstra wrote:
> > On Fri, 2009-07-24 at 19:47 +0200, Frederic Weisbecker wrote:
> > > On Fri, Jul 24, 2009 at 04:26:09PM +0200, Peter Zijlstra wrote:
> > > > On Fri, 2009-07-24 at 16:02 +0200, Fr?d?ric Weisbecker wrote:
> > > > > 2009/7/23 Peter Zijlstra <[email protected]>:
> > > > > > On Mon, 2009-07-20 at 13:08 -0400, Frederic Weisbecker wrote:
> > > > > >> This adds the support for kernel hardware breakpoints in perfcounter.
> > > > > >> It is added as a new type of software counter and can be defined by
> > > > > >> using the counter number 5 and by passsing the address of the
> > > > > >> breakpoint to set through the config attribute.
> > > > > >
> > > > > > Is there a limit to these hardware breakpoints? If so, the software
> > > > > > counter model is not sufficient, since we assume we can always schedule
> > > > > > all software counters. However if you were to add more counters than you
> > > > > > have hardware breakpoints you're hosed.
> > > > > >
> > > > > >
> > > > >
> > > > > Hmm, indeed. But this patch handles this case:
> > > > >
> > > > > +static const struct pmu *bp_perf_counter_init(struct perf_counter *counter)
> > > > > +{
> > > > > + if (hw_breakpoint_perf_init((unsigned long)counter->attr.config))
> > > > > + return NULL;
> > > > > +
> > > > >
> > > > > IIRC, hw_breakpoint_perf_init() calls register_kernel_breakpoint() which in turn
> > > > > returns -ENOSPC if we haven't any breakpoint room left.
> > > > >
> > > > > It seems we can only set 4 breakpoints simultaneously in x86, or
> > > > > something close to that.
> > > >
> > > > Ah, that's not the correct way of doing that. Suppose that you would
> > > > register 4 breakpoint counter to one task, that would leave you unable
> > > > to register a breakpoint counter on another task. Even though these
> > > > breakpoints would never be scheduled simultaneously.
> > >
> > >
> > >
> > > Ah, but the breakpoint API deals with that.
> > > We have two types of breakpoints: the kernel bp and the user bp.
> > > The kernel breakpoints are global points that don't deal with task
> > > scheduling, virtual registers, etc...
> > >
> > > But the user breakpoints (eg: used with ptrace) are dealt with virtual
> > > debug registers in a way similar to perfcounter: each time we switch the
> > > current task on a cpu, the hardware register states are stored in the
> > > thread, and we load the virtual values into the hardware for the next
> > > thread.
> >
> > Ah, but that is sub-optimal, perf counters doesn't actually change the
> > state if both tasks have the same counter configuration. Yielding a
> > great performance benefit on scheduling intensive workloads. Poking at
> > these MSRs, esp. writing to them is very expensive.
>
>
> Ah ok.
>
>
> > So I would suggest not using that feature of the breakpoint API for the
> > perf counter integration.
>
>
> That would forbid some kinds of profiling (explanations below).
>
>
> > > However, this patchset only deals with kernel breakpoint for now (wide
> > > tracing).
> >
> > Right, and that's all you would need for perf counter support, please
> > don't use whatever task state handling you have in place.
>
>
> I would actually propose to have a separate layer that manages
> the hardware registers <-> per thread virtual registers handling
> for things like breakpoint api and perfcounter.
>
> I know a simple RR of registers is not that hard to write, but at
> least that can allow simultaneous use of perfcounter and other users
> of breakpoint API without having two different versions of register
> management.
>
>
> > > > Also, regular perf counters would multiplex counters when over-committed
> > > > on a hardware resource, allowing you to create more such breakpoints
> > > > than you have actual hardware slots.
> >
> > > In the task level I talked above?
> >
> > For either cpu or task level.
> >
> > > > The way to do this is to create a breakpoint pmu which would simply fail
> > > > the pmu->enable() method if there are insufficient hardware resources
> > > > available.
> >
> > > Now I wonder if the code that manages hardware debug breakpoint task switching
> > > and the code from perfcounter could be factorized in one common thing.
> >
> > Dunno, its really not that hard to RR a list of counters/breakpoint.
> >
> > > > Also, your init routine, the above hw_breakpoint_perf_init(), will have
> > > > to verify that when the counter is part of a group, this and all other
> > > > hw breakpoint counters in that group can, now, but also in the future,
> > > > be scheduled simultaneously.
> >
> > > This is already dealt from the hardware breakpoint API.
> > > We use only one breakpoint register for the user breakpoints, and the rest
> > > for kernel breakpoints. Also if no user breakpoint is registered, every
> > > registers can be used for kernek breakpoints.
> >
> > This means that you can only ever allow 3 breakpoints into any one group
> > and have to ensure that no other user can come in when they're not in
> > active use -- the group is scheduled out.
> > That is, you have to reserve the max number of breakpoint in a group for
> > exclusive use by perf counters.
>
>
> Hmm, if we reserve all breakpoints registers for perfcounter exclusive use
> when it runs, that excludes any profiling of ptrace while doing a POKE_USR
> or gdb while using breakpoints.
>
> That's why I think it would be better to make use of the hardware breakpoints
> from perfcounter using the bp API. Doing so allows concurrent users of bp while
> perf is using them. Then we have no restriction concerning the profiling of
> code that uses breakpoints.
>
> Using a seperate hardware register <-> virtual register management layer
> would then solve the problem of two different ad hoc implementations to
> maintain and which impacts profiling performances.
>
>
> > Also, this 1 for userspace seems restrictive. I'd want to have all 4
> > from GDB if I'd knew my hardware was capable and I'd needed that many.
>
>
> Actually I've made a mistake, you can use several user breakpoints, as
> many as the number of hardware breakpoints, minus the number of kernel
> bp currently set.
>
>
> >
> > > > This means that there should be some arbitration towards other in-kernel
> > > > hw breakpoint users, because if you allow all 4 hw breakpoints in a
> > > > group and then let another hw breakpoint users have one, you can never
> > > > schedule that group again.
> >
> > > That's also why I think it's better to keep this virtual register management
> > > from inside the breakpoint API, so that it can deal with perfcounter, ptrace,
> > > etc... at the same.
> > >
> > > What do you think?
> >
> > I think not. I think the breakpoint API should not do task state, or at
> > least have an interface without this.
> >
> > Having two multiplexing layers on top of one another is inefficient and
> > error prone.
>
>
> And what do you think of the above idea?
>

I agree with your idea to split the hardware counter management from
virtual per-process counters. IMO, this limited resource needs to be
managed centrally at one location and because system-wide level
performance counters do not need to flip the performance counters
depending on the current task. We can easily think of an embedded system
where providing system-wide performance counters would be important (for
tracing for instance), but which would compile-out the per-task
performance counters to save space.

Note that you will have to deal with some policy here, because you can
have performance counter reservation asked from both the kernel
(for either kernel and per-task watchpoints) and from userspace (for
per-task watchpoints).

You have to think of scenarios like:

A userspace application asks for all hardware watchpoints, and then the
kernel needs 1-2 of them. The application should probably be "kicked
out" of 2 of those watchpoints _after_ they have been set up, otherwise
this would allow a resource exhaustion to be performed by the
application at the detriment of the kernel.

Having the ability for a performance counter read to fail when a virtual
performance counter has been kicked-out by a more important user would
allow the kernel to take control of the performance counters without
letting userspace in its way.

Mathieu



> Thanks.
>

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2009-07-25 16:23:26

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 5/5] perfcounter: Add support for kernel hardware breakpoints

On Sat, 2009-07-25 at 16:19 +0200, Frederic Weisbecker wrote:

> > Ah, but that is sub-optimal, perf counters doesn't actually change the
> > state if both tasks have the same counter configuration. Yielding a
> > great performance benefit on scheduling intensive workloads. Poking at
> > these MSRs, esp. writing to them is very expensive.
>
>
> Ah ok.
>
>
> > So I would suggest not using that feature of the breakpoint API for the
> > perf counter integration.
>
>
> That would forbid some kinds of profiling (explanations below).
>
>
> > > However, this patchset only deals with kernel breakpoint for now (wide
> > > tracing).
> >
> > Right, and that's all you would need for perf counter support, please
> > don't use whatever task state handling you have in place.
>
>
> I would actually propose to have a separate layer that manages
> the hardware registers <-> per thread virtual registers handling
> for things like breakpoint api and perfcounter.
>
> I know a simple RR of registers is not that hard to write, but at
> least that can allow simultaneous use of perfcounter and other users
> of breakpoint API without having two different versions of register
> management.

I simply cannot see how you would be able to multiplex userspace/debug
breakpoints. I'd utterly hate it if I'd missed a breakpoint simply
because someone else also wanted to make use of it.

I'd declare the system broken and useless.

Counters OTOH can be multiplexed because of their statistical nature,
you can simply scale them back up based on their time share.

Therefore you'll have to deal with hard reservations anyway.

Also, you don't need to a-priory reserve all breakpoints, you'll simply
need as many as the largest group (wrt breakpoints) has.

> > > This is already dealt from the hardware breakpoint API.
> > > We use only one breakpoint register for the user breakpoints, and the rest
> > > for kernel breakpoints. Also if no user breakpoint is registered, every
> > > registers can be used for kernek breakpoints.
> >
> > This means that you can only ever allow 3 breakpoints into any one group
> > and have to ensure that no other user can come in when they're not in
> > active use -- the group is scheduled out.
> > That is, you have to reserve the max number of breakpoint in a group for
> > exclusive use by perf counters.
>
>
> Hmm, if we reserve all breakpoints registers for perfcounter exclusive use
> when it runs, that excludes any profiling of ptrace while doing a POKE_USR
> or gdb while using breakpoints.
>
> That's why I think it would be better to make use of the hardware breakpoints
> from perfcounter using the bp API. Doing so allows concurrent users of bp while
> perf is using them. Then we have no restriction concerning the profiling of
> code that uses breakpoints.

Like said above, you cannot do this.

> Using a seperate hardware register <-> virtual register management layer
> would then solve the problem of two different ad hoc implementations to
> maintain and which impacts profiling performances.
>
>
> > Also, this 1 for userspace seems restrictive. I'd want to have all 4
> > from GDB if I'd knew my hardware was capable and I'd needed that many.
>
>
> Actually I've made a mistake, you can use several user breakpoints, as
> many as the number of hardware breakpoints, minus the number of kernel
> bp currently set.

Better :-)

2009-07-25 16:27:29

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 5/5] perfcounter: Add support for kernel hardware breakpoints

On Sat, 2009-07-25 at 11:51 -0400, Mathieu Desnoyers wrote:
> I agree with your idea to split the hardware counter management from
> virtual per-process counters. IMO, this limited resource needs to be
> managed centrally at one location and because system-wide level
> performance counters do not need to flip the performance counters
> depending on the current task.

System wide counters never care about the task state.

Its task counters we sometimes don't re-program counters for on context
switch when both tasks have the same configuration, saving greatly on
context switch costs.

> We can easily think of an embedded system
> where providing system-wide performance counters would be important (for
> tracing for instance), but which would compile-out the per-task
> performance counters to save space.

That doesn't make sense, the per task/global parts of perf counters are
tightly interwoven and don't differ much.

> Note that you will have to deal with some policy here, because you can
> have performance counter reservation asked from both the kernel
> (for either kernel and per-task watchpoints) and from userspace (for
> per-task watchpoints).

cpu counters can be both kernel and user
task counter can be both kernel and user.

Your above use of 'and' doesn't make sense.

2009-07-25 23:57:16

by K.Prasad

[permalink] [raw]
Subject: Re: [RFC][PATCH 5/5] perfcounter: Add support for kernel hardware breakpoints

On Sat, Jul 25, 2009 at 06:22:52PM +0200, Peter Zijlstra wrote:
> On Sat, 2009-07-25 at 16:19 +0200, Frederic Weisbecker wrote:
>
> > > Ah, but that is sub-optimal, perf counters doesn't actually change the
> > > state if both tasks have the same counter configuration. Yielding a
> > > great performance benefit on scheduling intensive workloads. Poking at
> > > these MSRs, esp. writing to them is very expensive.
> >
> >
> > Ah ok.
> >
> >
> > > So I would suggest not using that feature of the breakpoint API for the
> > > perf counter integration.
> >
> >
> > That would forbid some kinds of profiling (explanations below).
> >
> >
> > > > However, this patchset only deals with kernel breakpoint for now (wide
> > > > tracing).
> > >
> > > Right, and that's all you would need for perf counter support, please
> > > don't use whatever task state handling you have in place.
> >
> >
> > I would actually propose to have a separate layer that manages
> > the hardware registers <-> per thread virtual registers handling
> > for things like breakpoint api and perfcounter.
> >
> > I know a simple RR of registers is not that hard to write, but at
> > least that can allow simultaneous use of perfcounter and other users
> > of breakpoint API without having two different versions of register
> > management.
>
> I simply cannot see how you would be able to multiplex userspace/debug
> breakpoints. I'd utterly hate it if I'd missed a breakpoint simply
> because someone else also wanted to make use of it.
>
> I'd declare the system broken and useless.
>
> Counters OTOH can be multiplexed because of their statistical nature,
> you can simply scale them back up based on their time share.
>
> Therefore you'll have to deal with hard reservations anyway.
>

I don't claim to have understood the requirements of perf-counters
fully, but I sense the persistence of a few doubts about the register
allocation mechanism of the hw-breakpoints API...thought it might be
worthwhile to briefly explain.

A few notes about the same:

Kernel-space breakpoints: They are system-wide i.e. one kernel-space
breakpoint request consumes one debug register on 'all' CPUs of the
system.
User-space breakpoints: Requests are stored in per-thread data
structures. Written onto the debug register only when scheduled-into
the CPU, and are removed when another task using the debug registers is
scheduled onto the CPU.
Debug register allocation mechanism: register request succeeds only when
the availability of a debug register is guaranteed (for both user and
kernel space). Allocation on a first-come-first-serve basis, no
priorities for register requests and hence no pre-emption of requests.

In short the available debug registers are divided into:

# of debug registers = # of kernel-space requests + MAX(# of all
user-space requests) + freely available debug registers.

Thus on an x86 system with 4 debug registers, if there exists 1
kernel-space request it consumes one debug register on all CPUs in the
system. The remaining 3 debug registers can be consumed by user-space
i.e. upto 3 user-space requests per-thread.

There is no restriction on how many debug registers can be consumed by
the kernel- or user-space requests and it is entirely dependent on the
number of free registers available for use then.

Thanks,
K.Prasad

2009-07-27 08:51:26

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 5/5] perfcounter: Add support for kernel hardware breakpoints

On Sun, 2009-07-26 at 05:27 +0530, K.Prasad wrote:

> I don't claim to have understood the requirements of perf-counters
> fully, but I sense the persistence of a few doubts about the register
> allocation mechanism of the hw-breakpoints API...thought it might be
> worthwhile to briefly explain.
>
> A few notes about the same:
>
> Kernel-space breakpoints: They are system-wide i.e. one kernel-space
> breakpoint request consumes one debug register on 'all' CPUs of the
> system.
> User-space breakpoints: Requests are stored in per-thread data
> structures. Written onto the debug register only when scheduled-into
> the CPU, and are removed when another task using the debug registers is
> scheduled onto the CPU.
> Debug register allocation mechanism: register request succeeds only when
> the availability of a debug register is guaranteed (for both user and
> kernel space). Allocation on a first-come-first-serve basis, no
> priorities for register requests and hence no pre-emption of requests.
>
> In short the available debug registers are divided into:
>
> # of debug registers = # of kernel-space requests + MAX(# of all
> user-space requests) + freely available debug registers.
>
> Thus on an x86 system with 4 debug registers, if there exists 1
> kernel-space request it consumes one debug register on all CPUs in the
> system. The remaining 3 debug registers can be consumed by user-space
> i.e. upto 3 user-space requests per-thread.
>
> There is no restriction on how many debug registers can be consumed by
> the kernel- or user-space requests and it is entirely dependent on the
> number of free registers available for use then.

Right, this is an utter miss-match for what perf counters would want.

Firstly, you seem to have this weird split of kernel/userspace
breakpoints. Perf counters looks at things in a per-cpu fashion, so the
all-cpus kernel breakpoint stuff is useless. Also, from perf counters'
POV its perfectly reasonable to have a per-task kernel breakpoint.

Secondly, perf counters wants to schedule the per task breakpoints
because we can optimize the context switch, saving lots of these MSR
writes under some common scenarios.

Thirdly, we can multiplex perf counters beyond their hardware maximum,
something you simply cannot do for a debug interface.

Like I said, please use the raw per-cpu breakpoint interface for perf
counters and connect that with the minimally required reservation you
need to make your other thing work.

You simply cannot put perf-counter breakpoints on top of whatever virt
layer you created going by what you say it is.

2009-07-28 00:18:52

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC][PATCH 5/5] perfcounter: Add support for kernel hardware breakpoints

On Sat, Jul 25, 2009 at 06:22:52PM +0200, Peter Zijlstra wrote:
> On Sat, 2009-07-25 at 16:19 +0200, Frederic Weisbecker wrote:
>
> > > Ah, but that is sub-optimal, perf counters doesn't actually change the
> > > state if both tasks have the same counter configuration. Yielding a
> > > great performance benefit on scheduling intensive workloads. Poking at
> > > these MSRs, esp. writing to them is very expensive.
> >
> >
> > Ah ok.
> >
> >
> > > So I would suggest not using that feature of the breakpoint API for the
> > > perf counter integration.
> >
> >
> > That would forbid some kinds of profiling (explanations below).
> >
> >
> > > > However, this patchset only deals with kernel breakpoint for now (wide
> > > > tracing).
> > >
> > > Right, and that's all you would need for perf counter support, please
> > > don't use whatever task state handling you have in place.
> >
> >
> > I would actually propose to have a separate layer that manages
> > the hardware registers <-> per thread virtual registers handling
> > for things like breakpoint api and perfcounter.
> >
> > I know a simple RR of registers is not that hard to write, but at
> > least that can allow simultaneous use of perfcounter and other users
> > of breakpoint API without having two different versions of register
> > management.
>
> I simply cannot see how you would be able to multiplex userspace/debug
> breakpoints. I'd utterly hate it if I'd missed a breakpoint simply
> because someone else also wanted to make use of it.


What I mean by multiplexing is that, say in x86, each task can have
4 breakpoints maximum. Once the task is scheduled out, its breakpoints
are saved and the hardware debug registers are used for the next task.
Once a task registers a breakpoint, it never looses it.


> I'd declare the system broken and useless.
>
> Counters OTOH can be multiplexed because of their statistical nature,
> you can simply scale them back up based on their time share.
>
> Therefore you'll have to deal with hard reservations anyway.
>
> Also, you don't need to a-priory reserve all breakpoints, you'll simply
> need as many as the largest group (wrt breakpoints) has.

I still don't understand why it is needed to reserve breakpoints for
a group of monitored tasks. Once they have registered their breakpoints,
the number of necessary hardware registers for these will be available
every time the task is scheduled.

By nature, MAX_NR (4 in x86) breakpoints are available for every tasks, minus
the number of wide kernel breakpoints in use.

I don't see the need of a reservation here (which is already done by the API),
I feel a bit confused in this debate.

2009-07-28 01:03:57

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC][PATCH 5/5] perfcounter: Add support for kernel hardware breakpoints

On Mon, Jul 27, 2009 at 10:53:37AM +0200, Peter Zijlstra wrote:
> On Sun, 2009-07-26 at 05:27 +0530, K.Prasad wrote:
>
> > I don't claim to have understood the requirements of perf-counters
> > fully, but I sense the persistence of a few doubts about the register
> > allocation mechanism of the hw-breakpoints API...thought it might be
> > worthwhile to briefly explain.
> >
> > A few notes about the same:
> >
> > Kernel-space breakpoints: They are system-wide i.e. one kernel-space
> > breakpoint request consumes one debug register on 'all' CPUs of the
> > system.
> > User-space breakpoints: Requests are stored in per-thread data
> > structures. Written onto the debug register only when scheduled-into
> > the CPU, and are removed when another task using the debug registers is
> > scheduled onto the CPU.
> > Debug register allocation mechanism: register request succeeds only when
> > the availability of a debug register is guaranteed (for both user and
> > kernel space). Allocation on a first-come-first-serve basis, no
> > priorities for register requests and hence no pre-emption of requests.
> >
> > In short the available debug registers are divided into:
> >
> > # of debug registers = # of kernel-space requests + MAX(# of all
> > user-space requests) + freely available debug registers.
> >
> > Thus on an x86 system with 4 debug registers, if there exists 1
> > kernel-space request it consumes one debug register on all CPUs in the
> > system. The remaining 3 debug registers can be consumed by user-space
> > i.e. upto 3 user-space requests per-thread.
> >
> > There is no restriction on how many debug registers can be consumed by
> > the kernel- or user-space requests and it is entirely dependent on the
> > number of free registers available for use then.
>
> Right, this is an utter miss-match for what perf counters would want.
>
> Firstly, you seem to have this weird split of kernel/userspace
> breakpoints. Perf counters looks at things in a per-cpu fashion, so the
> all-cpus kernel breakpoint stuff is useless. Also, from perf counters'
> POV its perfectly reasonable to have a per-task kernel breakpoint.


Hmm, indeed. The breakpoint api lacks these services for perf counter,
although targetting specific cpus seems a (somewhat) trivial add-on to me on the
bp API.

Also I forgot the kernel-only profiling things.


> Secondly, perf counters wants to schedule the per task breakpoints
> because we can optimize the context switch, saving lots of these MSR
> writes under some common scenarios.


Ok with that.
But if you have an efficient pmu scheduling, it may be still interesting
to make it a seperate core entity. The bp API may find some profit in it.

The breakpoint registers also require specific treatments with several registers
magic, all organized inside the arch API which also cares about cpu hotplug and
kexec things.


> Thirdly, we can multiplex perf counters beyond their hardware maximum,
> something you simply cannot do for a debug interface.


Again, I'm stuck in what you mean by multiplexing here :-)


> Like I said, please use the raw per-cpu breakpoint interface for perf
> counters and connect that with the minimally required reservation you
> need to make your other thing work.
>
> You simply cannot put perf-counter breakpoints on top of whatever virt
> layer you created going by what you say it is.


Well, indeed it seems perfcounter handles the counters to cover very specific
needs, uncovered by the debug API we have.
I guess we can create a struct pmu for the breakpoints and plug it with
some low level add-on in the breakpoint API.

2009-07-28 01:35:07

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/5] hw-breakpoints: Make kernel breakpoints API truly generic

On Sat, Jul 25, 2009 at 11:38:49AM -0400, Mathieu Desnoyers wrote:
> * Frederic Weisbecker ([email protected]) wrote:
> > On Mon, Jul 20, 2009 at 01:27:48PM -0400, Mathieu Desnoyers wrote:
> [...]
> >
> > > Maybe we should think of a more flexible breakpoint type mapping too,
> > > e.g.:
> > >
> > > monitor _strictly_ execute operation on address 0x...
> > > -> would fail if the architecture does not support execution access
> > > monitoring
> > > monitor (at least) execute operations on address 0x...
> > > -> would be allowed to use a more general monitor (e.g. RWX) if the
> > > architecture does not support "execute only" monitor.
> > >
> > > (same for read and write)
> > >
> > > Mathieu
> >
> >
> > Well, I'm not sure the problem mostly resides in the hardware implementation
> > of strict exec breakpoint types. But I guess your point is not limiting to
> > that. Yeah for example, x86 doesn't support read-only breakpoints.
>
> Exactly. I used "execute" only as an example, but in the end, we could
> end up with a list looking like:
>
> HW_WATCH_R (1 << 0)
> HW_WATCH_NOT_R (1 << 0)
> HW_WATCH_W (1 << 1)
> HW_WATCH_NOT_W (1 << 1)
> HW_WATCH_X (1 << 2)
> HW_WATCH_NOT_X (1 << 2)
>
> So for instance, flags :
>
> HW_WATCH_R|HW_WATCH_NOT_W|HW_WATCH_NOT_X
>
> would specify that the architecture has to support a "read" watchpoint
> which does not trigger on write nor execute.
>
> HW_WATCH_R
>
> would specify that the architecture _must_ support "read" watchpoint,
> and we don't care about W or X.
>
> HW_WATCH_R|HW_WATCH_W|HW_WATCH_X
>
> Would ask for watching rwx on an address. The architecture would have to
> support all those three.
>
> Some combinations would be invalid (e.g. HW_WATCH_R|HW_WATCH_NOT_R).
>
> There might be better ways to express this, but at this it should show
> my point a bit more clearly.
>
> Mathieu


Ok, I see your point.
Before making a choice to generically express the modes supported by an arch,
I need a good global view of what is supported by most archs,
thing that I lack and which I should look at quickly :)

Thanks.


>
> > But I guess that can be simulated using software artifacts, for example using
> > READ-WRITE breakpoints + the x86 decoder API, recently submitted by Masami,
> > to find the nature of the current instruction.
> >
> > Anyway, your point is indeed important: return common error values for unsupported
> > breakpoint operations.
> >
> > Thanks.
> >
>
> --
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2009-07-28 07:22:09

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 5/5] perfcounter: Add support for kernel hardware breakpoints

On Tue, 2009-07-28 at 03:03 +0200, Frederic Weisbecker wrote:
> > Thirdly, we can multiplex perf counters beyond their hardware maximum,
> > something you simply cannot do for a debug interface.
>
>
> Again, I'm stuck in what you mean by multiplexing here :-)

If you'd create say 16 breakpoint counters, we'd RR them over the 4
available hardware breakpoints (or less when others are taken by someone
else).

Since its all statistics anyway, we can simply scale the event counts up
by the time-share they received.

2009-07-28 07:24:06

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 5/5] perfcounter: Add support for kernel hardware breakpoints

On Tue, 2009-07-28 at 02:18 +0200, Frederic Weisbecker wrote:
>
> I still don't understand why it is needed to reserve breakpoints for
> a group of monitored tasks. Once they have registered their breakpoints,
> the number of necessary hardware registers for these will be available
> every time the task is scheduled.

Its possible to create >4 breakpoint counters. We time-share them.

The only constraint we have is that those within a group should be on
the 'PMU' at the same time.

If you have no groups, a single available breakpoint slot should suffice
for any configuration (although more is better), but if you get groups
you have to guarantee your created group can run.

2009-07-28 14:05:09

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC][PATCH 5/5] perfcounter: Add support for kernel hardware breakpoints

* Peter Zijlstra ([email protected]) wrote:
> On Tue, 2009-07-28 at 03:03 +0200, Frederic Weisbecker wrote:
> > > Thirdly, we can multiplex perf counters beyond their hardware maximum,
> > > something you simply cannot do for a debug interface.
> >
> >
> > Again, I'm stuck in what you mean by multiplexing here :-)
>
> If you'd create say 16 breakpoint counters, we'd RR them over the 4
> available hardware breakpoints (or less when others are taken by someone
> else).
>
> Since its all statistics anyway, we can simply scale the event counts up
> by the time-share they received.
>

Although hw breakpoint and performance counters could be multiplexed to
be used in RR for profiling, I fear using these for tracing does not
necessarily fit well with these statistical artefacts.

Therefore I'd recommend leaving such bp RR as a feature separate from
a low-level breakpoint API, because it seem only useful to profilers.

I think we all have our very precise use-case in mind, and this is what
makes discussion so difficult. So, for my point of view (tracing), I'd
like to be able to set performance counters and breakpoints both
monitoring userspace and kernel. The userspace breakpoints would be
exchanged when a different process is scheduled.

I see 2 complementary ways to collect data into the trace:

- triggering an interrupt each X hardware events, trace this interrupt.
- reading the performance counter value at tracepoints identifying
execution context change (e.g. system call entry/exit, interrupt
entry/exit...)

As you see, none of these 2 ways to gather trace data allow statistical
RR. Therefore, having this feature in the low-level API does not seem to
make sense for tracing.

So if other use-cases could be explained, maybe we can factor out the
common interfaces.

Thanks,

Mathieu

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2009-07-28 14:40:13

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 5/5] perfcounter: Add support for kernel hardware breakpoints

On Tue, 2009-07-28 at 10:04 -0400, Mathieu Desnoyers wrote:

> As you see, none of these 2 ways to gather trace data allow statistical
> RR. Therefore, having this feature in the low-level API does not seem to
> make sense for tracing.

I'm not asking for it to be available in the low level API. Quite
contrary, I'm saying that the current API is too high level and pretty
useless for perf counters.


2009-07-28 16:12:34

by K.Prasad

[permalink] [raw]
Subject: Re: [RFC][PATCH 5/5] perfcounter: Add support for kernel hardware breakpoints

On Mon, Jul 27, 2009 at 10:53:37AM +0200, Peter Zijlstra wrote:
> On Sun, 2009-07-26 at 05:27 +0530, K.Prasad wrote:
>
> > I don't claim to have understood the requirements of perf-counters
> > fully, but I sense the persistence of a few doubts about the register
> > allocation mechanism of the hw-breakpoints API...thought it might be
> > worthwhile to briefly explain.
> >
> > A few notes about the same:
> >
> > Kernel-space breakpoints: They are system-wide i.e. one kernel-space
> > breakpoint request consumes one debug register on 'all' CPUs of the
> > system.
> > User-space breakpoints: Requests are stored in per-thread data
> > structures. Written onto the debug register only when scheduled-into
> > the CPU, and are removed when another task using the debug registers is
> > scheduled onto the CPU.
> > Debug register allocation mechanism: register request succeeds only when
> > the availability of a debug register is guaranteed (for both user and
> > kernel space). Allocation on a first-come-first-serve basis, no
> > priorities for register requests and hence no pre-emption of requests.
> >
> > In short the available debug registers are divided into:
> >
> > # of debug registers = # of kernel-space requests + MAX(# of all
> > user-space requests) + freely available debug registers.
> >
> > Thus on an x86 system with 4 debug registers, if there exists 1
> > kernel-space request it consumes one debug register on all CPUs in the
> > system. The remaining 3 debug registers can be consumed by user-space
> > i.e. upto 3 user-space requests per-thread.
> >
> > There is no restriction on how many debug registers can be consumed by
> > the kernel- or user-space requests and it is entirely dependent on the
> > number of free registers available for use then.
>
> Right, this is an utter miss-match for what perf counters would want.
>

> Firstly, you seem to have this weird split of kernel/userspace
> breakpoints. Perf counters looks at things in a per-cpu fashion, so the
> all-cpus kernel breakpoint stuff is useless. Also, from perf counters'
> POV its perfectly reasonable to have a per-task kernel breakpoint.
>

Although the existing implementation of hw-breakpoint API doesn't
support per-task kernel-space breakpoints, it isn't very difficult to
extend it to do so.

We could change the breakpoint infrastructure to something like this:

kernel-space breakpoints:
kernel-space addresses, system-wide i.e. on all CPUs, persist till explicit
unregistration, consume 1 debug register always.

New per-task breakpoints (i.e. modified user-space breakpoints):
accepts kernel- or user-space addresses, enabled per-task, consumes 1 debug
register (only when task is scheduled on the CPU), releases debug register
when yielding the CPU.

> Secondly, perf counters wants to schedule the per task breakpoints
> because we can optimize the context switch, saving lots of these MSR
> writes under some common scenarios.
>

perf counters can continue to schedule per-task breakpoints -
enabling/disabling a breakpoint would require a call to the
'register'/'unregister' interface and since it is per-cpu it is
light-weight when compared to system-wide breakpoints (that require IPIs
for propagation).

The common breakpoints can be identified and exempted from yielding the
debug registers (i.e. from the unregister-->register cycle) in the
perf-counter code.

As a side note, I'm not sure if extra-polating (linearly?) the debug
register's "hit counter" value is a good idea. While a function may cause
several 'write' operations on a variable (say due to a loop statement) for
once, it may not exhibit similar behaviour throughout the time-slice of the
program's execution. Scaling the values may lead to incorrect results.

> Thirdly, we can multiplex perf counters beyond their hardware maximum,
> something you simply cannot do for a debug interface.
>

I suppose that you are referring to the RR scheduling of breakpoints and
scaling of results? It can be achieved in the manner as explained above.

> Like I said, please use the raw per-cpu breakpoint interface for perf
> counters and connect that with the minimally required reservation you
> need to make your other thing work.
>
> You simply cannot put perf-counter breakpoints on top of whatever virt
> layer you created going by what you say it is.
>

One of the design goals of the hw-breakpoint API is to provide a layer
of arbitration between various consumers of the physical debug register.
We should be able to extend the API to meet the demands of new users
with unique requirements (if not supported already), and the description
above broadly describe them for perf-counters.

If agreeable, I'll submit a patch that modifies the user-space
breakpoints into a per-task one. Frederic may want to base the
perf-counter integration code on the proposed interface.

Let me know what your thoughts on the same.

Thanks,
K.Prasad

2009-07-28 16:38:51

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 5/5] perfcounter: Add support for kernel hardware breakpoints

On Tue, 2009-07-28 at 21:42 +0530, K.Prasad wrote:

> > Firstly, you seem to have this weird split of kernel/userspace
> > breakpoints. Perf counters looks at things in a per-cpu fashion, so the
> > all-cpus kernel breakpoint stuff is useless. Also, from perf counters'
> > POV its perfectly reasonable to have a per-task kernel breakpoint.
> >
>
> Although the existing implementation of hw-breakpoint API doesn't
> support per-task kernel-space breakpoints, it isn't very difficult to
> extend it to do so.
>
> We could change the breakpoint infrastructure to something like this:
>
> kernel-space breakpoints:
> kernel-space addresses, system-wide i.e. on all CPUs, persist till explicit
> unregistration, consume 1 debug register always.
>
> New per-task breakpoints (i.e. modified user-space breakpoints):
> accepts kernel- or user-space addresses, enabled per-task, consumes 1 debug
> register (only when task is scheduled on the CPU), releases debug register
> when yielding the CPU.

That still doesn't provide per-cpu breakpoints.

> > Secondly, perf counters wants to schedule the per task breakpoints
> > because we can optimize the context switch, saving lots of these MSR
> > writes under some common scenarios.
> >
>
> perf counters can continue to schedule per-task breakpoints -
> enabling/disabling a breakpoint would require a call to the
> 'register'/'unregister' interface and since it is per-cpu it is
> light-weight when compared to system-wide breakpoints (that require IPIs
> for propagation).
>
> The common breakpoints can be identified and exempted from yielding the
> debug registers (i.e. from the unregister-->register cycle) in the
> perf-counter code.

If you want to implement it that way.. looking for duplicates is bound
to result in something O(n^2), but with n=4 that's manageable.

Again, you seem to be missing per-cpu breakpoints.

> As a side note, I'm not sure if extra-polating (linearly?) the debug
> register's "hit counter" value is a good idea. While a function may cause
> several 'write' operations on a variable (say due to a loop statement) for
> once, it may not exhibit similar behaviour throughout the time-slice of the
> program's execution. Scaling the values may lead to incorrect results.

Sure, it won't be perfect, but if you assume the RR interval is
decoupled from the task you can get statistically relevant information.

> > Like I said, please use the raw per-cpu breakpoint interface for perf
> > counters and connect that with the minimally required reservation you
> > need to make your other thing work.
> >
> > You simply cannot put perf-counter breakpoints on top of whatever virt
> > layer you created going by what you say it is.
> >
>
> One of the design goals of the hw-breakpoint API is to provide a layer
> of arbitration between various consumers of the physical debug register.
> We should be able to extend the API to meet the demands of new users
> with unique requirements (if not supported already), and the description
> above broadly describe them for perf-counters.

Sure, but currently it does too much.

All you need for perf counter support is a per-cpu interface, no
per-task, no system-wide.

But you want to mix that up with your per-task interface, which will
complicate matters.


2009-07-29 00:36:16

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC][PATCH 5/5] perfcounter: Add support for kernel hardware breakpoints

On Tue, Jul 28, 2009 at 09:24:37AM +0200, Peter Zijlstra wrote:
> On Tue, 2009-07-28 at 03:03 +0200, Frederic Weisbecker wrote:
> > > Thirdly, we can multiplex perf counters beyond their hardware maximum,
> > > something you simply cannot do for a debug interface.
> >
> >
> > Again, I'm stuck in what you mean by multiplexing here :-)
>
> If you'd create say 16 breakpoint counters, we'd RR them over the 4
> available hardware breakpoints (or less when others are taken by someone
> else).
>
> Since its all statistics anyway, we can simply scale the event counts up
> by the time-share they received.
>

Aah, ok I understand now.
But I fear it may kill the accuracy of the breakpoints statistics.
It's fine for a theoretical linear rate of breakpoint events.
But what happens if we are profiling something much more unstable
with a rain of hits in a small window of memory between large timeframes?
If we have a breakpoint inside this window of memory and this breakpoint
is not plugged, waiting for its turn in the RR, we loose this rain of events.

2009-07-29 06:37:21

by K.Prasad

[permalink] [raw]
Subject: Re: [RFC][PATCH 5/5] perfcounter: Add support for kernel hardware breakpoints

On Tue, Jul 28, 2009 at 06:41:26PM +0200, Peter Zijlstra wrote:
> On Tue, 2009-07-28 at 21:42 +0530, K.Prasad wrote:
>
> > > Firstly, you seem to have this weird split of kernel/userspace
> > > breakpoints. Perf counters looks at things in a per-cpu fashion, so the
> > > all-cpus kernel breakpoint stuff is useless. Also, from perf counters'
> > > POV its perfectly reasonable to have a per-task kernel breakpoint.
> > >
> >
> > Although the existing implementation of hw-breakpoint API doesn't
> > support per-task kernel-space breakpoints, it isn't very difficult to
> > extend it to do so.
> >
> > We could change the breakpoint infrastructure to something like this:
> >
> > kernel-space breakpoints:
> > kernel-space addresses, system-wide i.e. on all CPUs, persist till explicit
> > unregistration, consume 1 debug register always.
> >
> > New per-task breakpoints (i.e. modified user-space breakpoints):
> > accepts kernel- or user-space addresses, enabled per-task, consumes 1 debug
> > register (only when task is scheduled on the CPU), releases debug register
> > when yielding the CPU.
>
> That still doesn't provide per-cpu breakpoints.
>

Yes, it doesn't provide a per-cpu only implementation. One can obtain
the per-cpu data from the system-wide breakpoints by filtering it for a
given CPU (agreed, it will associated overhead).

A true per-cpu breakpoint implementation that co-exists with
system-wide and per-task breakpoints will be difficult. It might require
the re-introduction of some old features and a few new ones (like switching
between kernel and user-space breakpoints at syscall time) that were
rejected earlier by the community.

Also, the reason for a per-cpu only breakpoint (user and kernel-space)
isn't very obvious. While kernel variables can be read/written
throughout the system and user-space variables are per-task, the need
for obtaining per-cpu information isn't clear.

> > > Secondly, perf counters wants to schedule the per task breakpoints
> > > because we can optimize the context switch, saving lots of these MSR
> > > writes under some common scenarios.
> > >
> >
> > perf counters can continue to schedule per-task breakpoints -
> > enabling/disabling a breakpoint would require a call to the
> > 'register'/'unregister' interface and since it is per-cpu it is
> > light-weight when compared to system-wide breakpoints (that require IPIs
> > for propagation).
> >
> > The common breakpoints can be identified and exempted from yielding the
> > debug registers (i.e. from the unregister-->register cycle) in the
> > perf-counter code.
>
> If you want to implement it that way.. looking for duplicates is bound
> to result in something O(n^2), but with n=4 that's manageable.
>
> Again, you seem to be missing per-cpu breakpoints.
>
> > As a side note, I'm not sure if extra-polating (linearly?) the debug
> > register's "hit counter" value is a good idea. While a function may cause
> > several 'write' operations on a variable (say due to a loop statement) for
> > once, it may not exhibit similar behaviour throughout the time-slice of the
> > program's execution. Scaling the values may lead to incorrect results.
>
> Sure, it won't be perfect, but if you assume the RR interval is
> decoupled from the task you can get statistically relevant information.
>
> > > Like I said, please use the raw per-cpu breakpoint interface for perf
> > > counters and connect that with the minimally required reservation you
> > > need to make your other thing work.
> > >
> > > You simply cannot put perf-counter breakpoints on top of whatever virt
> > > layer you created going by what you say it is.
> > >
> >
> > One of the design goals of the hw-breakpoint API is to provide a layer
> > of arbitration between various consumers of the physical debug register.
> > We should be able to extend the API to meet the demands of new users
> > with unique requirements (if not supported already), and the description
> > above broadly describe them for perf-counters.
>
> Sure, but currently it does too much.
>
> All you need for perf counter support is a per-cpu interface, no
> per-task, no system-wide.
>
> But you want to mix that up with your per-task interface, which will
> complicate matters.
>

This is a little confusing. I'm trying to understand which of the
questions below does perf-counter try to answer. i) and ii) is what I
thought would be, and asking iii) doesn't make much sense. What do you
think?

i) Tell me how many times kernel variable 'x' was updated when task 'a'
was scheduled
ii) Tell me how many times kernel variable 'x' was updated on the system
since I registered for the breakpoint
iii) Tell me how many times kernel variable 'x' was updated on CPU 'n'

Thanks,
K.Prasad

2009-07-29 08:26:11

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 5/5] perfcounter: Add support for kernel hardware breakpoints

On Wed, 2009-07-29 at 02:36 +0200, Frederic Weisbecker wrote:
> On Tue, Jul 28, 2009 at 09:24:37AM +0200, Peter Zijlstra wrote:
> > On Tue, 2009-07-28 at 03:03 +0200, Frederic Weisbecker wrote:
> > > > Thirdly, we can multiplex perf counters beyond their hardware maximum,
> > > > something you simply cannot do for a debug interface.
> > >
> > >
> > > Again, I'm stuck in what you mean by multiplexing here :-)
> >
> > If you'd create say 16 breakpoint counters, we'd RR them over the 4
> > available hardware breakpoints (or less when others are taken by someone
> > else).
> >
> > Since its all statistics anyway, we can simply scale the event counts up
> > by the time-share they received.
> >
>
> Aah, ok I understand now.
> But I fear it may kill the accuracy of the breakpoints statistics.
> It's fine for a theoretical linear rate of breakpoint events.
> But what happens if we are profiling something much more unstable
> with a rain of hits in a small window of memory between large timeframes?
> If we have a breakpoint inside this window of memory and this breakpoint
> is not plugged, waiting for its turn in the RR, we loose this rain of events.

You don't have to overload the breakpoint set, but we will if you create
more than there are hardware resources available.

If the RR period is independent of the application and significantly
shorter than the sample duration, statistics will be usable.

Sure, they'll not be perfect, but we can improve our confidence interval
by running longer. At some point it really doesn't matter anymore.

Take your example, of small bursts of events, due to the first
assumption - the RR period being independent of the application - we're
bound to see those busts in a proportional number of sample windows.

If the full sample duration is in the same order of the RR period, then
yes, you'll have a fair chance of missing the burst -- don't do that
then :-)

If you know you'll be running very very short, use attr.pinned, to
request a counter that'll not be multiplexed.


2009-07-29 09:20:09

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 5/5] perfcounter: Add support for kernel hardware breakpoints

On Wed, 2009-07-29 at 12:07 +0530, K.Prasad wrote:

> > That still doesn't provide per-cpu breakpoints.
> >
>
> Yes, it doesn't provide a per-cpu only implementation. One can obtain
> the per-cpu data from the system-wide breakpoints by filtering it for a
> given CPU (agreed, it will associated overhead).
>
> A true per-cpu breakpoint implementation that co-exists with
> system-wide and per-task breakpoints will be difficult. It might require
> the re-introduction of some old features and a few new ones (like switching
> between kernel and user-space breakpoints at syscall time) that were
> rejected earlier by the community.

I'm not clear on why you'd need to switch breakpoints on syscall entry.
You can simply leave the kernel address breakpoint around in userspace,
they're not able to poke at that address space anyway.

> Also, the reason for a per-cpu only breakpoint (user and kernel-space)
> isn't very obvious. While kernel variables can be read/written
> throughout the system and user-space variables are per-task, the need
> for obtaining per-cpu information isn't clear.

Well, suppose you're monitoring a per-cpu variable, or interested in the
effects of a workload confined to 1 cpu or node, there is no reason to
have this breakpoint on all cpus.

>
> This is a little confusing. I'm trying to understand which of the
> questions below does perf-counter try to answer. i) and ii) is what I
> thought would be, and asking iii) doesn't make much sense. What do you
> think?
>
> i) Tell me how many times kernel variable 'x' was updated when task 'a'
> was scheduled
> ii) Tell me how many times kernel variable 'x' was updated on the system
> since I registered for the breakpoint
> iii) Tell me how many times kernel variable 'x' was updated on CPU 'n'

It's 1 and 3. perf-counters doesn't deal with system wide for the
reasons given above.

If you want system wide, you can create a perf counter for each cpu.

But the larger the machine gets, the less likely it is that that is what
you're after. Being per-cpu allows you to do things on a
per-node/socket/etc.. basis which is impossible to do when you limit
yourself to per task or system wide.

2009-07-29 14:03:56

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC][PATCH 5/5] perfcounter: Add support for kernel hardware breakpoints

On Wed, Jul 29, 2009 at 10:28:52AM +0200, Peter Zijlstra wrote:
> On Wed, 2009-07-29 at 02:36 +0200, Frederic Weisbecker wrote:
> > On Tue, Jul 28, 2009 at 09:24:37AM +0200, Peter Zijlstra wrote:
> > > On Tue, 2009-07-28 at 03:03 +0200, Frederic Weisbecker wrote:
> > > > > Thirdly, we can multiplex perf counters beyond their hardware maximum,
> > > > > something you simply cannot do for a debug interface.
> > > >
> > > >
> > > > Again, I'm stuck in what you mean by multiplexing here :-)
> > >
> > > If you'd create say 16 breakpoint counters, we'd RR them over the 4
> > > available hardware breakpoints (or less when others are taken by someone
> > > else).
> > >
> > > Since its all statistics anyway, we can simply scale the event counts up
> > > by the time-share they received.
> > >
> >
> > Aah, ok I understand now.
> > But I fear it may kill the accuracy of the breakpoints statistics.
> > It's fine for a theoretical linear rate of breakpoint events.
> > But what happens if we are profiling something much more unstable
> > with a rain of hits in a small window of memory between large timeframes?
> > If we have a breakpoint inside this window of memory and this breakpoint
> > is not plugged, waiting for its turn in the RR, we loose this rain of events.
>
> You don't have to overload the breakpoint set, but we will if you create
> more than there are hardware resources available.
>
> If the RR period is independent of the application and significantly
> shorter than the sample duration, statistics will be usable.
>
> Sure, they'll not be perfect, but we can improve our confidence interval
> by running longer. At some point it really doesn't matter anymore.
>
> Take your example, of small bursts of events, due to the first
> assumption - the RR period being independent of the application - we're
> bound to see those busts in a proportional number of sample windows.
>
> If the full sample duration is in the same order of the RR period, then
> yes, you'll have a fair chance of missing the burst -- don't do that
> then :-)
>
> If you know you'll be running very very short, use attr.pinned, to
> request a counter that'll not be multiplexed.


Ok, I'm convinced :-)

2009-07-29 14:58:05

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [RFC][PATCH 5/5] perfcounter: Add support for kernel hardware breakpoints

Em Wed, Jul 29, 2009 at 11:22:52AM +0200, Peter Zijlstra escreveu:
> On Wed, 2009-07-29 at 12:07 +0530, K.Prasad wrote:
>
> > > That still doesn't provide per-cpu breakpoints.
> > >
> >
> > Yes, it doesn't provide a per-cpu only implementation. One can obtain
> > the per-cpu data from the system-wide breakpoints by filtering it for a
> > given CPU (agreed, it will associated overhead).
> >
> > A true per-cpu breakpoint implementation that co-exists with
> > system-wide and per-task breakpoints will be difficult. It might require
> > the re-introduction of some old features and a few new ones (like switching
> > between kernel and user-space breakpoints at syscall time) that were
> > rejected earlier by the community.
>
> I'm not clear on why you'd need to switch breakpoints on syscall entry.
> You can simply leave the kernel address breakpoint around in userspace,
> they're not able to poke at that address space anyway.
>
> > Also, the reason for a per-cpu only breakpoint (user and kernel-space)
> > isn't very obvious. While kernel variables can be read/written
> > throughout the system and user-space variables are per-task, the need
> > for obtaining per-cpu information isn't clear.
>
> Well, suppose you're monitoring a per-cpu variable, or interested in the
> effects of a workload confined to 1 cpu or node, there is no reason to
> have this breakpoint on all cpus.

I was going to talk about exactly that, CPU isolation for one specific
workload, I don't want that hits on tcp_v4_rcv on another CPU get
counted, just the ones on that specific CPU.

The other CPUs can be given a break (pun intended :-P) by not having to
process traps we're uninterested in.

- Arnaldo