LinuxLists.cc - [patch 02/11] x86 architecture implementation of Hardware Breakpoint interfaces

[permalink] [raw]

Subject: Re: [patch 02/11] x86 architecture implementation of Hardware Breakpoint interfaces

* [email protected] <[email protected]> wrote:

> +/*
> + * Handle debug exception notifications.
> + */
> +
> +int __kprobes hw_breakpoint_handler(struct die_args *args)
> +{
> + struct cpu_hw_breakpoint *chbi;
> + int i;
> + struct hw_breakpoint *bp;
> + struct thread_hw_breakpoint *thbi = NULL;
> +
> + /* The DR6 value is stored in args->err */
> +#define DR6 (args->err)

that's ugly - what's wrong with an old-fashioned "int db6 =
args->err" type of approach?

> +
> + if (DR6 & DR_STEP)
> + return NOTIFY_DONE;
> +
> + chbi = &per_cpu(cpu_bp, get_cpu());
> +
> + /* Disable all breakpoints so that the callbacks can run without
> + * triggering recursive debug exceptions.
> + */
> + set_debugreg(0UL, 7);
> +
> + /* Assert that local interrupts are disabled
> + * Reset the DRn bits in the virtualized register value.
> + * The ptrace trigger routine will add in whatever is needed.
> + */
> + current->thread.vdr6 &= ~(DR_TRAP0|DR_TRAP1|DR_TRAP2|DR_TRAP3);
> +
> + /* Are we a victim of lazy debug-register switching? */
> + if (!chbi->bp_task)
> + ;
> + else if (chbi->bp_task != current) {
> +
> + /* No user breakpoints are valid. Perform the belated
> + * debug-register switch.
> + */
> + switch_to_none_hw_breakpoint();
> + } else {
> + thbi = chbi->bp_task->thread.hw_breakpoint_info;
> + }
> +
> + /* Handle all the breakpoints that were triggered */
> + for (i = 0; i < HB_NUM; ++i) {
> + if (likely(!(DR6 & (DR_TRAP0 << i))))
> + continue;
> +
> + /* Find the corresponding hw_breakpoint structure and
> + * invoke its triggered callback.
> + */
> + if (i < chbi->cur_kbpdata->num_kbps)
> + bp = chbi->cur_kbpdata->bps[i];
> + else if (thbi)
> + bp = thbi->bps[i];
> + else /* False alarm due to lazy DR switching */
> + continue;
> + if (bp) {
> + switch (bp->info.type) {
> + case HW_BREAKPOINT_WRITE:
> + case HW_BREAKPOINT_RW:
> + if (bp->triggered)
> + (bp->triggered)(bp, args->regs);
> + /* Re-enable the breakpoints */
> + set_debugreg(thbi ? thbi->tkdr7 :
> + chbi->cur_kbpdata->mkdr7, 7);
> + put_cpu_no_resched();
> +
> + return NOTIFY_STOP;
> + /*
> + * Presently we allow instruction breakpoints only in
> + * user-space when requested through ptrace.
> + */
> + case HW_BREAKPOINT_EXECUTE:
> + if (arch_check_va_in_userspace(bp->info.address,
> + current)) {
> + (bp->triggered)(bp, args->regs);
> + /* We'll return NOTIFY_DONE, do_debug will take care of the rest */
> + return NOTIFY_DONE;
> + }
> + }

the linebreaks here became so ugly because the whole loop body
should be moved inside a helper function.

> +++ linux-2.6-tip.hbkpt/arch/x86/include/asm/hw_breakpoint.h
> @@ -0,0 +1,132 @@
> +#ifndef _I386_HW_BREAKPOINT_H
> +#define _I386_HW_BREAKPOINT_H
> +
> +#ifdef __KERNEL__
> +#define __ARCH_HW_BREAKPOINT_H
> +
> +struct arch_hw_breakpoint {
> + char *name; /* Contains name of the symbol to set bkpt */
> + unsigned long address;
> + u8 len;
> + u8 type;
> +} __attribute__((packed));

hm, why packed and why u8 ? We dont expose this to user-space,
do we? (if yes then 'unsigned long' is wrong and __KERNEL__ is
wrong as well)

> +#include <linux/kdebug.h>
> +#include <asm-generic/hw_breakpoint.h>
> +
> +/* HW breakpoint accessor routines */
> +static inline const void *hw_breakpoint_get_kaddress(struct hw_breakpoint *bp)
> +{
> + return (const void *) bp->info.address;
> +}
> +
> +static inline const void __user *hw_breakpoint_get_uaddress
> + (struct hw_breakpoint *bp)
> +{
> + return (const void __user *) bp->info.address;
> +}
> +
> +static inline unsigned hw_breakpoint_get_len(struct hw_breakpoint *bp)
> +{
> + return bp->info.len;
> +}
> +
> +static inline unsigned hw_breakpoint_get_type(struct hw_breakpoint *bp)
> +{
> + return bp->info.type;
> +}

why this redirection, why dont just use the structure as-is? If
there's any arch weirdness then that arch should have
arch-special accessors - not the generic code.

> +
> +/* Kernel symbol lookup routine for installing Data HW Breakpoint Address */
> +static inline unsigned long hw_breakpoint_lookup_name(const char *name)
> +{
> + return kallsyms_lookup_name(name);
> +}

A wrapper around kallsyms_lookup_name() is quite pointless -
pleae us kallsyms_lookup_name() drectly.

> +/* Per-thread HW breakpoint and debug register info */
> +struct thread_hw_breakpoint {
> +
> + /* utrace support */
> + struct list_head node; /* Entry in thread list */
> + struct list_head thread_bps; /* Thread's breakpoints */
> + struct hw_breakpoint *bps[HB_NUM]; /* Highest-priority bps */
> + unsigned long tdr[HB_NUM]; /* and their addresses */

Please rename it to something like ->hw_breakpoint[] and
->address[] - 'bps' and 'tdr' look quite meaningless.

> + int num_installed; /* Number of installed bps */
> + unsigned gennum; /* update-generation number */

i suspect the gennum we can get rid of if we get rid of the
notion of priorities, right?

Ingo

2009-03-10 14:59:20

[permalink] [raw]

Subject: Re: [patch 02/11] x86 architecture implementation of Hardware Breakpoint interfaces

On Tue, 10 Mar 2009, Ingo Molnar wrote:

> * [email protected] <[email protected]> wrote:
>
> > +/*
> > + * Handle debug exception notifications.
> > + */
> > +
> > +int __kprobes hw_breakpoint_handler(struct die_args *args)
> > +{
> > + struct cpu_hw_breakpoint *chbi;
> > + int i;
> > + struct hw_breakpoint *bp;
> > + struct thread_hw_breakpoint *thbi = NULL;
> > +
> > + /* The DR6 value is stored in args->err */
> > +#define DR6 (args->err)
>
> that's ugly - what's wrong with an old-fashioned "int db6 =
> args->err" type of approach?

Yes, it is ugly. It was a holdover from an earlier version, and in
fact it's likely to change in the future to become even more ugly. But
for now, you're right -- a simple assignment would be better.

> > +++ linux-2.6-tip.hbkpt/arch/x86/include/asm/hw_breakpoint.h
> > @@ -0,0 +1,132 @@
> > +#ifndef _I386_HW_BREAKPOINT_H
> > +#define _I386_HW_BREAKPOINT_H
> > +
> > +#ifdef __KERNEL__
> > +#define __ARCH_HW_BREAKPOINT_H
> > +
> > +struct arch_hw_breakpoint {
> > + char *name; /* Contains name of the symbol to set bkpt */
> > + unsigned long address;
> > + u8 len;
> > + u8 type;
> > +} __attribute__((packed));
>
> hm, why packed and why u8 ? We dont expose this to user-space,
> do we? (if yes then 'unsigned long' is wrong and __KERNEL__ is
> wrong as well)

I can't remember why this was made packed; there doesn't seem to be any
important reason for it. The structure is not exposed to userspace.
The len and type fields are u8 because they contain values no larger
than 255.

> > +#include <linux/kdebug.h>
> > +#include <asm-generic/hw_breakpoint.h>
> > +
> > +/* HW breakpoint accessor routines */
> > +static inline const void *hw_breakpoint_get_kaddress(struct hw_breakpoint *bp)
> > +{
> > + return (const void *) bp->info.address;
> > +}
> > +
> > +static inline const void __user *hw_breakpoint_get_uaddress
> > + (struct hw_breakpoint *bp)
> > +{
> > + return (const void __user *) bp->info.address;
> > +}
> > +
> > +static inline unsigned hw_breakpoint_get_len(struct hw_breakpoint *bp)
> > +{
> > + return bp->info.len;
> > +}
> > +
> > +static inline unsigned hw_breakpoint_get_type(struct hw_breakpoint *bp)
> > +{
> > + return bp->info.type;
> > +}
>
> why this redirection, why dont just use the structure as-is? If
> there's any arch weirdness then that arch should have
> arch-special accessors - not the generic code.

These _are_ the arch-specific accessors. Look at the filename:
arch/x86/include/asm/hw_breakpoint.h.

> > + int num_installed; /* Number of installed bps */
> > + unsigned gennum; /* update-generation number */
>
> i suspect the gennum we can get rid of if we get rid of the
> notion of priorities, right?

No. gennum has nothing to do with priorities.

Alan Stern

2009-03-10 15:19:18

[permalink] [raw]

Subject: Re: [patch 02/11] x86 architecture implementation of Hardware Breakpoint interfaces

* Alan Stern <[email protected]> wrote:

> On Tue, 10 Mar 2009, Ingo Molnar wrote:
>
> > * [email protected] <[email protected]> wrote:
> >
> > > +/*
> > > + * Handle debug exception notifications.
> > > + */
> > > +
> > > +int __kprobes hw_breakpoint_handler(struct die_args *args)
> > > +{
> > > + struct cpu_hw_breakpoint *chbi;
> > > + int i;
> > > + struct hw_breakpoint *bp;
> > > + struct thread_hw_breakpoint *thbi = NULL;
> > > +
> > > + /* The DR6 value is stored in args->err */
> > > +#define DR6 (args->err)
> >
> > that's ugly - what's wrong with an old-fashioned "int db6 =
> > args->err" type of approach?
>
> Yes, it is ugly. It was a holdover from an earlier version, and in
> fact it's likely to change in the future to become even more ugly. But
> for now, you're right -- a simple assignment would be better.
>
> > > +++ linux-2.6-tip.hbkpt/arch/x86/include/asm/hw_breakpoint.h
> > > @@ -0,0 +1,132 @@
> > > +#ifndef _I386_HW_BREAKPOINT_H
> > > +#define _I386_HW_BREAKPOINT_H
> > > +
> > > +#ifdef __KERNEL__
> > > +#define __ARCH_HW_BREAKPOINT_H
> > > +
> > > +struct arch_hw_breakpoint {
> > > + char *name; /* Contains name of the symbol to set bkpt */
> > > + unsigned long address;
> > > + u8 len;
> > > + u8 type;
> > > +} __attribute__((packed));
> >
> > hm, why packed and why u8 ? We dont expose this to user-space,
> > do we? (if yes then 'unsigned long' is wrong and __KERNEL__ is
> > wrong as well)
>
> I can't remember why this was made packed; there doesn't seem to be any
> important reason for it. The structure is not exposed to userspace.
> The len and type fields are u8 because they contain values no larger
> than 255.
>
> > > +#include <linux/kdebug.h>
> > > +#include <asm-generic/hw_breakpoint.h>
> > > +
> > > +/* HW breakpoint accessor routines */
> > > +static inline const void *hw_breakpoint_get_kaddress(struct hw_breakpoint *bp)
> > > +{
> > > + return (const void *) bp->info.address;
> > > +}
> > > +
> > > +static inline const void __user *hw_breakpoint_get_uaddress
> > > + (struct hw_breakpoint *bp)
> > > +{
> > > + return (const void __user *) bp->info.address;
> > > +}
> > > +
> > > +static inline unsigned hw_breakpoint_get_len(struct hw_breakpoint *bp)
> > > +{
> > > + return bp->info.len;
> > > +}
> > > +
> > > +static inline unsigned hw_breakpoint_get_type(struct hw_breakpoint *bp)
> > > +{
> > > + return bp->info.type;
> > > +}
> >
> > why this redirection, why dont just use the structure as-is?
> > If there's any arch weirdness then that arch should have
> > arch-special accessors - not the generic code.
>
> These _are_ the arch-specific accessors. Look at the
> filename: arch/x86/include/asm/hw_breakpoint.h.

I very well know which file this is, you need to read my reply
again.

These are very generic-sounding fields and they should not be
hidden via pointless wrappers by the generic code. Dont let the
tail wag the dog. If there's architecture weirdness that does
not fit the generic code, then _that_ architecture should have
the ugliness - not the generic code. (note that these accessors
are used by the generic code so the uglification spreads there)

> > > + int num_installed; /* Number of installed bps */ +
> > > unsigned gennum; /* update-generation number */
> >
> > i suspect the gennum we can get rid of if we get rid of the
> > notion of priorities, right?
>
> No. gennum has nothing to do with priorities.

Well it's introduced because we have a priority-sorted list of
breakpoints not an array. A list needs to be maintained and when
updated it's reloaded. I was thinking about possibly getting rid
of that list complication and go back to the simpler array. But
it's hard because the lifetime of a kernel space breakpoint
spans context-switches so there has to be separation.

Ingo

2009-03-10 17:11:33

[permalink] [raw]

Subject: Re: [patch 02/11] x86 architecture implementation of Hardware Breakpoint interfaces

On Tue, 10 Mar 2009, Ingo Molnar wrote:

> > > why this redirection, why dont just use the structure as-is?
> > > If there's any arch weirdness then that arch should have
> > > arch-special accessors - not the generic code.
> >
> > These _are_ the arch-specific accessors. Look at the
> > filename: arch/x86/include/asm/hw_breakpoint.h.
>
> I very well know which file this is, you need to read my reply
> again.
>
> These are very generic-sounding fields and they should not be
> hidden via pointless wrappers by the generic code. Dont let the
> tail wag the dog. If there's architecture weirdness that does
> not fit the generic code, then _that_ architecture should have
> the ugliness - not the generic code. (note that these accessors
> are used by the generic code so the uglification spreads there)

Hm. I haven't been keeping careful track of all the updates Prasad has
been making. In my fairly-old copy of the hw-breakpoint work, the
accessors are _not_ used by the generic code. They are there for
future users of the API, not for internal use by the API itself. Is
there something I'm missing?

I have the feeling that this doesn't really address your comment, but
I'm not sure if that's because I don't understand your point or you
don't understand mine...

> These are very generic-sounding fields ...

Would you be happier if the field names were changed to be less
generic-sounding?

> > > > + int num_installed; /* Number of installed bps */ +
> > > > unsigned gennum; /* update-generation number */
> > >
> > > i suspect the gennum we can get rid of if we get rid of the
> > > notion of priorities, right?
> >
> > No. gennum has nothing to do with priorities.
>
> Well it's introduced because we have a priority-sorted list of
> breakpoints not an array.

More generally, it's there because kernel & userspace breakpoints can
be installed and uninstalled while a task is running -- and yes, this
is partially because breakpoints are prioritized. (Although it's worth
pointing out that even your suggestion of always prioritizing kernel
breakpoints above userspace breakpoints would have the same effect.)
However the fact that the breakpoints are stored in a list rather than
an array doesn't seem to be relevant.

> A list needs to be maintained and when
> updated it's reloaded.

The same is true of an array.

> I was thinking about possibly getting rid
> of that list complication and go back to the simpler array. But
> it's hard because the lifetime of a kernel space breakpoint
> spans context-switches so there has to be separation.

Yes, kernel breakpoints have to be kept separate from userspace
breakpoints. But even if you focus just on userspace breakpoints, you
still need to use a list because debuggers can try to register an
arbitrarily large number of breakpoints.

Alan Stern

2009-03-10 17:26:39

[permalink] [raw]

Subject: Re: [patch 02/11] x86 architecture implementation of Hardware Breakpoint interfaces

* Alan Stern <[email protected]> wrote:

> On Tue, 10 Mar 2009, Ingo Molnar wrote:
>
> > > > why this redirection, why dont just use the structure as-is?
> > > > If there's any arch weirdness then that arch should have
> > > > arch-special accessors - not the generic code.
> > >
> > > These _are_ the arch-specific accessors. Look at the
> > > filename: arch/x86/include/asm/hw_breakpoint.h.
> >
> > I very well know which file this is, you need to read my reply
> > again.
> >
> > These are very generic-sounding fields and they should not be
> > hidden via pointless wrappers by the generic code. Dont let the
> > tail wag the dog. If there's architecture weirdness that does
> > not fit the generic code, then _that_ architecture should have
> > the ugliness - not the generic code. (note that these accessors
> > are used by the generic code so the uglification spreads there)
>
> Hm. I haven't been keeping careful track of all the updates
> Prasad has been making. In my fairly-old copy of the
> hw-breakpoint work, the accessors are _not_ used by the
> generic code. They are there for future users of the API, not
> for internal use by the API itself. Is there something I'm
> missing?

Right, they do seem unused at the moment. I was going over the
patches and this stuck out as wrong.

> I have the feeling that this doesn't really address your
> comment, but I'm not sure if that's because I don't understand
> your point or you don't understand mine...

Removing them addresses my comment.

> > These are very generic-sounding fields ...
>
> Would you be happier if the field names were changed to be
> less generic-sounding?

Not sure what to make of this kind of reply. This isnt about me
being happier. Generic-sounding accessors for generic-sounding
fields is an easily recognizable pattern for broken design.

> > > > > + int num_installed; /* Number of installed bps */ +
> > > > > unsigned gennum; /* update-generation number */
> > > >
> > > > i suspect the gennum we can get rid of if we get rid of the
> > > > notion of priorities, right?
> > >
> > > No. gennum has nothing to do with priorities.
> >
> > Well it's introduced because we have a priority-sorted list of
> > breakpoints not an array.
>
> More generally, it's there because kernel & userspace
> breakpoints can be installed and uninstalled while a task is
> running -- and yes, this is partially because breakpoints are
> prioritized. (Although it's worth pointing out that even your
> suggestion of always prioritizing kernel breakpoints above
> userspace breakpoints would have the same effect.) However
> the fact that the breakpoints are stored in a list rather than
> an array doesn't seem to be relevant.
>
> > A list needs to be maintained and when updated it's
> > reloaded.
>
> The same is true of an array.

Not if what we do what the previous code did: reloaded the full
array unconditionally. (it's just 4 entries)

> > I was thinking about possibly getting rid of that list
> > complication and go back to the simpler array. But it's hard
> > because the lifetime of a kernel space breakpoint spans
> > context-switches so there has to be separation.
>
> Yes, kernel breakpoints have to be kept separate from
> userspace breakpoints. But even if you focus just on
> userspace breakpoints, you still need to use a list because
> debuggers can try to register an arbitrarily large number of
> breakpoints.

That 'arbitrarily larg number of breakpoints' worries me. It's a
pretty broken concept for a 4-items resource that cannot be
time-shared and hence cannot be overcommitted.

Seems to me that much of the complexity of this patchset:

28 files changed, 2439 insertions(+), 199 deletions(-)

Could be eliminated via a very simple exclusive reservation
mechanism.

Ingo

2009-03-10 20:30:40

[permalink] [raw]

Subject: Re: [patch 02/11] x86 architecture implementation of Hardware Breakpoint interfaces

On Tue, 10 Mar 2009, Ingo Molnar wrote:

> > More generally, it's there because kernel & userspace
> > breakpoints can be installed and uninstalled while a task is
> > running -- and yes, this is partially because breakpoints are
> > prioritized. (Although it's worth pointing out that even your
> > suggestion of always prioritizing kernel breakpoints above
> > userspace breakpoints would have the same effect.) However
> > the fact that the breakpoints are stored in a list rather than
> > an array doesn't seem to be relevant.
> >
> > > A list needs to be maintained and when updated it's
> > > reloaded.
> >
> > The same is true of an array.
>
> Not if what we do what the previous code did: reloaded the full
> array unconditionally. (it's just 4 entries)

But that array still has to be set up somehow. It is private to the
task; the only logical place to set it up is when the CPU switches to
that task.

In the old code, it wasn't possible for task B or the kernel to
affect the contents of task A's debug registers. With hw-breakpoints
it _is_ possible, because the balance between debug registers allocated
to kernel breakpoints and debug registers allocated to userspace
breakpoints can change. That's why the additional complexity is
needed.

> > Yes, kernel breakpoints have to be kept separate from
> > userspace breakpoints. But even if you focus just on
> > userspace breakpoints, you still need to use a list because
> > debuggers can try to register an arbitrarily large number of
> > breakpoints.
>
> That 'arbitrarily larg number of breakpoints' worries me. It's a
> pretty broken concept for a 4-items resource that cannot be
> time-shared and hence cannot be overcommitted.

Suppose we never allow callers to register more breakpoints than will
fit in the CPU's registers. Do we then use a simple first-come
first-served algorithm, with no prioritization? If we do prioritize
some breakpoint registrations more highly than others, how do we inform
callers that their breakpoint has been kicked out by one of higher
priority? And how do we let them know when the higher-priority
breakpoint has been unregistered, so they can try again?

> Seems to me that much of the complexity of this patchset:
>
> 28 files changed, 2439 insertions(+), 199 deletions(-)
>
> Could be eliminated via a very simple exclusive reservation
> mechanism.

Can it really be as simple as all that?

Roland, what do you think?

Alan Stern

2009-03-11 12:13:10

[permalink] [raw]

Subject: Re: [patch 02/11] x86 architecture implementation of Hardware Breakpoint interfaces

* Alan Stern <[email protected]> wrote:

> On Tue, 10 Mar 2009, Ingo Molnar wrote:
>
> > > More generally, it's there because kernel & userspace
> > > breakpoints can be installed and uninstalled while a task is
> > > running -- and yes, this is partially because breakpoints are
> > > prioritized. (Although it's worth pointing out that even your
> > > suggestion of always prioritizing kernel breakpoints above
> > > userspace breakpoints would have the same effect.) However
> > > the fact that the breakpoints are stored in a list rather than
> > > an array doesn't seem to be relevant.
> > >
> > > > A list needs to be maintained and when updated it's
> > > > reloaded.
> > >
> > > The same is true of an array.
> >
> > Not if what we do what the previous code did: reloaded the full
> > array unconditionally. (it's just 4 entries)
>
> But that array still has to be set up somehow. It is private
> to the task; the only logical place to set it up is when the
> CPU switches to that task.
>
> In the old code, it wasn't possible for task B or the kernel
> to affect the contents of task A's debug registers. With
> hw-breakpoints it _is_ possible, because the balance between
> debug registers allocated to kernel breakpoints and debug
> registers allocated to userspace breakpoints can change.
> That's why the additional complexity is needed.

Yes - but we dont really need any scheduler complexity for this.

An IPI is enough to reload debug registers in an affected task
(and calculate the real debug register layout) - and the next
context switches will pick up changes automatically.

Am i missing anything? I'm trying to find the design that has
the minimal possible complexity. (without killing any necessary
features)

> > > Yes, kernel breakpoints have to be kept separate from
> > > userspace breakpoints. But even if you focus just on
> > > userspace breakpoints, you still need to use a list
> > > because debuggers can try to register an arbitrarily large
> > > number of breakpoints.
> >
> > That 'arbitrarily large number of breakpoints' worries me.
> > It's a pretty broken concept for a 4-items resource that
> > cannot be time-shared and hence cannot be overcommitted.
>
> Suppose we never allow callers to register more breakpoints
> than will fit in the CPU's registers. Do we then use a simple
> first-come first-served algorithm, with no prioritization? If
> we do prioritize some breakpoint registrations more highly
> than others, how do we inform callers that their breakpoint
> has been kicked out by one of higher priority? And how do we
> let them know when the higher-priority breakpoint has been
> unregistered, so they can try again?

For an un-shareable resource like this (and this is really a
rare case [and we shouldnt even consider switching between user
and kernel debug registers at system call time]), the best
approach is to have a rigid reservation mechanism with clear,
hard, early failures in the overcommit case.

Silently breaking a user-space debugging sessions just because
the admin has a debug register based system-wide profiling
running, is pretty much the worst usage model. It does not give
user-space any idea about what happened - the breakpoints just
"dont work".

So i'd suggest a really simple scheme (depicted for x86 bug
applicable on other architectures too):

- we have a system-wide resource of 4 debug registers.

- kernel-side can allocate debug registers system-wide (it
takes effect on all CPUs, at once), up to 4 of them. The 5th
allocation will fail.

- user-side uses the ptrace APIs - and if it runs into the
limit, ptrace should return a failure.

There's the following special case: the kernel reserves a debug
register when there's tasks in the system that already have
reserved all debug registers. I.e. the constraint was not known
when the user-space session started, and the kernel violates it
afterwards.

There's a couple of choices here, with various scales of
conflict resolution:

1- silently override the user-space breakpoint

2- notify the user-space task via a signal - SIGXCPU or so.

3- reject the kernel-space allocation with a sufficiently
informative log message: "task 123 already uses 4 debug
registers, cannot allocate more kernel breakpoints" -
leaving the resolution of the conflict to the admin.

#1 isnt particularly good because it brings back a
'silentfailure' mode.

#2 might be too brutal: starting something innocous-looking
might kill a debug session. OTOH user-space debuggers could
catch the signal and inform the user.

#3 is probably the most informative (and hence probably the
best) variant. It also leaves policy of how to resolve the
conflict to the admin.

> > Seems to me that much of the complexity of this patchset:
> >
> > 28 files changed, 2439 insertions(+), 199 deletions(-)
> >
> > Could be eliminated via a very simple exclusive reservation
> > mechanism.
>
> Can it really be as simple as all that?

Would be nice to have it simple. Reluctance regarding this
patchset is mostly rooted in that diffstat above.

The changes it does in the x86 architecture code are nice
generalizations and cleanups. Both the scheduler, task
startup/exit and ptrace bits look pretty sane in terms of
factoring out debug register details. But the breakpoint
management looks very complex.

Ingo

2009-03-11 12:50:38

by K.Prasad

[permalink] [raw]

Subject: Re: [patch 02/11] x86 architecture implementation of Hardware Breakpoint interfaces

On Wed, Mar 11, 2009 at 01:12:20PM +0100, Ingo Molnar wrote:
>
> * Alan Stern <[email protected]> wrote:
>
> > On Tue, 10 Mar 2009, Ingo Molnar wrote:
> >
> > > > More generally, it's there because kernel & userspace
> > > > breakpoints can be installed and uninstalled while a task is
> > > > running -- and yes, this is partially because breakpoints are
> > > > prioritized. (Although it's worth pointing out that even your
> > > > suggestion of always prioritizing kernel breakpoints above
> > > > userspace breakpoints would have the same effect.) However
> > > > the fact that the breakpoints are stored in a list rather than
> > > > an array doesn't seem to be relevant.
> > > >
> > > > > A list needs to be maintained and when updated it's
> > > > > reloaded.
> > > >
> > > > The same is true of an array.
> > >
> > > Not if what we do what the previous code did: reloaded the full
> > > array unconditionally. (it's just 4 entries)
> >
> > But that array still has to be set up somehow. It is private
> > to the task; the only logical place to set it up is when the
> > CPU switches to that task.
> >
> > In the old code, it wasn't possible for task B or the kernel
> > to affect the contents of task A's debug registers. With
> > hw-breakpoints it _is_ possible, because the balance between
> > debug registers allocated to kernel breakpoints and debug
> > registers allocated to userspace breakpoints can change.
> > That's why the additional complexity is needed.
>
> Yes - but we dont really need any scheduler complexity for this.
>
> An IPI is enough to reload debug registers in an affected task
> (and calculate the real debug register layout) - and the next
> context switches will pick up changes automatically.
>
> Am i missing anything? I'm trying to find the design that has
> the minimal possible complexity. (without killing any necessary
> features)
>
> > > > Yes, kernel breakpoints have to be kept separate from
> > > > userspace breakpoints. But even if you focus just on
> > > > userspace breakpoints, you still need to use a list
> > > > because debuggers can try to register an arbitrarily large
> > > > number of breakpoints.
> > >
> > > That 'arbitrarily large number of breakpoints' worries me.
> > > It's a pretty broken concept for a 4-items resource that
> > > cannot be time-shared and hence cannot be overcommitted.
> >
> > Suppose we never allow callers to register more breakpoints
> > than will fit in the CPU's registers. Do we then use a simple
> > first-come first-served algorithm, with no prioritization? If
> > we do prioritize some breakpoint registrations more highly
> > than others, how do we inform callers that their breakpoint
> > has been kicked out by one of higher priority? And how do we
> > let them know when the higher-priority breakpoint has been
> > unregistered, so they can try again?
>
> For an un-shareable resource like this (and this is really a
> rare case [and we shouldnt even consider switching between user
> and kernel debug registers at system call time]), the best
> approach is to have a rigid reservation mechanism with clear,
> hard, early failures in the overcommit case.
>
> Silently breaking a user-space debugging sessions just because
> the admin has a debug register based system-wide profiling
> running, is pretty much the worst usage model. It does not give
> user-space any idea about what happened - the breakpoints just
> "dont work".
>
> So i'd suggest a really simple scheme (depicted for x86 bug
> applicable on other architectures too):
>
> - we have a system-wide resource of 4 debug registers.
>
> - kernel-side can allocate debug registers system-wide (it
> takes effect on all CPUs, at once), up to 4 of them. The 5th
> allocation will fail.
>
> - user-side uses the ptrace APIs - and if it runs into the
> limit, ptrace should return a failure.
>
> There's the following special case: the kernel reserves a debug
> register when there's tasks in the system that already have
> reserved all debug registers. I.e. the constraint was not known
> when the user-space session started, and the kernel violates it
> afterwards.
>
> There's a couple of choices here, with various scales of
> conflict resolution:
>
> 1- silently override the user-space breakpoint
>
> 2- notify the user-space task via a signal - SIGXCPU or so.
>
> 3- reject the kernel-space allocation with a sufficiently
> informative log message: "task 123 already uses 4 debug
> registers, cannot allocate more kernel breakpoints" -
> leaving the resolution of the conflict to the admin.
>
> #1 isnt particularly good because it brings back a
> 'silentfailure' mode.
>
> #2 might be too brutal: starting something innocous-looking
> might kill a debug session. OTOH user-space debuggers could
> catch the signal and inform the user.
>
> #3 is probably the most informative (and hence probably the
> best) variant. It also leaves policy of how to resolve the
> conflict to the admin.
>

While reserving more discussions after Roland posts his views, I thought
I'd share some of mine here.

The present implementation can be likened to #3 except that the
uninstalled() callback is invoked (the user-space call through ptrace
takes a higher priority and evicts the kernel-space requests even now).

After the task using four debug registers yield the CPU, the
kernel-space breakpoint requests are 'restored' and installed() is
called again.

Even if #3 was implemented as described, we would still retain a
majority of the complexity in balance_kernel_vs_user() to check newer
tasks with requests for breakpoint registers.

Thanks,
K.Prasad

2009-03-11 13:10:57

[permalink] [raw]

Subject: Re: [patch 02/11] x86 architecture implementation of Hardware Breakpoint interfaces

* K.Prasad <[email protected]> wrote:

> Even if #3 was implemented as described, we would still retain
> a majority of the complexity in balance_kernel_vs_user() to
> check newer tasks with requests for breakpoint registers.

Not if it's implemented in a really simple way:

Kernel gets debug registers in db4..db3..db2..db1 order, and its
allocation is essentially hardcoded - i.e. we dont try to be
fancy.

User-space (gdb) on the other hand will try to allocate in the
db1..db2..db3..db4 order.

Maintain a 'max debug register index' value driven by ptrace and
maintain a 'min debug register index' driven by kernel-space
hw-breakpoint allocations.

If they meet somewhere inbetween then we have overcommit which
we dont allow. In all other cases (which i proffer covers 100%
of the sane cases) they will mix amicably.

Sure, user-space can in principle do db4..db3..db2..db1
allocations as well, but it would be silly and GDB does not do
that.

So there's no real overlap between register usage - hence no
need for any complex scheduling smarts. Keep it simple first,
and only add complexity when it's justified.

[ for the special case of an architecture having just a single
debug register this will bring the expected behavior of either
allowing gdb to use the breakpoint or allowing the kernel to
use it. ]

Ingo

2009-03-11 16:32:31

[permalink] [raw]

Subject: Re: [patch 02/11] x86 architecture implementation of Hardware Breakpoint interfaces

On Wed, 11 Mar 2009, Ingo Molnar wrote:

> > > Not if what we do what the previous code did: reloaded the full
> > > array unconditionally. (it's just 4 entries)
> >
> > But that array still has to be set up somehow. It is private
> > to the task; the only logical place to set it up is when the
> > CPU switches to that task.
> >
> > In the old code, it wasn't possible for task B or the kernel
> > to affect the contents of task A's debug registers. With
> > hw-breakpoints it _is_ possible, because the balance between
> > debug registers allocated to kernel breakpoints and debug
> > registers allocated to userspace breakpoints can change.
> > That's why the additional complexity is needed.
>
> Yes - but we dont really need any scheduler complexity for this.
>
> An IPI is enough to reload debug registers in an affected task
> (and calculate the real debug register layout) - and the next
> context switches will pick up changes automatically.
>
> Am i missing anything? I'm trying to find the design that has
> the minimal possible complexity. (without killing any necessary
> features)

I think you _are_ missing something, though it's not clear what.

"and the next context switches will pick up changes automatically" --
that may not be entirely right. Yes, the next context switch will pick
up the changes to DR1-4, but it won't necessarily pick up the changes
to DR7. However the details depend very much on how debug registers
are allocated; with no priorities or evictions much of the complexity
will disappear anyway.

> For an un-shareable resource like this (and this is really a
> rare case [and we shouldnt even consider switching between user
> and kernel debug registers at system call time]), the best
> approach is to have a rigid reservation mechanism with clear,
> hard, early failures in the overcommit case.
>
> Silently breaking a user-space debugging sessions just because
> the admin has a debug register based system-wide profiling
> running, is pretty much the worst usage model. It does not give
> user-space any idea about what happened - the breakpoints just
> "dont work".
>
> So i'd suggest a really simple scheme (depicted for x86 bug
> applicable on other architectures too):
>
> - we have a system-wide resource of 4 debug registers.
>
> - kernel-side can allocate debug registers system-wide (it
> takes effect on all CPUs, at once), up to 4 of them. The 5th
> allocation will fail.
>
> - user-side uses the ptrace APIs - and if it runs into the
> limit, ptrace should return a failure.

Roland, of course, is all in favor of making hw-breakpoints compatible
with utrace. The API should be flexible enough to encompass both
legacy ptrace and utrace.

> There's the following special case: the kernel reserves a debug
> register when there's tasks in the system that already have
> reserved all debug registers. I.e. the constraint was not known
> when the user-space session started, and the kernel violates it
> afterwards.

Right. Or the kernel tries to allocate 2 debug registers when
userspace has already allocated 3, and so on...

> There's a couple of choices here, with various scales of
> conflict resolution:
>
> 1- silently override the user-space breakpoint
>
> 2- notify the user-space task via a signal - SIGXCPU or so.
>
> 3- reject the kernel-space allocation with a sufficiently
> informative log message: "task 123 already uses 4 debug
> registers, cannot allocate more kernel breakpoints" -
> leaving the resolution of the conflict to the admin.

We can't necessarily assign a particular task to the debug registers
already in use. There might be more than one task using them. But of
course we can always just say that they are already in use, and if
necessary there could be a /proc interface with more information.

Besides, we have to be able to reject kernel breakpoint requests in any
case ("the 5th allocation will fail").

> #1 isnt particularly good because it brings back a
> 'silentfailure' mode.

Agreed.

> #2 might be too brutal: starting something innocous-looking
> might kill a debug session. OTOH user-space debuggers could
> catch the signal and inform the user.
>
> #3 is probably the most informative (and hence probably the
> best) variant. It also leaves policy of how to resolve the
> conflict to the admin.

AFAICS, #3 really is "first come, first served". What do you mean by
"policy of how to resolve the conflict"? It sounds like there are no
policy choices involved; whoever requests the debug register first will
get it.

> Would be nice to have it simple. Reluctance regarding this
> patchset is mostly rooted in that diffstat above.

I'd be happy to implement #3. Mostly it would just involve removing
code from the patches.

> The changes it does in the x86 architecture code are nice
> generalizations and cleanups. Both the scheduler, task
> startup/exit and ptrace bits look pretty sane in terms of
> factoring out debug register details. But the breakpoint
> management looks very complex.

Yes, there's no denying it. But I don't want to commit to any
particular changes without Roland's input.

Alan Stern

2009-03-11 16:39:43

[permalink] [raw]

Subject: Re: [patch 02/11] x86 architecture implementation of Hardware Breakpoint interfaces

On Wed, 11 Mar 2009, K.Prasad wrote:

> The present implementation can be likened to #3 except that the
> uninstalled() callback is invoked (the user-space call through ptrace
> takes a higher priority and evicts the kernel-space requests even now).
>
> After the task using four debug registers yield the CPU, the
> kernel-space breakpoint requests are 'restored' and installed() is
> called again.

No, that is wrong. The kernel breakpoints do not get reinstalled until
the userspace breakpoints are unregistered. Merely yielding the CPU is
not sufficient.

> Even if #3 was implemented as described, we would still retain a
> majority of the complexity in balance_kernel_vs_user() to check newer
> tasks with requests for breakpoint registers.

Some complexity is certainly needed, because at all times we need to
know the maximum number of breakpoints requested by any user task.
The number of kernel breakpoints that can be allocated is limited to 4
minus this number.

Alan Stern

2009-03-11 17:41:38

by K.Prasad

[permalink] [raw]

Subject: Re: [patch 02/11] x86 architecture implementation of Hardware Breakpoint interfaces

On Wed, Mar 11, 2009 at 12:32:19PM -0400, Alan Stern wrote:
> On Wed, 11 Mar 2009, Ingo Molnar wrote:
>
> > > > Not if what we do what the previous code did: reloaded the full
> > > > array unconditionally. (it's just 4 entries)
> > >
> > > But that array still has to be set up somehow. It is private
> > > to the task; the only logical place to set it up is when the
> > > CPU switches to that task.
> > >
> > > In the old code, it wasn't possible for task B or the kernel
> > > to affect the contents of task A's debug registers. With
> > > hw-breakpoints it _is_ possible, because the balance between
> > > debug registers allocated to kernel breakpoints and debug
> > > registers allocated to userspace breakpoints can change.
> > > That's why the additional complexity is needed.
> >
> > Yes - but we dont really need any scheduler complexity for this.
> >
> > An IPI is enough to reload debug registers in an affected task
> > (and calculate the real debug register layout) - and the next
> > context switches will pick up changes automatically.
> >
> > Am i missing anything? I'm trying to find the design that has
> > the minimal possible complexity. (without killing any necessary
> > features)
>
> I think you _are_ missing something, though it's not clear what.
>
> "and the next context switches will pick up changes automatically" --
> that may not be entirely right. Yes, the next context switch will pick
> up the changes to DR1-4, but it won't necessarily pick up the changes
> to DR7. However the details depend very much on how debug registers
> are allocated; with no priorities or evictions much of the complexity
> will disappear anyway.
>
> > For an un-shareable resource like this (and this is really a
> > rare case [and we shouldnt even consider switching between user
> > and kernel debug registers at system call time]), the best
> > approach is to have a rigid reservation mechanism with clear,
> > hard, early failures in the overcommit case.
> >
> > Silently breaking a user-space debugging sessions just because
> > the admin has a debug register based system-wide profiling
> > running, is pretty much the worst usage model. It does not give
> > user-space any idea about what happened - the breakpoints just
> > "dont work".
> >
> > So i'd suggest a really simple scheme (depicted for x86 bug
> > applicable on other architectures too):
> >
> > - we have a system-wide resource of 4 debug registers.
> >
> > - kernel-side can allocate debug registers system-wide (it
> > takes effect on all CPUs, at once), up to 4 of them. The 5th
> > allocation will fail.
> >
> > - user-side uses the ptrace APIs - and if it runs into the
> > limit, ptrace should return a failure.
>
> Roland, of course, is all in favor of making hw-breakpoints compatible
> with utrace. The API should be flexible enough to encompass both
> legacy ptrace and utrace.
>
> > There's the following special case: the kernel reserves a debug
> > register when there's tasks in the system that already have
> > reserved all debug registers. I.e. the constraint was not known
> > when the user-space session started, and the kernel violates it
> > afterwards.
>
> Right. Or the kernel tries to allocate 2 debug registers when
> userspace has already allocated 3, and so on...
>
> > There's a couple of choices here, with various scales of
> > conflict resolution:
> >
> > 1- silently override the user-space breakpoint
> >
> > 2- notify the user-space task via a signal - SIGXCPU or so.
> >
> > 3- reject the kernel-space allocation with a sufficiently
> > informative log message: "task 123 already uses 4 debug
> > registers, cannot allocate more kernel breakpoints" -
> > leaving the resolution of the conflict to the admin.
>
> We can't necessarily assign a particular task to the debug registers
> already in use. There might be more than one task using them. But of
> course we can always just say that they are already in use, and if
> necessary there could be a /proc interface with more information.
>
> Besides, we have to be able to reject kernel breakpoint requests in any
> case ("the 5th allocation will fail").
>
> > #1 isnt particularly good because it brings back a
> > 'silentfailure' mode.
>
> Agreed.
>
> > #2 might be too brutal: starting something innocous-looking
> > might kill a debug session. OTOH user-space debuggers could
> > catch the signal and inform the user.
> >
> > #3 is probably the most informative (and hence probably the
> > best) variant. It also leaves policy of how to resolve the
> > conflict to the admin.
>
> AFAICS, #3 really is "first come, first served". What do you mean by
> "policy of how to resolve the conflict"? It sounds like there are no
> policy choices involved; whoever requests the debug register first will
> get it.
>

With FCFS or an allocation mechanism without the (un)installed()
callbacks we'd lose the ability to record requests and service them
later when registers become availabile.

Say when (un)installed() callbacks are implemented for the proposed
ftrace-plugin to trace kernel symbols, they can automatically stop/start
tracing as and when registers become (un)available. This can be helpful when
we wish to profile memory access over a kernel variable for a long duration
(where small loss of tracing data can be tolerated), while the system would
permit simultaneous user-space access (say a GDB session using 'hbreak').

Are we fine with disallowing such usage, which if done will let the requester
of the breakpoint register 'poll' periodically to check availability.

Thanks,
K.Prasad

2009-03-12 02:47:48

by Roland McGrath

[permalink] [raw]

Subject: Re: [patch 02/11] x86 architecture implementation of Hardware Breakpoint interfaces

Perhaps it would help if asm-generic/hw_breakpoint.h had some kerneldoc
comments for the arch-specific functions that the arch's asm/hw_breakpoint.h
must define (in the style of asm-generic/syscall.h). I note that Ingo
didn't have any comments about asm-generic/hw_breakpoint.h in his review.
Its purpose should be to make any arch maintainer understand why the API it
specifies for each arch to meet makes sense across the arch's.

> why this redirection, why dont just use the structure as-is? If
> there's any arch weirdness then that arch should have
> arch-special accessors - not the generic code.

The fields of arch_hw_breakpoint are arch-specific. Another arch's
struct will not have .type and .len fields at all. e.g., on powerpc
there is just one size supported, so hw_breakpoint_get_len() would be an
inline returning a constant. Its type is encoded in low bits of the
address word, and the arch implementation may not want to use bit-field
called .type for that (and if it did, it couldn't use a bit-field called
.address with the meaning you'd want it to have).

Having any fields in arch_hw_breakpoint at all be part of the API
restricts the arch implementation unreasonably. So it has accessors to
fetch them instead. (Arguably we could punt those accessors from the
API for hw_breakpoint users, but the arch-independent part of the
hw_breakpoint implementation might still want them, I'm not sure.)
Likewise, they need to be filled in by setters or by explicit type/len
arguments to the registration calls. This appears to be a tenet we
worked out the first time around that has gotten lost in the shuffle
more recently.

I think it would be illustrative to have a second arch implementation to
compare to the x86 one. Ingo has a tendency to pretend everything is an
x86 until shown the concrete evidence. The obvious choice is powerpc.
Its facility is very simple, so the arch-specific part of the
implementation should be trivial--it's the "base case" of simplest
available hw_breakpoint arch, really. Also, it happens that Prasad's
employer is interested in having that support.

For example, a sensible powerpc implementation would clearly demonstrate
why you need accessors or at least either pre-registration setters or
explicit type/len arguments in registration calls.

Thanks,
Roland

2009-03-13 03:43:48

[permalink] [raw]

Subject: Re: [patch 02/11] x86 architecture implementation of Hardware Breakpoint interfaces

* Roland McGrath <[email protected]> wrote:

> Perhaps it would help if asm-generic/hw_breakpoint.h had some
> kerneldoc comments for the arch-specific functions that the
> arch's asm/hw_breakpoint.h must define (in the style of
> asm-generic/syscall.h). I note that Ingo didn't have any
> comments about asm-generic/hw_breakpoint.h in his review. Its
> purpose should be to make any arch maintainer understand why
> the API it specifies for each arch to meet makes sense across
> the arch's.
>
> > why this redirection, why dont just use the structure as-is?
> > If there's any arch weirdness then that arch should have
> > arch-special accessors - not the generic code.
>
> The fields of arch_hw_breakpoint are arch-specific. Another
> arch's struct will not have .type and .len fields at all.
> e.g., on powerpc there is just one size supported, so
> hw_breakpoint_get_len() would be an inline returning a
> constant. Its type is encoded in low bits of the address
> word, and the arch implementation may not want to use
> bit-field called .type for that (and if it did, it couldn't
> use a bit-field called .address with the meaning you'd want it
> to have).
>
> Having any fields in arch_hw_breakpoint at all be part of the
> API restricts the arch implementation unreasonably. So it has
> accessors to fetch them instead. (Arguably we could punt
> those accessors from the API for hw_breakpoint users, but the
> arch-independent part of the hw_breakpoint implementation
> might still want them, I'm not sure.) Likewise, they need to
> be filled in by setters or by explicit type/len arguments to
> the registration calls. This appears to be a tenet we worked
> out the first time around that has gotten lost in the shuffle
> more recently.
>
> I think it would be illustrative to have a second arch
> implementation to compare to the x86 one. Ingo has a tendency
> to pretend everything is an x86 until shown the concrete
> evidence. The obvious choice is powerpc. Its facility is very
> simple, so the arch-specific part of the implementation should
> be trivial--it's the "base case" of simplest available
> hw_breakpoint arch, really. Also, it happens that Prasad's
> employer is interested in having that support.
>
> For example, a sensible powerpc implementation would clearly
> demonstrate why you need accessors or at least either
> pre-registration setters or explicit type/len arguments in
> registration calls.

That would help. I indeed have a tendency to strike out code
that's not immediately needed, i also tend to make sure that
design is sane on the platform that 95%+ of our active
developers/users use.

The core issue being discussed is the debug register allocation
and scheduling model though, and you have not directly commented
on that.

My argument in a nutshell is that a bottom-up for user +
top-down for kernel use static allocator with no dynamic
scheduling will get us most of the benefits with a tenth of the
complexity.

Ingo

2009-03-13 14:04:57

[permalink] [raw]

Subject: Re: [patch 02/11] x86 architecture implementation of Hardware Breakpoint interfaces

On Fri, 13 Mar 2009, Ingo Molnar wrote:

> The core issue being discussed is the debug register allocation
> and scheduling model though, and you have not directly commented
> on that.
>
> My argument in a nutshell is that a bottom-up for user +
> top-down for kernel use static allocator with no dynamic
> scheduling will get us most of the benefits with a tenth of the
> complexity.

Take this even farther: We shouldn't restrict userspace to bottom-up
register allocation. With very little additional effort we can
virtualize the debug registers; then userspace can allocate them in
whatever order it wants and still end up using the physical registers
in bottom-up order (or top-down, which is the order used by the current
patches).

After all, there's nothing to prevent programs other than gdb from
using ptrace, and there's no guarantee that gdb will continue to
allocate registers in increasing order.

Alan Stern

2009-03-13 14:13:54

[permalink] [raw]

Subject: Re: [patch 02/11] x86 architecture implementation of Hardware Breakpoint interfaces

* Alan Stern <[email protected]> wrote:

> On Fri, 13 Mar 2009, Ingo Molnar wrote:
>
> > The core issue being discussed is the debug register
> > allocation and scheduling model though, and you have not
> > directly commented on that.
> >
> > My argument in a nutshell is that a bottom-up for user +
> > top-down for kernel use static allocator with no dynamic
> > scheduling will get us most of the benefits with a tenth of
> > the complexity.
>
> Take this even farther: We shouldn't restrict userspace to
> bottom-up register allocation. With very little additional
> effort we can virtualize the debug registers; then userspace
> can allocate them in whatever order it wants and still end up
> using the physical registers in bottom-up order (or top-down,
> which is the order used by the current patches).
>
> After all, there's nothing to prevent programs other than gdb
> from using ptrace, and there's no guarantee that gdb will
> continue to allocate registers in increasing order.

If in ~10 years of its existence no such usage arose so i dont
think it will magically appear now.

The thing is, kernel-side use of debug registers is a borderline
item whose impact we should minimalize as much as possible.
Linus in the past expressed that it is fine to not have _any_
management of user versus kernel debug registers. So we want to
approach this from the minimalistic side. I offered such a very
minimal design that is trivial in terms of correctness and
impact.

We can still get this simple allocation model into .30 if we
dont waste time arguing about unnecessarily. If someone runs
into limitations the model can be extended.

Ingo

2009-03-13 19:01:35

by K.Prasad

[permalink] [raw]

Subject: Re: [patch 02/11] x86 architecture implementation of Hardware Breakpoint interfaces

On Fri, Mar 13, 2009 at 03:13:04PM +0100, Ingo Molnar wrote:
>
> * Alan Stern <[email protected]> wrote:
>
> > On Fri, 13 Mar 2009, Ingo Molnar wrote:
> >
> > > The core issue being discussed is the debug register
> > > allocation and scheduling model though, and you have not
> > > directly commented on that.
> > >
> > > My argument in a nutshell is that a bottom-up for user +
> > > top-down for kernel use static allocator with no dynamic
> > > scheduling will get us most of the benefits with a tenth of
> > > the complexity.
> >
> > Take this even farther: We shouldn't restrict userspace to
> > bottom-up register allocation. With very little additional
> > effort we can virtualize the debug registers; then userspace
> > can allocate them in whatever order it wants and still end up
> > using the physical registers in bottom-up order (or top-down,
> > which is the order used by the current patches).
> >
> > After all, there's nothing to prevent programs other than gdb
> > from using ptrace, and there's no guarantee that gdb will
> > continue to allocate registers in increasing order.
>
> If in ~10 years of its existence no such usage arose so i dont
> think it will magically appear now.
>
> The thing is, kernel-side use of debug registers is a borderline
> item whose impact we should minimalize as much as possible.
> Linus in the past expressed that it is fine to not have _any_
> management of user versus kernel debug registers. So we want to
> approach this from the minimalistic side. I offered such a very
> minimal design that is trivial in terms of correctness and
> impact.
>
> We can still get this simple allocation model into .30 if we
> dont waste time arguing about unnecessarily. If someone runs
> into limitations the model can be extended.
>
> Ingo

Here's a summary of the intended changes to the patchset, which I hope
to post early the following week. It tears down many features in the
present submission (The write-up below is done without the benefit of
actually having run into limitations while trying to chisel out code).

- Adopt a static allocation method for registers, say FCFS (and perhaps
botton-up for user-space allocations and the reverse for
kernel-space), although individual counters to do book-keeping should also
suffice.

- Use an array of HB_NUM size for storing the breakpoint requests (and
not a linked-list implementation as done now).

- Define a HAVE_HW_BREAKPOINTS in arch/x86/Kconfig unconditionally, but
build kernel/hw_breakpoint.o, samples/hw_breakpoint/data_breakpoint.o
and kernel/trace/trace_ksym.o build conditionally if
HAVE_HW_BREAKPOINTS is declared. Declaring this flag will help
a)prevent build failures in other archs b)Prevent ftrace from showing
up availability of kernel symbol tracing even in unsupported archs.

- Simplify the switch_to_thread_hw_breakpoint() function (any help from
Alan Stern here would be gladly accepted).

- Remove callbacks such as unregister/register.

- remove the code to implement prioritisation of requests

- Add histogram support to include a 'hit counter' to the traced kernel
symbols.

- Address coding-style related comments.

Hope they are not in sync with the comments received thus far. Let me
know if there are changes to be made.

Thanks,
K.Prasad

2009-03-13 21:21:30

[permalink] [raw]

Subject: Re: [patch 02/11] x86 architecture implementation of Hardware Breakpoint interfaces

On Sat, 14 Mar 2009, K.Prasad wrote:

> Here's a summary of the intended changes to the patchset, which I hope
> to post early the following week. It tears down many features in the
> present submission (The write-up below is done without the benefit of
> actually having run into limitations while trying to chisel out code).
>
> - Adopt a static allocation method for registers, say FCFS (and perhaps
> botton-up for user-space allocations and the reverse for
> kernel-space), although individual counters to do book-keeping should also
> suffice.

You can't enforce bottom-up allocation for userspace breakpoint
requests. In fact, you'll have to add a parameter indicating which
debug register is requested. The ptrace interface will use this
parameter; the utrace interface won't care so it will specify something
like HW_BREAKPOINT_ANY_REGISTER.

You will have to add an array of HB_NUM counters, to keep track of how
many tasks are using each debug register.

> - Use an array of HB_NUM size for storing the breakpoint requests (and
> not a linked-list implementation as done now).
>
> - Define a HAVE_HW_BREAKPOINTS in arch/x86/Kconfig unconditionally, but
> build kernel/hw_breakpoint.o, samples/hw_breakpoint/data_breakpoint.o
> and kernel/trace/trace_ksym.o build conditionally if
> HAVE_HW_BREAKPOINTS is declared. Declaring this flag will help
> a)prevent build failures in other archs b)Prevent ftrace from showing
> up availability of kernel symbol tracing even in unsupported archs.

This isn't quite right. At the moment kernel/hw_breakpoint.c isn't
built at all; instead it is #included by the corresponding
arch-specific source file. Of course, you could change that.

> - Simplify the switch_to_thread_hw_breakpoint() function (any help from
> Alan Stern here would be gladly accepted).

Sure. It will depend on how you implement the other changes.

> - Remove callbacks such as unregister/register.
>
> - remove the code to implement prioritisation of requests

Remove the inline accessors. They can be added back when they are
needed.

Some architectures have arbitrary-length debug regions, not
fixed-length 1, 2, 4, or 8 bytes. We should give some thought to
making the interface compatible with such things.

> - Add histogram support to include a 'hit counter' to the traced kernel
> symbols.
>
> - Address coding-style related comments.
>
> Hope they are not in sync with the comments received thus far. Let me
> know if there are changes to be made.

Another change we need to change is the way DR6 is passed to the debug
notifier chain. Currently it is passed by value when do_debug()
calls notify_die(). Instead we need to pass it by reference so that
the notifier routines can change its value. Each time a notifier
routine handles a breakpoint event, the corresponding bit in DR6 should
be turned off.

Alan Stern

2009-03-14 03:41:52

[permalink] [raw]

Subject: Re: [patch 02/11] x86 architecture implementation of Hardware Breakpoint interfaces

On Tue, 2009-03-10 at 18:26 +0100, Ingo Molnar wrote:
>
> That 'arbitrarily larg number of breakpoints' worries me. It's a
> pretty broken concept for a 4-items resource that cannot be
> time-shared and hence cannot be overcommitted.
>
> Seems to me that much of the complexity of this patchset:
>
> 28 files changed, 2439 insertions(+), 199 deletions(-)
>
> Could be eliminated via a very simple exclusive reservation
> mechanism.
>
I also have some worries about the bloat of this infrastructure,
especially in the context switching code.

I would prefer the arch to be in control of the state in the task struct
and just context switch the actual HW registers at that stage.

Ben.

2009-03-14 03:43:00

[permalink] [raw]

Subject: Re: [patch 02/11] x86 architecture implementation of Hardware Breakpoint interfaces

On Tue, 2009-03-10 at 16:30 -0400, Alan Stern wrote:
> Suppose we never allow callers to register more breakpoints than will
> fit in the CPU's registers. Do we then use a simple first-come
> first-served algorithm, with no prioritization? If we do prioritize
> some breakpoint registrations more highly than others, how do we
> inform
> callers that their breakpoint has been kicked out by one of higher
> priority? And how do we let them know when the higher-priority
> breakpoint has been unregistered, so they can try again?

Do we really need such a mess ? Honestly ... We've been living fine
before without any of that.

Ben.

2009-03-14 03:44:48

[permalink] [raw]

Subject: Re: [patch 02/11] x86 architecture implementation of Hardware Breakpoint interfaces

On Wed, 2009-03-11 at 13:12 +0100, Ingo Molnar wrote:
>
> #3 is probably the most informative (and hence probably the
> best) variant. It also leaves policy of how to resolve the
> conflict to the admin.

Agreed.
>
> Would be nice to have it simple. Reluctance regarding this
> patchset is mostly rooted in that diffstat above.
>
> The changes it does in the x86 architecture code are nice
> generalizations and cleanups. Both the scheduler, task
> startup/exit and ptrace bits look pretty sane in terms of
> factoring out debug register details. But the breakpoint
> management looks very complex

I agree there is some interest in generalization and cleanup, especially
as far as userspace APIs go, though it's a hard nut to crack as every
CPU family out there has some subtle differences in the way breakpoints
or watchpoints work (for example, alignment constraints, ability to do
ranges, the way they handle kernel vs. user, etc...)

I'm not yet sold.

Cheers,
Ben.

2009-03-14 03:47:21

[permalink] [raw]

Subject: Re: [patch 02/11] x86 architecture implementation of Hardware Breakpoint interfaces

On Wed, 2009-03-11 at 14:10 +0100, Ingo Molnar wrote:
>
> Kernel gets debug registers in db4..db3..db2..db1 order, and its
> allocation is essentially hardcoded - i.e. we dont try to be
> fancy.
>
> User-space (gdb) on the other hand will try to allocate in the
> db1..db2..db3..db4 order.
>
> Maintain a 'max debug register index' value driven by ptrace and
> maintain a 'min debug register index' driven by kernel-space
> hw-breakpoint allocations.

A few added details from the perspective of powerpc ...

breakpoints and watchpoints are separate resources with different
capacity depending on the chip, so far nothing fancy.

We also have the ability to do range breakpoints/watchpoints on some
processors by using pairs of registers, which adds some constraints to
the allocation.

We also have a value compare capability for watchpoint, but this can
also have a different capacity limitation from either the breakpoints
and the watchpoints themselves.

Cheers,
Ben.

2009-03-14 03:48:49

[permalink] [raw]

Subject: Re: [patch 02/11] x86 architecture implementation of Hardware Breakpoint interfaces

On Wed, 2009-03-11 at 23:11 +0530, K.Prasad wrote:
> With FCFS or an allocation mechanism without the (un)installed()
> callbacks we'd lose the ability to record requests and service them
> later when registers become availabile.
>
> Say when (un)installed() callbacks are implemented for the proposed
> ftrace-plugin to trace kernel symbols, they can automatically stop/start
> tracing as and when registers become (un)available. This can be helpful when
> we wish to profile memory access over a kernel variable for a long duration
> (where small loss of tracing data can be tolerated), while the system would
> permit simultaneous user-space access (say a GDB session using 'hbreak').
>
> Are we fine with disallowing such usage, which if done will let the requester
> of the breakpoint register 'poll' periodically to check availability.

Is that such a big deal ? Can't we just have the kernel degrade to
classic SW breakpoints ?

Smells like overengineering to me ...

Ben.

2009-03-14 03:52:37

[permalink] [raw]

Subject: Re: [patch 02/11] x86 architecture implementation of Hardware Breakpoint interfaces

On Wed, 2009-03-11 at 19:46 -0700, Roland McGrath wrote:
>
> I think it would be illustrative to have a second arch implementation to
> compare to the x86 one. Ingo has a tendency to pretend everything is an
> x86 until shown the concrete evidence. The obvious choice is powerpc.
> Its facility is very simple, so the arch-specific part of the
> implementation should be trivial--it's the "base case" of simplest
> available hw_breakpoint arch, really. Also, it happens that Prasad's
> employer is interested in having that support.
>
> For example, a sensible powerpc implementation would clearly demonstrate
> why you need accessors or at least either pre-registration setters or
> explicit type/len arguments in registration calls.

Well, we happen to be just in the middle of implementing support for
BookE HW debug facilities :-) (which have more HW breakpoints &
watchpoints than server PPCs along with fancy features like ranged
breakpoints or value compare) so it's a right time to give that a try.

I'm Ccing David Gibson and Torez Smith who both have been working on the
infrastructure to control the debug regs. For now we are just giving
pretty much direct access to the debug regs from ptrace (since they are
somewhat architected they are very similar if not identical between a
whole bunch of embedded powerpc's) but a more abstract interface would
be nice.

Ben.

2009-03-14 12:25:25

[permalink] [raw]

Subject: Re: [patch 02/11] x86 architecture implementation of Hardware Breakpoint interfaces

* Alan Stern <[email protected]> wrote:

> On Sat, 14 Mar 2009, K.Prasad wrote:
>
> > Here's a summary of the intended changes to the patchset, which I hope
> > to post early the following week. It tears down many features in the
> > present submission (The write-up below is done without the benefit of
> > actually having run into limitations while trying to chisel out code).
> >
> > - Adopt a static allocation method for registers, say FCFS (and perhaps
> > botton-up for user-space allocations and the reverse for
> > kernel-space), although individual counters to do book-keeping should also
> > suffice.
>
> You can't enforce bottom-up allocation for userspace breakpoint
> requests. [...]

That's not the point.

The point is to offer a reasonable and simple static allocator
that will work fine with usual gdb usage. If something takes
away db4 that's as if user-space took away all registers - tough
luck.

You are trying to put complexity into a situation that is not
schedulable hence not resolvable _anyway_. There's just 4 debug
registers, not more. If the combined usage goes above four
someone will lose anyway - even with your allocator.

With my proposal the 'loss' can indeed come sooner if user-space
took db4 and there's nothing left for the kernel anymore - but
that's just an uninteresting special case that wont occur with
typical debug-register usage.

If it ever causes problems seriously _then_ will be the time to
consider "is it worth adding a more complex, dynamic allocator
for debug registers". Not now. This stuff is currently
over-designed and not acceptable to me in its current form.

Ingo

2009-03-14 16:11:04

[permalink] [raw]

Subject: Re: [patch 02/11] x86 architecture implementation of Hardware Breakpoint interfaces

On Sat, 14 Mar 2009, Ingo Molnar wrote:

>
> * Alan Stern <[email protected]> wrote:
>
> > On Sat, 14 Mar 2009, K.Prasad wrote:
> >
> > > Here's a summary of the intended changes to the patchset, which I hope
> > > to post early the following week. It tears down many features in the
> > > present submission (The write-up below is done without the benefit of
> > > actually having run into limitations while trying to chisel out code).
> > >
> > > - Adopt a static allocation method for registers, say FCFS (and perhaps
> > > botton-up for user-space allocations and the reverse for
> > > kernel-space), although individual counters to do book-keeping should also
> > > suffice.
> >
> > You can't enforce bottom-up allocation for userspace breakpoint
> > requests. [...]
>
> That's not the point.
>
> The point is to offer a reasonable and simple static allocator
> that will work fine with usual gdb usage. If something takes
> away db4 that's as if user-space took away all registers - tough
> luck.
>
> You are trying to put complexity into a situation that is not
> schedulable hence not resolvable _anyway_. There's just 4 debug
> registers, not more. If the combined usage goes above four
> someone will lose anyway - even with your allocator.

You are reading far more into my message than what I wrote.

I'm _not_ trying to put complexity anywhere. All I did was point out
that Prasad was wrong to state that the kernel could adopt (or enforce)
a bottom-up method for allocating debug registers for userspace
breakpoints. I trust you aren't trying to imply that he really was
right?

> With my proposal the 'loss' can indeed come sooner if user-space
> took db4 and there's nothing left for the kernel anymore - but
> that's just an uninteresting special case that wont occur with
> typical debug-register usage.
>
> If it ever causes problems seriously _then_ will be the time to
> consider "is it worth adding a more complex, dynamic allocator
> for debug registers". Not now. This stuff is currently
> over-designed and not acceptable to me in its current form.

My message didn't mention a word about more complex, dynamic
allocation. Just the opposite, in fact -- because if we did virtualize
the debug registers then we _would_ be able to enforce bottom-up
allocation.

So in the end, you're _agreeing_ with what I wrote. And yet the tone
of your reply suggests that you seemed to think that my message had
some deep, hostile intent. It didn't.

Alan Stern

2009-03-14 16:40:26