2020-11-12 07:50:34

by Dmitry Vyukov

[permalink] [raw]
Subject: Process-wide watchpoints

Hello perf maintainers,

I have a wish for a particular kernel functionality related to
watchpoints, and I would appreciate it if you can say how
feasible/complex to add it is (mostly glueing existing infra pieces,
or redesigning and adding lots of new code), or maybe it exists
already and I am missing it.

You can think of the functionality as setting MPROT_NONE but for a few
bytes only using watchpoints. On the access the accessing thread
should receive a signal (similar to SIGSEGV). Kernel copy_to/from_user
should not be affected (no EFAULT), I think this is already satisfied
for watchpoints. This functionality is also intended for production
environments (if you are interested -- for sampling race detection),
number of threads in the process can be up to, say, ~~10K and the
watchpoint is intended to be set for a very brief period of time
(~~few ms).

This can be done today with both perf_event_open and ptrace.
However, the problem is that both APIs work on a single thread level
(? perf_event_open can be inherited by children, but not for existing
siblings). So doing this would require iterating over, say, 10K
threads, calling perf_event_open, F_SETOWN, F_SETSIG, later close and
consuming 40K file descriptors.

What I would like to have is a single syscall that does all of it for
the whole process (either sending IPIs to currently running siblings,
or maybe activating this only on the next sched in).

I see at least one potential problem: what do we do if some sibling
thread already has all 4 watchpoints consumed? We don't necessarily
want to iterate over all 10K threads synchronously, nor we even want
to fail in this case. The intended use case is that only this feature
will mostly use watchpoints, so all threads will have equal number of
available watchpoints. So perhaps the removal of the watchpoint could
just communicate that there were some threads that were not able to
install the watchpoint.

Does it make any sense? How feasible/complex to add it is?

Thanks in advance


2020-11-12 10:33:39

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Process-wide watchpoints

On Thu, Nov 12, 2020 at 08:46:23AM +0100, Dmitry Vyukov wrote:

> for sampling race detection),
> number of threads in the process can be up to, say, ~~10K and the
> watchpoint is intended to be set for a very brief period of time
> (~~few ms).

Performance is a consideration here, doing lots of IPIs in such a short
window, on potentially large machines is a DoS risk.

> This can be done today with both perf_event_open and ptrace.
> However, the problem is that both APIs work on a single thread level
> (? perf_event_open can be inherited by children, but not for existing
> siblings). So doing this would require iterating over, say, 10K

One way would be to create the event before the process starts spawning
threads and keeping it disabled. Then every thread will inherit it, but
it'll be inactive.

> I see at least one potential problem: what do we do if some sibling
> thread already has all 4 watchpoints consumed?

That would be immediately avoided by this, since it will have the
watchpoint reserved per inheriting the event.

Then you can do ioctl(PERF_EVENT_IOC_{MODIFY_ATTRIBUTES,ENABLE,DISABLE})
to update the watch location and enable/disable it. This _will_ indeed
result in a shitload of IPIs if the threads are active, but it should
work.

2020-11-12 10:47:57

by Dmitry Vyukov

[permalink] [raw]
Subject: Re: Process-wide watchpoints

On Thu, Nov 12, 2020 at 11:31 AM Peter Zijlstra <[email protected]> wrote:
>
> On Thu, Nov 12, 2020 at 08:46:23AM +0100, Dmitry Vyukov wrote:
>
> > for sampling race detection),
> > number of threads in the process can be up to, say, ~~10K and the
> > watchpoint is intended to be set for a very brief period of time
> > (~~few ms).
>
> Performance is a consideration here, doing lots of IPIs in such a short
> window, on potentially large machines is a DoS risk.
>
> > This can be done today with both perf_event_open and ptrace.
> > However, the problem is that both APIs work on a single thread level
> > (? perf_event_open can be inherited by children, but not for existing
> > siblings). So doing this would require iterating over, say, 10K
>
> One way would be to create the event before the process starts spawning
> threads and keeping it disabled. Then every thread will inherit it, but
> it'll be inactive.
>
> > I see at least one potential problem: what do we do if some sibling
> > thread already has all 4 watchpoints consumed?
>
> That would be immediately avoided by this, since it will have the
> watchpoint reserved per inheriting the event.
>
> Then you can do ioctl(PERF_EVENT_IOC_{MODIFY_ATTRIBUTES,ENABLE,DISABLE})
> to update the watch location and enable/disable it. This _will_ indeed
> result in a shitload of IPIs if the threads are active, but it should
> work.

Aha! That's the possibility I missed.
We will try to prototype this and get back with more questions if/when
we have them.
Thanks!

2021-01-31 11:39:11

by Dmitry Vyukov

[permalink] [raw]
Subject: Re: Process-wide watchpoints

On Thu, Nov 12, 2020 at 11:43 AM Dmitry Vyukov <[email protected]> wrote:
> > > for sampling race detection),
> > > number of threads in the process can be up to, say, ~~10K and the
> > > watchpoint is intended to be set for a very brief period of time
> > > (~~few ms).
> >
> > Performance is a consideration here, doing lots of IPIs in such a short
> > window, on potentially large machines is a DoS risk.
> >
> > > This can be done today with both perf_event_open and ptrace.
> > > However, the problem is that both APIs work on a single thread level
> > > (? perf_event_open can be inherited by children, but not for existing
> > > siblings). So doing this would require iterating over, say, 10K
> >
> > One way would be to create the event before the process starts spawning
> > threads and keeping it disabled. Then every thread will inherit it, but
> > it'll be inactive.
> >
> > > I see at least one potential problem: what do we do if some sibling
> > > thread already has all 4 watchpoints consumed?
> >
> > That would be immediately avoided by this, since it will have the
> > watchpoint reserved per inheriting the event.
> >
> > Then you can do ioctl(PERF_EVENT_IOC_{MODIFY_ATTRIBUTES,ENABLE,DISABLE})
> > to update the watch location and enable/disable it. This _will_ indeed
> > result in a shitload of IPIs if the threads are active, but it should
> > work.
>
> Aha! That's the possibility I missed.
> We will try to prototype this and get back with more questions if/when
> we have them.
> Thanks!

Hi Peter,

I've tested this approach and it works, but only in half.
PERF_EVENT_IOC_{ENABLE,DISABLE} work as advertised.
However, PERF_EVENT_IOC_MODIFY_ATTRIBUTES does not work for inherited
child events.
Does something like this make any sense to you? Are you willing to
accept such change?

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 55d18791a72d..f6974807a32c 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3174,7 +3174,7 @@ int perf_event_refresh(struct perf_event *event,
int refresh)
}
EXPORT_SYMBOL_GPL(perf_event_refresh);

-static int perf_event_modify_breakpoint(struct perf_event *bp,
+static int _perf_event_modify_breakpoint(struct perf_event *bp,
struct perf_event_attr *attr)
{
int err;
@@ -3189,6 +3189,28 @@ static int perf_event_modify_breakpoint(struct
perf_event *bp,
return err;
}

+static int perf_event_modify_breakpoint(struct perf_event *bp,
+ struct perf_event_attr *attr)
+{
+ struct perf_event *child;
+ int err;
+
+ WARN_ON_ONCE(bp->ctx->parent_ctx);
+
+ mutex_lock(&bp->child_mutex);
+ err = _perf_event_modify_breakpoint(bp, attr);
+ if (err)
+ goto unlock;
+ list_for_each_entry(child, &bp->child_list, child_list) {
+ err = _perf_event_modify_breakpoint(child, attr);
+ if (err)
+ goto unlock;
+ }
+unlock:
+ mutex_unlock(&bp->child_mutex);
+ return err;
+}
+
static int perf_event_modify_attr(struct perf_event *event,
struct perf_event_attr *attr)

2021-01-31 12:09:27

by Dmitry Vyukov

[permalink] [raw]
Subject: Re: Process-wide watchpoints

On Sun, Jan 31, 2021 at 11:04 AM Dmitry Vyukov <[email protected]> wrote:
>
> On Thu, Nov 12, 2020 at 11:43 AM Dmitry Vyukov <[email protected]> wrote:
> > > > for sampling race detection),
> > > > number of threads in the process can be up to, say, ~~10K and the
> > > > watchpoint is intended to be set for a very brief period of time
> > > > (~~few ms).
> > >
> > > Performance is a consideration here, doing lots of IPIs in such a short
> > > window, on potentially large machines is a DoS risk.
> > >
> > > > This can be done today with both perf_event_open and ptrace.
> > > > However, the problem is that both APIs work on a single thread level
> > > > (? perf_event_open can be inherited by children, but not for existing
> > > > siblings). So doing this would require iterating over, say, 10K
> > >
> > > One way would be to create the event before the process starts spawning
> > > threads and keeping it disabled. Then every thread will inherit it, but
> > > it'll be inactive.
> > >
> > > > I see at least one potential problem: what do we do if some sibling
> > > > thread already has all 4 watchpoints consumed?
> > >
> > > That would be immediately avoided by this, since it will have the
> > > watchpoint reserved per inheriting the event.
> > >
> > > Then you can do ioctl(PERF_EVENT_IOC_{MODIFY_ATTRIBUTES,ENABLE,DISABLE})
> > > to update the watch location and enable/disable it. This _will_ indeed
> > > result in a shitload of IPIs if the threads are active, but it should
> > > work.
> >
> > Aha! That's the possibility I missed.
> > We will try to prototype this and get back with more questions if/when
> > we have them.
> > Thanks!
>
> Hi Peter,
>
> I've tested this approach and it works, but only in half.
> PERF_EVENT_IOC_{ENABLE,DISABLE} work as advertised.
> However, PERF_EVENT_IOC_MODIFY_ATTRIBUTES does not work for inherited
> child events.
> Does something like this make any sense to you? Are you willing to
> accept such change?
>
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 55d18791a72d..f6974807a32c 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -3174,7 +3174,7 @@ int perf_event_refresh(struct perf_event *event,
> int refresh)
> }
> EXPORT_SYMBOL_GPL(perf_event_refresh);
>
> -static int perf_event_modify_breakpoint(struct perf_event *bp,
> +static int _perf_event_modify_breakpoint(struct perf_event *bp,
> struct perf_event_attr *attr)
> {
> int err;
> @@ -3189,6 +3189,28 @@ static int perf_event_modify_breakpoint(struct
> perf_event *bp,
> return err;
> }
>
> +static int perf_event_modify_breakpoint(struct perf_event *bp,
> + struct perf_event_attr *attr)
> +{
> + struct perf_event *child;
> + int err;
> +
> + WARN_ON_ONCE(bp->ctx->parent_ctx);
> +
> + mutex_lock(&bp->child_mutex);
> + err = _perf_event_modify_breakpoint(bp, attr);
> + if (err)
> + goto unlock;
> + list_for_each_entry(child, &bp->child_list, child_list) {
> + err = _perf_event_modify_breakpoint(child, attr);
> + if (err)
> + goto unlock;
> + }
> +unlock:
> + mutex_unlock(&bp->child_mutex);
> + return err;
> +}
> +
> static int perf_event_modify_attr(struct perf_event *event,
> struct perf_event_attr *attr)


Not directly related to the above question, but related to my use case.
Could we extend bpf_perf_event_data with some more data re breakpoint events?

struct bpf_perf_event_data {
bpf_user_pt_regs_t regs;
__u64 sample_period;
__u64 addr;
};

Ideally, I would like to have an actual access address, size and
read/write type (may not match bp addr/size). Is that info easily
available at the point of bpf hook call?
Or, if that's not available at least breakpoint bp_type/bp_size.

Is it correct that we can materialize in bpf_perf_event_data anything
that's available in bpf_perf_event_data_kern (if it makes sense in the
public interface of course)?

struct bpf_perf_event_data_kern {
bpf_user_pt_regs_t *regs;
struct perf_sample_data *data;
struct perf_event *event;
};

Unfortunately I don't see perf_event_attr.bp_type/bp_size
stored/accessible anywhere in bpf_perf_event_data_kern. What would be
the right way to expose them in bpf_perf_event_data?

Thanks

2021-02-01 08:52:53

by Dmitry Vyukov

[permalink] [raw]
Subject: Re: Process-wide watchpoints

On Sun, Jan 31, 2021 at 11:28 AM Dmitry Vyukov <[email protected]> wrote:
>
> On Sun, Jan 31, 2021 at 11:04 AM Dmitry Vyukov <[email protected]> wrote:
> >
> > On Thu, Nov 12, 2020 at 11:43 AM Dmitry Vyukov <[email protected]> wrote:
> > > > > for sampling race detection),
> > > > > number of threads in the process can be up to, say, ~~10K and the
> > > > > watchpoint is intended to be set for a very brief period of time
> > > > > (~~few ms).
> > > >
> > > > Performance is a consideration here, doing lots of IPIs in such a short
> > > > window, on potentially large machines is a DoS risk.
> > > >
> > > > > This can be done today with both perf_event_open and ptrace.
> > > > > However, the problem is that both APIs work on a single thread level
> > > > > (? perf_event_open can be inherited by children, but not for existing
> > > > > siblings). So doing this would require iterating over, say, 10K
> > > >
> > > > One way would be to create the event before the process starts spawning
> > > > threads and keeping it disabled. Then every thread will inherit it, but
> > > > it'll be inactive.
> > > >
> > > > > I see at least one potential problem: what do we do if some sibling
> > > > > thread already has all 4 watchpoints consumed?
> > > >
> > > > That would be immediately avoided by this, since it will have the
> > > > watchpoint reserved per inheriting the event.
> > > >
> > > > Then you can do ioctl(PERF_EVENT_IOC_{MODIFY_ATTRIBUTES,ENABLE,DISABLE})
> > > > to update the watch location and enable/disable it. This _will_ indeed
> > > > result in a shitload of IPIs if the threads are active, but it should
> > > > work.
> > >
> > > Aha! That's the possibility I missed.
> > > We will try to prototype this and get back with more questions if/when
> > > we have them.
> > > Thanks!
> >
> > Hi Peter,
> >
> > I've tested this approach and it works, but only in half.
> > PERF_EVENT_IOC_{ENABLE,DISABLE} work as advertised.
> > However, PERF_EVENT_IOC_MODIFY_ATTRIBUTES does not work for inherited
> > child events.
> > Does something like this make any sense to you? Are you willing to
> > accept such change?
> >
> > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > index 55d18791a72d..f6974807a32c 100644
> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> > @@ -3174,7 +3174,7 @@ int perf_event_refresh(struct perf_event *event,
> > int refresh)
> > }
> > EXPORT_SYMBOL_GPL(perf_event_refresh);
> >
> > -static int perf_event_modify_breakpoint(struct perf_event *bp,
> > +static int _perf_event_modify_breakpoint(struct perf_event *bp,
> > struct perf_event_attr *attr)
> > {
> > int err;
> > @@ -3189,6 +3189,28 @@ static int perf_event_modify_breakpoint(struct
> > perf_event *bp,
> > return err;
> > }
> >
> > +static int perf_event_modify_breakpoint(struct perf_event *bp,
> > + struct perf_event_attr *attr)
> > +{
> > + struct perf_event *child;
> > + int err;
> > +
> > + WARN_ON_ONCE(bp->ctx->parent_ctx);
> > +
> > + mutex_lock(&bp->child_mutex);
> > + err = _perf_event_modify_breakpoint(bp, attr);
> > + if (err)
> > + goto unlock;
> > + list_for_each_entry(child, &bp->child_list, child_list) {
> > + err = _perf_event_modify_breakpoint(child, attr);
> > + if (err)
> > + goto unlock;
> > + }
> > +unlock:
> > + mutex_unlock(&bp->child_mutex);
> > + return err;
> > +}
> > +
> > static int perf_event_modify_attr(struct perf_event *event,
> > struct perf_event_attr *attr)
>
>
> Not directly related to the above question, but related to my use case.
> Could we extend bpf_perf_event_data with some more data re breakpoint events?
>
> struct bpf_perf_event_data {
> bpf_user_pt_regs_t regs;
> __u64 sample_period;
> __u64 addr;
> };
>
> Ideally, I would like to have an actual access address, size and
> read/write type (may not match bp addr/size). Is that info easily
> available at the point of bpf hook call?
> Or, if that's not available at least breakpoint bp_type/bp_size.
>
> Is it correct that we can materialize in bpf_perf_event_data anything
> that's available in bpf_perf_event_data_kern (if it makes sense in the
> public interface of course)?
>
> struct bpf_perf_event_data_kern {
> bpf_user_pt_regs_t *regs;
> struct perf_sample_data *data;
> struct perf_event *event;
> };
>
> Unfortunately I don't see perf_event_attr.bp_type/bp_size
> stored/accessible anywhere in bpf_perf_event_data_kern. What would be
> the right way to expose them in bpf_perf_event_data?

Or, alternatively would it be reasonable for perf to generate SIGTRAP
directly on watchpoint hit (like ptrace does)? That's what I am
ultimately trying to do by attaching a bpf program.

2021-02-03 05:41:54

by Namhyung Kim

[permalink] [raw]
Subject: Re: Process-wide watchpoints

Hello,

On Sun, Jan 31, 2021 at 7:28 PM Dmitry Vyukov <[email protected]> wrote:
> Not directly related to the above question, but related to my use case.
> Could we extend bpf_perf_event_data with some more data re breakpoint events?
>
> struct bpf_perf_event_data {
> bpf_user_pt_regs_t regs;
> __u64 sample_period;
> __u64 addr;
> };
>
> Ideally, I would like to have an actual access address, size and
> read/write type (may not match bp addr/size). Is that info easily
> available at the point of bpf hook call?
> Or, if that's not available at least breakpoint bp_type/bp_size.
>
> Is it correct that we can materialize in bpf_perf_event_data anything
> that's available in bpf_perf_event_data_kern (if it makes sense in the
> public interface of course)?
>
> struct bpf_perf_event_data_kern {
> bpf_user_pt_regs_t *regs;
> struct perf_sample_data *data;
> struct perf_event *event;
> };
>
> Unfortunately I don't see perf_event_attr.bp_type/bp_size
> stored/accessible anywhere in bpf_perf_event_data_kern. What would be
> the right way to expose them in bpf_perf_event_data?

I think you can access it via event->attr.bp_type/size in the struct
bpf_perf_event_data_kern.

Thanks,
Namhyung

2021-02-03 12:24:41

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Process-wide watchpoints

On Sun, Jan 31, 2021 at 11:04:43AM +0100, Dmitry Vyukov wrote:

> PERF_EVENT_IOC_{ENABLE,DISABLE} work as advertised.
> However, PERF_EVENT_IOC_MODIFY_ATTRIBUTES does not work for inherited
> child events.
> Does something like this make any sense to you? Are you willing to
> accept such change?
>
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 55d18791a72d..f6974807a32c 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -3174,7 +3174,7 @@ int perf_event_refresh(struct perf_event *event,
> int refresh)
> }
> EXPORT_SYMBOL_GPL(perf_event_refresh);
>
> -static int perf_event_modify_breakpoint(struct perf_event *bp,
> +static int _perf_event_modify_breakpoint(struct perf_event *bp,
> struct perf_event_attr *attr)
> {
> int err;
> @@ -3189,6 +3189,28 @@ static int perf_event_modify_breakpoint(struct
> perf_event *bp,
> return err;
> }
>
> +static int perf_event_modify_breakpoint(struct perf_event *bp,
> + struct perf_event_attr *attr)
> +{
> + struct perf_event *child;
> + int err;
> +
> + WARN_ON_ONCE(bp->ctx->parent_ctx);
> +
> + mutex_lock(&bp->child_mutex);
> + err = _perf_event_modify_breakpoint(bp, attr);
> + if (err)
> + goto unlock;
> + list_for_each_entry(child, &bp->child_list, child_list) {
> + err = _perf_event_modify_breakpoint(child, attr);
> + if (err)
> + goto unlock;
> + }
> +unlock:
> + mutex_unlock(&bp->child_mutex);
> + return err;
> +}
> +
> static int perf_event_modify_attr(struct perf_event *event,
> struct perf_event_attr *attr)

Oh.. yeah, normal ioctl()s go through the perf_event_for_each_child()
thing, but that doesn't work here.

So yeah, I suppose your patch makes sense.

2021-02-03 12:31:52

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Process-wide watchpoints

On Mon, Feb 01, 2021 at 09:50:20AM +0100, Dmitry Vyukov wrote:
> Or, alternatively would it be reasonable for perf to generate SIGTRAP
> directly on watchpoint hit (like ptrace does)? That's what I am
> ultimately trying to do by attaching a bpf program.

Perf should be able to generate signals, The perf_event_open manpage
lists two ways of trigering signals. The second way doesn't work for
you, due to it not working on inherited counters, but would the first
work?

That is, set attr::wakeup_events and fcntl(F_SETSIG).

2021-02-03 12:56:47

by Dmitry Vyukov

[permalink] [raw]
Subject: Re: Process-wide watchpoints

On Wed, Feb 3, 2021 at 1:29 PM Peter Zijlstra <[email protected]> wrote:
>
> On Mon, Feb 01, 2021 at 09:50:20AM +0100, Dmitry Vyukov wrote:
> > Or, alternatively would it be reasonable for perf to generate SIGTRAP
> > directly on watchpoint hit (like ptrace does)? That's what I am
> > ultimately trying to do by attaching a bpf program.
>
> Perf should be able to generate signals, The perf_event_open manpage
> lists two ways of trigering signals. The second way doesn't work for
> you, due to it not working on inherited counters, but would the first
> work?
>
> That is, set attr::wakeup_events and fcntl(F_SETSIG).

The problem is that this sends a signal to the fd owner rather than
the thread that hit the breakpoint. At least that's what happened in
our tests. We would like to send a signal to the thread that hit the
breakpoint.

2021-02-03 12:57:38

by Dmitry Vyukov

[permalink] [raw]
Subject: Re: Process-wide watchpoints

On Wed, Feb 3, 2021 at 1:49 PM Dmitry Vyukov <[email protected]> wrote:
>
> On Wed, Feb 3, 2021 at 1:29 PM Peter Zijlstra <[email protected]> wrote:
> >
> > On Mon, Feb 01, 2021 at 09:50:20AM +0100, Dmitry Vyukov wrote:
> > > Or, alternatively would it be reasonable for perf to generate SIGTRAP
> > > directly on watchpoint hit (like ptrace does)? That's what I am
> > > ultimately trying to do by attaching a bpf program.
> >
> > Perf should be able to generate signals, The perf_event_open manpage
> > lists two ways of trigering signals. The second way doesn't work for
> > you, due to it not working on inherited counters, but would the first
> > work?
> >
> > That is, set attr::wakeup_events and fcntl(F_SETSIG).
>
> The problem is that this sends a signal to the fd owner rather than
> the thread that hit the breakpoint. At least that's what happened in
> our tests. We would like to send a signal to the thread that hit the
> breakpoint.

I think that signal also does not carry the address and other info
typically present in SIGTRAP/SIGSEGV.

2021-02-03 13:40:53

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Process-wide watchpoints

On Wed, Feb 03, 2021 at 01:49:56PM +0100, Dmitry Vyukov wrote:
> On Wed, Feb 3, 2021 at 1:29 PM Peter Zijlstra <[email protected]> wrote:
> >
> > On Mon, Feb 01, 2021 at 09:50:20AM +0100, Dmitry Vyukov wrote:
> > > Or, alternatively would it be reasonable for perf to generate SIGTRAP
> > > directly on watchpoint hit (like ptrace does)? That's what I am
> > > ultimately trying to do by attaching a bpf program.
> >
> > Perf should be able to generate signals, The perf_event_open manpage
> > lists two ways of trigering signals. The second way doesn't work for
> > you, due to it not working on inherited counters, but would the first
> > work?
> >
> > That is, set attr::wakeup_events and fcntl(F_SETSIG).
>
> The problem is that this sends a signal to the fd owner rather than
> the thread that hit the breakpoint. At least that's what happened in
> our tests. We would like to send a signal to the thread that hit the
> breakpoint.

Ah indeed.. all of this was aimed at self-monitoring.

Letting perf send a signal to the monitored task is intrusive.. let me
think on that.

2021-02-04 08:12:05

by Dmitry Vyukov

[permalink] [raw]
Subject: Re: Process-wide watchpoints

On Wed, Feb 3, 2021 at 2:37 PM Peter Zijlstra <[email protected]> wrote:
>
> On Wed, Feb 03, 2021 at 01:49:56PM +0100, Dmitry Vyukov wrote:
> > On Wed, Feb 3, 2021 at 1:29 PM Peter Zijlstra <[email protected]> wrote:
> > >
> > > On Mon, Feb 01, 2021 at 09:50:20AM +0100, Dmitry Vyukov wrote:
> > > > Or, alternatively would it be reasonable for perf to generate SIGTRAP
> > > > directly on watchpoint hit (like ptrace does)? That's what I am
> > > > ultimately trying to do by attaching a bpf program.
> > >
> > > Perf should be able to generate signals, The perf_event_open manpage
> > > lists two ways of trigering signals. The second way doesn't work for
> > > you, due to it not working on inherited counters, but would the first
> > > work?
> > >
> > > That is, set attr::wakeup_events and fcntl(F_SETSIG).
> >
> > The problem is that this sends a signal to the fd owner rather than
> > the thread that hit the breakpoint. At least that's what happened in
> > our tests. We would like to send a signal to the thread that hit the
> > breakpoint.
>
> Ah indeed.. all of this was aimed at self-monitoring.
>
> Letting perf send a signal to the monitored task is intrusive.. let me
> think on that.

I was thinking of something very similar to that bpf_send_signal that
delays sending to exit from irq:
https://elixir.bootlin.com/linux/latest/source/kernel/trace/bpf_trace.c#L1091

2021-02-04 08:15:44

by Dmitry Vyukov

[permalink] [raw]
Subject: Re: Process-wide watchpoints

On Wed, Feb 3, 2021 at 6:38 AM Namhyung Kim <[email protected]> wrote:
>
> Hello,
>
> On Sun, Jan 31, 2021 at 7:28 PM Dmitry Vyukov <[email protected]> wrote:
> > Not directly related to the above question, but related to my use case.
> > Could we extend bpf_perf_event_data with some more data re breakpoint events?
> >
> > struct bpf_perf_event_data {
> > bpf_user_pt_regs_t regs;
> > __u64 sample_period;
> > __u64 addr;
> > };
> >
> > Ideally, I would like to have an actual access address, size and
> > read/write type (may not match bp addr/size). Is that info easily
> > available at the point of bpf hook call?
> > Or, if that's not available at least breakpoint bp_type/bp_size.
> >
> > Is it correct that we can materialize in bpf_perf_event_data anything
> > that's available in bpf_perf_event_data_kern (if it makes sense in the
> > public interface of course)?
> >
> > struct bpf_perf_event_data_kern {
> > bpf_user_pt_regs_t *regs;
> > struct perf_sample_data *data;
> > struct perf_event *event;
> > };
> >
> > Unfortunately I don't see perf_event_attr.bp_type/bp_size
> > stored/accessible anywhere in bpf_perf_event_data_kern. What would be
> > the right way to expose them in bpf_perf_event_data?
>
> I think you can access it via event->attr.bp_type/size in the struct
> bpf_perf_event_data_kern.


Hi Namhyung,

Right, that's I need. Thanks.

2021-02-04 10:00:51

by Dmitry Vyukov

[permalink] [raw]
Subject: Re: Process-wide watchpoints

On Thu, Feb 4, 2021 at 10:39 AM Peter Zijlstra <[email protected]> wrote:
>
> On Thu, Feb 04, 2021 at 09:10:11AM +0100, Dmitry Vyukov wrote:
> > On Wed, Feb 3, 2021 at 2:37 PM Peter Zijlstra <[email protected]> wrote:
>
> > > Letting perf send a signal to the monitored task is intrusive.. let me
> > > think on that.
> >
> > I was thinking of something very similar to that bpf_send_signal that
> > delays sending to exit from irq:
> > https://elixir.bootlin.com/linux/latest/source/kernel/trace/bpf_trace.c#L1091
>
> Oh, making code to do it isn't the problem. The problem stems from the
> fact that perf is supposed to be observant only. The exception is when
> you monitor yourself, in that case you can send signals to yourself,
> because you know what you're doing (supposedly ;-).
>
> But if you go send signals to the task you're monitoring, you're
> actually changing their code-flow, you're an active participant instead
> of an observer.
>
> Also, they might not be able to handle the signal, in which case you're
> not changing the program but terminating it entirely.
>
> That's a big conceptual shift.
>
> OTOH, we're using ptrace permission checks, and ptrace() can inject
> signals just fine. But it's a fairly big departure from what perf set
> out to be.

Oh, I see, I did not think about this.

FWIW it's doable today by attaching a BPF program.

Will it help if this mode is restricted to monitoring the current
process? Sending signals indeed usually requires cooperation, so doing
it for the current process looks like a reasonable restriction.
This may be not a fundamental restriction, but rather "we don't have
any use cases and are not sure about implications, so this is a
precaution measure, may be relaxed in future".

2021-02-04 12:14:53

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Process-wide watchpoints

On Thu, Feb 04, 2021 at 10:54:42AM +0100, Dmitry Vyukov wrote:
> On Thu, Feb 4, 2021 at 10:39 AM Peter Zijlstra <[email protected]> wrote:

> > OTOH, we're using ptrace permission checks, and ptrace() can inject
> > signals just fine. But it's a fairly big departure from what perf set
> > out to be.
>
> Oh, I see, I did not think about this.
>
> FWIW it's doable today by attaching a BPF program.

Sorta. For one, I can't operate BPF to save my life. Secondly, BPF has
some very dodgy recursion rules and it's trivial to loose BPF
invocations because another BPF is already running.

> Will it help if this mode is restricted to monitoring the current
> process? Sending signals indeed usually requires cooperation, so doing
> it for the current process looks like a reasonable restriction.
> This may be not a fundamental restriction, but rather "we don't have
> any use cases and are not sure about implications, so this is a
> precaution measure, may be relaxed in future".

Yeah, limiting it might help. I can trivially add attr::thread_only,
that requires attr::inherit and will limit it to CLONE_THREAD (find
below).

What do we do then? The advantage of IOC_REFRESH is that it disables the
event until it gets explicitly re-armed, avoiding recursion issues etc.
Do you want those semantics? If so, we'd need to have IOC_REFRESH find
the actual event for the current task, which should be doable I suppose.

And I need to dig into that fcntl() crud again, see if that's capable of
doing a SIGTRAP and if it's possible to target that to the task raising
it, instead of doing a process wide signal delivery.

Lemme rummage about a bit.

---
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -955,7 +955,7 @@ extern void __perf_event_task_sched_in(s
struct task_struct *task);
extern void __perf_event_task_sched_out(struct task_struct *prev,
struct task_struct *next);
-extern int perf_event_init_task(struct task_struct *child);
+extern int perf_event_init_task(struct task_struct *child, unsigned long clone_flags);
extern void perf_event_exit_task(struct task_struct *child);
extern void perf_event_free_task(struct task_struct *task);
extern void perf_event_delayed_put(struct task_struct *task);
@@ -1446,7 +1446,8 @@ perf_event_task_sched_in(struct task_str
static inline void
perf_event_task_sched_out(struct task_struct *prev,
struct task_struct *next) { }
-static inline int perf_event_init_task(struct task_struct *child) { return 0; }
+static inline int perf_event_init_task(struct task_struct *child,
+ unsigned long clone_flags) { return 0; }
static inline void perf_event_exit_task(struct task_struct *child) { }
static inline void perf_event_free_task(struct task_struct *task) { }
static inline void perf_event_delayed_put(struct task_struct *task) { }
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -388,7 +388,8 @@ struct perf_event_attr {
aux_output : 1, /* generate AUX records instead of events */
cgroup : 1, /* include cgroup events */
text_poke : 1, /* include text poke events */
- __reserved_1 : 30;
+ thread_only : 1, /* only inherit on threads */
+ __reserved_1 : 29;

union {
__u32 wakeup_events; /* wakeup every n events */
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -12776,12 +12776,13 @@ static int
inherit_task_group(struct perf_event *event, struct task_struct *parent,
struct perf_event_context *parent_ctx,
struct task_struct *child, int ctxn,
- int *inherited_all)
+ unsigned long clone_flags, int *inherited_all)
{
int ret;
struct perf_event_context *child_ctx;

- if (!event->attr.inherit) {
+ if (!event->attr.inherit ||
+ (event->attr.thread_only && !(clone_flags & CLONE_THREAD))) {
*inherited_all = 0;
return 0;
}
@@ -12813,7 +12814,7 @@ inherit_task_group(struct perf_event *ev
/*
* Initialize the perf_event context in task_struct
*/
-static int perf_event_init_context(struct task_struct *child, int ctxn)
+static int perf_event_init_context(struct task_struct *child, int ctxn, unsigned long clone_flags)
{
struct perf_event_context *child_ctx, *parent_ctx;
struct perf_event_context *cloned_ctx;
@@ -12853,7 +12854,8 @@ static int perf_event_init_context(struc
*/
perf_event_groups_for_each(event, &parent_ctx->pinned_groups) {
ret = inherit_task_group(event, parent, parent_ctx,
- child, ctxn, &inherited_all);
+ child, ctxn, clone_flags,
+ &inherited_all);
if (ret)
goto out_unlock;
}
@@ -12869,7 +12871,8 @@ static int perf_event_init_context(struc

perf_event_groups_for_each(event, &parent_ctx->flexible_groups) {
ret = inherit_task_group(event, parent, parent_ctx,
- child, ctxn, &inherited_all);
+ child, ctxn, clone_flags,
+ &inherited_all);
if (ret)
goto out_unlock;
}
@@ -12911,7 +12914,7 @@ static int perf_event_init_context(struc
/*
* Initialize the perf_event context in task_struct
*/
-int perf_event_init_task(struct task_struct *child)
+int perf_event_init_task(struct task_struct *child, unsigned long clone_flags)
{
int ctxn, ret;

@@ -12920,7 +12923,7 @@ int perf_event_init_task(struct task_str
INIT_LIST_HEAD(&child->perf_event_list);

for_each_task_context_nr(ctxn) {
- ret = perf_event_init_context(child, ctxn);
+ ret = perf_event_init_context(child, ctxn, clone_flags);
if (ret) {
perf_event_free_task(child);
return ret;
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2068,7 +2068,7 @@ static __latent_entropy struct task_stru
if (retval)
goto bad_fork_cleanup_policy;

- retval = perf_event_init_task(p);
+ retval = perf_event_init_task(p, clone_flags);
if (retval)
goto bad_fork_cleanup_policy;
retval = audit_alloc(p);

2021-02-04 13:37:11

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Process-wide watchpoints

On Thu, Feb 04, 2021 at 01:53:59PM +0100, Dmitry Vyukov wrote:
> On Thu, Feb 4, 2021 at 1:09 PM Peter Zijlstra <[email protected]> wrote:

> > What do we do then? The advantage of IOC_REFRESH is that it disables the
> > event until it gets explicitly re-armed, avoiding recursion issues etc.
> > Do you want those semantics? If so, we'd need to have IOC_REFRESH find
> > the actual event for the current task, which should be doable I suppose.
>
> Frankly, I don't know. I didn't use it in my prototype, nor I fully
> understand what it's doing. Does it make sense for breakpoints?
> I see IOC_REFRESH has a check for !attr.inherit, so it will fail for
> my use case currently. I would say we just leave it as is for now.

Well, the way it works is that currently you set event_limit > 0. Then
each event will decrement, when we hit 0 we disable and raise a signal.

REFRESH will increment event_limit and re-enable.

This means you're guaranteed not to get another signal until you're
ready for it. It allows leaving the signal handler context to handle the
signal.

I suppose you're looking for something like this, which goes in top of
that thread_only thing.

--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -389,7 +389,8 @@ struct perf_event_attr {
cgroup : 1, /* include cgroup events */
text_poke : 1, /* include text poke events */
thread_only : 1, /* only inherit on threads */
- __reserved_1 : 29;
+ sigtrap : 1, /* foo */
+ __reserved_1 : 28;

union {
__u32 wakeup_events; /* wakeup every n events */
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6273,6 +6273,13 @@ static void perf_pending_event_disable(s

if (cpu == smp_processor_id()) {
WRITE_ONCE(event->pending_disable, -1);
+
+ if (event->attr.sigtrap) {
+ atomic_inc(&event->event_limit); /* rearm */
+ send_sig_info(SIGTRAP, SEND_SIG_PRIV, current);
+ return;
+ }
+
perf_event_disable_local(event);
return;
}
@@ -8936,6 +8943,7 @@ static int __perf_event_overflow(struct
int throttle, struct perf_sample_data *data,
struct pt_regs *regs)
{
+ perf_overflow_handler_t ovf;
int events = atomic_read(&event->event_limit);
int ret = 0;

@@ -8961,7 +8969,15 @@ static int __perf_event_overflow(struct
perf_event_disable_inatomic(event);
}

- READ_ONCE(event->overflow_handler)(event, data, regs);
+ ovf = READ_ONCE(event->overflow_handler);
+#ifdef CONFIG_RETPOLINE
+ if (ovf == perf_event_output_forward) {
+ perf_event_output_forward(event, data, regs);
+ } else if (ovf == perf_event_output_backward) {
+ perf_event_output_backward(event, data, regs);
+ } else
+#endif
+ ovf(event, data, regs);

if (*perf_event_fasync(event) && event->pending_kill) {
event->pending_wakeup = 1;
@@ -11281,6 +11297,9 @@ perf_event_alloc(struct perf_event_attr

event->state = PERF_EVENT_STATE_INACTIVE;

+ if (event->attr.sigtrap)
+ event->event_limit = ATOMIC_INIT(1);
+
if (task) {
event->attach_state = PERF_ATTACH_TASK;
/*
@@ -11556,6 +11575,9 @@ static int perf_copy_attr(struct perf_ev
if (attr->thread_only && !attr->inherit)
return -EINVAL;

+ if (attr->sigtrap && attr->inherit && !attr->thread_only)
+ return -EINVAL;
+
out:
return ret;



2021-02-04 13:40:32

by Dmitry Vyukov

[permalink] [raw]
Subject: Re: Process-wide watchpoints

On Thu, Feb 4, 2021 at 2:10 PM Peter Zijlstra <[email protected]> wrote:
>
> On Thu, Feb 04, 2021 at 01:53:59PM +0100, Dmitry Vyukov wrote:
> > Humm... I was thinking of perf_event_open(pid == 0).
> > It does not make sense to send SIGTRAP in a remote process, because it
> > does not necessarily cooperate with us.
> >
> > But is there any problem with clone w/o CLONE_THREAD? Assuming the
> > current process has setup the signal handler, the child will have the
> > same handler and the same code/address space. So delivery of SIGTRAP
> > should work the same way in the child.
>
> Nothing should be doing CLONE_VM without CLONE_THREAD. Yes, it's
> possible, but if you do so, you get to keep the pieces IMO.
>
> Current libc either does a full clone (fork) or pthread_create,
> pthread_create does CLONE_THREAD.

I meant a different thing.
I meant that we could restrict synchronous SIGTRAP for (1)
perf_event_open(pid != 0) and (2) disable it after exec.
What is the issue here for clone without CLONE_THREAD?

2021-02-04 13:51:40

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Process-wide watchpoints

On Thu, Feb 04, 2021 at 02:35:36PM +0100, Dmitry Vyukov wrote:
> On Thu, Feb 4, 2021 at 2:10 PM Peter Zijlstra <[email protected]> wrote:
> >
> > On Thu, Feb 04, 2021 at 01:53:59PM +0100, Dmitry Vyukov wrote:
> > > Humm... I was thinking of perf_event_open(pid == 0).
> > > It does not make sense to send SIGTRAP in a remote process, because it
> > > does not necessarily cooperate with us.
> > >
> > > But is there any problem with clone w/o CLONE_THREAD? Assuming the
> > > current process has setup the signal handler, the child will have the
> > > same handler and the same code/address space. So delivery of SIGTRAP
> > > should work the same way in the child.
> >
> > Nothing should be doing CLONE_VM without CLONE_THREAD. Yes, it's
> > possible, but if you do so, you get to keep the pieces IMO.
> >
> > Current libc either does a full clone (fork) or pthread_create,
> > pthread_create does CLONE_THREAD.
>
> I meant a different thing.
> I meant that we could restrict synchronous SIGTRAP for (1)
> perf_event_open(pid != 0) and (2) disable it after exec.

Ah, I figured a generic means to inherit across a process, but not a
process tree might be useful.

I don't much like magical/implied constraints.

> What is the issue here for clone without CLONE_THREAD?

It's an abomination that's possible and an endless cause of trouble :/

2021-02-04 15:08:34

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Process-wide watchpoints

On Thu, Feb 04, 2021 at 02:35:36PM +0100, Dmitry Vyukov wrote:

> I meant that we could restrict synchronous SIGTRAP for (1)
> perf_event_open(pid != 0) and (2) disable it after exec.

Hmm, I think I finally get what you're after. And yes, multi-process or
fork() based thingies are common and might well work too.

disable_on_exec isn't quite right though, it needs to be something that
kills the events entirely. I'll think about it.

2021-02-04 23:38:12

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Process-wide watchpoints

On Thu, Feb 04, 2021 at 09:10:11AM +0100, Dmitry Vyukov wrote:
> On Wed, Feb 3, 2021 at 2:37 PM Peter Zijlstra <[email protected]> wrote:

> > Letting perf send a signal to the monitored task is intrusive.. let me
> > think on that.
>
> I was thinking of something very similar to that bpf_send_signal that
> delays sending to exit from irq:
> https://elixir.bootlin.com/linux/latest/source/kernel/trace/bpf_trace.c#L1091

Oh, making code to do it isn't the problem. The problem stems from the
fact that perf is supposed to be observant only. The exception is when
you monitor yourself, in that case you can send signals to yourself,
because you know what you're doing (supposedly ;-).

But if you go send signals to the task you're monitoring, you're
actually changing their code-flow, you're an active participant instead
of an observer.

Also, they might not be able to handle the signal, in which case you're
not changing the program but terminating it entirely.

That's a big conceptual shift.

OTOH, we're using ptrace permission checks, and ptrace() can inject
signals just fine. But it's a fairly big departure from what perf set
out to be.

2021-02-05 00:12:40

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Process-wide watchpoints

On Thu, Feb 04, 2021 at 01:53:59PM +0100, Dmitry Vyukov wrote:
> Humm... I was thinking of perf_event_open(pid == 0).
> It does not make sense to send SIGTRAP in a remote process, because it
> does not necessarily cooperate with us.
>
> But is there any problem with clone w/o CLONE_THREAD? Assuming the
> current process has setup the signal handler, the child will have the
> same handler and the same code/address space. So delivery of SIGTRAP
> should work the same way in the child.

Nothing should be doing CLONE_VM without CLONE_THREAD. Yes, it's
possible, but if you do so, you get to keep the pieces IMO.

Current libc either does a full clone (fork) or pthread_create,
pthread_create does CLONE_THREAD.

2021-02-05 00:12:54

by Dmitry Vyukov

[permalink] [raw]
Subject: Re: Process-wide watchpoints

On Thu, Feb 4, 2021 at 1:09 PM Peter Zijlstra <[email protected]> wrote:
>
> On Thu, Feb 04, 2021 at 10:54:42AM +0100, Dmitry Vyukov wrote:
> > On Thu, Feb 4, 2021 at 10:39 AM Peter Zijlstra <[email protected]> wrote:
>
> > > OTOH, we're using ptrace permission checks, and ptrace() can inject
> > > signals just fine. But it's a fairly big departure from what perf set
> > > out to be.
> >
> > Oh, I see, I did not think about this.
> >
> > FWIW it's doable today by attaching a BPF program.
>
> Sorta. For one, I can't operate BPF to save my life. Secondly, BPF has
> some very dodgy recursion rules and it's trivial to loose BPF
> invocations because another BPF is already running.
>
> > Will it help if this mode is restricted to monitoring the current
> > process? Sending signals indeed usually requires cooperation, so doing
> > it for the current process looks like a reasonable restriction.
> > This may be not a fundamental restriction, but rather "we don't have
> > any use cases and are not sure about implications, so this is a
> > precaution measure, may be relaxed in future".
>
> Yeah, limiting it might help. I can trivially add attr::thread_only,
> that requires attr::inherit and will limit it to CLONE_THREAD (find
> below).
>
> What do we do then? The advantage of IOC_REFRESH is that it disables the
> event until it gets explicitly re-armed, avoiding recursion issues etc.
> Do you want those semantics? If so, we'd need to have IOC_REFRESH find
> the actual event for the current task, which should be doable I suppose.

Frankly, I don't know. I didn't use it in my prototype, nor I fully
understand what it's doing. Does it make sense for breakpoints?
I see IOC_REFRESH has a check for !attr.inherit, so it will fail for
my use case currently. I would say we just leave it as is for now.


> And I need to dig into that fcntl() crud again, see if that's capable of
> doing a SIGTRAP and if it's possible to target that to the task raising
> it, instead of doing a process wide signal delivery.
>
> Lemme rummage about a bit.

Note if we do this, I would also need an address and FAULT_FLAG_WRITE.
AFAIU the current code sends SIGTRAP w/o any arguments.

> ---
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -955,7 +955,7 @@ extern void __perf_event_task_sched_in(s
> struct task_struct *task);
> extern void __perf_event_task_sched_out(struct task_struct *prev,
> struct task_struct *next);
> -extern int perf_event_init_task(struct task_struct *child);
> +extern int perf_event_init_task(struct task_struct *child, unsigned long clone_flags);
> extern void perf_event_exit_task(struct task_struct *child);
> extern void perf_event_free_task(struct task_struct *task);
> extern void perf_event_delayed_put(struct task_struct *task);
> @@ -1446,7 +1446,8 @@ perf_event_task_sched_in(struct task_str
> static inline void
> perf_event_task_sched_out(struct task_struct *prev,
> struct task_struct *next) { }
> -static inline int perf_event_init_task(struct task_struct *child) { return 0; }
> +static inline int perf_event_init_task(struct task_struct *child,
> + unsigned long clone_flags) { return 0; }
> static inline void perf_event_exit_task(struct task_struct *child) { }
> static inline void perf_event_free_task(struct task_struct *task) { }
> static inline void perf_event_delayed_put(struct task_struct *task) { }
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -388,7 +388,8 @@ struct perf_event_attr {
> aux_output : 1, /* generate AUX records instead of events */
> cgroup : 1, /* include cgroup events */
> text_poke : 1, /* include text poke events */
> - __reserved_1 : 30;
> + thread_only : 1, /* only inherit on threads */
> + __reserved_1 : 29;
>
> union {
> __u32 wakeup_events; /* wakeup every n events */
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -12776,12 +12776,13 @@ static int
> inherit_task_group(struct perf_event *event, struct task_struct *parent,
> struct perf_event_context *parent_ctx,
> struct task_struct *child, int ctxn,
> - int *inherited_all)
> + unsigned long clone_flags, int *inherited_all)
> {
> int ret;
> struct perf_event_context *child_ctx;
>
> - if (!event->attr.inherit) {
> + if (!event->attr.inherit ||
> + (event->attr.thread_only && !(clone_flags & CLONE_THREAD))) {
> *inherited_all = 0;
> return 0;
> }
> @@ -12813,7 +12814,7 @@ inherit_task_group(struct perf_event *ev
> /*
> * Initialize the perf_event context in task_struct
> */
> -static int perf_event_init_context(struct task_struct *child, int ctxn)
> +static int perf_event_init_context(struct task_struct *child, int ctxn, unsigned long clone_flags)
> {
> struct perf_event_context *child_ctx, *parent_ctx;
> struct perf_event_context *cloned_ctx;
> @@ -12853,7 +12854,8 @@ static int perf_event_init_context(struc
> */
> perf_event_groups_for_each(event, &parent_ctx->pinned_groups) {
> ret = inherit_task_group(event, parent, parent_ctx,
> - child, ctxn, &inherited_all);
> + child, ctxn, clone_flags,
> + &inherited_all);
> if (ret)
> goto out_unlock;
> }
> @@ -12869,7 +12871,8 @@ static int perf_event_init_context(struc
>
> perf_event_groups_for_each(event, &parent_ctx->flexible_groups) {
> ret = inherit_task_group(event, parent, parent_ctx,
> - child, ctxn, &inherited_all);
> + child, ctxn, clone_flags,
> + &inherited_all);
> if (ret)
> goto out_unlock;
> }
> @@ -12911,7 +12914,7 @@ static int perf_event_init_context(struc
> /*
> * Initialize the perf_event context in task_struct
> */
> -int perf_event_init_task(struct task_struct *child)
> +int perf_event_init_task(struct task_struct *child, unsigned long clone_flags)
> {
> int ctxn, ret;
>
> @@ -12920,7 +12923,7 @@ int perf_event_init_task(struct task_str
> INIT_LIST_HEAD(&child->perf_event_list);
>
> for_each_task_context_nr(ctxn) {
> - ret = perf_event_init_context(child, ctxn);
> + ret = perf_event_init_context(child, ctxn, clone_flags);
> if (ret) {
> perf_event_free_task(child);
> return ret;
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -2068,7 +2068,7 @@ static __latent_entropy struct task_stru
> if (retval)
> goto bad_fork_cleanup_policy;
>
> - retval = perf_event_init_task(p);
> + retval = perf_event_init_task(p, clone_flags);
> if (retval)
> goto bad_fork_cleanup_policy;
> retval = audit_alloc(p);


Humm... I was thinking of perf_event_open(pid == 0).
It does not make sense to send SIGTRAP in a remote process, because it
does not necessarily cooperate with us.

But is there any problem with clone w/o CLONE_THREAD? Assuming the
current process has setup the signal handler, the child will have the
same handler and the same code/address space. So delivery of SIGTRAP
should work the same way in the child.
I see how it may cause problems for exec, though. We don't have the
same signal handler in address space (I assume exec resets all signal
handlers). So in this case SIGTRAP will cause problems.

disable_on_exec : 1, /* disable after exec */

which would be complementary to the current:

enable_on_exec : 1, /* next exec enables */

2021-02-05 00:13:07

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Process-wide watchpoints

On Thu, Feb 04, 2021 at 01:09:13PM +0100, Peter Zijlstra wrote:

> And I need to dig into that fcntl() crud again, see if that's capable of
> doing a SIGTRAP and if it's possible to target that to the task raising
> it, instead of doing a process wide signal delivery.

Hmm, so it might be possible to extend SETOWN_EX to allow
{F_OWNER_TID,0} to mean any current thread.

And I think SETSIG can be used to set SIGTRAP, although I don't see it
setting the fields you wanted, not entirely sure that's fixable.

2021-02-05 00:14:11

by Dmitry Vyukov

[permalink] [raw]
Subject: Re: Process-wide watchpoints

On Thu, Feb 4, 2021 at 2:33 PM Peter Zijlstra <[email protected]> wrote:
>
> On Thu, Feb 04, 2021 at 01:53:59PM +0100, Dmitry Vyukov wrote:
> > On Thu, Feb 4, 2021 at 1:09 PM Peter Zijlstra <[email protected]> wrote:
>
> > > What do we do then? The advantage of IOC_REFRESH is that it disables the
> > > event until it gets explicitly re-armed, avoiding recursion issues etc.
> > > Do you want those semantics? If so, we'd need to have IOC_REFRESH find
> > > the actual event for the current task, which should be doable I suppose.
> >
> > Frankly, I don't know. I didn't use it in my prototype, nor I fully
> > understand what it's doing. Does it make sense for breakpoints?
> > I see IOC_REFRESH has a check for !attr.inherit, so it will fail for
> > my use case currently. I would say we just leave it as is for now.
>
> Well, the way it works is that currently you set event_limit > 0. Then
> each event will decrement, when we hit 0 we disable and raise a signal.
>
> REFRESH will increment event_limit and re-enable.
>
> This means you're guaranteed not to get another signal until you're
> ready for it. It allows leaving the signal handler context to handle the
> signal.
>
> I suppose you're looking for something like this, which goes in top of
> that thread_only thing.
>
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -389,7 +389,8 @@ struct perf_event_attr {
> cgroup : 1, /* include cgroup events */
> text_poke : 1, /* include text poke events */
> thread_only : 1, /* only inherit on threads */
> - __reserved_1 : 29;
> + sigtrap : 1, /* foo */
> + __reserved_1 : 28;
>
> union {
> __u32 wakeup_events; /* wakeup every n events */
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -6273,6 +6273,13 @@ static void perf_pending_event_disable(s
>
> if (cpu == smp_processor_id()) {
> WRITE_ONCE(event->pending_disable, -1);
> +
> + if (event->attr.sigtrap) {
> + atomic_inc(&event->event_limit); /* rearm */
> + send_sig_info(SIGTRAP, SEND_SIG_PRIV, current);
> + return;
> + }
> +
> perf_event_disable_local(event);
> return;
> }
> @@ -8936,6 +8943,7 @@ static int __perf_event_overflow(struct
> int throttle, struct perf_sample_data *data,
> struct pt_regs *regs)
> {
> + perf_overflow_handler_t ovf;
> int events = atomic_read(&event->event_limit);
> int ret = 0;
>
> @@ -8961,7 +8969,15 @@ static int __perf_event_overflow(struct
> perf_event_disable_inatomic(event);
> }
>
> - READ_ONCE(event->overflow_handler)(event, data, regs);
> + ovf = READ_ONCE(event->overflow_handler);
> +#ifdef CONFIG_RETPOLINE
> + if (ovf == perf_event_output_forward) {
> + perf_event_output_forward(event, data, regs);
> + } else if (ovf == perf_event_output_backward) {
> + perf_event_output_backward(event, data, regs);
> + } else
> +#endif
> + ovf(event, data, regs);
>
> if (*perf_event_fasync(event) && event->pending_kill) {
> event->pending_wakeup = 1;
> @@ -11281,6 +11297,9 @@ perf_event_alloc(struct perf_event_attr
>
> event->state = PERF_EVENT_STATE_INACTIVE;
>
> + if (event->attr.sigtrap)
> + event->event_limit = ATOMIC_INIT(1);
> +
> if (task) {
> event->attach_state = PERF_ATTACH_TASK;
> /*
> @@ -11556,6 +11575,9 @@ static int perf_copy_attr(struct perf_ev
> if (attr->thread_only && !attr->inherit)
> return -EINVAL;
>
> + if (attr->sigtrap && attr->inherit && !attr->thread_only)
> + return -EINVAL;
> +
> out:
> return ret;


Thanks. Let me see if this will work for us and test.