2024-01-19 16:40:42

by Ben Gainey

[permalink] [raw]
Subject: [PATCH 0/1] Support PERF_SAMPLE_READ with inherit_stat

This change allows events to use PERF_SAMPLE READ with inherit so long
as both inherit_stat and PERF_SAMPLE_TID are set.

Currently it is not possible to use PERF_SAMPLE_READ with inherit. This
restriction assumes the user is interested in collecting aggregate
statistics as per `perf stat`. It prevents a user from collecting
per-thread samples using counter groups from a multi-threaded or
multi-process application, as with `perf record -e '{....}:S'`. Instead
users must use system-wide mode, or forgo the ability to sample counter
groups. System-wide mode is often problematic as it requires specific
permissions (no CAP_PERFMON / root access), or may lead to capture of
significant amounts of extra data from other processes running on the
system.

Perf already supports the ability to collect per-thread counts with
`inherit` via the `inherit_stat` flag. This patch changes
`perf_event_alloc` relaxing the restriction to combine `inherit` with
`PERF_SAMPLE_READ` so that the combination will be allowed so long as
`inherit_stat` and `PERF_SAMPLE_TID` are enabled.

In this configuration stream ids (such as may appear in the read_format
field of a PERF_RECORD_SAMPLE) are no longer globally unique, rather
the pair of (stream id, tid) uniquely identify each event. Tools that
rely on this, for example to calculate a delta between samples, would
need updating to take this into account. Previously valid event
configurations (system-wide, no-inherit and so on) where each stream id
is the identifier are unaffected.

This patch has been tested on aarch64 both my manual inspection of the
output of `perf script -D` and through a modified version of Arm's
commercial profiling tools and the numbers appear to line up as one
would expect, but some further validation across other architectures
and/or edge cases would be welcome.

This patch was developed and tested on top of v6.7.


Ben Gainey (1):
perf: Support PERF_SAMPLE_READ with inherit_stat

kernel/events/core.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

--
2.43.0



2024-01-19 16:40:53

by Ben Gainey

[permalink] [raw]
Subject: [PATCH 1/1] perf: Support PERF_SAMPLE_READ with inherit_stat

This change allows events to use PERF_SAMPLE READ with inherit
so long as both inherit_stat and PERF_SAMPLE_TID are set.

In this configuration, and event will be inherited into any
child processes / threads, allowing convenient profiling of a
multi-process or multi-threaded application, whilst allowing
profiling tools to collect per-thread samples, in particular
of groups of counters.

Signed-off-by: Ben Gainey <[email protected]>
---
kernel/events/core.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 9efd0d7775e7c..4b603463d888f 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -11988,10 +11988,13 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
local64_set(&hwc->period_left, hwc->sample_period);

/*
- * We currently do not support PERF_SAMPLE_READ on inherited events.
+ * We do not support PERF_SAMPLE_READ on inherited events unless
+ * inherit_stat and PERF_SAMPLE_TID are also selected, which allows
+ * inherited events to collect per-thread samples.
* See perf_output_read().
*/
- if (attr->inherit && (attr->sample_type & PERF_SAMPLE_READ))
+ if (attr->inherit && (attr->sample_type & PERF_SAMPLE_READ)
+ && !(attr->inherit_stat && (attr->sample_type & PERF_SAMPLE_TID)))
goto err_ns;

if (!has_branch_stack(event))
--
2.43.0


2024-01-19 18:09:24

by Ben Gainey

[permalink] [raw]
Subject: Re: [PATCH 0/1] Support PERF_SAMPLE_READ with inherit_stat

On Fri, 2024-01-19 at 09:45 -0800, Andi Kleen wrote:
> Ben Gainey <[email protected]> writes:
>
> > In this configuration stream ids (such as may appear in the
> > read_format
> > field of a PERF_RECORD_SAMPLE) are no longer globally unique,
> > rather
> > the pair of (stream id, tid) uniquely identify each event. Tools
> > that
> > rely on this, for example to calculate a delta between samples,
> > would
> > need updating to take this into account. Previously valid event
> > configurations (system-wide, no-inherit and so on) where each
> > stream id
> > is the identifier are unaffected.
>
> So is this an ABI break? It might need an optin, if it breaks
> anything,
> which wouldn't surprise me. We do have a lot of different perf stream
> parsers around these days and we cannot break them.
>
> -Andi

I had considered that, but given currently this perf_event_attr
configuration is not allowed, I assumed that it would require existing
tools to add support which would in effect be an opt-in. Of course,
adding a new flag to be explicit would be trivial enough if required.

That said, the binary format for the mmap records / read() etc does not
change so using an unmodified tool to parse the data file will give bad
results. Therefore any workflow where "modified recording tool" can be
combined with "older / unmodified parsing tool" will break. Not sure of
the best way to handle this... presumably whenever a change is made to
the perf record format, any workflow that allows old parsers to read
new format data without version checks could fail? Admittedly this is a
"looks the same but isn't" change so harder for tools devs to spot. Any
suggestions?

For the perf tools, is there a means to record in perf.data a minimum
supported tool version / feature incompatibility flags?

Regards
Ben
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

2024-01-19 19:01:20

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 0/1] Support PERF_SAMPLE_READ with inherit_stat

Ben Gainey <[email protected]> writes:

> In this configuration stream ids (such as may appear in the read_format
> field of a PERF_RECORD_SAMPLE) are no longer globally unique, rather
> the pair of (stream id, tid) uniquely identify each event. Tools that
> rely on this, for example to calculate a delta between samples, would
> need updating to take this into account. Previously valid event
> configurations (system-wide, no-inherit and so on) where each stream id
> is the identifier are unaffected.

So is this an ABI break? It might need an optin, if it breaks anything,
which wouldn't surprise me. We do have a lot of different perf stream
parsers around these days and we cannot break them.

-Andi

2024-01-20 00:11:08

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 0/1] Support PERF_SAMPLE_READ with inherit_stat

> I had considered that, but given currently this perf_event_attr
> configuration is not allowed, I assumed that it would require existing
> tools to add support which would in effect be an opt-in. Of course,
> adding a new flag to be explicit would be trivial enough if required.

That's fair. Makes sense.

> That said, the binary format for the mmap records / read() etc does not
> change so using an unmodified tool to parse the data file will give bad
> results. Therefore any workflow where "modified recording tool" can be
> combined with "older / unmodified parsing tool" will break. Not sure of
> the best way to handle this... presumably whenever a change is made to
> the perf record format, any workflow that allows old parsers to read
> new format data without version checks could fail? Admittedly this is a
> "looks the same but isn't" change so harder for tools devs to spot. Any
> suggestions?

For perf itself we can find something. It does a couple of checks, like
reserved bits in the perf_event_attr. For the general case of other
parsers it's unclear. I suppose could increment the magic identifier
to PERFILE3

-Andi

2024-01-20 00:50:37

by Namhyung Kim

[permalink] [raw]
Subject: Re: [PATCH 0/1] Support PERF_SAMPLE_READ with inherit_stat

Hello,

On Fri, Jan 19, 2024 at 8:39 AM Ben Gainey <[email protected]> wrote:
>
> This change allows events to use PERF_SAMPLE READ with inherit so long
> as both inherit_stat and PERF_SAMPLE_TID are set.
>
> Currently it is not possible to use PERF_SAMPLE_READ with inherit. This
> restriction assumes the user is interested in collecting aggregate
> statistics as per `perf stat`. It prevents a user from collecting
> per-thread samples using counter groups from a multi-threaded or
> multi-process application, as with `perf record -e '{....}:S'`. Instead
> users must use system-wide mode, or forgo the ability to sample counter
> groups. System-wide mode is often problematic as it requires specific
> permissions (no CAP_PERFMON / root access), or may lead to capture of
> significant amounts of extra data from other processes running on the
> system.
>
> Perf already supports the ability to collect per-thread counts with
> `inherit` via the `inherit_stat` flag. This patch changes
> `perf_event_alloc` relaxing the restriction to combine `inherit` with
> `PERF_SAMPLE_READ` so that the combination will be allowed so long as
> `inherit_stat` and `PERF_SAMPLE_TID` are enabled.

I'm not sure if it's correct. Maybe I misunderstand inherit_stat but
AFAIK it's just to use prev_task's events when next_task has the
compatible event context. So the event values it sees in samples
would depend on the timing or scheduler behavior.

Also event counts and time values PERF_SAMPLE_READ sees
include child event's so the values of the parent event can be
updated even if it's inactive. And the values will vary for the
next_task whether prev_task is the parent or not. I think it
would return consistent values only if it iterates all child events
and sums up the values like it does for read(2). But it cannot
do that in the NMI handler.

Frankly I don't understand how inherit_stat supports per-thread
counts properly. Also it doesn't seem to be used by default in
the perf tools. IIUC per-thread count is supported when you
don't set the inherit bit and open separate events for each
thread but I guess that's not what you want.

Anyway, I'm ok with the idea of using PERF_SAMPLE_READ to
improve per-thread profiling especially with event groups.
But I think it should not use inherit_stat and it needs a way to
not include child stats in the samples.

What do you think?

Thanks,
Namhyung

>
> In this configuration stream ids (such as may appear in the read_format
> field of a PERF_RECORD_SAMPLE) are no longer globally unique, rather
> the pair of (stream id, tid) uniquely identify each event. Tools that
> rely on this, for example to calculate a delta between samples, would
> need updating to take this into account. Previously valid event
> configurations (system-wide, no-inherit and so on) where each stream id
> is the identifier are unaffected.
>
> This patch has been tested on aarch64 both my manual inspection of the
> output of `perf script -D` and through a modified version of Arm's
> commercial profiling tools and the numbers appear to line up as one
> would expect, but some further validation across other architectures
> and/or edge cases would be welcome.
>
> This patch was developed and tested on top of v6.7.
>
>
> Ben Gainey (1):
> perf: Support PERF_SAMPLE_READ with inherit_stat
>
> kernel/events/core.c | 7 +++++--
> 1 file changed, 5 insertions(+), 2 deletions(-)
>
> --
> 2.43.0
>

2024-01-20 16:15:28

by Ben Gainey

[permalink] [raw]
Subject: Re: [PATCH 0/1] Support PERF_SAMPLE_READ with inherit_stat

On Fri, 2024-01-19 at 16:49 -0800, Namhyung Kim wrote:
> Hello,
>
> On Fri, Jan 19, 2024 at 8:39 AM Ben Gainey <[email protected]>
> wrote:
> >
> > This change allows events to use PERF_SAMPLE READ with inherit so
> > long
> > as both inherit_stat and PERF_SAMPLE_TID are set.
> >
> > Currently it is not possible to use PERF_SAMPLE_READ with inherit.
> > This
> > restriction assumes the user is interested in collecting aggregate
> > statistics as per `perf stat`. It prevents a user from collecting
> > per-thread samples using counter groups from a multi-threaded or
> > multi-process application, as with `perf record -e '{....}:S'`.
> > Instead
> > users must use system-wide mode, or forgo the ability to sample
> > counter
> > groups. System-wide mode is often problematic as it requires
> > specific
> > permissions (no CAP_PERFMON / root access), or may lead to capture
> > of
> > significant amounts of extra data from other processes running on
> > the
> > system.
> >
> > Perf already supports the ability to collect per-thread counts with
> > `inherit` via the `inherit_stat` flag. This patch changes
> > `perf_event_alloc` relaxing the restriction to combine `inherit`
> > with
> > `PERF_SAMPLE_READ` so that the combination will be allowed so long
> > as
> > `inherit_stat` and `PERF_SAMPLE_TID` are enabled.
>
> I'm not sure if it's correct. Maybe I misunderstand inherit_stat but
> AFAIK it's just to use prev_task's events when next_task has the
> compatible event context. So the event values it sees in samples
> would depend on the timing or scheduler behavior.

I think you are referring to __perf_event_sync_stat as called from
perf_event_context_sched_out, but isn't the point that
perf_event_context_sched_out will just swap the tasks and RCU pointers
as an optmisation when the incoming context is the same as the outgoing
context (rather than stopping and restarting), and so when inherit_stat
is used, along with swaping the outgong and incoming task data in the
event, it also reads and then swaps the outgoing and incoming counter
values in the event so that the counter values that belong to the
outgoing event are correctly associated to that task.

Looking again at perf_event_context_sched_out though, one thing i note
is that the call to perf_event_sync_stat happens after the call to
perf_ctx_enable, which I think means two things:
* The pmu->read call in the sync will be operating on an actively
counting PMU, so the counts recorded in each event in the context may
be skewed relative to each other. This is not the case elsewhere so
that the events on the PMU are read in the disabled state.
* perc_ctx_enable (at least for arm_pmu) will not reload the the
incoming "remaining" period for any sampling events so its possible
that an overflow would happen sooner than expected in the incoming
context (though i don't think this will leave a corrupted count, just a
smaller than expected count for that sample).


>
> Also event counts and time values PERF_SAMPLE_READ sees
> include child event's so the values of the parent event can be
> updated even if it's inactive. And the values will vary for the
> next_task whether prev_task is the parent or not. I think it
> would return consistent values only if it iterates all child events
> and sums up the values like it does for read(2). But it cannot
> do that in the NMI handler.
>
> Frankly I don't understand how inherit_stat supports per-thread
> counts properly. Also it doesn't seem to be used by default in
> the perf tools.

Hmmm, ok rereading through core.c, i think the thing I have missed is
the interaction between perf_event_count and sync_child_event, which I
guess I missed when testing this because IIRC sync_child_event is only
called on migration, hotplug and task exit events.

I don't think it would be sensible to skip the update to child_count in
sync_child_event as this would break the behaviour of the `read()` on
an event's fd. I suppose the better thing to do would be to have
perf_output_read_group/_one avoid reading child_count for these kinds
of events. That would ensure the mmap sample is correct.


> IIUC per-thread count is supported when you
> don't set the inherit bit and open separate events for each
> thread but I guess that's not what you want.

You can take that approach... but I don't think it is pleasent :-).
If you have an application that spawns threads at runtime the tool must
somehow track them (perhaps by polling proc or using ptrace). Polling
is not ideal as it can miss things or introduce high overhead in the
tool. More importantly, creating a new event per thread can consume a
lot of file descriptors, particularly if you open per-thread-per-cpu
events to have them write to the same mmap.

FWIW I recently prototyped a version of this in Arm's profiler tools,
where we are also prototyping support for per-function metrics... On
something like a Gravaton 3, having 64 cores, where the full set of
published metrics for Neoverse-V1 contains something like 60 PMU
events, tracing a database application that creates a new thread per
connection, it spawned ~30 threads... the tool tried to create ~100k
perf events... hit ulimit and terminated. I realize this annecdote is
on the extreme end of things, but it is possible this would be a common
issue for large core-count server systems.

>
> Anyway, I'm ok with the idea of using PERF_SAMPLE_READ to
> improve per-thread profiling especially with event groups.
> But I think it should not use inherit_stat and it needs a way to
> not include child stats in the samples.
>
> What do you think?

I'm happy to find a different way to opt in to this instead of
`inherit_stat` if people prefer. Though I think if my understanding of
__perf_event_sync_stat described above is correct, then so long as i
fix the behaviour of perf_output_read_group/_one then I think the
sampled counts would be correct.


I can look at adding a new flag bit to opt in... this would probably
also eleviate some of Andi's concerns in the other part of this thread.
Open to suggestions otherwise...

Thanks
Ben


>
> Thanks,
> Namhyung
>
> >
> > In this configuration stream ids (such as may appear in the
> > read_format
> > field of a PERF_RECORD_SAMPLE) are no longer globally unique,
> > rather
> > the pair of (stream id, tid) uniquely identify each event. Tools
> > that
> > rely on this, for example to calculate a delta between samples,
> > would
> > need updating to take this into account. Previously valid event
> > configurations (system-wide, no-inherit and so on) where each
> > stream id
> > is the identifier are unaffected.
> >
> > This patch has been tested on aarch64 both my manual inspection of
> > the
> > output of `perf script -D` and through a modified version of Arm's
> > commercial profiling tools and the numbers appear to line up as one
> > would expect, but some further validation across other
> > architectures
> > and/or edge cases would be welcome.
> >
> > This patch was developed and tested on top of v6.7.
> >
> >
> > Ben Gainey (1):
> > perf: Support PERF_SAMPLE_READ with inherit_stat
> >
> > kernel/events/core.c | 7 +++++--
> > 1 file changed, 5 insertions(+), 2 deletions(-)
> >
> > --
> > 2.43.0
> >

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.