LinuxLists.cc - [PATCH v8 0/4] perf: Support PERF_SAMPLE

2024-06-12 13:39:29

Subject: [PATCH v8 0/4] perf: Support PERF_SAMPLE_READ with inherit

This change allows events to use PERF_SAMPLE READ with inherit so long
as PERF_SAMPLE_TID is also set.

Currently it is not possible to use PERF_SAMPLE_READ with inherit. This
restriction assumes the user is interested in collecting aggregate
statistics as per `perf stat`. It prevents a user from collecting
per-thread samples using counter groups from a multi-threaded or
multi-process application, as with `perf record -e '{....}:S'`. Instead
users must use system-wide mode, or forgo the ability to sample counter
groups, or profile a single thread. System-wide mode is often
problematic as it requires specific permissions (no CAP_PERFMON / root
access), or may lead to capture of significant amounts of extra data
from other processes running on the system.

This patch changes `perf_event_alloc` relaxing the restriction against
combining `inherit` with `PERF_SAMPLE_READ` so that the combination
will be allowed so long as `PERF_SAMPLE_TID` is enabled. It modifies
sampling so that only the count associated with the active thread is
recorded into the buffer. It modifies the context switch handling so
that perf contexts are always switched out if they have this kind of
event so that the correct per-thread state is maintained. Finally, the
tools are updated to allow perf record to specify this combination and
to correctly decode the sample data.

In this configuration sample values, as may appear in the read_format
field of a PERF_RECORD_SAMPLE, are no longer global counters. Instead
the value reports the per-thread value for the active thread.
Tools that expect the global total, for example when calculate a delta
between samples, would need updating to take this into account when
opting into this new behaviour. Previously valid event configurations
(system-wide, no-inherit and so on) are unaffected.

Changes since v7:
- Rebase on v6.10-rc3
- Respond to Peter Zijlstra's feedback:
- Renamed nr_pending to nr_no_switch_fast and merged in nr_inherit_read
which otherwise had overlapping use
- Updated some of the commit messages to provide better justifications
of usecase, behavioural changes and so on
- Cleanup perf_event_count/_cumulative
- Make it explicit that the sampling event decides whether or not the
per-thread value is given in read_format for PERF_RECORD_SAMPLE and
PERF_RECORD_READ; updated tools to account for this.

Changes since v6:
- Rebase on v6.10-rc2
- Make additional "perf test" tests succeed / skip based on kernel
version as per feedback from Namhyung.

Changes since v5:
- Rebase on v6.9
- Cleanup feedback from Namhyung Kim

Changes since v4:
- Rebase on v6.9-rc1
- Removed the dependency on inherit_stat that was previously assumed
necessary as per feedback from Namhyung Kim.
- Fixed an incorrect use of zfree instead of free in the tools leading
to an abort on tool shutdown.
- Additional test coverage improvements added to perf test.
- Cleaned up the remaining bit of irrelevant change missed between v3
and v4.

Changes since v3:
- Cleaned up perf test data changes incorrectly included into this
series from elsewhere.

Changes since v2:
- Rebase on v6.8
- Respond to James Clarke's feedback; fixup some typos and move some
repeated checks into a helper macro.
- Cleaned up checkpatch lints.
- Updated perf test; fixed evsel handling so that existing tests pass
and added new tests to cover the new behaviour.

Changes since v1:
- Rebase on v6.8-rc1
- Fixed value written into sample after child exists.
- Modified handling of switch-out so that context with these events
take the slow path, so that the per-event/per-thread PMU state is
correctly switched.
- Modified perf tools to support this mode of operation.

Ben Gainey (4):
perf: Rename perf_event_context.nr_pending to nr_no_switch_fast.
perf: Support PERF_SAMPLE_READ with inherit
tools/perf: Correctly calculate sample period for inherited
SAMPLE_READ values
tools/perf: Allow inherit + PERF_SAMPLE_READ when opening events

include/linux/perf_event.h | 8 ++-
kernel/events/core.c | 69 +++++++++++++------
tools/lib/perf/evsel.c | 48 +++++++++++++
tools/lib/perf/include/internal/evsel.h | 63 ++++++++++++++++-
tools/perf/tests/attr/README | 2 +
.../tests/attr/test-record-group-sampling | 3 +-
.../tests/attr/test-record-group-sampling1 | 51 ++++++++++++++
.../tests/attr/test-record-group-sampling2 | 61 ++++++++++++++++
tools/perf/tests/attr/test-record-group2 | 1 +
...{test-record-group2 => test-record-group3} | 10 +--
tools/perf/util/evsel.c | 19 ++++-
tools/perf/util/evsel.h | 1 +
tools/perf/util/session.c | 25 ++++---
13 files changed, 321 insertions(+), 40 deletions(-)
create mode 100644 tools/perf/tests/attr/test-record-group-sampling1
create mode 100644 tools/perf/tests/attr/test-record-group-sampling2
copy tools/perf/tests/attr/{test-record-group2 => test-record-group3} (81%)

--
2.45.2

2024-06-12 13:39:38

by Ben Gainey

[permalink] [raw]

Subject: [PATCH v8 1/4] perf: Rename perf_event_context.nr_pending to nr_no_switch_fast.

nr_pending counts the number of events in the context that
either pending_sigtrap or pending_work, but it is used
to prevent taking the fast path in perf_event_context_sched_out.

Renamed to reflect what it is used for, rather than what it
counts. This change allows using the field to track other
event properties that also require skipping the fast path
without possible confusion over the name.

Signed-off-by: Ben Gainey <[email protected]>
---
include/linux/perf_event.h | 5 ++++-
kernel/events/core.c | 14 +++++++-------
2 files changed, 11 insertions(+), 8 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index a5304ae8c654..c0c6c70bb9f1 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -962,12 +962,15 @@ struct perf_event_context {
struct rcu_head rcu_head;

/*
+ * The count of events for which using the switch-out fast path
+ * should be avoided.
+ *
* Sum (event->pending_sigtrap + event->pending_work)
*
* The SIGTRAP is targeted at ctx->task, as such it won't do changing
* that until the signal is delivered.
*/
- local_t nr_pending;
+ local_t nr_no_switch_fast;
};

/*
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 8f908f077935..c9cb22304d11 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2295,7 +2295,7 @@ event_sched_out(struct perf_event *event, struct perf_event_context *ctx)
task_work_add(current, &event->pending_task, TWA_RESUME);
}
if (dec)
- local_dec(&event->ctx->nr_pending);
+ local_dec(&event->ctx->nr_no_switch_fast);
}

perf_event_set_state(event, state);
@@ -3531,9 +3531,9 @@ perf_event_context_sched_out(struct task_struct *task, struct task_struct *next)

perf_ctx_disable(ctx, false);

- /* PMIs are disabled; ctx->nr_pending is stable. */
- if (local_read(&ctx->nr_pending) ||
- local_read(&next_ctx->nr_pending)) {
+ /* PMIs are disabled; ctx->nr_no_switch_fast is stable. */
+ if (local_read(&ctx->nr_no_switch_fast) ||
+ local_read(&next_ctx->nr_no_switch_fast)) {
/*
* Must not swap out ctx when there's pending
* events that rely on the ctx->task relation.
@@ -6768,7 +6768,7 @@ static void __perf_pending_irq(struct perf_event *event)
if (event->pending_sigtrap) {
event->pending_sigtrap = 0;
perf_sigtrap(event);
- local_dec(&event->ctx->nr_pending);
+ local_dec(&event->ctx->nr_no_switch_fast);
}
if (event->pending_disable) {
event->pending_disable = 0;
@@ -6841,7 +6841,7 @@ static void perf_pending_task(struct callback_head *head)
if (event->pending_work) {
event->pending_work = 0;
perf_sigtrap(event);
- local_dec(&event->ctx->nr_pending);
+ local_dec(&event->ctx->nr_no_switch_fast);
}

if (rctx >= 0)
@@ -9711,7 +9711,7 @@ static int __perf_event_overflow(struct perf_event *event,
pending_id = hash32_ptr((void *)instruction_pointer(regs)) ?: 1;
if (!event->pending_sigtrap) {
event->pending_sigtrap = pending_id;
- local_inc(&event->ctx->nr_pending);
+ local_inc(&event->ctx->nr_no_switch_fast);
} else if (event->attr.exclude_kernel && valid_sample) {
/*
* Should not be able to return to user space without
--
2.45.2

2024-06-12 13:39:57

by Ben Gainey

[permalink] [raw]

Subject: [PATCH v8 2/4] perf: Support PERF_SAMPLE_READ with inherit

This change allows events to use PERF_SAMPLE_READ with inherit
so long as PERF_SAMPLE_TID is also set. This enables sample based
profiling of a group of counters over a hierarchy of processes or
threads. This is useful, for example, for collecting per-thread
counters/metrics, event based sampling of multiple counters as a unit,
access to the enabled and running time when using multiplexing and so
on.

Prior to this, users were restricted to either collecting aggregate
statistics for a multi-threaded/-process application (e.g. with
"perf stat"), or to sample individual threads, or to profile the entire
system (which requires root or CAP_PERFMON, and may produce much more
data than is required). Theoretically a tool could poll for or otherwise
monitor thread/process creation and construct whatever events the user
is interested in using perf_event_open, for each new thread or process,
but this is racy, can lead to file-descriptor exhaustion, and ultimately
just replicates the behaviour of inherit, but in userspace.

This configuration differs from inherit without PERF_SAMPLE_READ in that
the accumulated event count, and consequently any sample (such as if
triggered by overflow of sample_period) will be on a per-thread rather
than on an aggregate basis.

The meaning of read_format::value field of both PERF_RECORD_READ and
PERF_RECORD_SAMPLE is changed such that if the sampled event uses this
new configuration then the values reported will be per-thread rather
than the global aggregate value. This is a change from the existing
semantics of read_format (where PERF_SAMPLE_READ is used without
inherit), but it is necessary to expose the per-thread counter values,
and it avoids reinventing a separate "read_format_thread" field that
otherwise replicates the same behaviour. This change should not break
existing tools, since this configuration was not previously valid and
was rejected by the kernel. Tools that opt into this new mode will need
to account for this when calculating the counter delta for a given
sample. Tools that wish to have both the per-thread and aggregate value
can perform the global aggregation themselves from the per-thread
values.

The change to read_format::value does not affect existing valid
perf_event_attr configurations, nor does it change the behaviour of
calls to "read" on an event descriptor. Both continue to report the
aggregate value for the entire thread/process hierarchy. The difference
between the results reported by "read" and PERF_RECORD_SAMPLE in this
new configuration is justified on the basis that it is not (easily)
possible for "read" to target a specific thread (the caller only has
the fd for the original parent event).

Signed-off-by: Ben Gainey <[email protected]>
---
include/linux/perf_event.h | 3 ++-
kernel/events/core.c | 55 ++++++++++++++++++++++++++++----------
2 files changed, 43 insertions(+), 15 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index c0c6c70bb9f1..291c232bd3aa 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -965,7 +965,8 @@ struct perf_event_context {
* The count of events for which using the switch-out fast path
* should be avoided.
*
- * Sum (event->pending_sigtrap + event->pending_work)
+ * Sum (event->pending_sigtrap + event->pending_work
+ * + (attr->inherit && (attr->sample_type & PERF_SAMPLE_READ)))
*
* The SIGTRAP is targeted at ctx->task, as such it won't do changing
* that until the signal is delivered.
diff --git a/kernel/events/core.c b/kernel/events/core.c
index c9cb22304d11..3ad79ce8634b 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1767,6 +1767,14 @@ perf_event_groups_next(struct perf_event *event, struct pmu *pmu)
event = rb_entry_safe(rb_next(&event->group_node), \
typeof(*event), group_node))

+/*
+ * Does the event attribute request inherit with PERF_SAMPLE_READ
+ */
+static inline bool has_inherit_and_sample_read(struct perf_event_attr *attr)
+{
+ return attr->inherit && (attr->sample_type & PERF_SAMPLE_READ);
+}
+
/*
* Add an event from the lists for its context.
* Must be called with ctx->mutex and ctx->lock held.
@@ -1797,6 +1805,8 @@ list_add_event(struct perf_event *event, struct perf_event_context *ctx)
ctx->nr_user++;
if (event->attr.inherit_stat)
ctx->nr_stat++;
+ if (has_inherit_and_sample_read(&event->attr))
+ local_inc(&ctx->nr_no_switch_fast);

if (event->state > PERF_EVENT_STATE_OFF)
perf_cgroup_event_enable(event, ctx);
@@ -2021,6 +2031,8 @@ list_del_event(struct perf_event *event, struct perf_event_context *ctx)
ctx->nr_user--;
if (event->attr.inherit_stat)
ctx->nr_stat--;
+ if (has_inherit_and_sample_read(&event->attr))
+ local_dec(&ctx->nr_no_switch_fast);

list_del_rcu(&event->event_entry);

@@ -3537,6 +3549,11 @@ perf_event_context_sched_out(struct task_struct *task, struct task_struct *next)
/*
* Must not swap out ctx when there's pending
* events that rely on the ctx->task relation.
+ *
+ * Likewise, when a context contains inherit +
+ * SAMPLE_READ events they should be switched
+ * out using the slow path so that they are
+ * treated as if they were distinct contexts.
*/
raw_spin_unlock(&next_ctx->lock);
rcu_read_unlock();
@@ -4552,8 +4569,11 @@ static void __perf_event_read(void *info)
raw_spin_unlock(&ctx->lock);
}

-static inline u64 perf_event_count(struct perf_event *event)
+static inline u64 perf_event_count(struct perf_event *event, bool self)
{
+ if (self)
+ return local64_read(&event->count);
+
return local64_read(&event->count) + atomic64_read(&event->child_count);
}

@@ -5486,7 +5506,7 @@ static u64 __perf_event_read_value(struct perf_event *event, u64 *enabled, u64 *
mutex_lock(&event->child_mutex);

(void)perf_event_read(event, false);
- total += perf_event_count(event);
+ total += perf_event_count(event, false);

*enabled += event->total_time_enabled +
atomic64_read(&event->child_total_time_enabled);
@@ -5495,7 +5515,7 @@ static u64 __perf_event_read_value(struct perf_event *event, u64 *enabled, u64 *

list_for_each_entry(child, &event->child_list, child_list) {
(void)perf_event_read(child, false);
- total += perf_event_count(child);
+ total += perf_event_count(child, false);
*enabled += child->total_time_enabled;
*running += child->total_time_running;
}
@@ -5577,14 +5597,14 @@ static int __perf_read_group_add(struct perf_event *leader,
/*
* Write {count,id} tuples for every sibling.
*/
- values[n++] += perf_event_count(leader);
+ values[n++] += perf_event_count(leader, false);
if (read_format & PERF_FORMAT_ID)
values[n++] = primary_event_id(leader);
if (read_format & PERF_FORMAT_LOST)
values[n++] = atomic64_read(&leader->lost_samples);

for_each_sibling_event(sub, leader) {
- values[n++] += perf_event_count(sub);
+ values[n++] += perf_event_count(sub, false);
if (read_format & PERF_FORMAT_ID)
values[n++] = primary_event_id(sub);
if (read_format & PERF_FORMAT_LOST)
@@ -6164,7 +6184,7 @@ void perf_event_update_userpage(struct perf_event *event)
++userpg->lock;
barrier();
userpg->index = perf_event_index(event);
- userpg->offset = perf_event_count(event);
+ userpg->offset = perf_event_count(event, false);
if (userpg->index)
userpg->offset -= local64_read(&event->hw.prev_count);

@@ -7224,7 +7244,7 @@ static void perf_output_read_one(struct perf_output_handle *handle,
u64 values[5];
int n = 0;

- values[n++] = perf_event_count(event);
+ values[n++] = perf_event_count(event, has_inherit_and_sample_read(&event->attr));
if (read_format & PERF_FORMAT_TOTAL_TIME_ENABLED) {
values[n++] = enabled +
atomic64_read(&event->child_total_time_enabled);
@@ -7242,14 +7262,15 @@ static void perf_output_read_one(struct perf_output_handle *handle,
}

static void perf_output_read_group(struct perf_output_handle *handle,
- struct perf_event *event,
- u64 enabled, u64 running)
+ struct perf_event *event,
+ u64 enabled, u64 running)
{
struct perf_event *leader = event->group_leader, *sub;
u64 read_format = event->attr.read_format;
unsigned long flags;
u64 values[6];
int n = 0;
+ bool self = has_inherit_and_sample_read(&event->attr);

/*
* Disabling interrupts avoids all counter scheduling
@@ -7269,7 +7290,7 @@ static void perf_output_read_group(struct perf_output_handle *handle,
(leader->state == PERF_EVENT_STATE_ACTIVE))
leader->pmu->read(leader);

- values[n++] = perf_event_count(leader);
+ values[n++] = perf_event_count(leader, self);
if (read_format & PERF_FORMAT_ID)
values[n++] = primary_event_id(leader);
if (read_format & PERF_FORMAT_LOST)
@@ -7284,7 +7305,7 @@ static void perf_output_read_group(struct perf_output_handle *handle,
(sub->state == PERF_EVENT_STATE_ACTIVE))
sub->pmu->read(sub);

- values[n++] = perf_event_count(sub);
+ values[n++] = perf_event_count(sub, self);
if (read_format & PERF_FORMAT_ID)
values[n++] = primary_event_id(sub);
if (read_format & PERF_FORMAT_LOST)
@@ -7305,6 +7326,10 @@ static void perf_output_read_group(struct perf_output_handle *handle,
* The problem is that its both hard and excessively expensive to iterate the
* child list, not to mention that its impossible to IPI the children running
* on another CPU, from interrupt/NMI context.
+ *
+ * Instead the combination of PERF_SAMPLE_READ and inherit will track per-thread
+ * counts rather than attempting to accumulate some value across all children on
+ * all cores.
*/
static void perf_output_read(struct perf_output_handle *handle,
struct perf_event *event)
@@ -12034,10 +12059,12 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
local64_set(&hwc->period_left, hwc->sample_period);

/*
- * We currently do not support PERF_SAMPLE_READ on inherited events.
+ * We do not support PERF_SAMPLE_READ on inherited events unless
+ * PERF_SAMPLE_TID is also selected, which allows inherited events to
+ * collect per-thread samples.
* See perf_output_read().
*/
- if (attr->inherit && (attr->sample_type & PERF_SAMPLE_READ))
+ if (has_inherit_and_sample_read(attr) && !(attr->sample_type & PERF_SAMPLE_TID))
goto err_ns;

if (!has_branch_stack(event))
@@ -13061,7 +13088,7 @@ static void sync_child_event(struct perf_event *child_event)
perf_event_read_event(child_event, task);
}

- child_val = perf_event_count(child_event);
+ child_val = perf_event_count(child_event, false);

/*
* Add back the child's count to the parent's count:
--
2.45.2

2024-06-12 13:40:10

by Ben Gainey

[permalink] [raw]

Subject: [PATCH v8 3/4] tools/perf: Correctly calculate sample period for inherited SAMPLE_READ values

Sample period calculation in deliver_sample_value is updated to
calculate the per-thread period delta for events that are inherit +
PERF_SAMPLE_READ. When the sampling event has this configuration, the
read_format.id is used with the tid from the sample to lookup the
storage of the previously accumulated counter total before calculating
the delta. All existing valid configurations where read_format.value
represents some global value continue to use just the read_format.id to
locate the storage of the previously accumulated total.

perf_sample_id is modified to support tracking per-thread
values, along with the existing global per-id values. In the
per-thread case, values are stored in a hash by tid within the
perf_sample_id, and are dynamically allocated as the number is not known
ahead of time.

Signed-off-by: Ben Gainey <[email protected]>
---
tools/lib/perf/evsel.c | 48 +++++++++++++++++++
tools/lib/perf/include/internal/evsel.h | 63 ++++++++++++++++++++++++-
tools/perf/util/session.c | 25 ++++++----
3 files changed, 126 insertions(+), 10 deletions(-)

diff --git a/tools/lib/perf/evsel.c b/tools/lib/perf/evsel.c
index c07160953224..abdae2f9498b 100644
--- a/tools/lib/perf/evsel.c
+++ b/tools/lib/perf/evsel.c
@@ -5,6 +5,7 @@
#include <perf/evsel.h>
#include <perf/cpumap.h>
#include <perf/threadmap.h>
+#include <linux/hash.h>
#include <linux/list.h>
#include <internal/evsel.h>
#include <linux/zalloc.h>
@@ -23,6 +24,7 @@ void perf_evsel__init(struct perf_evsel *evsel, struct perf_event_attr *attr,
int idx)
{
INIT_LIST_HEAD(&evsel->node);
+ INIT_LIST_HEAD(&evsel->per_stream_periods);
evsel->attr = *attr;
evsel->idx = idx;
evsel->leader = evsel;
@@ -531,10 +533,56 @@ int perf_evsel__alloc_id(struct perf_evsel *evsel, int ncpus, int nthreads)

void perf_evsel__free_id(struct perf_evsel *evsel)
{
+ struct perf_sample_id_period *pos, *n;
+
xyarray__delete(evsel->sample_id);
evsel->sample_id = NULL;
zfree(&evsel->id);
evsel->ids = 0;
+
+ perf_evsel_for_each_per_thread_period_safe(evsel, n, pos) {
+ list_del_init(&pos->node);
+ free(pos);
+ }
+}
+
+bool perf_evsel__attr_has_per_thread_sample_period(struct perf_evsel *evsel)
+{
+ return (evsel->attr.sample_type & PERF_SAMPLE_READ)
+ && (evsel->attr.sample_type & PERF_SAMPLE_TID)
+ && evsel->attr.inherit;
+}
+
+u64 *perf_sample_id__get_period_storage(struct perf_sample_id *sid, u32 tid, bool per_thread)
+{
+ struct hlist_head *head;
+ struct perf_sample_id_period *res;
+ int hash;
+
+ if (!per_thread)
+ return &sid->period;
+
+ hash = hash_32(tid, PERF_SAMPLE_ID__HLIST_BITS);
+ head = &sid->periods[hash];
+
+ hlist_for_each_entry(res, head, hnode)
+ if (res->tid == tid)
+ return &res->period;
+
+ if (sid->evsel == NULL)
+ return NULL;
+
+ res = zalloc(sizeof(struct perf_sample_id_period));
+ if (res == NULL)
+ return NULL;
+
+ INIT_LIST_HEAD(&res->node);
+ res->tid = tid;
+
+ list_add_tail(&res->node, &sid->evsel->per_stream_periods);
+ hlist_add_head(&res->hnode, &sid->periods[hash]);
+
+ return &res->period;
}

void perf_counts_values__scale(struct perf_counts_values *count,
diff --git a/tools/lib/perf/include/internal/evsel.h b/tools/lib/perf/include/internal/evsel.h
index 5cd220a61962..ea78defa77d0 100644
--- a/tools/lib/perf/include/internal/evsel.h
+++ b/tools/lib/perf/include/internal/evsel.h
@@ -11,6 +11,32 @@
struct perf_thread_map;
struct xyarray;

+/**
+ * The per-thread accumulated period storage node.
+ */
+struct perf_sample_id_period {
+ struct list_head node;
+ struct hlist_node hnode;
+ /* Holds total ID period value for PERF_SAMPLE_READ processing. */
+ u64 period;
+ /* The TID that the values belongs to */
+ u32 tid;
+};
+
+/**
+ * perf_evsel_for_each_per_thread_period_safe - safely iterate thru all the
+ * per_stream_periods
+ * @evlist:perf_evsel instance to iterate
+ * @item: struct perf_sample_id_period iterator
+ * @tmp: struct perf_sample_id_period temp iterator
+ */
+#define perf_evsel_for_each_per_thread_period_safe(evsel, tmp, item) \
+ list_for_each_entry_safe(item, tmp, &(evsel)->per_stream_periods, node)
+
+
+#define PERF_SAMPLE_ID__HLIST_BITS 4
+#define PERF_SAMPLE_ID__HLIST_SIZE (1 << PERF_SAMPLE_ID__HLIST_BITS)
+
/*
* Per fd, to map back from PERF_SAMPLE_ID to evsel, only used when there are
* more than one entry in the evlist.
@@ -34,8 +60,32 @@ struct perf_sample_id {
pid_t machine_pid;
struct perf_cpu vcpu;

- /* Holds total ID period value for PERF_SAMPLE_READ processing. */
- u64 period;
+ /*
+ * Per-thread, and global event counts are mutually exclusive:
+ * Whilst it is possible to combine events into a group with differing
+ * values of PERF_SAMPLE_READ, it is not valid to have inconsistent
+ * values for `inherit`. Therefore it is not possible to have a
+ * situation where a per-thread event is sampled as a global event;
+ * all !inherit groups are global, and all groups where the sampling
+ * event is inherit + PERF_SAMPLE_READ will be per-thread. Any event
+ * that is part of such a group that is inherit but not PERF_SAMPLE_READ
+ * will be read as per-thread. If such an event can also trigger a
+ * sample (such as with sample_period > 0) then it will not cause
+ * `read_format` to be included in its PERF_RECORD_SAMPLE, and
+ * therefore will not expose the per-thread group members as global.
+ */
+ union {
+ /*
+ * Holds total ID period value for PERF_SAMPLE_READ processing
+ * (when period is not per-thread).
+ */
+ u64 period;
+ /*
+ * Holds total ID period value for PERF_SAMPLE_READ processing
+ * (when period is per-thread).
+ */
+ struct hlist_head periods[PERF_SAMPLE_ID__HLIST_SIZE];
+ };
};

struct perf_evsel {
@@ -58,6 +108,10 @@ struct perf_evsel {
u32 ids;
struct perf_evsel *leader;

+ /* For events where the read_format value is per-thread rather than
+ * global, stores the per-thread cumulative period */
+ struct list_head per_stream_periods;
+
/* parse modifier helper */
int nr_members;
/*
@@ -88,4 +142,9 @@ int perf_evsel__apply_filter(struct perf_evsel *evsel, const char *filter);
int perf_evsel__alloc_id(struct perf_evsel *evsel, int ncpus, int nthreads);
void perf_evsel__free_id(struct perf_evsel *evsel);

+bool perf_evsel__attr_has_per_thread_sample_period(struct perf_evsel *evsel);
+
+u64 *perf_sample_id__get_period_storage(struct perf_sample_id *sid, u32 tid,
+ bool per_thread);
+
#endif /* __LIBPERF_INTERNAL_EVSEL_H */
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index a10343b9dcd4..3515024986d5 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -1474,18 +1474,24 @@ static int deliver_sample_value(struct evlist *evlist,
union perf_event *event,
struct perf_sample *sample,
struct sample_read_value *v,
- struct machine *machine)
+ struct machine *machine,
+ bool per_thread)
{
struct perf_sample_id *sid = evlist__id2sid(evlist, v->id);
struct evsel *evsel;
+ u64 *storage = NULL;

if (sid) {
+ storage = perf_sample_id__get_period_storage(sid, sample->tid, per_thread);
+ }
+
+ if (storage) {
sample->id = v->id;
- sample->period = v->value - sid->period;
- sid->period = v->value;
+ sample->period = v->value - *storage;
+ *storage = v->value;
}

- if (!sid || sid->evsel == NULL) {
+ if (!storage || sid->evsel == NULL) {
++evlist->stats.nr_unknown_id;
return 0;
}
@@ -1506,14 +1512,15 @@ static int deliver_sample_group(struct evlist *evlist,
union perf_event *event,
struct perf_sample *sample,
struct machine *machine,
- u64 read_format)
+ u64 read_format,
+ bool per_thread)
{
int ret = -EINVAL;
struct sample_read_value *v = sample->read.group.values;

sample_read_group__for_each(v, sample->read.group.nr, read_format) {
ret = deliver_sample_value(evlist, tool, event, sample, v,
- machine);
+ machine, per_thread);
if (ret)
break;
}
@@ -1528,6 +1535,7 @@ static int evlist__deliver_sample(struct evlist *evlist, struct perf_tool *tool,
/* We know evsel != NULL. */
u64 sample_type = evsel->core.attr.sample_type;
u64 read_format = evsel->core.attr.read_format;
+ bool per_thread = perf_evsel__attr_has_per_thread_sample_period(&evsel->core);

/* Standard sample delivery. */
if (!(sample_type & PERF_SAMPLE_READ))
@@ -1536,10 +1544,11 @@ static int evlist__deliver_sample(struct evlist *evlist, struct perf_tool *tool,
/* For PERF_SAMPLE_READ we have either single or group mode. */
if (read_format & PERF_FORMAT_GROUP)
return deliver_sample_group(evlist, tool, event, sample,
- machine, read_format);
+ machine, read_format, per_thread);
else
return deliver_sample_value(evlist, tool, event, sample,
- &sample->read.one, machine);
+ &sample->read.one, machine,
+ per_thread);
}

static int machines__deliver_event(struct machines *machines,
--
2.45.2

2024-06-12 13:40:29

by Ben Gainey

[permalink] [raw]

Subject: [PATCH v8 4/4] tools/perf: Allow inherit + PERF_SAMPLE_READ when opening events

The "perf record" tool will now default to this new mode if the user
specifies a sampling group when not in system-wide mode, and when
"--no-inherit" is not specified.

This change updates evsel to allow the combination of inherit
and PERF_SAMPLE_READ.

A fallback is implemented for kernel versions where this feature is not
supported.

Signed-off-by: Ben Gainey <[email protected]>
---
tools/perf/tests/attr/README | 2 +
.../tests/attr/test-record-group-sampling | 3 +-
.../tests/attr/test-record-group-sampling1 | 51 ++++++++++++++++
.../tests/attr/test-record-group-sampling2 | 61 +++++++++++++++++++
tools/perf/tests/attr/test-record-group2 | 1 +
...{test-record-group2 => test-record-group3} | 10 +--
tools/perf/util/evsel.c | 19 +++++-
tools/perf/util/evsel.h | 1 +
8 files changed, 141 insertions(+), 7 deletions(-)
create mode 100644 tools/perf/tests/attr/test-record-group-sampling1
create mode 100644 tools/perf/tests/attr/test-record-group-sampling2
copy tools/perf/tests/attr/{test-record-group2 => test-record-group3} (81%)

diff --git a/tools/perf/tests/attr/README b/tools/perf/tests/attr/README
index 4066fec7180a..67c4ca76b85d 100644
--- a/tools/perf/tests/attr/README
+++ b/tools/perf/tests/attr/README
@@ -51,6 +51,8 @@ Following tests are defined (with perf commands):
perf record --call-graph fp kill (test-record-graph-fp-aarch64)
perf record -e '{cycles,instructions}' kill (test-record-group1)
perf record -e '{cycles/period=1/,instructions/period=2/}:S' kill (test-record-group2)
+ perf record -e '{cycles,cache-misses}:S' kill (test-record-group-sampling1)
+ perf record -c 10000 -e '{cycles,cache-misses}:S' kill (test-record-group-sampling2)
perf record -D kill (test-record-no-delay)
perf record -i kill (test-record-no-inherit)
perf record -n kill (test-record-no-samples)
diff --git a/tools/perf/tests/attr/test-record-group-sampling b/tools/perf/tests/attr/test-record-group-sampling
index 97e7e64a38f0..c32ceb156226 100644
--- a/tools/perf/tests/attr/test-record-group-sampling
+++ b/tools/perf/tests/attr/test-record-group-sampling
@@ -2,6 +2,7 @@
command = record
args = --no-bpf-event -e '{cycles,cache-misses}:S' kill >/dev/null 2>&1
ret = 1
+kernel_until = 6.10

[event-1:base-record]
fd=1
@@ -18,7 +19,7 @@ group_fd=1
type=0
config=3

-# default | PERF_SAMPLE_READ
+# default | PERF_SAMPLE_READ | PERF_SAMPLE_PERIOD
sample_type=343

# PERF_FORMAT_ID | PERF_FORMAT_GROUP | PERF_FORMAT_LOST
diff --git a/tools/perf/tests/attr/test-record-group-sampling1 b/tools/perf/tests/attr/test-record-group-sampling1
new file mode 100644
index 000000000000..db6404c1ef78
--- /dev/null
+++ b/tools/perf/tests/attr/test-record-group-sampling1
@@ -0,0 +1,51 @@
+[config]
+command = record
+args = --no-bpf-event -e '{cycles,cache-misses}:S' kill >/dev/null 2>&1
+ret = 1
+kernel_since = 6.10
+
+[event-1:base-record]
+fd=1
+group_fd=-1
+
+# cycles
+type=0
+config=0
+
+# default | PERF_SAMPLE_READ | PERF_SAMPLE_PERIOD
+sample_type=343
+
+# PERF_FORMAT_ID | PERF_FORMAT_GROUP | PERF_FORMAT_LOST | PERF_FORMAT_TOTAL_TIME_ENABLED | PERF_FORMAT_TOTAL_TIME_RUNNING
+read_format=28|31
+task=1
+mmap=1
+comm=1
+enable_on_exec=1
+disabled=1
+
+# inherit is enabled for group sampling
+inherit=1
+
+[event-2:base-record]
+fd=2
+group_fd=1
+
+# cache-misses
+type=0
+config=3
+
+# default | PERF_SAMPLE_READ | PERF_SAMPLE_PERIOD
+sample_type=343
+
+# PERF_FORMAT_ID | PERF_FORMAT_GROUP | PERF_FORMAT_LOST | PERF_FORMAT_TOTAL_TIME_ENABLED | PERF_FORMAT_TOTAL_TIME_RUNNING
+read_format=28|31
+task=0
+mmap=0
+comm=0
+enable_on_exec=0
+disabled=0
+freq=0
+
+# inherit is enabled for group sampling
+inherit=1
+
diff --git a/tools/perf/tests/attr/test-record-group-sampling2 b/tools/perf/tests/attr/test-record-group-sampling2
new file mode 100644
index 000000000000..32884df78c95
--- /dev/null
+++ b/tools/perf/tests/attr/test-record-group-sampling2
@@ -0,0 +1,61 @@
+[config]
+command = record
+args = --no-bpf-event -c 10000 -e '{cycles,cache-misses}:S' kill >/dev/null 2>&1
+ret = 1
+kernel_since = 6.10
+
+[event-1:base-record]
+fd=1
+group_fd=-1
+
+# cycles
+type=0
+config=0
+
+# default | PERF_SAMPLE_READ
+sample_type=87
+
+# PERF_FORMAT_ID | PERF_FORMAT_GROUP | PERF_FORMAT_LOST | PERF_FORMAT_TOTAL_TIME_ENABLED | PERF_FORMAT_TOTAL_TIME_RUNNING
+read_format=28|31
+task=1
+mmap=1
+comm=1
+enable_on_exec=1
+disabled=1
+
+# inherit is enabled for group sampling
+inherit=1
+
+# sampling disabled
+sample_freq=0
+sample_period=10000
+freq=0
+write_backward=0
+
+[event-2:base-record]
+fd=2
+group_fd=1
+
+# cache-misses
+type=0
+config=3
+
+# default | PERF_SAMPLE_READ
+sample_type=87
+
+# PERF_FORMAT_ID | PERF_FORMAT_GROUP | PERF_FORMAT_LOST | PERF_FORMAT_TOTAL_TIME_ENABLED | PERF_FORMAT_TOTAL_TIME_RUNNING
+read_format=28|31
+task=0
+mmap=0
+comm=0
+enable_on_exec=0
+disabled=0
+
+# inherit is enabled for group sampling
+inherit=1
+
+# sampling disabled
+sample_freq=0
+sample_period=0
+freq=0
+write_backward=0
diff --git a/tools/perf/tests/attr/test-record-group2 b/tools/perf/tests/attr/test-record-group2
index cebdaa8e64e4..8fe6c679618c 100644
--- a/tools/perf/tests/attr/test-record-group2
+++ b/tools/perf/tests/attr/test-record-group2
@@ -2,6 +2,7 @@
command = record
args = --no-bpf-event -e '{cycles/period=1234000/,instructions/period=6789000/}:S' kill >/dev/null 2>&1
ret = 1
+kernel_until = 6.10

[event-1:base-record]
fd=1
diff --git a/tools/perf/tests/attr/test-record-group2 b/tools/perf/tests/attr/test-record-group3
similarity index 81%
copy from tools/perf/tests/attr/test-record-group2
copy to tools/perf/tests/attr/test-record-group3
index cebdaa8e64e4..72e6a9ee2f60 100644
--- a/tools/perf/tests/attr/test-record-group2
+++ b/tools/perf/tests/attr/test-record-group3
@@ -2,6 +2,7 @@
command = record
args = --no-bpf-event -e '{cycles/period=1234000/,instructions/period=6789000/}:S' kill >/dev/null 2>&1
ret = 1
+kernel_since = 6.10

[event-1:base-record]
fd=1
@@ -9,8 +10,9 @@ group_fd=-1
config=0|1
sample_period=1234000
sample_type=87
-read_format=12|28
-inherit=0
+read_format=28|31
+disabled=1
+inherit=1
freq=0

[event-2:base-record]
@@ -19,9 +21,9 @@ group_fd=1
config=0|1
sample_period=6789000
sample_type=87
-read_format=12|28
+read_format=28|31
disabled=0
-inherit=0
+inherit=1
mmap=0
comm=0
freq=0
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 4f818ab6b662..66b782811997 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -1162,7 +1162,15 @@ void evsel__config(struct evsel *evsel, struct record_opts *opts,
*/
if (leader->core.nr_members > 1) {
attr->read_format |= PERF_FORMAT_GROUP;
- attr->inherit = 0;
+ }
+
+ /*
+ * Inherit + SAMPLE_READ requires SAMPLE_TID in the read_format
+ */
+ if (attr->inherit) {
+ evsel__set_sample_bit(evsel, TID);
+ evsel->core.attr.read_format |=
+ PERF_FORMAT_ID;
}
}

@@ -1838,6 +1846,8 @@ static int __evsel__prepare_open(struct evsel *evsel, struct perf_cpu_map *cpus,

static void evsel__disable_missing_features(struct evsel *evsel)
{
+ if (perf_missing_features.inherit_sample_read)
+ evsel->core.attr.inherit = 0;
if (perf_missing_features.branch_counters)
evsel->core.attr.branch_sample_type &= ~PERF_SAMPLE_BRANCH_COUNTERS;
if (perf_missing_features.read_lost)
@@ -1893,7 +1903,12 @@ bool evsel__detect_missing_features(struct evsel *evsel)
* Must probe features in the order they were added to the
* perf_event_attr interface.
*/
- if (!perf_missing_features.branch_counters &&
+ if (!perf_missing_features.inherit_sample_read &&
+ evsel->core.attr.inherit && (evsel->core.attr.sample_type & PERF_SAMPLE_READ)) {
+ perf_missing_features.inherit_sample_read = true;
+ pr_debug2("Using PERF_SAMPLE_READ / :S modifier is not compatible with inherit, falling back to no-inherit.\n");
+ return true;
+ } else if (!perf_missing_features.branch_counters &&
(evsel->core.attr.branch_sample_type & PERF_SAMPLE_BRANCH_COUNTERS)) {
perf_missing_features.branch_counters = true;
pr_debug2("switching off branch counters support\n");
diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
index 375a38e15cd9..911c2fd42c6d 100644
--- a/tools/perf/util/evsel.h
+++ b/tools/perf/util/evsel.h
@@ -192,6 +192,7 @@ struct perf_missing_features {
bool weight_struct;
bool read_lost;
bool branch_counters;
+ bool inherit_sample_read;
};

extern struct perf_missing_features perf_missing_features;
--
2.45.2