2023-10-04 18:41:27

by Liang, Kan

[permalink] [raw]
Subject: [PATCH V4 1/7] perf: Add branch stack counters

From: Kan Liang <[email protected]>

Currently, the additional information of a branch entry is stored in a
u64 space. With more and more information added, the space is running
out. For example, the information of occurrences of events will be added
for each branch.

Two places were suggested to append the counters.
https://lore.kernel.org/lkml/[email protected]/
One place is right after the flags of each branch entry. It changes the
existing struct perf_branch_entry. The later ARCH specific
implementation has to be really careful to consistently pick
the right struct.
The other place is right after the entire struct perf_branch_stack.
The disadvantage is that the pointer of the extra space has to be
recorded. The common interface perf_sample_save_brstack() has to be
updated.

The latter is much straightforward, and should be easily understood and
maintained. It is implemented in the patch.

Add a new branch sample type, PERF_SAMPLE_BRANCH_COUNTERS, to indicate
the event which is recorded in the branch info.

The "u64 counters" may store the occurrences of several events. The
information regarding the number of events/counters and the width of
each counter should be exposed via sysfs as a reference for the perf
tool. Define the branch_counter_nr and branch_counter_width ABI here.
The support will be implemented later in the Intel-specific patch.

Suggested-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
Cc: Sandipan Das <[email protected]>
Cc: Ravi Bangoria <[email protected]>
Cc: Athira Rajeev <[email protected]>
---

Changes since V3:
- Add a new branch sample type, PERF_SAMPLE_BRANCH_COUNTERS
Drop the two branch sample type in V2.
- Add the branch_counter_nr and branch_counter_width ABI

.../testing/sysfs-bus-event_source-devices-caps | 6 ++++++
arch/powerpc/perf/core-book3s.c | 2 +-
arch/x86/events/amd/core.c | 2 +-
arch/x86/events/core.c | 2 +-
arch/x86/events/intel/core.c | 2 +-
arch/x86/events/intel/ds.c | 4 ++--
include/linux/perf_event.h | 17 ++++++++++++++++-
include/uapi/linux/perf_event.h | 10 ++++++++++
kernel/events/core.c | 8 ++++++++
9 files changed, 46 insertions(+), 7 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-bus-event_source-devices-caps b/Documentation/ABI/testing/sysfs-bus-event_source-devices-caps
index 8757dcf41c08..451f0c620aa7 100644
--- a/Documentation/ABI/testing/sysfs-bus-event_source-devices-caps
+++ b/Documentation/ABI/testing/sysfs-bus-event_source-devices-caps
@@ -16,3 +16,9 @@ Description:
Example output in powerpc:
grep . /sys/bus/event_source/devices/cpu/caps/*
/sys/bus/event_source/devices/cpu/caps/pmu_name:POWER9
+
+ The "branch_counter_nr" in the supported platform exposes the
+ maximum number of counters which can be shown in the u64 counters
+ of PERF_SAMPLE_BRANCH_COUNTERS, while the "branch_counter_width"
+ exposes the width of each counter. Both of them can be used by
+ the perf tool to parse the logged counters in each branch.
diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
index 8c1f7def596e..3c14596bbfaf 100644
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -2313,7 +2313,7 @@ static void record_and_restart(struct perf_event *event, unsigned long val,
struct cpu_hw_events *cpuhw;
cpuhw = this_cpu_ptr(&cpu_hw_events);
power_pmu_bhrb_read(event, cpuhw);
- perf_sample_save_brstack(&data, event, &cpuhw->bhrb_stack);
+ perf_sample_save_brstack(&data, event, &cpuhw->bhrb_stack, NULL);
}

if (event->attr.sample_type & PERF_SAMPLE_DATA_SRC &&
diff --git a/arch/x86/events/amd/core.c b/arch/x86/events/amd/core.c
index e24976593a29..4ee6390b45c9 100644
--- a/arch/x86/events/amd/core.c
+++ b/arch/x86/events/amd/core.c
@@ -940,7 +940,7 @@ static int amd_pmu_v2_handle_irq(struct pt_regs *regs)
continue;

if (has_branch_stack(event))
- perf_sample_save_brstack(&data, event, &cpuc->lbr_stack);
+ perf_sample_save_brstack(&data, event, &cpuc->lbr_stack, NULL);

if (perf_event_overflow(event, &data, regs))
x86_pmu_stop(event, 0);
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 40ad1425ffa2..40c9af124128 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -1702,7 +1702,7 @@ int x86_pmu_handle_irq(struct pt_regs *regs)
perf_sample_data_init(&data, 0, event->hw.last_period);

if (has_branch_stack(event))
- perf_sample_save_brstack(&data, event, &cpuc->lbr_stack);
+ perf_sample_save_brstack(&data, event, &cpuc->lbr_stack, NULL);

if (perf_event_overflow(event, &data, regs))
x86_pmu_stop(event, 0);
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index a08f794a0e79..41a164764a84 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3047,7 +3047,7 @@ static int handle_pmi_common(struct pt_regs *regs, u64 status)
perf_sample_data_init(&data, 0, event->hw.last_period);

if (has_branch_stack(event))
- perf_sample_save_brstack(&data, event, &cpuc->lbr_stack);
+ perf_sample_save_brstack(&data, event, &cpuc->lbr_stack, NULL);

if (perf_event_overflow(event, &data, regs))
x86_pmu_stop(event, 0);
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index bf97ab904d40..cb3f329f8fa4 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -1755,7 +1755,7 @@ static void setup_pebs_fixed_sample_data(struct perf_event *event,
setup_pebs_time(event, data, pebs->tsc);

if (has_branch_stack(event))
- perf_sample_save_brstack(data, event, &cpuc->lbr_stack);
+ perf_sample_save_brstack(data, event, &cpuc->lbr_stack, NULL);
}

static void adaptive_pebs_save_regs(struct pt_regs *regs,
@@ -1912,7 +1912,7 @@ static void setup_pebs_adaptive_sample_data(struct perf_event *event,

if (has_branch_stack(event)) {
intel_pmu_store_pebs_lbrs(lbr);
- perf_sample_save_brstack(data, event, &cpuc->lbr_stack);
+ perf_sample_save_brstack(data, event, &cpuc->lbr_stack, NULL);
}
}

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index e85cd1c0eaf3..9ad79f8107cb 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1138,6 +1138,10 @@ static inline bool branch_sample_priv(const struct perf_event *event)
return event->attr.branch_sample_type & PERF_SAMPLE_BRANCH_PRIV_SAVE;
}

+static inline bool branch_sample_counters(const struct perf_event *event)
+{
+ return event->attr.branch_sample_type & PERF_SAMPLE_BRANCH_COUNTERS;
+}

struct perf_sample_data {
/*
@@ -1172,6 +1176,7 @@ struct perf_sample_data {
struct perf_callchain_entry *callchain;
struct perf_raw_record *raw;
struct perf_branch_stack *br_stack;
+ u64 *br_stack_cntr;
union perf_sample_weight weight;
union perf_mem_data_src data_src;
u64 txn;
@@ -1249,7 +1254,8 @@ static inline void perf_sample_save_raw_data(struct perf_sample_data *data,

static inline void perf_sample_save_brstack(struct perf_sample_data *data,
struct perf_event *event,
- struct perf_branch_stack *brs)
+ struct perf_branch_stack *brs,
+ u64 *brs_cntr)
{
int size = sizeof(u64); /* nr */

@@ -1257,7 +1263,16 @@ static inline void perf_sample_save_brstack(struct perf_sample_data *data,
size += sizeof(u64);
size += brs->nr * sizeof(struct perf_branch_entry);

+ /*
+ * The extension space for counters is appended after the
+ * struct perf_branch_stack. It is used to store the occurrences
+ * of events of each branch.
+ */
+ if (brs_cntr)
+ size += brs->nr * sizeof(u64);
+
data->br_stack = brs;
+ data->br_stack_cntr = brs_cntr;
data->dyn_size += size;
data->sample_flags |= PERF_SAMPLE_BRANCH_STACK;
}
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 39c6a250dd1b..4461f380425b 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -204,6 +204,8 @@ enum perf_branch_sample_type_shift {

PERF_SAMPLE_BRANCH_PRIV_SAVE_SHIFT = 18, /* save privilege mode */

+ PERF_SAMPLE_BRANCH_COUNTERS_SHIFT = 19, /* save occurrences of events on a branch */
+
PERF_SAMPLE_BRANCH_MAX_SHIFT /* non-ABI */
};

@@ -235,6 +237,8 @@ enum perf_branch_sample_type {

PERF_SAMPLE_BRANCH_PRIV_SAVE = 1U << PERF_SAMPLE_BRANCH_PRIV_SAVE_SHIFT,

+ PERF_SAMPLE_BRANCH_COUNTERS = 1U << PERF_SAMPLE_BRANCH_COUNTERS_SHIFT,
+
PERF_SAMPLE_BRANCH_MAX = 1U << PERF_SAMPLE_BRANCH_MAX_SHIFT,
};

@@ -982,6 +986,12 @@ enum perf_event_type {
* { u64 nr;
* { u64 hw_idx; } && PERF_SAMPLE_BRANCH_HW_INDEX
* { u64 from, to, flags } lbr[nr];
+ * #
+ * # The format of the counters is decided by the
+ * # "branch_counter_nr" and "branch_counter_width",
+ * # which are defined in the ABI.
+ * #
+ * { u64 counters; } cntr[nr] && PERF_SAMPLE_BRANCH_COUNTERS
* } && PERF_SAMPLE_BRANCH_STACK
*
* { u64 abi; # enum perf_sample_regs_abi
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 41e28f64a4a9..56b08ffeed2f 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7336,6 +7336,14 @@ void perf_output_sample(struct perf_output_handle *handle,
if (branch_sample_hw_index(event))
perf_output_put(handle, data->br_stack->hw_idx);
perf_output_copy(handle, data->br_stack->entries, size);
+ /*
+ * Add the extension space which is appended
+ * right after the struct perf_branch_stack.
+ */
+ if (data->br_stack_cntr) {
+ size = data->br_stack->nr * sizeof(u64);
+ perf_output_copy(handle, data->br_stack_cntr, size);
+ }
} else {
/*
* we always store at least the value of nr
--
2.35.1


2023-10-04 18:41:31

by Liang, Kan

[permalink] [raw]
Subject: [PATCH V4 3/7] perf: Add branch_sample_call_stack

From: Kan Liang <[email protected]>

Add a helper function to check call stack sample type.

The later patch will invoke the function in several places.

Signed-off-by: Kan Liang <[email protected]>
---

No changes since V3

arch/x86/events/core.c | 2 +-
include/linux/perf_event.h | 5 +++++
2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 40c9af124128..09050641ce5d 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -601,7 +601,7 @@ int x86_pmu_hw_config(struct perf_event *event)
}
}

- if (event->attr.branch_sample_type & PERF_SAMPLE_BRANCH_CALL_STACK)
+ if (branch_sample_call_stack(event))
event->attach_state |= PERF_ATTACH_TASK_DATA;

/*
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 9ad79f8107cb..826d2d632184 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1143,6 +1143,11 @@ static inline bool branch_sample_counters(const struct perf_event *event)
return event->attr.branch_sample_type & PERF_SAMPLE_BRANCH_COUNTERS;
}

+static inline bool branch_sample_call_stack(const struct perf_event *event)
+{
+ return event->attr.branch_sample_type & PERF_SAMPLE_BRANCH_CALL_STACK;
+}
+
struct perf_sample_data {
/*
* Fields set by perf_sample_data_init() unconditionally,
--
2.35.1

2023-10-04 18:41:31

by Liang, Kan

[permalink] [raw]
Subject: [PATCH V4 6/7] perf header: Support num and width of branch counters

From: Kan Liang <[email protected]>

To support the branch counters feature, the information of the maximum
number of supported counters and the width of the counters is exposed
in the sysfs caps folder. The perf tool can use the information to parse
the logged counters in each branch.

Store the information in the perf_env for later usage.

Signed-off-by: Kan Liang <[email protected]>
---

New patch

tools/perf/util/env.h | 5 +++++
tools/perf/util/header.c | 18 +++++++++++++++---
2 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/tools/perf/util/env.h b/tools/perf/util/env.h
index 4566c51f2fd9..48d7f8759a2a 100644
--- a/tools/perf/util/env.h
+++ b/tools/perf/util/env.h
@@ -46,6 +46,9 @@ struct hybrid_node {
struct pmu_caps {
int nr_caps;
unsigned int max_branches;
+ unsigned int br_cntr_nr;
+ unsigned int br_cntr_width;
+
char **caps;
char *pmu_name;
};
@@ -62,6 +65,8 @@ struct perf_env {
unsigned long long total_mem;
unsigned int msr_pmu_type;
unsigned int max_branches;
+ unsigned int br_cntr_nr;
+ unsigned int br_cntr_width;
int kernel_is_64_bit;

int nr_cmdline;
diff --git a/tools/perf/util/header.c b/tools/perf/util/header.c
index d812e1e371a7..9664062ba835 100644
--- a/tools/perf/util/header.c
+++ b/tools/perf/util/header.c
@@ -3256,7 +3256,9 @@ static int process_compressed(struct feat_fd *ff,
}

static int __process_pmu_caps(struct feat_fd *ff, int *nr_caps,
- char ***caps, unsigned int *max_branches)
+ char ***caps, unsigned int *max_branches,
+ unsigned int *br_cntr_nr,
+ unsigned int *br_cntr_width)
{
char *name, *value, *ptr;
u32 nr_pmu_caps, i;
@@ -3291,6 +3293,12 @@ static int __process_pmu_caps(struct feat_fd *ff, int *nr_caps,
if (!strcmp(name, "branches"))
*max_branches = atoi(value);

+ if (!strcmp(name, "branch_counter_nr"))
+ *br_cntr_nr = atoi(value);
+
+ if (!strcmp(name, "branch_counter_width"))
+ *br_cntr_width = atoi(value);
+
free(value);
free(name);
}
@@ -3315,7 +3323,9 @@ static int process_cpu_pmu_caps(struct feat_fd *ff,
{
int ret = __process_pmu_caps(ff, &ff->ph->env.nr_cpu_pmu_caps,
&ff->ph->env.cpu_pmu_caps,
- &ff->ph->env.max_branches);
+ &ff->ph->env.max_branches,
+ &ff->ph->env.br_cntr_nr,
+ &ff->ph->env.br_cntr_width);

if (!ret && !ff->ph->env.cpu_pmu_caps)
pr_debug("cpu pmu capabilities not available\n");
@@ -3344,7 +3354,9 @@ static int process_pmu_caps(struct feat_fd *ff, void *data __maybe_unused)
for (i = 0; i < nr_pmu; i++) {
ret = __process_pmu_caps(ff, &pmu_caps[i].nr_caps,
&pmu_caps[i].caps,
- &pmu_caps[i].max_branches);
+ &pmu_caps[i].max_branches,
+ &pmu_caps[i].br_cntr_nr,
+ &pmu_caps[i].br_cntr_width);
if (ret)
goto err;

--
2.35.1

2023-10-04 18:41:38

by Liang, Kan

[permalink] [raw]
Subject: [PATCH V4 2/7] perf/x86: Add PERF_X86_EVENT_NEEDS_BRANCH_STACK flag

From: Kan Liang <[email protected]>

Currently, branch_sample_type !=0 is used to check whether a branch
stack setup is required. But it doesn't check the sample type,
unnecessary branch stack setup may be done for a counting event. E.g.,
perf record -e "{branch-instructions,branch-misses}:S" -j any
Also, the event only with the new PERF_SAMPLE_BRANCH_COUNTERS branch
sample type may not require a branch stack setup either.

Add a new flag NEEDS_BRANCH_STACK to indicate whether the event requires
a branch stack setup. Replace the needs_branch_stack() by checking the
new flag.

The counting event check is implemented here. The later patch will take
the new PERF_SAMPLE_BRANCH_COUNTERS into account.

Signed-off-by: Kan Liang <[email protected]>
---

No changes since V3

arch/x86/events/intel/core.c | 14 +++++++++++---
arch/x86/events/perf_event_flags.h | 1 +
2 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 41a164764a84..a99449c0d77c 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2527,9 +2527,14 @@ static void intel_pmu_assign_event(struct perf_event *event, int idx)
perf_report_aux_output_id(event, idx);
}

+static __always_inline bool intel_pmu_needs_branch_stack(struct perf_event *event)
+{
+ return event->hw.flags & PERF_X86_EVENT_NEEDS_BRANCH_STACK;
+}
+
static void intel_pmu_del_event(struct perf_event *event)
{
- if (needs_branch_stack(event))
+ if (intel_pmu_needs_branch_stack(event))
intel_pmu_lbr_del(event);
if (event->attr.precise_ip)
intel_pmu_pebs_del(event);
@@ -2820,7 +2825,7 @@ static void intel_pmu_add_event(struct perf_event *event)
{
if (event->attr.precise_ip)
intel_pmu_pebs_add(event);
- if (needs_branch_stack(event))
+ if (intel_pmu_needs_branch_stack(event))
intel_pmu_lbr_add(event);
}

@@ -3897,7 +3902,10 @@ static int intel_pmu_hw_config(struct perf_event *event)
x86_pmu.pebs_aliases(event);
}

- if (needs_branch_stack(event)) {
+ if (needs_branch_stack(event) && is_sampling_event(event))
+ event->hw.flags |= PERF_X86_EVENT_NEEDS_BRANCH_STACK;
+
+ if (intel_pmu_needs_branch_stack(event)) {
ret = intel_pmu_setup_lbr_filter(event);
if (ret)
return ret;
diff --git a/arch/x86/events/perf_event_flags.h b/arch/x86/events/perf_event_flags.h
index 1dc19b9b4426..a1685981c520 100644
--- a/arch/x86/events/perf_event_flags.h
+++ b/arch/x86/events/perf_event_flags.h
@@ -20,3 +20,4 @@ PERF_ARCH(TOPDOWN, 0x04000) /* Count Topdown slots/metrics events */
PERF_ARCH(PEBS_STLAT, 0x08000) /* st+stlat data address sampling */
PERF_ARCH(AMD_BRS, 0x10000) /* AMD Branch Sampling */
PERF_ARCH(PEBS_LAT_HYBRID, 0x20000) /* ld and st lat for hybrid */
+PERF_ARCH(NEEDS_BRANCH_STACK, 0x40000) /* require branch stack setup */
--
2.35.1

2023-10-04 18:41:39

by Liang, Kan

[permalink] [raw]
Subject: [PATCH V4 4/7] perf/x86/intel: Support LBR event logging

From: Kan Liang <[email protected]>

The LBR event logging introduces a per-counter indication of precise
event occurrences in LBRs. It can provide a means to attribute exposed
retirement latency to combinations of events across a block of
instructions. It also provides a means of attributing Timed LBR
latencies to events.

The feature is first introduced on SRF/GRR. It is an enhancement of the
ARCH LBR. It adds new fields in the LBR_INFO MSRs to log the occurrences
of events on the GP counters. The information is displayed by the order
of counters.

The design proposed in this patch requires that the events which are
logged must be in a group with the event that has LBR. If there are
more than one LBR group, the event logging information only from the
current group (overflowed) are stored for the perf tool, otherwise the
perf tool cannot know which and when other groups are scheduled
especially when multiplexing is triggered. The user can ensure it uses
the maximum number of counters that support LBR info (4 by now) by
making the group large enough.

The HW only logs events by the order of counters. The order may be
different from the order of enabling which the perf tool can understand.
When parsing the information of each branch entry, convert the counter
order to the enabled order, and store the enabled order in the extension
space.

Unconditionally reset LBRs for an LBR event group when it's deleted. The
logged events' occurrences information is only valid for the current LBR
group. If another LBR group is scheduled later, the information from the
stale LBRs would be otherwise wrongly interpreted.

Add a sanity check in intel_pmu_hw_config(). Disable the feature if other
counter filters (inv, cmask, edge, in_tx) are set or LBR call stack mode
is enabled. (For the LBR call stack mode, we cannot simply flush the
LBR, since it will break the call stack. Also, there is no obvious usage
with the call stack mode for now.)

Only applying the PERF_SAMPLE_BRANCH_COUNTERS doesn't require any branch
stack setup.

Expose the maximum number of supported counters and the width of the
counters into the sysfs. The perf tool can use the information to parse
the logged counters in each branch.

Signed-off-by: Kan Liang <[email protected]>
---

Changes since V3
- Support the "branch_counter_nr" and "branch_counter_width"
- Support the PERF_SAMPLE_BRANCH_COUNTERS

arch/x86/events/intel/core.c | 91 +++++++++++++++++++++++++++--
arch/x86/events/intel/ds.c | 2 +-
arch/x86/events/intel/lbr.c | 94 +++++++++++++++++++++++++++++-
arch/x86/events/perf_event.h | 12 ++++
arch/x86/events/perf_event_flags.h | 1 +
arch/x86/include/asm/msr-index.h | 2 +
arch/x86/include/asm/perf_event.h | 4 ++
7 files changed, 198 insertions(+), 8 deletions(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index a99449c0d77c..5557310d430a 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2792,6 +2792,7 @@ static void intel_pmu_enable_fixed(struct perf_event *event)

static void intel_pmu_enable_event(struct perf_event *event)
{
+ u64 enable_mask = ARCH_PERFMON_EVENTSEL_ENABLE;
struct hw_perf_event *hwc = &event->hw;
int idx = hwc->idx;

@@ -2800,8 +2801,10 @@ static void intel_pmu_enable_event(struct perf_event *event)

switch (idx) {
case 0 ... INTEL_PMC_IDX_FIXED - 1:
+ if (branch_sample_counters(event))
+ enable_mask |= ARCH_PERFMON_EVENTSEL_LBR_LOG;
intel_set_masks(event, idx);
- __x86_pmu_enable_event(hwc, ARCH_PERFMON_EVENTSEL_ENABLE);
+ __x86_pmu_enable_event(hwc, enable_mask);
break;
case INTEL_PMC_IDX_FIXED ... INTEL_PMC_IDX_FIXED_BTS - 1:
case INTEL_PMC_IDX_METRIC_BASE ... INTEL_PMC_IDX_METRIC_END:
@@ -3052,7 +3055,7 @@ static int handle_pmi_common(struct pt_regs *regs, u64 status)
perf_sample_data_init(&data, 0, event->hw.last_period);

if (has_branch_stack(event))
- perf_sample_save_brstack(&data, event, &cpuc->lbr_stack, NULL);
+ intel_pmu_lbr_save_brstack(&data, cpuc, event);

if (perf_event_overflow(event, &data, regs))
x86_pmu_stop(event, 0);
@@ -3617,6 +3620,13 @@ intel_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
if (cpuc->excl_cntrs)
return intel_get_excl_constraints(cpuc, event, idx, c2);

+ /* The LBR event logging may not be available for all counters. */
+ if (branch_sample_counters(event)) {
+ c2 = dyn_constraint(cpuc, c2, idx);
+ c2->idxmsk64 &= x86_pmu.lbr_events;
+ c2->weight = hweight64(c2->idxmsk64);
+ }
+
return c2;
}

@@ -3905,6 +3915,44 @@ static int intel_pmu_hw_config(struct perf_event *event)
if (needs_branch_stack(event) && is_sampling_event(event))
event->hw.flags |= PERF_X86_EVENT_NEEDS_BRANCH_STACK;

+ if (branch_sample_counters(event)) {
+ struct perf_event *leader, *sibling;
+
+ if (!(x86_pmu.flags & PMU_FL_LBR_EVENT) ||
+ (event->attr.config & ~INTEL_ARCH_EVENT_MASK))
+ return -EINVAL;
+
+ /*
+ * The event logging is not supported in the call stack mode
+ * yet, since we cannot simply flush the LBR during e.g.,
+ * multiplexing. Also, there is no obvious usage with the call
+ * stack mode. Simply forbids it for now.
+ *
+ * If any events in the group enable the LBR event logging
+ * feature, the group is treated as a LBR event logging group,
+ * which requires the extra space to store the counters.
+ */
+ leader = event->group_leader;
+ if (branch_sample_call_stack(leader))
+ return -EINVAL;
+ leader->hw.flags |= PERF_X86_EVENT_BRANCH_COUNTERS;
+
+ for_each_sibling_event(sibling, leader) {
+ if (branch_sample_call_stack(sibling))
+ return -EINVAL;
+ }
+
+ /*
+ * Only applying the PERF_SAMPLE_BRANCH_COUNTERS doesn't
+ * require any branch stack setup.
+ * Clear the bit to avoid unnecessary branch stack setup.
+ */
+ if (0 == (event->attr.branch_sample_type &
+ ~(PERF_SAMPLE_BRANCH_PLM_ALL |
+ PERF_SAMPLE_BRANCH_COUNTERS)))
+ event->hw.flags &= ~PERF_X86_EVENT_NEEDS_BRANCH_STACK;
+ }
+
if (intel_pmu_needs_branch_stack(event)) {
ret = intel_pmu_setup_lbr_filter(event);
if (ret)
@@ -4383,8 +4431,13 @@ cmt_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
*/
if (event->attr.precise_ip == 3) {
/* Force instruction:ppp on PMC0, 1 and Fixed counter 0 */
- if (constraint_match(&fixed0_constraint, event->hw.config))
- return &fixed0_counter0_1_constraint;
+ if (constraint_match(&fixed0_constraint, event->hw.config)) {
+ /* The fixed counter 0 doesn't support LBR event logging. */
+ if (branch_sample_counters(event))
+ return &counter0_1_constraint;
+ else
+ return &fixed0_counter0_1_constraint;
+ }

switch (c->idxmsk64 & 0x3ull) {
case 0x1:
@@ -4563,7 +4616,7 @@ int intel_cpuc_prepare(struct cpu_hw_events *cpuc, int cpu)
goto err;
}

- if (x86_pmu.flags & (PMU_FL_EXCL_CNTRS | PMU_FL_TFA)) {
+ if (x86_pmu.flags & (PMU_FL_EXCL_CNTRS | PMU_FL_TFA | PMU_FL_LBR_EVENT)) {
size_t sz = X86_PMC_IDX_MAX * sizeof(struct event_constraint);

cpuc->constraint_list = kzalloc_node(sz, GFP_KERNEL, cpu_to_node(cpu));
@@ -5535,8 +5588,30 @@ static ssize_t branches_show(struct device *cdev,

static DEVICE_ATTR_RO(branches);

+static ssize_t branch_counter_nr_show(struct device *cdev,
+ struct device_attribute *attr,
+ char *buf)
+{
+ return snprintf(buf, PAGE_SIZE, "%d\n", fls(x86_pmu.lbr_events));
+}
+
+static DEVICE_ATTR_RO(branch_counter_nr);
+
+static ssize_t branch_counter_width_show(struct device *cdev,
+ struct device_attribute *attr,
+ char *buf)
+{
+ return snprintf(buf, PAGE_SIZE, "2\n");
+}
+
+static DEVICE_ATTR_RO(branch_counter_width);
+
+
+
static struct attribute *lbr_attrs[] = {
&dev_attr_branches.attr,
+ &dev_attr_branch_counter_nr.attr,
+ &dev_attr_branch_counter_width.attr,
NULL
};

@@ -5590,7 +5665,11 @@ mem_is_visible(struct kobject *kobj, struct attribute *attr, int i)
static umode_t
lbr_is_visible(struct kobject *kobj, struct attribute *attr, int i)
{
- return x86_pmu.lbr_nr ? attr->mode : 0;
+ /* branches */
+ if (i == 0)
+ return x86_pmu.lbr_nr ? attr->mode : 0;
+
+ return (x86_pmu.flags & PMU_FL_LBR_EVENT) ? attr->mode : 0;
}

static umode_t
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index cb3f329f8fa4..d49d661ec0a7 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -1912,7 +1912,7 @@ static void setup_pebs_adaptive_sample_data(struct perf_event *event,

if (has_branch_stack(event)) {
intel_pmu_store_pebs_lbrs(lbr);
- perf_sample_save_brstack(data, event, &cpuc->lbr_stack, NULL);
+ intel_pmu_lbr_save_brstack(data, cpuc, event);
}
}

diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
index c3b0d15a9841..1e80a551a4c2 100644
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -676,6 +676,21 @@ void intel_pmu_lbr_del(struct perf_event *event)
WARN_ON_ONCE(cpuc->lbr_users < 0);
WARN_ON_ONCE(cpuc->lbr_pebs_users < 0);
perf_sched_cb_dec(event->pmu);
+
+ /*
+ * The logged occurrences information is only valid for the
+ * current LBR group. If another LBR group is scheduled in
+ * later, the information from the stale LBRs will be wrongly
+ * interpreted. Reset the LBRs here.
+ * For the context switch, the LBR will be unconditionally
+ * flushed when a new task is scheduled in. If both the new task
+ * and the old task are monitored by a LBR event group. The
+ * reset here is redundant. But the extra reset doesn't impact
+ * the functionality. It's hard to distinguish the above case.
+ * Keep the unconditionally reset for a LBR event group for now.
+ */
+ if (is_branch_counters_group(event))
+ intel_pmu_lbr_reset();
}

static inline bool vlbr_exclude_host(void)
@@ -866,6 +881,18 @@ static __always_inline u16 get_lbr_cycles(u64 info)
return cycles;
}

+static __always_inline void get_lbr_events(struct cpu_hw_events *cpuc,
+ int i, u64 info)
+{
+ /*
+ * The later code will decide what content can be disclosed
+ * to the perf tool. It's no harmful to unconditionally update
+ * the cpuc->lbr_events.
+ * Pleae see intel_pmu_lbr_event_reorder()
+ */
+ cpuc->lbr_events[i] = info & LBR_INFO_EVENTS;
+}
+
static void intel_pmu_store_lbr(struct cpu_hw_events *cpuc,
struct lbr_entry *entries)
{
@@ -898,11 +925,70 @@ static void intel_pmu_store_lbr(struct cpu_hw_events *cpuc,
e->abort = !!(info & LBR_INFO_ABORT);
e->cycles = get_lbr_cycles(info);
e->type = get_lbr_br_type(info);
+
+ get_lbr_events(cpuc, i, info);
}

cpuc->lbr_stack.nr = i;
}

+#define ARCH_LBR_EVENT_LOG_WIDTH 2
+#define ARCH_LBR_EVENT_LOG_MASK 0x3
+
+static __always_inline void intel_pmu_update_lbr_event(u64 *lbr_events, int idx, int pos)
+{
+ u64 logs = *lbr_events >> (LBR_INFO_EVENTS_OFFSET +
+ idx * ARCH_LBR_EVENT_LOG_WIDTH);
+
+ logs &= ARCH_LBR_EVENT_LOG_MASK;
+ *lbr_events |= logs << (pos * ARCH_LBR_EVENT_LOG_WIDTH);
+}
+
+/*
+ * The enabled order may be different from the counter order.
+ * Update the lbr_events with the enabled order.
+ */
+static void intel_pmu_lbr_event_reorder(struct cpu_hw_events *cpuc,
+ struct perf_event *event)
+{
+ int i, j, pos = 0, enabled[X86_PMC_IDX_MAX];
+ struct perf_event *leader, *sibling;
+
+ leader = event->group_leader;
+ if (branch_sample_counters(leader))
+ enabled[pos++] = leader->hw.idx;
+
+ for_each_sibling_event(sibling, leader) {
+ if (!branch_sample_counters(sibling))
+ continue;
+ enabled[pos++] = sibling->hw.idx;
+ }
+
+ if (!pos)
+ return;
+
+ for (i = 0; i < cpuc->lbr_stack.nr; i++) {
+ for (j = 0; j < pos; j++)
+ intel_pmu_update_lbr_event(&cpuc->lbr_events[i], enabled[j], j);
+
+ /* Clear the original counter order */
+ cpuc->lbr_events[i] &= ~LBR_INFO_EVENTS;
+ }
+}
+
+void intel_pmu_lbr_save_brstack(struct perf_sample_data *data,
+ struct cpu_hw_events *cpuc,
+ struct perf_event *event)
+{
+ if (is_branch_counters_group(event)) {
+ intel_pmu_lbr_event_reorder(cpuc, event);
+ perf_sample_save_brstack(data, event, &cpuc->lbr_stack, cpuc->lbr_events);
+ return;
+ }
+
+ perf_sample_save_brstack(data, event, &cpuc->lbr_stack, NULL);
+}
+
static void intel_pmu_arch_lbr_read(struct cpu_hw_events *cpuc)
{
intel_pmu_store_lbr(cpuc, NULL);
@@ -1173,8 +1259,10 @@ intel_pmu_lbr_filter(struct cpu_hw_events *cpuc)
for (i = 0; i < cpuc->lbr_stack.nr; ) {
if (!cpuc->lbr_entries[i].from) {
j = i;
- while (++j < cpuc->lbr_stack.nr)
+ while (++j < cpuc->lbr_stack.nr) {
cpuc->lbr_entries[j-1] = cpuc->lbr_entries[j];
+ cpuc->lbr_events[j-1] = cpuc->lbr_events[j];
+ }
cpuc->lbr_stack.nr--;
if (!cpuc->lbr_entries[i].from)
continue;
@@ -1525,8 +1613,12 @@ void __init intel_pmu_arch_lbr_init(void)
x86_pmu.lbr_mispred = ecx.split.lbr_mispred;
x86_pmu.lbr_timed_lbr = ecx.split.lbr_timed_lbr;
x86_pmu.lbr_br_type = ecx.split.lbr_br_type;
+ x86_pmu.lbr_events = ecx.split.lbr_events;
x86_pmu.lbr_nr = lbr_nr;

+ if (!!x86_pmu.lbr_events)
+ x86_pmu.flags |= PMU_FL_LBR_EVENT;
+
if (x86_pmu.lbr_mispred)
static_branch_enable(&x86_lbr_mispred);
if (x86_pmu.lbr_timed_lbr)
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 53dd5d495ba6..4f0722a1be76 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -110,6 +110,11 @@ static inline bool is_topdown_event(struct perf_event *event)
return is_metric_event(event) || is_slots_event(event);
}

+static inline bool is_branch_counters_group(struct perf_event *event)
+{
+ return event->group_leader->hw.flags & PERF_X86_EVENT_BRANCH_COUNTERS;
+}
+
struct amd_nb {
int nb_id; /* NorthBridge id */
int refcnt; /* reference count */
@@ -283,6 +288,7 @@ struct cpu_hw_events {
int lbr_pebs_users;
struct perf_branch_stack lbr_stack;
struct perf_branch_entry lbr_entries[MAX_LBR_ENTRIES];
+ u64 lbr_events[MAX_LBR_ENTRIES]; /* branch stack extra */
union {
struct er_account *lbr_sel;
struct er_account *lbr_ctl;
@@ -888,6 +894,7 @@ struct x86_pmu {
unsigned int lbr_mispred:1;
unsigned int lbr_timed_lbr:1;
unsigned int lbr_br_type:1;
+ unsigned int lbr_events:4;

void (*lbr_reset)(void);
void (*lbr_read)(struct cpu_hw_events *cpuc);
@@ -1012,6 +1019,7 @@ do { \
#define PMU_FL_INSTR_LATENCY 0x80 /* Support Instruction Latency in PEBS Memory Info Record */
#define PMU_FL_MEM_LOADS_AUX 0x100 /* Require an auxiliary event for the complete memory info */
#define PMU_FL_RETIRE_LATENCY 0x200 /* Support Retire Latency in PEBS */
+#define PMU_FL_LBR_EVENT 0x400 /* Support LBR event logging */

#define EVENT_VAR(_id) event_attr_##_id
#define EVENT_PTR(_id) &event_attr_##_id.attr.attr
@@ -1552,6 +1560,10 @@ void intel_pmu_store_pebs_lbrs(struct lbr_entry *lbr);

void intel_ds_init(void);

+void intel_pmu_lbr_save_brstack(struct perf_sample_data *data,
+ struct cpu_hw_events *cpuc,
+ struct perf_event *event);
+
void intel_pmu_lbr_swap_task_ctx(struct perf_event_pmu_context *prev_epc,
struct perf_event_pmu_context *next_epc);

diff --git a/arch/x86/events/perf_event_flags.h b/arch/x86/events/perf_event_flags.h
index a1685981c520..6c977c19f2cd 100644
--- a/arch/x86/events/perf_event_flags.h
+++ b/arch/x86/events/perf_event_flags.h
@@ -21,3 +21,4 @@ PERF_ARCH(PEBS_STLAT, 0x08000) /* st+stlat data address sampling */
PERF_ARCH(AMD_BRS, 0x10000) /* AMD Branch Sampling */
PERF_ARCH(PEBS_LAT_HYBRID, 0x20000) /* ld and st lat for hybrid */
PERF_ARCH(NEEDS_BRANCH_STACK, 0x40000) /* require branch stack setup */
+PERF_ARCH(BRANCH_COUNTERS, 0x80000) /* logs the counters in the extra space of each branch */
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 1d111350197f..7306b70f21ac 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -236,6 +236,8 @@
#define LBR_INFO_CYCLES 0xffff
#define LBR_INFO_BR_TYPE_OFFSET 56
#define LBR_INFO_BR_TYPE (0xfull << LBR_INFO_BR_TYPE_OFFSET)
+#define LBR_INFO_EVENTS_OFFSET 32
+#define LBR_INFO_EVENTS (0xffull << LBR_INFO_EVENTS_OFFSET)

#define MSR_ARCH_LBR_CTL 0x000014ce
#define ARCH_LBR_CTL_LBREN BIT(0)
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 85a9fd5a3ec3..7677605a39ef 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -31,6 +31,7 @@
#define ARCH_PERFMON_EVENTSEL_ENABLE (1ULL << 22)
#define ARCH_PERFMON_EVENTSEL_INV (1ULL << 23)
#define ARCH_PERFMON_EVENTSEL_CMASK 0xFF000000ULL
+#define ARCH_PERFMON_EVENTSEL_LBR_LOG (1ULL << 35)

#define INTEL_FIXED_BITS_MASK 0xFULL
#define INTEL_FIXED_BITS_STRIDE 4
@@ -216,6 +217,9 @@ union cpuid28_ecx {
unsigned int lbr_timed_lbr:1;
/* Branch Type Field Supported */
unsigned int lbr_br_type:1;
+ unsigned int reserved:13;
+ /* Event Logging Supported */
+ unsigned int lbr_events:4;
} split;
unsigned int full;
};
--
2.35.1

2023-10-04 18:42:32

by Liang, Kan

[permalink] [raw]
Subject: [PATCH V4 7/7] perf tools: Add branch counter knob

From: Kan Liang <[email protected]>

Add a new branch filter, "counter", for the branch counter option. It is
used to mark the events which should be logged in the branch. If it is
applied with the -j option, the counters of all the events should be
logged in the branch. If the legacy kernel doesn't support the new
branch sample type, switching off the branch counter filter.

The stored counter values in each branch are displayed right after the
regular branch stack information via perf report -D.

Usage examples:

perf record -e "{branch-instructions,branch-misses}:S" -j any,counter

Only the first event, branch-instructions, collect the LBR. Both
branch-instructions and branch-misses are marked as logged events.
The occurrences information of them can be found in the branch stack
extension space of each branch.

perf record -e "{cpu/branch-instructions,branch_type=any/,
cpu/branch-misses,branch_type=counter/}"

Only the first event, branch-instructions, collect the LBR. Only the
branch-misses event is marked as a logged event.

Signed-off-by: Kan Liang <[email protected]>
---

Changes since V3:
- Rename the filter name to "counter".
- Dump the "branch_counter_nr" and "branch_counter_width" in perf report
- Support PERF_SAMPLE_BRANCH_COUNTERS

tools/perf/Documentation/perf-record.txt | 4 +++
tools/perf/util/evsel.c | 31 ++++++++++++++++++++++-
tools/perf/util/evsel.h | 1 +
tools/perf/util/parse-branch-options.c | 1 +
tools/perf/util/perf_event_attr_fprintf.c | 1 +
tools/perf/util/sample.h | 1 +
tools/perf/util/session.c | 15 +++++++++--
7 files changed, 51 insertions(+), 3 deletions(-)

diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt
index d5217be012d7..b6afe7cc948d 100644
--- a/tools/perf/Documentation/perf-record.txt
+++ b/tools/perf/Documentation/perf-record.txt
@@ -442,6 +442,10 @@ following filters are defined:
4th-Gen Xeon+ server), the save branch type is unconditionally enabled
when the taken branch stack sampling is enabled.
- priv: save privilege state during sampling in case binary is not available later
+ - counter: save occurrences of the event since the last branch entry. Currently, the
+ feature is only supported by a newer CPU, e.g., Intel Sierra Forest and
+ later platforms. An error out is expected if it's used on the unsupported
+ kernel or CPUs.

+
The option requires at least one branch type among any, any_call, any_ret, ind_call, cond.
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index a8a5ff87cc1f..58a9b8c82790 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -1831,6 +1831,8 @@ static int __evsel__prepare_open(struct evsel *evsel, struct perf_cpu_map *cpus,

static void evsel__disable_missing_features(struct evsel *evsel)
{
+ if (perf_missing_features.branch_counters)
+ evsel->core.attr.branch_sample_type &= ~PERF_SAMPLE_BRANCH_COUNTERS;
if (perf_missing_features.read_lost)
evsel->core.attr.read_format &= ~PERF_FORMAT_LOST;
if (perf_missing_features.weight_struct) {
@@ -1884,7 +1886,12 @@ bool evsel__detect_missing_features(struct evsel *evsel)
* Must probe features in the order they were added to the
* perf_event_attr interface.
*/
- if (!perf_missing_features.read_lost &&
+ if (!perf_missing_features.branch_counters &&
+ (evsel->core.attr.branch_sample_type & PERF_SAMPLE_BRANCH_COUNTERS)) {
+ perf_missing_features.branch_counters = true;
+ pr_debug2("switching off branch counters support\n");
+ return true;
+ } else if (!perf_missing_features.read_lost &&
(evsel->core.attr.read_format & PERF_FORMAT_LOST)) {
perf_missing_features.read_lost = true;
pr_debug2("switching off PERF_FORMAT_LOST support\n");
@@ -2344,6 +2351,18 @@ u64 evsel__bitfield_swap_branch_flags(u64 value)
return new_val;
}

+static inline bool evsel__has_branch_counters(const struct evsel *evsel)
+{
+ struct evsel *cur, *leader = evsel__leader(evsel);
+
+ evlist__for_each_entry(evsel->evlist, cur) {
+ if ((leader == evsel__leader(cur)) &&
+ (cur->core.attr.branch_sample_type & PERF_SAMPLE_BRANCH_COUNTERS))
+ return true;
+ }
+ return false;
+}
+
int evsel__parse_sample(struct evsel *evsel, union perf_event *event,
struct perf_sample *data)
{
@@ -2577,6 +2596,16 @@ int evsel__parse_sample(struct evsel *evsel, union perf_event *event,

OVERFLOW_CHECK(array, sz, max_size);
array = (void *)array + sz;
+
+ if (evsel__has_branch_counters(evsel)) {
+ OVERFLOW_CHECK_u64(array);
+
+ data->branch_stack_cntr = (u64 *)array;
+ sz = data->branch_stack->nr * sizeof(u64);
+
+ OVERFLOW_CHECK(array, sz, max_size);
+ array = (void *)array + sz;
+ }
}

if (type & PERF_SAMPLE_REGS_USER) {
diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
index 848534ec74fa..85f24c986392 100644
--- a/tools/perf/util/evsel.h
+++ b/tools/perf/util/evsel.h
@@ -191,6 +191,7 @@ struct perf_missing_features {
bool code_page_size;
bool weight_struct;
bool read_lost;
+ bool branch_counters;
};

extern struct perf_missing_features perf_missing_features;
diff --git a/tools/perf/util/parse-branch-options.c b/tools/perf/util/parse-branch-options.c
index fd67d204d720..f7f7aff3d85a 100644
--- a/tools/perf/util/parse-branch-options.c
+++ b/tools/perf/util/parse-branch-options.c
@@ -36,6 +36,7 @@ static const struct branch_mode branch_modes[] = {
BRANCH_OPT("stack", PERF_SAMPLE_BRANCH_CALL_STACK),
BRANCH_OPT("hw_index", PERF_SAMPLE_BRANCH_HW_INDEX),
BRANCH_OPT("priv", PERF_SAMPLE_BRANCH_PRIV_SAVE),
+ BRANCH_OPT("counter", PERF_SAMPLE_BRANCH_COUNTERS),
BRANCH_END
};

diff --git a/tools/perf/util/perf_event_attr_fprintf.c b/tools/perf/util/perf_event_attr_fprintf.c
index 2247991451f3..8f04d3b7f3ec 100644
--- a/tools/perf/util/perf_event_attr_fprintf.c
+++ b/tools/perf/util/perf_event_attr_fprintf.c
@@ -55,6 +55,7 @@ static void __p_branch_sample_type(char *buf, size_t size, u64 value)
bit_name(COND), bit_name(CALL_STACK), bit_name(IND_JUMP),
bit_name(CALL), bit_name(NO_FLAGS), bit_name(NO_CYCLES),
bit_name(TYPE_SAVE), bit_name(HW_INDEX), bit_name(PRIV_SAVE),
+ bit_name(COUNTERS),
{ .name = NULL, }
};
#undef bit_name
diff --git a/tools/perf/util/sample.h b/tools/perf/util/sample.h
index c92ad0f51ecd..70b2c3135555 100644
--- a/tools/perf/util/sample.h
+++ b/tools/perf/util/sample.h
@@ -113,6 +113,7 @@ struct perf_sample {
void *raw_data;
struct ip_callchain *callchain;
struct branch_stack *branch_stack;
+ u64 *branch_stack_cntr;
struct regs_dump user_regs;
struct regs_dump intr_regs;
struct stack_dump user_stack;
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 1e9aa8ed15b6..4a094ab0362b 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -1150,9 +1150,13 @@ static void callchain__printf(struct evsel *evsel,
i, callchain->ips[i]);
}

-static void branch_stack__printf(struct perf_sample *sample, bool callstack)
+static void branch_stack__printf(struct perf_sample *sample,
+ struct evsel *evsel)
{
struct branch_entry *entries = perf_sample__branch_entries(sample);
+ bool callstack = evsel__has_branch_callstack(evsel);
+ u64 *branch_stack_cntr = sample->branch_stack_cntr;
+ struct perf_env *env = evsel__env(evsel);
uint64_t i;

if (!callstack) {
@@ -1194,6 +1198,13 @@ static void branch_stack__printf(struct perf_sample *sample, bool callstack)
}
}
}
+
+ if (branch_stack_cntr) {
+ printf("... branch stack counters: nr:%" PRIu64 " (counter width: %u max counter nr:%u)\n",
+ sample->branch_stack->nr, env->br_cntr_width, env->br_cntr_nr);
+ for (i = 0; i < sample->branch_stack->nr; i++)
+ printf("..... %2"PRIu64": %016" PRIx64 "\n", i, branch_stack_cntr[i]);
+ }
}

static void regs_dump__printf(u64 mask, u64 *regs, const char *arch)
@@ -1355,7 +1366,7 @@ static void dump_sample(struct evsel *evsel, union perf_event *event,
callchain__printf(evsel, sample);

if (evsel__has_br_stack(evsel))
- branch_stack__printf(sample, evsel__has_branch_callstack(evsel));
+ branch_stack__printf(sample, evsel);

if (sample_type & PERF_SAMPLE_REGS_USER)
regs_user__printf(sample, arch);
--
2.35.1

2023-10-04 18:44:20

by Liang, Kan

[permalink] [raw]
Subject: [PATCH V4 5/7] tools headers UAPI: Sync include/uapi/linux/perf_event.h header with the kernel

From: Kan Liang <[email protected]>

Sync the new sample type for the branch counters feature.

Signed-off-by: Kan Liang <[email protected]>
---

Changes since V3:
- Support PERF_SAMPLE_BRANCH_COUNTERS

tools/include/uapi/linux/perf_event.h | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h
index 39c6a250dd1b..4461f380425b 100644
--- a/tools/include/uapi/linux/perf_event.h
+++ b/tools/include/uapi/linux/perf_event.h
@@ -204,6 +204,8 @@ enum perf_branch_sample_type_shift {

PERF_SAMPLE_BRANCH_PRIV_SAVE_SHIFT = 18, /* save privilege mode */

+ PERF_SAMPLE_BRANCH_COUNTERS_SHIFT = 19, /* save occurrences of events on a branch */
+
PERF_SAMPLE_BRANCH_MAX_SHIFT /* non-ABI */
};

@@ -235,6 +237,8 @@ enum perf_branch_sample_type {

PERF_SAMPLE_BRANCH_PRIV_SAVE = 1U << PERF_SAMPLE_BRANCH_PRIV_SAVE_SHIFT,

+ PERF_SAMPLE_BRANCH_COUNTERS = 1U << PERF_SAMPLE_BRANCH_COUNTERS_SHIFT,
+
PERF_SAMPLE_BRANCH_MAX = 1U << PERF_SAMPLE_BRANCH_MAX_SHIFT,
};

@@ -982,6 +986,12 @@ enum perf_event_type {
* { u64 nr;
* { u64 hw_idx; } && PERF_SAMPLE_BRANCH_HW_INDEX
* { u64 from, to, flags } lbr[nr];
+ * #
+ * # The format of the counters is decided by the
+ * # "branch_counter_nr" and "branch_counter_width",
+ * # which are defined in the ABI.
+ * #
+ * { u64 counters; } cntr[nr] && PERF_SAMPLE_BRANCH_COUNTERS
* } && PERF_SAMPLE_BRANCH_STACK
*
* { u64 abi; # enum perf_sample_regs_abi
--
2.35.1

2023-10-16 17:48:40

by Liang, Kan

[permalink] [raw]
Subject: Re: [PATCH V4 1/7] perf: Add branch stack counters

Hi Peter,

Could you please share your comments for this series?

Thanks,
Kan

On 2023-10-04 2:40 p.m., [email protected] wrote:
> From: Kan Liang <[email protected]>
>
> Currently, the additional information of a branch entry is stored in a
> u64 space. With more and more information added, the space is running
> out. For example, the information of occurrences of events will be added
> for each branch.
>
> Two places were suggested to append the counters.
> https://lore.kernel.org/lkml/[email protected]/
> One place is right after the flags of each branch entry. It changes the
> existing struct perf_branch_entry. The later ARCH specific
> implementation has to be really careful to consistently pick
> the right struct.
> The other place is right after the entire struct perf_branch_stack.
> The disadvantage is that the pointer of the extra space has to be
> recorded. The common interface perf_sample_save_brstack() has to be
> updated.
>
> The latter is much straightforward, and should be easily understood and
> maintained. It is implemented in the patch.
>
> Add a new branch sample type, PERF_SAMPLE_BRANCH_COUNTERS, to indicate
> the event which is recorded in the branch info.
>
> The "u64 counters" may store the occurrences of several events. The
> information regarding the number of events/counters and the width of
> each counter should be exposed via sysfs as a reference for the perf
> tool. Define the branch_counter_nr and branch_counter_width ABI here.
> The support will be implemented later in the Intel-specific patch.
>
> Suggested-by: Peter Zijlstra (Intel) <[email protected]>
> Signed-off-by: Kan Liang <[email protected]>
> Cc: Sandipan Das <[email protected]>
> Cc: Ravi Bangoria <[email protected]>
> Cc: Athira Rajeev <[email protected]>
> ---
>
> Changes since V3:
> - Add a new branch sample type, PERF_SAMPLE_BRANCH_COUNTERS
> Drop the two branch sample type in V2.
> - Add the branch_counter_nr and branch_counter_width ABI
>
> .../testing/sysfs-bus-event_source-devices-caps | 6 ++++++
> arch/powerpc/perf/core-book3s.c | 2 +-
> arch/x86/events/amd/core.c | 2 +-
> arch/x86/events/core.c | 2 +-
> arch/x86/events/intel/core.c | 2 +-
> arch/x86/events/intel/ds.c | 4 ++--
> include/linux/perf_event.h | 17 ++++++++++++++++-
> include/uapi/linux/perf_event.h | 10 ++++++++++
> kernel/events/core.c | 8 ++++++++
> 9 files changed, 46 insertions(+), 7 deletions(-)
>
> diff --git a/Documentation/ABI/testing/sysfs-bus-event_source-devices-caps b/Documentation/ABI/testing/sysfs-bus-event_source-devices-caps
> index 8757dcf41c08..451f0c620aa7 100644
> --- a/Documentation/ABI/testing/sysfs-bus-event_source-devices-caps
> +++ b/Documentation/ABI/testing/sysfs-bus-event_source-devices-caps
> @@ -16,3 +16,9 @@ Description:
> Example output in powerpc:
> grep . /sys/bus/event_source/devices/cpu/caps/*
> /sys/bus/event_source/devices/cpu/caps/pmu_name:POWER9
> +
> + The "branch_counter_nr" in the supported platform exposes the
> + maximum number of counters which can be shown in the u64 counters
> + of PERF_SAMPLE_BRANCH_COUNTERS, while the "branch_counter_width"
> + exposes the width of each counter. Both of them can be used by
> + the perf tool to parse the logged counters in each branch.
> diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
> index 8c1f7def596e..3c14596bbfaf 100644
> --- a/arch/powerpc/perf/core-book3s.c
> +++ b/arch/powerpc/perf/core-book3s.c
> @@ -2313,7 +2313,7 @@ static void record_and_restart(struct perf_event *event, unsigned long val,
> struct cpu_hw_events *cpuhw;
> cpuhw = this_cpu_ptr(&cpu_hw_events);
> power_pmu_bhrb_read(event, cpuhw);
> - perf_sample_save_brstack(&data, event, &cpuhw->bhrb_stack);
> + perf_sample_save_brstack(&data, event, &cpuhw->bhrb_stack, NULL);
> }
>
> if (event->attr.sample_type & PERF_SAMPLE_DATA_SRC &&
> diff --git a/arch/x86/events/amd/core.c b/arch/x86/events/amd/core.c
> index e24976593a29..4ee6390b45c9 100644
> --- a/arch/x86/events/amd/core.c
> +++ b/arch/x86/events/amd/core.c
> @@ -940,7 +940,7 @@ static int amd_pmu_v2_handle_irq(struct pt_regs *regs)
> continue;
>
> if (has_branch_stack(event))
> - perf_sample_save_brstack(&data, event, &cpuc->lbr_stack);
> + perf_sample_save_brstack(&data, event, &cpuc->lbr_stack, NULL);
>
> if (perf_event_overflow(event, &data, regs))
> x86_pmu_stop(event, 0);
> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> index 40ad1425ffa2..40c9af124128 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -1702,7 +1702,7 @@ int x86_pmu_handle_irq(struct pt_regs *regs)
> perf_sample_data_init(&data, 0, event->hw.last_period);
>
> if (has_branch_stack(event))
> - perf_sample_save_brstack(&data, event, &cpuc->lbr_stack);
> + perf_sample_save_brstack(&data, event, &cpuc->lbr_stack, NULL);
>
> if (perf_event_overflow(event, &data, regs))
> x86_pmu_stop(event, 0);
> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> index a08f794a0e79..41a164764a84 100644
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c
> @@ -3047,7 +3047,7 @@ static int handle_pmi_common(struct pt_regs *regs, u64 status)
> perf_sample_data_init(&data, 0, event->hw.last_period);
>
> if (has_branch_stack(event))
> - perf_sample_save_brstack(&data, event, &cpuc->lbr_stack);
> + perf_sample_save_brstack(&data, event, &cpuc->lbr_stack, NULL);
>
> if (perf_event_overflow(event, &data, regs))
> x86_pmu_stop(event, 0);
> diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
> index bf97ab904d40..cb3f329f8fa4 100644
> --- a/arch/x86/events/intel/ds.c
> +++ b/arch/x86/events/intel/ds.c
> @@ -1755,7 +1755,7 @@ static void setup_pebs_fixed_sample_data(struct perf_event *event,
> setup_pebs_time(event, data, pebs->tsc);
>
> if (has_branch_stack(event))
> - perf_sample_save_brstack(data, event, &cpuc->lbr_stack);
> + perf_sample_save_brstack(data, event, &cpuc->lbr_stack, NULL);
> }
>
> static void adaptive_pebs_save_regs(struct pt_regs *regs,
> @@ -1912,7 +1912,7 @@ static void setup_pebs_adaptive_sample_data(struct perf_event *event,
>
> if (has_branch_stack(event)) {
> intel_pmu_store_pebs_lbrs(lbr);
> - perf_sample_save_brstack(data, event, &cpuc->lbr_stack);
> + perf_sample_save_brstack(data, event, &cpuc->lbr_stack, NULL);
> }
> }
>
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index e85cd1c0eaf3..9ad79f8107cb 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -1138,6 +1138,10 @@ static inline bool branch_sample_priv(const struct perf_event *event)
> return event->attr.branch_sample_type & PERF_SAMPLE_BRANCH_PRIV_SAVE;
> }
>
> +static inline bool branch_sample_counters(const struct perf_event *event)
> +{
> + return event->attr.branch_sample_type & PERF_SAMPLE_BRANCH_COUNTERS;
> +}
>
> struct perf_sample_data {
> /*
> @@ -1172,6 +1176,7 @@ struct perf_sample_data {
> struct perf_callchain_entry *callchain;
> struct perf_raw_record *raw;
> struct perf_branch_stack *br_stack;
> + u64 *br_stack_cntr;
> union perf_sample_weight weight;
> union perf_mem_data_src data_src;
> u64 txn;
> @@ -1249,7 +1254,8 @@ static inline void perf_sample_save_raw_data(struct perf_sample_data *data,
>
> static inline void perf_sample_save_brstack(struct perf_sample_data *data,
> struct perf_event *event,
> - struct perf_branch_stack *brs)
> + struct perf_branch_stack *brs,
> + u64 *brs_cntr)
> {
> int size = sizeof(u64); /* nr */
>
> @@ -1257,7 +1263,16 @@ static inline void perf_sample_save_brstack(struct perf_sample_data *data,
> size += sizeof(u64);
> size += brs->nr * sizeof(struct perf_branch_entry);
>
> + /*
> + * The extension space for counters is appended after the
> + * struct perf_branch_stack. It is used to store the occurrences
> + * of events of each branch.
> + */
> + if (brs_cntr)
> + size += brs->nr * sizeof(u64);
> +
> data->br_stack = brs;
> + data->br_stack_cntr = brs_cntr;
> data->dyn_size += size;
> data->sample_flags |= PERF_SAMPLE_BRANCH_STACK;
> }
> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> index 39c6a250dd1b..4461f380425b 100644
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -204,6 +204,8 @@ enum perf_branch_sample_type_shift {
>
> PERF_SAMPLE_BRANCH_PRIV_SAVE_SHIFT = 18, /* save privilege mode */
>
> + PERF_SAMPLE_BRANCH_COUNTERS_SHIFT = 19, /* save occurrences of events on a branch */
> +
> PERF_SAMPLE_BRANCH_MAX_SHIFT /* non-ABI */
> };
>
> @@ -235,6 +237,8 @@ enum perf_branch_sample_type {
>
> PERF_SAMPLE_BRANCH_PRIV_SAVE = 1U << PERF_SAMPLE_BRANCH_PRIV_SAVE_SHIFT,
>
> + PERF_SAMPLE_BRANCH_COUNTERS = 1U << PERF_SAMPLE_BRANCH_COUNTERS_SHIFT,
> +
> PERF_SAMPLE_BRANCH_MAX = 1U << PERF_SAMPLE_BRANCH_MAX_SHIFT,
> };
>
> @@ -982,6 +986,12 @@ enum perf_event_type {
> * { u64 nr;
> * { u64 hw_idx; } && PERF_SAMPLE_BRANCH_HW_INDEX
> * { u64 from, to, flags } lbr[nr];
> + * #
> + * # The format of the counters is decided by the
> + * # "branch_counter_nr" and "branch_counter_width",
> + * # which are defined in the ABI.
> + * #
> + * { u64 counters; } cntr[nr] && PERF_SAMPLE_BRANCH_COUNTERS
> * } && PERF_SAMPLE_BRANCH_STACK
> *
> * { u64 abi; # enum perf_sample_regs_abi
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 41e28f64a4a9..56b08ffeed2f 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -7336,6 +7336,14 @@ void perf_output_sample(struct perf_output_handle *handle,
> if (branch_sample_hw_index(event))
> perf_output_put(handle, data->br_stack->hw_idx);
> perf_output_copy(handle, data->br_stack->entries, size);
> + /*
> + * Add the extension space which is appended
> + * right after the struct perf_branch_stack.
> + */
> + if (data->br_stack_cntr) {
> + size = data->br_stack->nr * sizeof(u64);
> + perf_output_copy(handle, data->br_stack_cntr, size);
> + }
> } else {
> /*
> * we always store at least the value of nr

2023-10-19 09:26:50

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH V4 4/7] perf/x86/intel: Support LBR event logging

On Wed, Oct 04, 2023 at 11:40:41AM -0700, [email protected] wrote:

> diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
> index c3b0d15a9841..1e80a551a4c2 100644
> --- a/arch/x86/events/intel/lbr.c
> +++ b/arch/x86/events/intel/lbr.c
> @@ -676,6 +676,21 @@ void intel_pmu_lbr_del(struct perf_event *event)
> WARN_ON_ONCE(cpuc->lbr_users < 0);
> WARN_ON_ONCE(cpuc->lbr_pebs_users < 0);
> perf_sched_cb_dec(event->pmu);
> +
> + /*
> + * The logged occurrences information is only valid for the
> + * current LBR group. If another LBR group is scheduled in
> + * later, the information from the stale LBRs will be wrongly
> + * interpreted. Reset the LBRs here.
> + * For the context switch, the LBR will be unconditionally
> + * flushed when a new task is scheduled in. If both the new task
> + * and the old task are monitored by a LBR event group. The
> + * reset here is redundant. But the extra reset doesn't impact
> + * the functionality. It's hard to distinguish the above case.
> + * Keep the unconditionally reset for a LBR event group for now.
> + */

I found this really hard to read, also should this not rely on
!cpuc->lbr_users ?

As is, you'll reset the lbr for every event in the group.

> + if (is_branch_counters_group(event))
> + intel_pmu_lbr_reset();
> }

2023-10-19 09:29:10

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH V4 4/7] perf/x86/intel: Support LBR event logging

On Wed, Oct 04, 2023 at 11:40:41AM -0700, [email protected] wrote:

> +
> static struct attribute *lbr_attrs[] = {
> &dev_attr_branches.attr,
> + &dev_attr_branch_counter_nr.attr,
> + &dev_attr_branch_counter_width.attr,
> NULL
> };
>
> @@ -5590,7 +5665,11 @@ mem_is_visible(struct kobject *kobj, struct attribute *attr, int i)
> static umode_t
> lbr_is_visible(struct kobject *kobj, struct attribute *attr, int i)
> {
> - return x86_pmu.lbr_nr ? attr->mode : 0;
> + /* branches */
> + if (i == 0)
> + return x86_pmu.lbr_nr ? attr->mode : 0;
> +
> + return (x86_pmu.flags & PMU_FL_LBR_EVENT) ? attr->mode : 0;
> }

As in the patch this is fairly readable, but I just checked and in the
code lbr_attrs and lbr_is_visible() are rather far away from one another
which makes the whole i thing hard to interpret.

Should we re-organize that a little?

2023-10-19 10:53:14

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH V4 4/7] perf/x86/intel: Support LBR event logging

On Wed, Oct 04, 2023 at 11:40:41AM -0700, [email protected] wrote:

> +#define ARCH_LBR_EVENT_LOG_WIDTH 2
> +#define ARCH_LBR_EVENT_LOG_MASK 0x3

event log ?


> +static __always_inline void intel_pmu_update_lbr_event(u64 *lbr_events, int idx, int pos)
> +{
> + u64 logs = *lbr_events >> (LBR_INFO_EVENTS_OFFSET +
> + idx * ARCH_LBR_EVENT_LOG_WIDTH);
> +
> + logs &= ARCH_LBR_EVENT_LOG_MASK;
> + *lbr_events |= logs << (pos * ARCH_LBR_EVENT_LOG_WIDTH);
> +}
> +
> +/*
> + * The enabled order may be different from the counter order.
> + * Update the lbr_events with the enabled order.
> + */
> +static void intel_pmu_lbr_event_reorder(struct cpu_hw_events *cpuc,
> + struct perf_event *event)
> +{
> + int i, j, pos = 0, enabled[X86_PMC_IDX_MAX];
> + struct perf_event *leader, *sibling;
> +
> + leader = event->group_leader;
> + if (branch_sample_counters(leader))
> + enabled[pos++] = leader->hw.idx;
> +
> + for_each_sibling_event(sibling, leader) {
> + if (!branch_sample_counters(sibling))
> + continue;
> + enabled[pos++] = sibling->hw.idx;
> + }

Ok, so far so good: enabled[x] = y, is a mapping of hardware index (y)
to group order (x).

Although I would perhaps name that order[] instead of enabled[].

> +
> + if (!pos)
> + return;

How would we ever get here if this is the case?

> +
> + for (i = 0; i < cpuc->lbr_stack.nr; i++) {
> + for (j = 0; j < pos; j++)
> + intel_pmu_update_lbr_event(&cpuc->lbr_events[i], enabled[j], j);

But this confuses me... per that function it:

- extracts counter value for enabled[j] and,
- or's it into the same variable at j

But what if j is already taken by something else?

That is, suppose enabled[] = {3,2,1,0}, and lbr_events = 11 10 01 00

Then: for (j) intel_pmu_update_lbt_event(&lbr_event, enabled[j], j);

0: 3->0, 11 10 01 00 -> 11 10 01 11
1: 2->1, 11 10 01 11 -> 11 10 11 11
2: 1->2, 11 10 11 11 -> 11 11 11 11



> +
> + /* Clear the original counter order */
> + cpuc->lbr_events[i] &= ~LBR_INFO_EVENTS;
> + }
> +}

Would not something like:

src = lbr_events[i];
dst = 0;
for (j = 0; j < pos; j++) {
cnt = (src >> enabled[j]*2) & 3;
dst |= cnt << j*2
}
lbr_events[i] = dst;

be *FAR* clearer, and actually work?

2023-10-19 11:01:24

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH V4 4/7] perf/x86/intel: Support LBR event logging

On Wed, Oct 04, 2023 at 11:40:41AM -0700, [email protected] wrote:

> +static ssize_t branch_counter_width_show(struct device *cdev,
> + struct device_attribute *attr,
> + char *buf)
> +{
> + return snprintf(buf, PAGE_SIZE, "2\n");
> +}

> +#define ARCH_LBR_EVENT_LOG_WIDTH 2

I'm assuming this is the same '2' ? And having it hard-coded in two
locations is awesome..

> +#define ARCH_LBR_EVENT_LOG_MASK 0x3

Should probably be ((1<<2)-1)

As per that other email, the naming is confusing, should this not be:

ARCH_LBR_EVENT_COUNTER_BITS

or, since it's all local to lbr.c something shorter still, like:

LBR_COUNTER_BITS

hmm?

2023-10-19 11:09:46

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH V4 4/7] perf/x86/intel: Support LBR event logging

On Wed, Oct 04, 2023 at 11:40:41AM -0700, [email protected] wrote:

> @@ -3905,6 +3915,44 @@ static int intel_pmu_hw_config(struct perf_event *event)
> if (needs_branch_stack(event) && is_sampling_event(event))
> event->hw.flags |= PERF_X86_EVENT_NEEDS_BRANCH_STACK;
>
> + if (branch_sample_counters(event)) {
> + struct perf_event *leader, *sibling;
> +
> + if (!(x86_pmu.flags & PMU_FL_LBR_EVENT) ||
> + (event->attr.config & ~INTEL_ARCH_EVENT_MASK))
> + return -EINVAL;
> +
> + /*
> + * The event logging is not supported in the call stack mode
> + * yet, since we cannot simply flush the LBR during e.g.,
> + * multiplexing. Also, there is no obvious usage with the call
> + * stack mode. Simply forbids it for now.
> + *
> + * If any events in the group enable the LBR event logging
> + * feature, the group is treated as a LBR event logging group,
> + * which requires the extra space to store the counters.
> + */
> + leader = event->group_leader;
> + if (branch_sample_call_stack(leader))
> + return -EINVAL;
> + leader->hw.flags |= PERF_X86_EVENT_BRANCH_COUNTERS;

(superfluous whitespace before operator)

> +
> + for_each_sibling_event(sibling, leader) {
> + if (branch_sample_call_stack(sibling))
> + return -EINVAL;
> + }
> +
> + /*
> + * Only applying the PERF_SAMPLE_BRANCH_COUNTERS doesn't
> + * require any branch stack setup.
> + * Clear the bit to avoid unnecessary branch stack setup.
> + */
> + if (0 == (event->attr.branch_sample_type &
> + ~(PERF_SAMPLE_BRANCH_PLM_ALL |
> + PERF_SAMPLE_BRANCH_COUNTERS)))
> + event->hw.flags &= ~PERF_X86_EVENT_NEEDS_BRANCH_STACK;
> + }

Does this / should this check the number of group members vs supported
number of lbr counters?

2023-10-19 11:12:24

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH V4 4/7] perf/x86/intel: Support LBR event logging

On Wed, Oct 04, 2023 at 11:40:41AM -0700, [email protected] wrote:
> +static __always_inline void get_lbr_events(struct cpu_hw_events *cpuc,
> + int i, u64 info)
> +{
> + /*
> + * The later code will decide what content can be disclosed
> + * to the perf tool. It's no harmful to unconditionally update
> + * the cpuc->lbr_events.
> + * Pleae see intel_pmu_lbr_event_reorder()
> + */
> + cpuc->lbr_events[i] = info & LBR_INFO_EVENTS;
> +}

You could be forcing an extra cachemiss here. A long time ago I had
hacks to profile perf with perf, but perhaps PT can be abused for that
now?

2023-10-19 13:58:39

by Liang, Kan

[permalink] [raw]
Subject: Re: [PATCH V4 4/7] perf/x86/intel: Support LBR event logging



On 2023-10-19 5:23 a.m., Peter Zijlstra wrote:
> On Wed, Oct 04, 2023 at 11:40:41AM -0700, [email protected] wrote:
>
>> diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
>> index c3b0d15a9841..1e80a551a4c2 100644
>> --- a/arch/x86/events/intel/lbr.c
>> +++ b/arch/x86/events/intel/lbr.c
>> @@ -676,6 +676,21 @@ void intel_pmu_lbr_del(struct perf_event *event)
>> WARN_ON_ONCE(cpuc->lbr_users < 0);
>> WARN_ON_ONCE(cpuc->lbr_pebs_users < 0);
>> perf_sched_cb_dec(event->pmu);
>> +
>> + /*
>> + * The logged occurrences information is only valid for the
>> + * current LBR group. If another LBR group is scheduled in
>> + * later, the information from the stale LBRs will be wrongly
>> + * interpreted. Reset the LBRs here.
>> + * For the context switch, the LBR will be unconditionally
>> + * flushed when a new task is scheduled in. If both the new task
>> + * and the old task are monitored by a LBR event group. The
>> + * reset here is redundant. But the extra reset doesn't impact
>> + * the functionality. It's hard to distinguish the above case.
>> + * Keep the unconditionally reset for a LBR event group for now.
>> + */
>
> I found this really hard to read, also should this not rely on
> !cpuc->lbr_users ?
>

It's possible that the last LBR user is not in the branch_counters
group, e.g., a branch_counters group + several normal LBR events.
For this case, the is_branch_counters_group(event) return false for the
last LBR user. The LBR will not be reset.

> As is, you'll reset the lbr for every event in the group.
>
>> + if (is_branch_counters_group(event))
>> + intel_pmu_lbr_reset();
>> }

Right, I forgot to change it after I modified flag. :(

Here I think we should only clear the LBRs once for a branch_counters
group, e.g., in the leader event.

+ if (is_branch_counters_group(event) && event == event->group_leader)+
intel_pmu_lbr_reset();

The only problem is that the leader event may not be an LBR event. But I
guess it should be OK to limit that the leader event of a
branch_counters group must be an LBR event in hw_config().

Thanks,
Kan

2023-10-19 13:58:54

by Liang, Kan

[permalink] [raw]
Subject: Re: [PATCH V4 4/7] perf/x86/intel: Support LBR event logging



On 2023-10-19 5:26 a.m., Peter Zijlstra wrote:
> On Wed, Oct 04, 2023 at 11:40:41AM -0700, [email protected] wrote:
>
>> +
>> static struct attribute *lbr_attrs[] = {
>> &dev_attr_branches.attr,
>> + &dev_attr_branch_counter_nr.attr,
>> + &dev_attr_branch_counter_width.attr,
>> NULL
>> };
>>
>> @@ -5590,7 +5665,11 @@ mem_is_visible(struct kobject *kobj, struct attribute *attr, int i)
>> static umode_t
>> lbr_is_visible(struct kobject *kobj, struct attribute *attr, int i)
>> {
>> - return x86_pmu.lbr_nr ? attr->mode : 0;
>> + /* branches */
>> + if (i == 0)
>> + return x86_pmu.lbr_nr ? attr->mode : 0;
>> +
>> + return (x86_pmu.flags & PMU_FL_LBR_EVENT) ? attr->mode : 0;
>> }
>
> As in the patch this is fairly readable, but I just checked and in the
> code lbr_attrs and lbr_is_visible() are rather far away from one another
> which makes the whole i thing hard to interpret.
>
> Should we re-organize that a little?

Sure, I will implement a separate patch to re-organize it.

It seems there are only two attribute groups which have both .attrs and
.is_visible, group_default and group_caps_lbr. I will re-organize for
both of them.

Thanks,
Kan

2023-10-19 14:27:16

by Liang, Kan

[permalink] [raw]
Subject: Re: [PATCH V4 4/7] perf/x86/intel: Support LBR event logging



On 2023-10-19 6:52 a.m., Peter Zijlstra wrote:
> On Wed, Oct 04, 2023 at 11:40:41AM -0700, [email protected] wrote:
>
>> +#define ARCH_LBR_EVENT_LOG_WIDTH 2
>> +#define ARCH_LBR_EVENT_LOG_MASK 0x3
>
> event log ?

That's the name in the Intel spec. I will change it to the name used in
Linux and add a comment to map the name event log to the name branch
counter.

>
>
>> +static __always_inline void intel_pmu_update_lbr_event(u64 *lbr_events, int idx, int pos)
>> +{
>> + u64 logs = *lbr_events >> (LBR_INFO_EVENTS_OFFSET +
>> + idx * ARCH_LBR_EVENT_LOG_WIDTH);
>> +
>> + logs &= ARCH_LBR_EVENT_LOG_MASK;
>> + *lbr_events |= logs << (pos * ARCH_LBR_EVENT_LOG_WIDTH);
>> +}
>> +
>> +/*
>> + * The enabled order may be different from the counter order.
>> + * Update the lbr_events with the enabled order.
>> + */
>> +static void intel_pmu_lbr_event_reorder(struct cpu_hw_events *cpuc,
>> + struct perf_event *event)
>> +{
>> + int i, j, pos = 0, enabled[X86_PMC_IDX_MAX];
>> + struct perf_event *leader, *sibling;
>> +
>> + leader = event->group_leader;
>> + if (branch_sample_counters(leader))
>> + enabled[pos++] = leader->hw.idx;
>> +
>> + for_each_sibling_event(sibling, leader) {
>> + if (!branch_sample_counters(sibling))
>> + continue;
>> + enabled[pos++] = sibling->hw.idx;
>> + }
>
> Ok, so far so good: enabled[x] = y, is a mapping of hardware index (y)
> to group order (x).
>
> Although I would perhaps name that order[] instead of enabled[].

Sure

>
>> +
>> + if (!pos)
>> + return;
>
> How would we ever get here if this is the case?

It should be a bug. I will use a WARN_ON_ONCE() to replace it.

>
>> +
>> + for (i = 0; i < cpuc->lbr_stack.nr; i++) {
>> + for (j = 0; j < pos; j++)
>> + intel_pmu_update_lbr_event(&cpuc->lbr_events[i], enabled[j], j);
>
> But this confuses me... per that function it:
>
> - extracts counter value for enabled[j] and,
> - or's it into the same variable at j
>
> But what if j is already taken by something else?
>
> That is, suppose enabled[] = {3,2,1,0}, and lbr_events = 11 10 01 00
>
> Then: for (j) intel_pmu_update_lbt_event(&lbr_event, enabled[j], j);
>
> 0: 3->0, 11 10 01 00 -> 11 10 01 11
> 1: 2->1, 11 10 01 11 -> 11 10 11 11
> 2: 1->2, 11 10 11 11 -> 11 11 11 11
>
>
>
>> +
>> + /* Clear the original counter order */
>> + cpuc->lbr_events[i] &= ~LBR_INFO_EVENTS;
>> + }
>> +}
>
> Would not something like:
>
> src = lbr_events[i];
> dst = 0;
> for (j = 0; j < pos; j++) {
> cnt = (src >> enabled[j]*2) & 3;
> dst |= cnt << j*2
> }
> lbr_events[i] = dst;
>
> be *FAR* clearer, and actually work?

The original LBR event data is saved at offset 32 of LBR_INFO register.
In get_lbr_events(), the data was simply copied to the offset 32 of
cpuc->lbr_events.

The intel_pmu_update_lbr_event() reorders the value and saves it
starting from the offset 0.

I agree it's hard to read since it combines the src and dst into the
same variable.

I will use the suggested code and also update the get_lbr_events().

cpuc->lbr_events[i] = (info >> 32) & LBR_INFO_EVENTS;

Thanks,
Kan

2023-10-19 14:29:14

by Liang, Kan

[permalink] [raw]
Subject: Re: [PATCH V4 4/7] perf/x86/intel: Support LBR event logging



On 2023-10-19 7:00 a.m., Peter Zijlstra wrote:
> On Wed, Oct 04, 2023 at 11:40:41AM -0700, [email protected] wrote:
>
>> +static ssize_t branch_counter_width_show(struct device *cdev,
>> + struct device_attribute *attr,
>> + char *buf)
>> +{
>> + return snprintf(buf, PAGE_SIZE, "2\n");
>> +}
>
>> +#define ARCH_LBR_EVENT_LOG_WIDTH 2
>
> I'm assuming this is the same '2' ? And having it hard-coded in two
> locations is awesome..
>
>> +#define ARCH_LBR_EVENT_LOG_MASK 0x3
>
> Should probably be ((1<<2)-1)
>
> As per that other email, the naming is confusing, should this not be:
>
> ARCH_LBR_EVENT_COUNTER_BITS
>
> or, since it's all local to lbr.c something shorter still, like:
>
> LBR_COUNTER_BITS
>
> hmm?

Sure, I will use the name LBR_COUNTER_BITS.

Thanks,
Kan

2023-10-19 14:33:50

by Liang, Kan

[permalink] [raw]
Subject: Re: [PATCH V4 4/7] perf/x86/intel: Support LBR event logging



On 2023-10-19 7:09 a.m., Peter Zijlstra wrote:
> On Wed, Oct 04, 2023 at 11:40:41AM -0700, [email protected] wrote:
>
>> @@ -3905,6 +3915,44 @@ static int intel_pmu_hw_config(struct perf_event *event)
>> if (needs_branch_stack(event) && is_sampling_event(event))
>> event->hw.flags |= PERF_X86_EVENT_NEEDS_BRANCH_STACK;
>>
>> + if (branch_sample_counters(event)) {
>> + struct perf_event *leader, *sibling;
>> +
>> + if (!(x86_pmu.flags & PMU_FL_LBR_EVENT) ||
>> + (event->attr.config & ~INTEL_ARCH_EVENT_MASK))
>> + return -EINVAL;
>> +
>> + /*
>> + * The event logging is not supported in the call stack mode
>> + * yet, since we cannot simply flush the LBR during e.g.,
>> + * multiplexing. Also, there is no obvious usage with the call
>> + * stack mode. Simply forbids it for now.
>> + *
>> + * If any events in the group enable the LBR event logging
>> + * feature, the group is treated as a LBR event logging group,
>> + * which requires the extra space to store the counters.
>> + */
>> + leader = event->group_leader;
>> + if (branch_sample_call_stack(leader))
>> + return -EINVAL;
>> + leader->hw.flags |= PERF_X86_EVENT_BRANCH_COUNTERS;
>
> (superfluous whitespace before operator)
>
>> +
>> + for_each_sibling_event(sibling, leader) {
>> + if (branch_sample_call_stack(sibling))
>> + return -EINVAL;
>> + }
>> +
>> + /*
>> + * Only applying the PERF_SAMPLE_BRANCH_COUNTERS doesn't
>> + * require any branch stack setup.
>> + * Clear the bit to avoid unnecessary branch stack setup.
>> + */
>> + if (0 == (event->attr.branch_sample_type &
>> + ~(PERF_SAMPLE_BRANCH_PLM_ALL |
>> + PERF_SAMPLE_BRANCH_COUNTERS)))
>> + event->hw.flags &= ~PERF_X86_EVENT_NEEDS_BRANCH_STACK;
>> + }
>
> Does this / should this check the number of group members vs supported
> number of lbr counters?

Sure, I will add the check here for the numbers, so perf can error out
earlier.

Thanks,
Kan

2023-10-19 18:18:51

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH V4 4/7] perf/x86/intel: Support LBR event logging

On Thu, Oct 19, 2023 at 10:26:01AM -0400, Liang, Kan wrote:

> The original LBR event data is saved at offset 32 of LBR_INFO register.
> In get_lbr_events(), the data was simply copied to the offset 32 of
> cpuc->lbr_events.

Urgh, missed that. Clearly reading is a skill :-)

>
> The intel_pmu_update_lbr_event() reorders the value and saves it
> starting from the offset 0.
>
> I agree it's hard to read since it combines the src and dst into the
> same variable.
>
> I will use the suggested code and also update the get_lbr_events().

Thanks!

2023-10-20 12:46:32

by Liang, Kan

[permalink] [raw]
Subject: Re: [PATCH V4 4/7] perf/x86/intel: Support LBR event logging



On 2023-10-19 7:12 a.m., Peter Zijlstra wrote:
> On Wed, Oct 04, 2023 at 11:40:41AM -0700, [email protected] wrote:
>> +static __always_inline void get_lbr_events(struct cpu_hw_events *cpuc,
>> + int i, u64 info)
>> +{
>> + /*
>> + * The later code will decide what content can be disclosed
>> + * to the perf tool. It's no harmful to unconditionally update
>> + * the cpuc->lbr_events.
>> + * Pleae see intel_pmu_lbr_event_reorder()
>> + */
>> + cpuc->lbr_events[i] = info & LBR_INFO_EVENTS;
>> +}
>
> You could be forcing an extra cachemiss here.

Here is to temporarily store the branch _counter information. Maybe we
can leverage the reserved field of cpuc->lbr_entries[i] to avoid the
cachemiss.

e->reserved = info & LBR_INFO_COUNTERS;

I tried to add something like a static_assert to check the size of the
reserved field in case the field is shrink later. But the reserved field
is a bit field. I have no idea how to get the exact size of a bit field
unless define a macro. Is something as below OK? Any suggestions are
appreciated.


diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
index 1e80a551a4c2..62675593e39a 100644
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -1582,6 +1582,8 @@ static bool is_arch_lbr_xsave_available(void)
return true;
}

+static_assert((64 - PERF_BRANCH_ENTRY_INFO_BITS_MAX) >
LBR_INFO_COUNTERS_MAX_NUM * 2);
+
void __init intel_pmu_arch_lbr_init(void)
{
struct pmu *pmu = x86_get_pmu(smp_processor_id());
diff --git a/arch/x86/include/asm/msr-index.h
b/arch/x86/include/asm/msr-index.h
index f220c3598d03..e9ff8eba5efd 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -238,6 +238,7 @@
#define LBR_INFO_BR_TYPE (0xfull << LBR_INFO_BR_TYPE_OFFSET)
#define LBR_INFO_EVENTS_OFFSET 32
#define LBR_INFO_EVENTS (0xffull << LBR_INFO_EVENTS_OFFSET)
+#define LBR_INFO_COUNTERS_MAX_NUM 4

#define MSR_ARCH_LBR_CTL 0x000014ce
#define ARCH_LBR_CTL_LBREN BIT(0)
diff --git a/include/uapi/linux/perf_event.h
b/include/uapi/linux/perf_event.h
index 4461f380425b..3a64499b0f5d 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -1437,6 +1437,9 @@ struct perf_branch_entry {
reserved:31;
};

+/* Size of used info bits in struct perf_branch_entry */
+#define PERF_BRANCH_ENTRY_INFO_BITS_MAX 33
+
union perf_sample_weight {
__u64 full;
#if defined(__LITTLE_ENDIAN_BITFIELD)



> A long time ago I had
> hacks to profile perf with perf, but perhaps PT can be abused for that
> now?

As my understanding, the PT can only give the trace information, and may
not tell if there is a canchemiss or something.
I will take a deep look and see if PT can help the case.

Thanks,
Kan