2024-01-23 11:34:52

by Ben Gainey

[permalink] [raw]
Subject: [RFC PATCH 0/2] A mechanism for efficient support for per-function metrics

I've been working on an approach to supporting per-function metrics for
aarch64 cores, which requires some changes to the arm_pmuv3 driver, and
I'm wondering if this approach would make sense as a generic feature
that could be used to enable the same on other architectures?

The basic idea is as follows:

* Periodically sample one or more counters as needed for the chosen
set of metrics.
* Record a sample count for each symbol so as to identify hot
functions.
* Accumulate counter totals for each of the counters in each of the
metrics *but* only do this where the previous sample's symbol
matches the current sample's symbol.

Discarding the counter deltas for any given sample is important to
ensure that couters are correctly attributed to a single function,
without this step the resulting metrics trend towards some average
value across the top 'n' functions. It is acknowledged that it is
possible for this heuristic to fail, for example if samples to land
either side of a nested call, so a sufficiently small sample window
over which the counters are collected must be used to reduce the risk
of misattribution.

So far, this can be acheived without any further modifications to perf
tools or the kernel. However as noted it requires the counter
collection window to be sufficiently small; in testing on
Neoverse-N1/-V1, a window of about 300 cycles gives good results. Using
the cycle counter with a sample_period of 300 is possible, but such an
approach generates significant amounts of capture data, and introduces
a lot of overhead and probe effect. Whilst the kernel will throttle
such a configuration, which helps to mitigate a large portion of the
bandwidth and capture overhead, it is not something that can be
controlled for on a per event basis, or for non-root users, and because
throttling is controlled as a percentage of time its affects vary from
machine to machine.

For this to work efficiently, it is useful to provide a means to
decouple the sample window (time over which events are counted) from
the sample period (time between interesting samples). This patcheset
modifies the Arm PMU driver to support alternating between two
sample_period values, providing a simple and inexpensive way for tools
to separate out the sample period and the sample window. It is expected
to be used with the cycle counter event, alternating between a long and
short period and subsequently discarding the counter data for samples
with the long period. The combined long and short period gives the
overall sampling period, and the short sample period gives the sample
window. The symbol taken from the sample at the end of the long period
can be used by tools to ensure correct attribution as described
previously. The cycle counter is recommended as it provides fair
temporal distribution of samples as would be required for the
per-symbol sample count mentioned previously, and because the PMU can
be programmed to overflow after a sufficiently short window; this may
not be possible with software timer (for example). This patch does not
restrict to only the cycle counter, it is possible there could be other
novel uses based on different events.

To test this I have developed a simple `perf script` based python
script. For a limited set of Arm PMU events it will post process a
`perf record`-ing and generate a table of metrics. Along side this I
have developed a benchmark application that rotates through a sequence
of different classes of behaviour that can be detected by the Arm PMU
(eg. mispredics, cache misses, different instruction mixes). The path
through the benchmark can be rotated after each iteration so as to
ensure the results don't land on some lucky harmonic with the sample
period. The script can be used with and without the strobing patch
allowing comparison of the results. Testing was on Juno (A53+A57),
N1SDP, Gravaton 2 and 3. In addition this approach has been applied to
a few of Arm's tools projects and has correctly identified improvements
and regressions.

Headline results from testing indicate that ~300 cycles sample window
gives good results with or without the strobing patch. When the
strobing patch is used, the resulting `perf.data` files are typically
25-50x smaller that without, and take ~25x less time for the python
script to post-process. Without strobing, the test application's
runtime was x20 slower when sampling every 300 cycles as compared to
every 1000000 cycles. With strobing enabled such that the long period
was 999700 cycles and the short period was 300 cycles, the test
applications runtime was only x1.2 slower than every 1000000 cycles.
Notably, without the patch, L1D cache miss rates are significantly
higher than with the patch, which we attribute to increased impact on
the cache that trapping into the kernel every 300 cycles has. These
results are given with `perf_cpu_time_max_percent=25`. When tested
with `perf_cpu_time_max_percent=100` the size and time comparisons are
more significant. Disabling throttling did not lead to obvious
improvements in the collected metrics, suggesting that the sampling
approach is sufficient to collect representative metrics.

Cursory testing on a Xeon(R) W-2145 sampling every 300 cycles (without
the patch) suggests this approach would work for some counters.
Calculating branch miss rates for example appears to be correct,
likewise UOPS_EXECUTED.THREAD seems to give something like a sensible
cycles-per-uop value. On the other hand the fixed function instructions
counter does not appear to sample correctly (it seems to report either
very small or very large numbers). No idea whats going on there, so any
insight welcome...

This patch modifies the arm_pmu and introduces an additional field in
config2 to configure the behaviour. If we think there is broad
applicability, would it make sense to move into perf_event_attr flags
or field, and possibly pull up into core? If we don't think so, then
some feedback around the core header and arm_pmu modifications would
be appreciated.

A copy of the post-processing script is available at
https://github.com/ARM-software/gator/blob/prototypes/strobing/prototypes/strobing-patches/test-script/generate-function-metrics.py
note that the script its self has a dependency on
https://lore.kernel.org/linux-perf-users/[email protected]/


Ben Gainey (2):
arm_pmu: Allow the PMU to alternate between two sample_period values.
arm_pmuv3: Add config bits for sample period strobing

drivers/perf/arm_pmu.c | 74 +++++++++++++++++++++++++++++-------
drivers/perf/arm_pmuv3.c | 25 ++++++++++++
include/linux/perf/arm_pmu.h | 1 +
include/linux/perf_event.h | 10 ++++-
4 files changed, 95 insertions(+), 15 deletions(-)

--
2.43.0



2024-01-23 11:35:26

by Ben Gainey

[permalink] [raw]
Subject: [RFC PATCH 2/2] arm_pmuv3: Add config bits for sample period strobing

To expose the ability to alternate between sample periods to tools.
The field `strobe_period` is defined for config2 to hold the alternate
sample period. A non-zero value will enable strobing.

Signed-off-by: Ben Gainey <[email protected]>
---
drivers/perf/arm_pmuv3.c | 25 +++++++++++++++++++++++++
1 file changed, 25 insertions(+)

diff --git a/drivers/perf/arm_pmuv3.c b/drivers/perf/arm_pmuv3.c
index 23fa6c5da82c4..66b0219111bb8 100644
--- a/drivers/perf/arm_pmuv3.c
+++ b/drivers/perf/arm_pmuv3.c
@@ -318,6 +318,9 @@ static const struct attribute_group armv8_pmuv3_events_attr_group = {
#define ATTR_CFG_FLD_threshold_CFG config1 /* PMEVTYPER.TH */
#define ATTR_CFG_FLD_threshold_LO 5
#define ATTR_CFG_FLD_threshold_HI 16
+#define ATTR_CFG_FLD_strobe_period_CFG config2
+#define ATTR_CFG_FLD_strobe_period_LO 0
+#define ATTR_CFG_FLD_strobe_period_HI 31

GEN_PMU_FORMAT_ATTR(event);
GEN_PMU_FORMAT_ATTR(long);
@@ -325,6 +328,7 @@ GEN_PMU_FORMAT_ATTR(rdpmc);
GEN_PMU_FORMAT_ATTR(threshold_count);
GEN_PMU_FORMAT_ATTR(threshold_compare);
GEN_PMU_FORMAT_ATTR(threshold);
+GEN_PMU_FORMAT_ATTR(strobe_period);

static int sysctl_perf_user_access __read_mostly;

@@ -352,6 +356,16 @@ static u8 armv8pmu_event_threshold_control(struct perf_event_attr *attr)
return (th_compare << 1) | th_count;
}

+static inline u32 armv8pmu_event_strobe_period(struct perf_event *event)
+{
+ return ATTR_CFG_GET_FLD(&event->attr, strobe_period);
+}
+
+static inline bool armv8pmu_event_want_strobe(struct perf_event *event)
+{
+ return armv8pmu_event_strobe_period(event) != 0;
+}
+
static struct attribute *armv8_pmuv3_format_attrs[] = {
&format_attr_event.attr,
&format_attr_long.attr,
@@ -359,6 +373,7 @@ static struct attribute *armv8_pmuv3_format_attrs[] = {
&format_attr_threshold.attr,
&format_attr_threshold_compare.attr,
&format_attr_threshold_count.attr,
+ &format_attr_strobe_period.attr,
NULL,
};

@@ -1125,6 +1140,16 @@ static int __armv8_pmuv3_map_event(struct perf_event *event,
if (armv8pmu_event_is_64bit(event))
event->hw.flags |= ARMPMU_EVT_64BIT;

+ /*
+ * Support alternating between two sample periods
+ */
+ if (armv8pmu_event_want_strobe(event)) {
+ u32 strobe_period = armv8pmu_event_strobe_period(event);
+ armpmu_set_strobe_period(&(event->hw), strobe_period);
+ } else {
+ armpmu_set_strobe_period(&(event->hw), 0);
+ }
+
/*
* User events must be allocated into a single counter, and so
* must not be chained.
--
2.43.0


2024-01-23 11:42:16

by Ben Gainey

[permalink] [raw]
Subject: [RFC PATCH 1/2] arm_pmu: Allow the PMU to alternate between two sample_period values.

The arm PMU does not provide any mechanism for decoupling the period over
which counters are counted from the period between samples. This is
problematic for building a tool to measure per-function metrics derived
from a sampled counter group. Ideally such a tool wants a very small
sample window in order to correctly attribute the metrics to a given
function, but prefers a larger sample period that provides representative
coverage without excessive probe effect, triggering throttling, or
generating excessive amounts of data.

By alternating between a long and short sample_period and subsequently
discarding the long samples, tools may decouple the period between
samples that the tool cares about from the window of time over which
interesting counts are collected.

It is expected that typically tools would use this feature with the
cycles or instructions events as an approximation for time, but no
restrictions are applied to which events this can be applied to.

Signed-off-by: Ben Gainey <[email protected]>
---
drivers/perf/arm_pmu.c | 74 +++++++++++++++++++++++++++++-------
include/linux/perf/arm_pmu.h | 1 +
include/linux/perf_event.h | 10 ++++-
3 files changed, 70 insertions(+), 15 deletions(-)

diff --git a/drivers/perf/arm_pmu.c b/drivers/perf/arm_pmu.c
index 8458fe2cebb4f..58e40dbabfc3f 100644
--- a/drivers/perf/arm_pmu.c
+++ b/drivers/perf/arm_pmu.c
@@ -99,6 +99,17 @@ static const struct pmu_irq_ops percpu_pmunmi_ops = {
.free_pmuirq = armpmu_free_percpu_pmunmi
};

+static inline bool armpmu_is_strobe_enabled(struct hw_perf_event *hwc)
+{
+ return hwc->strobe_period != 0;
+}
+
+void armpmu_set_strobe_period(struct hw_perf_event *hwc, u32 period)
+{
+ hwc->strobe_period = period;
+ hwc->strobe_active = false;
+}
+
static DEFINE_PER_CPU(struct arm_pmu *, cpu_armpmu);
static DEFINE_PER_CPU(int, cpu_irq);
static DEFINE_PER_CPU(const struct pmu_irq_ops *, cpu_irq_ops);
@@ -202,22 +213,45 @@ int armpmu_event_set_period(struct perf_event *event)
struct arm_pmu *armpmu = to_arm_pmu(event->pmu);
struct hw_perf_event *hwc = &event->hw;
s64 left = local64_read(&hwc->period_left);
- s64 period = hwc->sample_period;
- u64 max_period;
+ s64 period_active = hwc->sample_period;
+ u64 max_period = arm_pmu_event_max_period(event);
int ret = 0;

- max_period = arm_pmu_event_max_period(event);
- if (unlikely(left <= -period)) {
- left = period;
- local64_set(&hwc->period_left, left);
- hwc->last_period = period;
- ret = 1;
- }
+ if (likely(!armpmu_is_strobe_enabled(hwc))) {
+ if (unlikely(left <= -period_active)) {
+ left = period_active;
+ local64_set(&hwc->period_left, left);
+ hwc->last_period = period_active;
+ ret = 1;
+ }
+
+ if (unlikely(left <= 0)) {
+ left += period_active;
+ local64_set(&hwc->period_left, left);
+ hwc->last_period = period_active;
+ ret = 1;
+ }
+ } else if (unlikely(left <= 0)) {
+ s64 new_period;
+ bool new_active;
+
+ /*
+ * When strobing is enabled, do not attempt to adjust the
+ * period based on the previous overflow, instead just
+ * alternate between the two periods
+ */
+ if (hwc->strobe_active) {
+ new_period = period_active;
+ new_active = false;
+ } else {
+ new_period = hwc->strobe_period;
+ new_active = true;
+ }

- if (unlikely(left <= 0)) {
- left += period;
+ left = new_period;
local64_set(&hwc->period_left, left);
- hwc->last_period = period;
+ hwc->last_period = new_period;
+ hwc->strobe_active = new_active;
ret = 1;
}

@@ -448,6 +482,9 @@ __hw_perf_event_init(struct perf_event *event)
int mapping, ret;

hwc->flags = 0;
+ hwc->strobe_active = false;
+ hwc->strobe_period = 0;
+
mapping = armpmu->map_event(event);

if (mapping < 0) {
@@ -456,6 +493,15 @@ __hw_perf_event_init(struct perf_event *event)
return mapping;
}

+ if (armpmu_is_strobe_enabled(hwc)) {
+ if (event->attr.freq)
+ return -EINVAL;
+ if (hwc->strobe_period == 0)
+ return -EINVAL;
+ if (hwc->strobe_period >= event->attr.sample_period)
+ return -EINVAL;
+ }
+
/*
* We don't assign an index until we actually place the event onto
* hardware. Use -1 to signify that we haven't decided where to put it
@@ -488,8 +534,8 @@ __hw_perf_event_init(struct perf_event *event)
* is far less likely to overtake the previous one unless
* you have some serious IRQ latency issues.
*/
- hwc->sample_period = arm_pmu_event_max_period(event) >> 1;
- hwc->last_period = hwc->sample_period;
+ hwc->sample_period = arm_pmu_event_max_period(event) >> 1;
+ hwc->last_period = hwc->sample_period;
local64_set(&hwc->period_left, hwc->sample_period);
}

diff --git a/include/linux/perf/arm_pmu.h b/include/linux/perf/arm_pmu.h
index b3b34f6670cfb..3ee74382e7a93 100644
--- a/include/linux/perf/arm_pmu.h
+++ b/include/linux/perf/arm_pmu.h
@@ -175,6 +175,7 @@ void armpmu_free(struct arm_pmu *pmu);
int armpmu_register(struct arm_pmu *pmu);
int armpmu_request_irq(int irq, int cpu);
void armpmu_free_irq(int irq, int cpu);
+void armpmu_set_strobe_period(struct hw_perf_event *hwc, u32 period);

#define ARMV8_PMU_PDEV_NAME "armv8-pmu"

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index d2a15c0c6f8a9..7ef3f39fe6171 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -157,7 +157,15 @@ struct hw_perf_event {
union {
struct { /* hardware */
u64 config;
- u64 last_tag;
+ union {
+ /* for s390 and x86 */
+ u64 last_tag;
+ /* for arm_pmu */
+ struct {
+ u32 strobe_period;
+ bool strobe_active;
+ };
+ };
unsigned long config_base;
unsigned long event_base;
int event_base_rdpmc;
--
2.43.0


2024-02-14 09:43:14

by Ben Gainey

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] perf: A mechanism for efficient support for per-function metrics

Hi all

Appreciate everyone is busy, but if you have some time I'd appreciate
some comments, particularly around whether this make sense to approach
as an Arm-only PMU feature or has wider applicability?

Thanks
Ben




On Tue, 2024-01-23 at 11:34 +0000, Ben Gainey wrote:
> I've been working on an approach to supporting per-function metrics
> for
> aarch64 cores, which requires some changes to the arm_pmuv3 driver,
> and
> I'm wondering if this approach would make sense as a generic feature
> that could be used to enable the same on other architectures?
>
> The basic idea is as follows:
>
>  * Periodically sample one or more counters as needed for the chosen
>    set of metrics.
>  * Record a sample count for each symbol so as to identify hot
>    functions.
>  * Accumulate counter totals for each of the counters in each of the
>    metrics *but* only do this where the previous sample's symbol
>    matches the current sample's symbol.
>
> Discarding the counter deltas for any given sample is important to
> ensure that couters are correctly attributed to a single function,
> without this step the resulting metrics trend towards some average
> value across the top 'n' functions. It is acknowledged that it is
> possible for this heuristic to fail, for example if samples to land
> either side of a nested call, so a sufficiently small sample window
> over which the counters are collected must be used to reduce the risk
> of misattribution.
>
> So far, this can be acheived without any further modifications to
> perf
> tools or the kernel. However as noted it requires the counter
> collection window to be sufficiently small; in testing on
> Neoverse-N1/-V1, a window of about 300 cycles gives good results.
> Using
> the cycle counter with a sample_period of 300 is possible, but such
> an
> approach generates significant amounts of capture data, and
> introduces
> a lot of overhead and probe effect. Whilst the kernel will throttle
> such a configuration, which helps to mitigate a large portion of the
> bandwidth and capture overhead, it is not something that can be
> controlled for on a per event basis, or for non-root users, and
> because
> throttling is controlled as a percentage of time its affects vary
> from
> machine to machine.
>
> For this to work efficiently, it is useful to provide a means to
> decouple the sample window (time over which events are counted) from
> the sample period (time between interesting samples). This patcheset
> modifies the Arm PMU driver to support alternating between two
> sample_period values, providing a simple and inexpensive way for
> tools
> to separate out the sample period and the sample window. It is
> expected
> to be used with the cycle counter event, alternating between a long
> and
> short period and subsequently discarding the counter data for samples
> with the long period. The combined long and short period gives the
> overall sampling period, and the short sample period gives the sample
> window. The symbol taken from the sample at the end of the long
> period
> can be used by tools to ensure correct attribution as described
> previously. The cycle counter is recommended as it provides fair
> temporal distribution of samples as would be required for the
> per-symbol sample count mentioned previously, and because the PMU can
> be programmed to overflow after a sufficiently short window; this may
> not be possible with software timer (for example). This patch does
> not
> restrict to only the cycle counter, it is possible there could be
> other
> novel uses based on different events.
>
> To test this I have developed a simple `perf script` based python
> script. For a limited set of Arm PMU events it will post process a
> `perf record`-ing and generate a table of metrics. Along side this I
> have developed a benchmark application that rotates through a
> sequence
> of different classes of behaviour that can be detected by the Arm PMU
> (eg. mispredics, cache misses, different instruction mixes). The path
> through the benchmark can be rotated after each iteration so as to
> ensure the results don't land on some lucky harmonic with the sample
> period. The script can be used with and without the strobing patch
> allowing comparison of the results. Testing was on Juno (A53+A57),
> N1SDP, Gravaton 2 and 3. In addition this approach has been applied
> to
> a few of Arm's tools projects and has correctly identified
> improvements
> and regressions.
>
> Headline results from testing indicate that ~300 cycles sample window
> gives good results with or without the strobing patch. When the
> strobing patch is used, the resulting `perf.data` files are typically
> 25-50x smaller that without, and take ~25x less time for the python
> script to post-process. Without strobing, the test application's
> runtime was x20 slower when sampling every 300 cycles as compared to
> every 1000000 cycles. With strobing enabled such that the long period
> was 999700 cycles and the short period was 300 cycles, the test
> applications runtime was only x1.2 slower than every 1000000 cycles.
> Notably, without the patch, L1D cache miss rates are significantly
> higher than with the patch, which we attribute to increased impact on
> the cache that trapping into the kernel every 300 cycles has. These
> results are given with `perf_cpu_time_max_percent=25`. When tested
> with `perf_cpu_time_max_percent=100` the size and time comparisons
> are
> more significant. Disabling throttling did not lead to obvious
> improvements in the collected metrics, suggesting that the sampling
> approach is sufficient to collect representative metrics.
>
> Cursory testing on a Xeon(R) W-2145 sampling every 300 cycles
> (without
> the patch) suggests this approach would work for some counters.
> Calculating branch miss rates for example appears to be correct,
> likewise UOPS_EXECUTED.THREAD seems to give something like a sensible
> cycles-per-uop value. On the other hand the fixed function
> instructions
> counter does not appear to sample correctly (it seems to report
> either
> very small or very large numbers). No idea whats going on there, so
> any
> insight welcome...
>
> This patch modifies the arm_pmu and introduces an additional field in
> config2 to configure the behaviour. If we think there is broad
> applicability, would it make sense to move into perf_event_attr flags
> or field, and possibly pull up into core? If we don't think so, then
> some feedback around the core header and arm_pmu modifications would
> be appreciated.
>
> A copy of the post-processing script is available at
> https://github.com/ARM-
> software/gator/blob/prototypes/strobing/prototypes/strobing-
> patches/test-script/generate-function-metrics.py
> note that the script its self has a dependency on
> https://lore.kernel.org/linux-perf-users/20240123103137.1890779-1-
> [email protected]/
>
>
> Ben Gainey (2):
>   arm_pmu: Allow the PMU to alternate between two sample_period
> values.
>   arm_pmuv3: Add config bits for sample period strobing
>
>  drivers/perf/arm_pmu.c       | 74 +++++++++++++++++++++++++++++-----
> --
>  drivers/perf/arm_pmuv3.c     | 25 ++++++++++++
>  include/linux/perf/arm_pmu.h |  1 +
>  include/linux/perf_event.h   | 10 ++++-
>  4 files changed, 95 insertions(+), 15 deletions(-)
>

2024-02-14 09:55:38

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] A mechanism for efficient support for per-function metrics

Ben Gainey <[email protected]> writes:

> I've been working on an approach to supporting per-function metrics for
> aarch64 cores, which requires some changes to the arm_pmuv3 driver, and
> I'm wondering if this approach would make sense as a generic feature
> that could be used to enable the same on other architectures?
>
> The basic idea is as follows:
>
> * Periodically sample one or more counters as needed for the chosen
> set of metrics.
> * Record a sample count for each symbol so as to identify hot
> functions.
> * Accumulate counter totals for each of the counters in each of the
> metrics *but* only do this where the previous sample's symbol
> matches the current sample's symbol.

It sounds very similar to what perf script -F +metric already does
(or did if it wasn't broken currently). It would be a straight forward
extension here to add this "same as previous" check.

Of course the feature is somewhat dubious in that it will have a very
strong systematic bias against short functions and even long functions
in some alternating execution patterns. I assume you did some
experiments to characterize this. It would be important
to emphasize this in any documentation.

> For this to work efficiently, it is useful to provide a means to
> decouple the sample window (time over which events are counted) from
> the sample period (time between interesting samples). This patcheset
> modifies the Arm PMU driver to support alternating between two
> sample_period values, providing a simple and inexpensive way for tools
> to separate out the sample period and the sample window. It is expected
> to be used with the cycle counter event, alternating between a long and
> short period and subsequently discarding the counter data for samples
> with the long period. The combined long and short period gives the
> overall sampling period, and the short sample period gives the sample
> window. The symbol taken from the sample at the end of the long period
> can be used by tools to ensure correct attribution as described
> previously. The cycle counter is recommended as it provides fair
> temporal distribution of samples as would be required for the
> per-symbol sample count mentioned previously, and because the PMU can
> be programmed to overflow after a sufficiently short window; this may
> not be possible with software timer (for example). This patch does not
> restrict to only the cycle counter, it is possible there could be other
> novel uses based on different events.

I don't see anything ARM specific with the technique, so if it's done
it should be done generically IMHO


> Cursory testing on a Xeon(R) W-2145 sampling every 300 cycles (without
> the patch) suggests this approach would work for some counters.
> Calculating branch miss rates for example appears to be correct,
> likewise UOPS_EXECUTED.THREAD seems to give something like a sensible
> cycles-per-uop value. On the other hand the fixed function instructions
> counter does not appear to sample correctly (it seems to report either
> very small or very large numbers). No idea whats going on there, so any
> insight welcome...

If you use precise samples with 3p there is a restriction on the periods
that is enforced by the kernel. Non precise or single/double p should
support arbitrary, except that any p is always period + 1.

One drawback of the technique on x86 is that it won't allow multi record
pebs (collecting samples without interrupts), so the overhead might
be intrinsically higher.

-Andi

2024-02-14 19:15:01

by Ben Gainey

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] A mechanism for efficient support for per-function metrics

Hi Andi

Thanks for commenting...

On Wed, 2024-02-14 at 01:55 -0800, Andi Kleen wrote:
> Ben Gainey <[email protected]> writes:
>
> > I've been working on an approach to supporting per-function metrics
> > for
> > aarch64 cores, which requires some changes to the arm_pmuv3 driver,
> > and
> > I'm wondering if this approach would make sense as a generic
> > feature
> > that could be used to enable the same on other architectures?
> >
> > The basic idea is as follows:
> >
> >  * Periodically sample one or more counters as needed for the
> > chosen
> >    set of metrics.
> >  * Record a sample count for each symbol so as to identify hot
> >    functions.
> >  * Accumulate counter totals for each of the counters in each of
> > the
> >    metrics *but* only do this where the previous sample's symbol
> >    matches the current sample's symbol.
>
> It sounds very similar to what perf script -F +metric already does
> (or did if it wasn't broken currently). It would be a straight
> forward
> extension here to add this "same as previous" check.


Nice, I wasn't aware of this feature. I'll have a play...


>
> Of course the feature is somewhat dubious in that it will have a very
> strong systematic bias against short functions and even long
> functions
> in some alternating execution patterns. I assume you did some
> experiments to characterize this. It would be important
> to emphasize this in any documentation.

The way I have been thinking about this is that for each sample you
always maintain a periodic sample count so that the relative ranking of
functions is maintained, and that the "same as previous" check is a way
to enhance the attributability of the PMU data for any given sample.

But it absolutely correct to say that this will bias the availability
of PMU data in the way you have describe. The bias depends on sample
window size, workload characteristics and so on.

It should be possible to provide a per metric "valid sample" count that
can be used to judge the "quality" of the metrics for each symbol,
which may allow the user to make some adjustments to the recording
paramters (modify sample period, or sample window size for example).

I'll have a think about the best way to convey this in docs. I have a
few ideas for ways to further impove the attributability / identify
cases where short functions are missed, but they'd not affect the
implementation of anything in the kernel, just perhaps the tool's post-
processing.


>
> > For this to work efficiently, it is useful to provide a means to
> > decouple the sample window (time over which events are counted)
> > from
> > the sample period (time between interesting samples). This
> > patcheset
> > modifies the Arm PMU driver to support alternating between two
> > sample_period values, providing a simple and inexpensive way for
> > tools
> > to separate out the sample period and the sample window. It is
> > expected
> > to be used with the cycle counter event, alternating between a long
> > and
> > short period and subsequently discarding the counter data for
> > samples
> > with the long period. The combined long and short period gives the
> > overall sampling period, and the short sample period gives the
> > sample
> > window. The symbol taken from the sample at the end of the long
> > period
> > can be used by tools to ensure correct attribution as described
> > previously. The cycle counter is recommended as it provides fair
> > temporal distribution of samples as would be required for the
> > per-symbol sample count mentioned previously, and because the PMU
> > can
> > be programmed to overflow after a sufficiently short window; this
> > may
> > not be possible with software timer (for example). This patch does
> > not
> > restrict to only the cycle counter, it is possible there could be
> > other
> > novel uses based on different events.
>
> I don't see anything ARM specific with the technique, so if it's done
> it should be done generically IMHO


Great. When i was originally thinking about the implementation of the
event strobing feature I was thinking:

* Add `strobe_sample` flag bit to opt into the fature
- This will be mutually exclusive with `freq`.
* Add `strobe_period` field to hold the alternate sample period (for
the sample window.
* Have all PMU drivers check and reject the `strobe_sample` flag by
default; the swizzling of the period will be done in the PMU driver its
self if it make sense to support this feature for a given PMU.
- Do you think this is sensible, or would be better handled in core?

Any obvious issues with this approach?

>
>
> > Cursory testing on a Xeon(R) W-2145 sampling every 300 cycles
> > (without
> > the patch) suggests this approach would work for some counters.
> > Calculating branch miss rates for example appears to be correct,
> > likewise UOPS_EXECUTED.THREAD seems to give something like a
> > sensible
> > cycles-per-uop value. On the other hand the fixed function
> > instructions
> > counter does not appear to sample correctly (it seems to report
> > either
> > very small or very large numbers). No idea whats going on there, so
> > any
> > insight welcome...
>
> If you use precise samples with 3p there is a restriction on the
> periods
> that is enforced by the kernel. Non precise or single/double p should
> support arbitrary, except that any p is always period + 1.

Is there some default value for precise? when testing I didn't set any
specific value for p modifier.


>
> One drawback of the technique on x86 is that it won't allow multi
> record
> pebs (collecting samples without interrupts), so the overhead might
> be intrinsically higher.
>
> -Andi


Sure, I think this kind of detail was why I was thinking it should be
the PMU driver rather than core that handles the strobing feature,
since there may be other considerations / better ways to collect the
metrics samples.

Regards
Ben

2024-02-15 13:00:32

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] A mechanism for efficient support for per-function metrics

On Wed, Feb 14, 2024 at 07:13:50PM +0000, Ben Gainey wrote:
>
> Nice, I wasn't aware of this feature. I'll have a play...

You have to use an old perf version for now, still need to fix it.
>
>
> >
> > Of course the feature is somewhat dubious in that it will have a very
> > strong systematic bias against short functions and even long
> > functions
> > in some alternating execution patterns. I assume you did some
> > experiments to characterize this. It would be important
> > to emphasize this in any documentation.
>
> The way I have been thinking about this is that for each sample you
> always maintain a periodic sample count so that the relative ranking of
> functions is maintained, and that the "same as previous" check is a way
> to enhance the attributability of the PMU data for any given sample.
>
> But it absolutely correct to say that this will bias the availability
> of PMU data in the way you have describe. The bias depends on sample
> window size, workload characteristics and so on.

I would be more comfortable with it if you added some randomization
on the window sizes. That would limit bias and worst case sampling
error.

> It should be possible to provide a per metric "valid sample" count that
> can be used to judge the "quality" of the metrics for each symbol,
> which may allow the user to make some adjustments to the recording
> paramters (modify sample period, or sample window size for example).

Even that would be misleading because it assumes that the IP stayed
in the same function between the two samples. But you could have
something like

F1 sample
F2
F1 sample

and if you're unlucky this could happen systematically. The
randomization would fight it somewhat, but even there you might
be very very unlucky.

The only sure way to judge it really is to run branch trace in parallel and see
if it is correct.

Also there is of course the problem that on a modern core the
reordering window might well be larger than your sample window, so any notion
of things happening inside a short window is quite fuzzy.

> >
> > I don't see anything ARM specific with the technique, so if it's done
> > it should be done generically IMHO
>
>
> Great. When i was originally thinking about the implementation of the
> event strobing feature I was thinking:
>
> * Add `strobe_sample` flag bit to opt into the fature
> - This will be mutually exclusive with `freq`.
> * Add `strobe_period` field to hold the alternate sample period (for
> the sample window.
> * Have all PMU drivers check and reject the `strobe_sample` flag by
> default; the swizzling of the period will be done in the PMU driver its
> self if it make sense to support this feature for a given PMU.
> - Do you think this is sensible, or would be better handled in core?

I would have a common function in core that is called from the PMU
drivers, similar to how the adaptive period is done today.

> > > the patch) suggests this approach would work for some counters.
> > > Calculating branch miss rates for example appears to be correct,
> > > likewise UOPS_EXECUTED.THREAD seems to give something like a
> > > sensible
> > > cycles-per-uop value. On the other hand the fixed function
> > > instructions
> > > counter does not appear to sample correctly (it seems to report
> > > either
> > > very small or very large numbers). No idea whats going on there, so
> > > any
> > > insight welcome...
> >
> > If you use precise samples with 3p there is a restriction on the
> > periods
> > that is enforced by the kernel. Non precise or single/double p should
> > support arbitrary, except that any p is always period + 1.
>
> Is there some default value for precise? when testing I didn't set any
> specific value for p modifier.

In some cases the perf tool tries to use the highest, e.g. if you don't
specify anything.

If you used :p (and not :P) it should have taken the number you
specified. Single :p is normally not useful because it samples
the -1 IP, normal use is either two or three p. Three p is more
precise but also has some restrictions on the sampling period
and the counters that can be used.

-Andi

2024-03-10 13:00:36

by Ben Gainey

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] A mechanism for efficient support for per-function metrics

On Wed, 2024-02-14 at 23:08 -0800, Andi Kleen wrote:
> > On Wed, Feb 14, 2024 at 07:13:50PM +0000, Ben Gainey wrote:
> > > >
> > > > Nice, I wasn't aware of this feature. I'll have a play...
> >
> > You have to use an old perf version for now, still need to fix it.
> > > >
> > > >
> > > > > >
> > > > > > Of course the feature is somewhat dubious in that it will
> > > > > > have a > > > very
> > > > > > strong systematic bias against short functions and even
> > > > > > long
> > > > > > functions
> > > > > > in some alternating execution patterns. I assume you did
> > > > > > some
> > > > > > experiments to characterize this. It would be important
> > > > > > to emphasize this in any documentation.
> > > >
> > > > The way I have been thinking about this is that for each sample
> > > > you
> > > > always maintain a periodic sample count so that the relative >
> > > > > ranking of
> > > > functions is maintained, and that the "same as previous" check
> > > > is a > > way
> > > > to enhance the attributability of the PMU data for any given >
> > > > > sample.
> > > >
> > > > But it absolutely correct to say that this will bias the > >
> > > > availability
> > > > of PMU data in the way you have describe. The bias depends on >
> > > > > sample
> > > > window size, workload characteristics and so on.
> >
> > I would be more comfortable with it if you added some randomization
> > on the window sizes. That would limit bias and worst case sampling
> > error.

Agreed. I think for the Arm PMU maybe there is not the same "precise"
behaviour but its certainly something that should be possible to
implement.


> >
> > > > It should be possible to provide a per metric "valid sample"
> > > > count > > that
> > > > can be used to judge the  "quality" of the metrics for each
> > > > symbol,
> > > > which may allow the user to make some adjustments to the
> > > > recording
> > > > paramters (modify sample period, or sample window size for > >
> > > > example).
> >
> > Even that would be misleading because it assumes that the IP stayed
> > in the same function between the two samples. But you could have
> > something like
> >
> > F1  sample
> > F2
> > F1  sample
> >
> > and if you're unlucky this could happen systematically. The
> > randomization would fight it somewhat, but even there you might
> > be very very unlucky.
> >
> > The only sure way to judge it really is to run branch trace in >
> > parallel and see
> > if it is correct.

True. I've been looking at using the function return PMU counter to
identify this case and filtering accordingly. Needs a bit more work but
seems promising.


> >
> > Also there is of course the problem that on a modern core the
> > reordering window might well be larger than your sample window, so
> > > any notion
> > of things happening inside a short window is quite fuzzy.


True. In the absense of an alternative hardware mechanism though, I'm
aiming for representative metrics rather than precise metrics. That is
to say, metrics attributable to functions with a sufficient degree of
accuracy to be meaningful/actionable. In this context the fuzziness is
acceptable.


> >
> > > > > >
> > > > > > I don't see anything ARM specific with the technique, so if
> > > > > > it's > > > done
> > > > > > it should be done generically IMHO
> > > >
> > > >
> > > > Great. When i was originally thinking about the implementation
> > > > of > > the
> > > > event strobing feature I was thinking:
> > > >
> > > >  * Add `strobe_sample` flag bit to opt into the fature
> > > >    - This will be mutually exclusive with `freq`.
> > > >  * Add `strobe_period` field to hold the alternate sample
> > > > period > > (for
> > > > the sample window.
> > > >  * Have all PMU drivers check and reject the `strobe_sample`
> > > > flag > > by
> > > > default; the swizzling of the period will be done in the PMU
> > > > driver > > its
> > > > self if it make sense to support this feature for a given PMU.
> > > >    - Do you think this is sensible, or would be better handled
> > > > in > > core?
> >
> > I would have a common function in core that is called from the PMU
> > drivers, similar to how the adaptive period is done today.

Ack, thanks.

Regards
Ben