2022-08-31 18:15:14

by Ian Rogers

[permalink] [raw]
Subject: [PATCH v2 4/7] perf topology: Add core_wide

It is possible to optimize metrics when all SMT threads (CPUs) on a
core are measuring events in system wide mode. For example, TMA
metrics defines CORE_CLKS for Sandybrdige as:

if SMT is disabled:
CPU_CLK_UNHALTED.THREAD
if SMT is enabled and recording on all SMT threads:
CPU_CLK_UNHALTED.THREAD_ANY / 2
if SMT is enabled and not recording on all SMT threads:
(CPU_CLK_UNHALTED.THREAD/2)*
(1+CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE/CPU_CLK_UNHALTED.REF_XCLK )

That is two more events are necessary when not gathering counts on all
SMT threads. To distinguish all SMT threads on a core vs system wide
(all CPUs) call the new property core wide. Add a core wide test that
determines the property from user requested CPUs, the topology and
system wide. System wide is required as other processes running on a
SMT thread will change the counts.

Signed-off-by: Ian Rogers <[email protected]>
---
tools/perf/util/cputopo.c | 46 +++++++++++++++++++++++++++++++++++++++
tools/perf/util/cputopo.h | 3 +++
tools/perf/util/smt.c | 14 ++++++++++++
tools/perf/util/smt.h | 7 ++++++
4 files changed, 70 insertions(+)

diff --git a/tools/perf/util/cputopo.c b/tools/perf/util/cputopo.c
index 511002e52714..1a3ff6449158 100644
--- a/tools/perf/util/cputopo.c
+++ b/tools/perf/util/cputopo.c
@@ -172,6 +172,52 @@ bool cpu_topology__smt_on(const struct cpu_topology *topology)
return false;
}

+bool cpu_topology__core_wide(const struct cpu_topology *topology,
+ const char *user_requested_cpu_list)
+{
+ struct perf_cpu_map *user_requested_cpus;
+
+ /*
+ * If user_requested_cpu_list is empty then all CPUs are recorded and so
+ * core_wide is true.
+ */
+ if (!user_requested_cpu_list)
+ return true;
+
+ user_requested_cpus = perf_cpu_map__new(user_requested_cpu_list);
+ /* Check that every user requested CPU is the complete set of SMT threads on a core. */
+ for (u32 i = 0; i < topology->core_cpus_lists; i++) {
+ const char *core_cpu_list = topology->core_cpus_list[i];
+ struct perf_cpu_map *core_cpus = perf_cpu_map__new(core_cpu_list);
+ struct perf_cpu cpu;
+ int idx;
+ bool has_first, first = true;
+
+ perf_cpu_map__for_each_cpu(cpu, idx, core_cpus) {
+ if (first) {
+ has_first = perf_cpu_map__has(user_requested_cpus, cpu);
+ first = false;
+ } else {
+ /*
+ * If the first core CPU is user requested then
+ * all subsequent CPUs in the core must be user
+ * requested too. If the first CPU isn't user
+ * requested then none of the others must be
+ * too.
+ */
+ if (perf_cpu_map__has(user_requested_cpus, cpu) != has_first) {
+ perf_cpu_map__put(core_cpus);
+ perf_cpu_map__put(user_requested_cpus);
+ return false;
+ }
+ }
+ }
+ perf_cpu_map__put(core_cpus);
+ }
+ perf_cpu_map__put(user_requested_cpus);
+ return true;
+}
+
static bool has_die_topology(void)
{
char filename[MAXPATHLEN];
diff --git a/tools/perf/util/cputopo.h b/tools/perf/util/cputopo.h
index 469db775a13c..969e5920a00e 100644
--- a/tools/perf/util/cputopo.h
+++ b/tools/perf/util/cputopo.h
@@ -60,6 +60,9 @@ struct cpu_topology *cpu_topology__new(void);
void cpu_topology__delete(struct cpu_topology *tp);
/* Determine from the core list whether SMT was enabled. */
bool cpu_topology__smt_on(const struct cpu_topology *topology);
+/* Are the sets of SMT siblings all enabled or all disabled in user_requested_cpus. */
+bool cpu_topology__core_wide(const struct cpu_topology *topology,
+ const char *user_requested_cpu_list);

struct numa_topology *numa_topology__new(void);
void numa_topology__delete(struct numa_topology *tp);
diff --git a/tools/perf/util/smt.c b/tools/perf/util/smt.c
index ce90c4ee4138..994e9e418227 100644
--- a/tools/perf/util/smt.c
+++ b/tools/perf/util/smt.c
@@ -21,3 +21,17 @@ bool smt_on(const struct cpu_topology *topology)
cached = true;
return cached_result;
}
+
+bool core_wide(bool system_wide, const char *user_requested_cpu_list,
+ const struct cpu_topology *topology)
+{
+ /* If not everything running on a core is being recorded then we can't use core_wide. */
+ if (!system_wide)
+ return false;
+
+ /* Cheap case that SMT is disabled and therefore we're inherently core_wide. */
+ if (!smt_on(topology))
+ return true;
+
+ return cpu_topology__core_wide(topology, user_requested_cpu_list);
+}
diff --git a/tools/perf/util/smt.h b/tools/perf/util/smt.h
index e26999c6b8d4..ae9095f2c38c 100644
--- a/tools/perf/util/smt.h
+++ b/tools/perf/util/smt.h
@@ -7,4 +7,11 @@ struct cpu_topology;
/* Returns true if SMT (aka hyperthreading) is enabled. */
bool smt_on(const struct cpu_topology *topology);

+/*
+ * Returns true when system wide and all SMT threads for a core are in the
+ * user_requested_cpus map.
+ */
+bool core_wide(bool system_wide, const char *user_requested_cpu_list,
+ const struct cpu_topology *topology);
+
#endif /* __SMT_H */
--
2.37.2.672.g94769d06f0-goog


2022-09-02 07:16:47

by Namhyung Kim

[permalink] [raw]
Subject: Re: [PATCH v2 4/7] perf topology: Add core_wide

Hi Ian,

On Wed, Aug 31, 2022 at 10:50 AM Ian Rogers <[email protected]> wrote:
>
> It is possible to optimize metrics when all SMT threads (CPUs) on a
> core are measuring events in system wide mode. For example, TMA
> metrics defines CORE_CLKS for Sandybrdige as:
>
> if SMT is disabled:
> CPU_CLK_UNHALTED.THREAD
> if SMT is enabled and recording on all SMT threads:
> CPU_CLK_UNHALTED.THREAD_ANY / 2
> if SMT is enabled and not recording on all SMT threads:
> (CPU_CLK_UNHALTED.THREAD/2)*
> (1+CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE/CPU_CLK_UNHALTED.REF_XCLK )
>
> That is two more events are necessary when not gathering counts on all
> SMT threads. To distinguish all SMT threads on a core vs system wide
> (all CPUs) call the new property core wide. Add a core wide test that
> determines the property from user requested CPUs, the topology and
> system wide. System wide is required as other processes running on a
> SMT thread will change the counts.
>
> Signed-off-by: Ian Rogers <[email protected]>
> ---
[SNIP]
> diff --git a/tools/perf/util/smt.c b/tools/perf/util/smt.c
> index ce90c4ee4138..994e9e418227 100644
> --- a/tools/perf/util/smt.c
> +++ b/tools/perf/util/smt.c
> @@ -21,3 +21,17 @@ bool smt_on(const struct cpu_topology *topology)
> cached = true;
> return cached_result;
> }
> +
> +bool core_wide(bool system_wide, const char *user_requested_cpu_list,
> + const struct cpu_topology *topology)
> +{
> + /* If not everything running on a core is being recorded then we can't use core_wide. */
> + if (!system_wide)
> + return false;

I'm not sure if it's correct. Wouldn't it be ok if it runs on all
threads in a core
even if system wide is off? I thought that's what the below code checks.

In fact I thought the opposite logic like

if (system_wide)
return true;

But it seems the code allows to have cpu_list in the system wide mode.
Then it also needs to check the user requested cpus being NULL.
(IMHO system_wide should be cleared when it has a cpu list...)

if (system_wide && !user_requested_cpu_list)
return true;

And if we have a target pointer, we could add this too.

if (!target__has_cpu(target))
return false;

Thanks,
Namhyung


> +
> + /* Cheap case that SMT is disabled and therefore we're inherently core_wide. */
> + if (!smt_on(topology))
> + return true;
> +
> + return cpu_topology__core_wide(topology, user_requested_cpu_list);
> +}
> diff --git a/tools/perf/util/smt.h b/tools/perf/util/smt.h
> index e26999c6b8d4..ae9095f2c38c 100644
> --- a/tools/perf/util/smt.h
> +++ b/tools/perf/util/smt.h
> @@ -7,4 +7,11 @@ struct cpu_topology;
> /* Returns true if SMT (aka hyperthreading) is enabled. */
> bool smt_on(const struct cpu_topology *topology);
>
> +/*
> + * Returns true when system wide and all SMT threads for a core are in the
> + * user_requested_cpus map.
> + */
> +bool core_wide(bool system_wide, const char *user_requested_cpu_list,
> + const struct cpu_topology *topology);
> +
> #endif /* __SMT_H */
> --
> 2.37.2.672.g94769d06f0-goog
>

2022-09-06 19:03:47

by Ian Rogers

[permalink] [raw]
Subject: Re: [PATCH v2 4/7] perf topology: Add core_wide

On Thu, Sep 1, 2022 at 11:29 PM Namhyung Kim <[email protected]> wrote:
>
> Hi Ian,
>
> On Wed, Aug 31, 2022 at 10:50 AM Ian Rogers <[email protected]> wrote:
> >
> > It is possible to optimize metrics when all SMT threads (CPUs) on a
> > core are measuring events in system wide mode. For example, TMA
> > metrics defines CORE_CLKS for Sandybrdige as:
> >
> > if SMT is disabled:
> > CPU_CLK_UNHALTED.THREAD
> > if SMT is enabled and recording on all SMT threads:
> > CPU_CLK_UNHALTED.THREAD_ANY / 2
> > if SMT is enabled and not recording on all SMT threads:
> > (CPU_CLK_UNHALTED.THREAD/2)*
> > (1+CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE/CPU_CLK_UNHALTED.REF_XCLK )
> >
> > That is two more events are necessary when not gathering counts on all
> > SMT threads. To distinguish all SMT threads on a core vs system wide
> > (all CPUs) call the new property core wide. Add a core wide test that
> > determines the property from user requested CPUs, the topology and
> > system wide. System wide is required as other processes running on a
> > SMT thread will change the counts.
> >
> > Signed-off-by: Ian Rogers <[email protected]>
> > ---
> [SNIP]
> > diff --git a/tools/perf/util/smt.c b/tools/perf/util/smt.c
> > index ce90c4ee4138..994e9e418227 100644
> > --- a/tools/perf/util/smt.c
> > +++ b/tools/perf/util/smt.c
> > @@ -21,3 +21,17 @@ bool smt_on(const struct cpu_topology *topology)
> > cached = true;
> > return cached_result;
> > }
> > +
> > +bool core_wide(bool system_wide, const char *user_requested_cpu_list,
> > + const struct cpu_topology *topology)
> > +{
> > + /* If not everything running on a core is being recorded then we can't use core_wide. */
> > + if (!system_wide)
> > + return false;
>
> I'm not sure if it's correct. Wouldn't it be ok if it runs on all
> threads in a core
> even if system wide is off? I thought that's what the below code checks.
>
> In fact I thought the opposite logic like
>
> if (system_wide)
> return true;
>
> But it seems the code allows to have cpu_list in the system wide mode.
> Then it also needs to check the user requested cpus being NULL.
> (IMHO system_wide should be cleared when it has a cpu list...)
>
> if (system_wide && !user_requested_cpu_list)
> return true;
>
> And if we have a target pointer, we could add this too.
>
> if (!target__has_cpu(target))
> return false;
>
> Thanks,
> Namhyung

Thanks Namhyung,

It may be correct or may not be without the system wide flag, and so I
want to be conservative and say we need the flag. Let me provide more
detail with an example:

I'm going to assume we want to measure how many slots there were for
an SMT thread. In the "Count Domain" of TMA metrics before Icelake
slots is a core wide metric, meaning that you are measuring >1 hyper
thread. As we are measuring >1 hyper thread and we don't control the
scheduling of the other hyperthread, the only valid mode is system
wide on all CPUs, or all CPUs on a core (aka core wide in this patch
set). However, the slots count domain information also says "Slots -
ICL onwards or EBS_Mode before ICL" meaning the count domain is valid
for an SMT thread. If I look at SkylakeX currently we have two
metrics:

https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/tree/tools/perf/pmu-events/arch/x86/skylakex/skx-metrics.json?h=perf/core#n167

{
"BriefDescription": "Total issue-pipeline slots (per-Physical
Core till ICL; per-Logical Processor ICL onward)",
"MetricExpr": "4 * CPU_CLK_UNHALTED.THREAD",
"MetricGroup": "TmaL1",
"MetricName": "SLOTS"
},
{
"BriefDescription": "Total issue-pipeline slots (per-Physical
Core till ICL; per-Logical Processor ICL onward)",
"MetricExpr": "4 * ( ( CPU_CLK_UNHALTED.THREAD / 2 ) * ( 1 +
CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK ) )",
"MetricGroup": "TmaL1_SMT",
"MetricName": "SLOTS_SMT"
},

The SLOTS metrics is going to compute a valid value if hyperthreading
is disabled, in other metrics you will see
(CPU_CLK_UNHALTED.THREAD_ANY / 2) where THREAD.ANY and THREAD are the
same event encoding. The reason for the divide by 2 is to adjust the
count down so that when we add the other hyperthreads count we get the
core wide value in the aggregation - the perf event on the 2nd hyper
thread isn't necessary, but that's a separate problem. SLOTS_SMT is
funny so let's dig into it:
The 4 is the pipeline width.
CPU_CLK_UNHALTED.THREAD is defined as, "Counts the number of core
cycles while the thread is not in a halt state.."
The divide by 2 is because of 2 hyperthreads.
The 1 is because we assume that half the slots are going to be given
to our hyperthread.
The CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK
will give a ratio for how many cycles our SMT thread was the only one
running, and so we got those slots from the other hyperthread.

With #core_wide we can have a single SLOTS metric of (new lines and //
comments for clarity):

// If we're #core_wide with SMT and we know the work on both
hyperthreads are being measured for all processes then use a single
counter
(4 * CPU_CLK_UNHALTED.THREAD_ANY / 2
if #core_wide else
// We're not core wide so we need thread counters to adjust the slots
metric based on the time this SMT thread is active
4 * ( CPU_CLK_UNHALTED.THREAD / 2 ) * ( 1 +
CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK )
) if #smt_on
// No hyper threading can just use the core wide counter
else 4 * CPU_CLK_UNHALTED.THREAD

As this metric is a foundation for many others then there is a great
deal of use for it and the current pre-Icelake computation is
generally only valid if on the command line you happen to have
specified a core wide set up - which is error prone.

Thanks,
Ian



> > +
> > + /* Cheap case that SMT is disabled and therefore we're inherently core_wide. */
> > + if (!smt_on(topology))
> > + return true;
> > +
> > + return cpu_topology__core_wide(topology, user_requested_cpu_list);
> > +}
> > diff --git a/tools/perf/util/smt.h b/tools/perf/util/smt.h
> > index e26999c6b8d4..ae9095f2c38c 100644
> > --- a/tools/perf/util/smt.h
> > +++ b/tools/perf/util/smt.h
> > @@ -7,4 +7,11 @@ struct cpu_topology;
> > /* Returns true if SMT (aka hyperthreading) is enabled. */
> > bool smt_on(const struct cpu_topology *topology);
> >
> > +/*
> > + * Returns true when system wide and all SMT threads for a core are in the
> > + * user_requested_cpus map.
> > + */
> > +bool core_wide(bool system_wide, const char *user_requested_cpu_list,
> > + const struct cpu_topology *topology);
> > +
> > #endif /* __SMT_H */
> > --
> > 2.37.2.672.g94769d06f0-goog
> >

2022-09-09 18:59:48

by Namhyung Kim

[permalink] [raw]
Subject: Re: [PATCH v2 4/7] perf topology: Add core_wide

On Tue, Sep 6, 2022 at 11:56 AM Ian Rogers <[email protected]> wrote:
>
> On Thu, Sep 1, 2022 at 11:29 PM Namhyung Kim <[email protected]> wrote:
> >
> > Hi Ian,
> >
> > On Wed, Aug 31, 2022 at 10:50 AM Ian Rogers <[email protected]> wrote:
> > >
> > > It is possible to optimize metrics when all SMT threads (CPUs) on a
> > > core are measuring events in system wide mode. For example, TMA
> > > metrics defines CORE_CLKS for Sandybrdige as:
> > >
> > > if SMT is disabled:
> > > CPU_CLK_UNHALTED.THREAD
> > > if SMT is enabled and recording on all SMT threads:
> > > CPU_CLK_UNHALTED.THREAD_ANY / 2
> > > if SMT is enabled and not recording on all SMT threads:
> > > (CPU_CLK_UNHALTED.THREAD/2)*
> > > (1+CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE/CPU_CLK_UNHALTED.REF_XCLK )
> > >
> > > That is two more events are necessary when not gathering counts on all
> > > SMT threads. To distinguish all SMT threads on a core vs system wide
> > > (all CPUs) call the new property core wide. Add a core wide test that
> > > determines the property from user requested CPUs, the topology and
> > > system wide. System wide is required as other processes running on a
> > > SMT thread will change the counts.
> > >
> > > Signed-off-by: Ian Rogers <[email protected]>
> > > ---
> > [SNIP]
> > > diff --git a/tools/perf/util/smt.c b/tools/perf/util/smt.c
> > > index ce90c4ee4138..994e9e418227 100644
> > > --- a/tools/perf/util/smt.c
> > > +++ b/tools/perf/util/smt.c
> > > @@ -21,3 +21,17 @@ bool smt_on(const struct cpu_topology *topology)
> > > cached = true;
> > > return cached_result;
> > > }
> > > +
> > > +bool core_wide(bool system_wide, const char *user_requested_cpu_list,
> > > + const struct cpu_topology *topology)
> > > +{
> > > + /* If not everything running on a core is being recorded then we can't use core_wide. */
> > > + if (!system_wide)
> > > + return false;
> >
> > I'm not sure if it's correct. Wouldn't it be ok if it runs on all
> > threads in a core
> > even if system wide is off? I thought that's what the below code checks.
> >
> > In fact I thought the opposite logic like
> >
> > if (system_wide)
> > return true;
> >
> > But it seems the code allows to have cpu_list in the system wide mode.
> > Then it also needs to check the user requested cpus being NULL.
> > (IMHO system_wide should be cleared when it has a cpu list...)
> >
> > if (system_wide && !user_requested_cpu_list)
> > return true;
> >
> > And if we have a target pointer, we could add this too.
> >
> > if (!target__has_cpu(target))
> > return false;
> >
> > Thanks,
> > Namhyung
>
> Thanks Namhyung,
>
> It may be correct or may not be without the system wide flag, and so I
> want to be conservative and say we need the flag. Let me provide more
> detail with an example:

Ok.

>
> I'm going to assume we want to measure how many slots there were for
> an SMT thread. In the "Count Domain" of TMA metrics before Icelake
> slots is a core wide metric, meaning that you are measuring >1 hyper
> thread. As we are measuring >1 hyper thread and we don't control the
> scheduling of the other hyperthread, the only valid mode is system
> wide on all CPUs, or all CPUs on a core (aka core wide in this patch
> set). However, the slots count domain information also says "Slots -
> ICL onwards or EBS_Mode before ICL" meaning the count domain is valid
> for an SMT thread. If I look at SkylakeX currently we have two
> metrics:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/tree/tools/perf/pmu-events/arch/x86/skylakex/skx-metrics.json?h=perf/core#n167
>
> {
> "BriefDescription": "Total issue-pipeline slots (per-Physical
> Core till ICL; per-Logical Processor ICL onward)",
> "MetricExpr": "4 * CPU_CLK_UNHALTED.THREAD",
> "MetricGroup": "TmaL1",
> "MetricName": "SLOTS"
> },
> {
> "BriefDescription": "Total issue-pipeline slots (per-Physical
> Core till ICL; per-Logical Processor ICL onward)",
> "MetricExpr": "4 * ( ( CPU_CLK_UNHALTED.THREAD / 2 ) * ( 1 +
> CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK ) )",
> "MetricGroup": "TmaL1_SMT",
> "MetricName": "SLOTS_SMT"
> },
>
> The SLOTS metrics is going to compute a valid value if hyperthreading
> is disabled, in other metrics you will see
> (CPU_CLK_UNHALTED.THREAD_ANY / 2) where THREAD.ANY and THREAD are the
> same event encoding. The reason for the divide by 2 is to adjust the
> count down so that when we add the other hyperthreads count we get the
> core wide value in the aggregation - the perf event on the 2nd hyper
> thread isn't necessary, but that's a separate problem. SLOTS_SMT is
> funny so let's dig into it:
> The 4 is the pipeline width.
> CPU_CLK_UNHALTED.THREAD is defined as, "Counts the number of core
> cycles while the thread is not in a halt state.."
> The divide by 2 is because of 2 hyperthreads.
> The 1 is because we assume that half the slots are going to be given
> to our hyperthread.
> The CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK
> will give a ratio for how many cycles our SMT thread was the only one
> running, and so we got those slots from the other hyperthread.
>
> With #core_wide we can have a single SLOTS metric of (new lines and //
> comments for clarity):
>
> // If we're #core_wide with SMT and we know the work on both
> hyperthreads are being measured for all processes then use a single
> counter
> (4 * CPU_CLK_UNHALTED.THREAD_ANY / 2
> if #core_wide else
> // We're not core wide so we need thread counters to adjust the slots
> metric based on the time this SMT thread is active
> 4 * ( CPU_CLK_UNHALTED.THREAD / 2 ) * ( 1 +
> CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK )
> ) if #smt_on
> // No hyper threading can just use the core wide counter
> else 4 * CPU_CLK_UNHALTED.THREAD
>
> As this metric is a foundation for many others then there is a great
> deal of use for it and the current pre-Icelake computation is
> generally only valid if on the command line you happen to have
> specified a core wide set up - which is error prone.

Thanks for the explanation. I understand why we need core_wide
for accurate metrics. My concern was just to have a clear
relationship between system-wide mode and user requested cpu list.

But I think it should be a separate thread, let me work on cpu map
cleanups as we discussed earlier in:

https://lore.kernel.org/r/[email protected]/

Thanks,
Namhyung