2020-03-13 23:11:59

by Kim Phillips

[permalink] [raw]
Subject: [PATCH 1/3 v2] perf/amd/uncore: Prepare L3 thread mask code for Family 19h support

In order to better accommodate the upcoming Family 19h support,
given the 80-char line limit, move the existing code into a new
l3_thread_slice_mask function.

Signed-off-by: Kim Phillips <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Michael Petlan <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
v2: split into two parts, this one being the mechanical
move based on Boris' review comments:

https://lkml.org/lkml/2020/3/12/581

arch/x86/events/amd/uncore.c | 25 ++++++++++++++++---------
1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/arch/x86/events/amd/uncore.c b/arch/x86/events/amd/uncore.c
index a6ea07f2aa84..dcdac001431e 100644
--- a/arch/x86/events/amd/uncore.c
+++ b/arch/x86/events/amd/uncore.c
@@ -180,6 +180,20 @@ static void amd_uncore_del(struct perf_event *event, int flags)
hwc->idx = -1;
}

+/*
+ * Convert logical cpu number to L3 PMC Config ThreadMask format
+ */
+static u64 l3_thread_slice_mask(int cpu)
+{
+ int thread = 2 * (cpu_data(cpu).cpu_core_id % 4);
+
+ if (smp_num_siblings > 1)
+ thread += cpu_data(cpu).apicid & 1;
+
+ return (1ULL << (AMD64_L3_THREAD_SHIFT + thread) &
+ AMD64_L3_THREAD_MASK) | AMD64_L3_SLICE_MASK;
+}
+
static int amd_uncore_event_init(struct perf_event *event)
{
struct amd_uncore *uncore;
@@ -209,15 +223,8 @@ static int amd_uncore_event_init(struct perf_event *event)
* SliceMask and ThreadMask need to be set for certain L3 events in
* Family 17h. For other events, the two fields do not affect the count.
*/
- if (l3_mask && is_llc_event(event)) {
- int thread = 2 * (cpu_data(event->cpu).cpu_core_id % 4);
-
- if (smp_num_siblings > 1)
- thread += cpu_data(event->cpu).apicid & 1;
-
- hwc->config |= (1ULL << (AMD64_L3_THREAD_SHIFT + thread) &
- AMD64_L3_THREAD_MASK) | AMD64_L3_SLICE_MASK;
- }
+ if (l3_mask && is_llc_event(event))
+ hwc->config |= l3_thread_slice_mask(event->cpu);

uncore = event_to_amd_uncore(event);
if (!uncore)
--
2.25.1


2020-03-13 23:11:59

by Kim Phillips

[permalink] [raw]
Subject: [PATCH 2/3 v2] perf/amd/uncore: Make L3 thread mask code more readable

Convert the l3_thread_slice_mask function to use the more
readable topology_* helper functions, more intuitive
variable names like shift and thread_mask, and BIT_ULL.

Signed-off-by: Kim Phillips <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Michael Petlan <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
v2: new this series based on splitting into two parts,
this one being the one with the non-move related changes,
based on Boris' review comments:

https://lkml.org/lkml/2020/3/12/581

arch/x86/events/amd/uncore.c | 13 ++++++++-----
1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/arch/x86/events/amd/uncore.c b/arch/x86/events/amd/uncore.c
index dcdac001431e..b622e59ccdd0 100644
--- a/arch/x86/events/amd/uncore.c
+++ b/arch/x86/events/amd/uncore.c
@@ -185,13 +185,16 @@ static void amd_uncore_del(struct perf_event *event, int flags)
*/
static u64 l3_thread_slice_mask(int cpu)
{
- int thread = 2 * (cpu_data(cpu).cpu_core_id % 4);
+ unsigned int shift, thread = 0;
+ u64 thread_mask, core = topology_core_id(cpu);

- if (smp_num_siblings > 1)
- thread += cpu_data(cpu).apicid & 1;
+ if (topology_smt_supported() && !topology_is_primary_thread(cpu))
+ thread = 1;

- return (1ULL << (AMD64_L3_THREAD_SHIFT + thread) &
- AMD64_L3_THREAD_MASK) | AMD64_L3_SLICE_MASK;
+ shift = AMD64_L3_THREAD_SHIFT + 2 * (core % 4) + thread;
+ thread_mask = BIT_ULL(shift);
+
+ return AMD64_L3_SLICE_MASK | thread_mask;
}

static int amd_uncore_event_init(struct perf_event *event)
--
2.25.1

2020-03-13 23:12:45

by Kim Phillips

[permalink] [raw]
Subject: [PATCH 3/3 v2] perf/amd/uncore: Add support for Family 19h L3 PMU

Family 19h introduces change in slice, core and thread specification
in its L3 Performance Event Select (ChL3PmcCfg) h/w register. The
change is incompatible with Family 17h's version of the register.

Introduce a new path in l3_thread_slice_mask() do things differently
for Family 19h vs. Family 17h, otherwise the new hardware doesn't
get programmed correctly.

Instead of a linear core--thread bitmask, Family 19h takes
an encoded core number, and a separate thread mask. There are
new bits that are set for all cores and all slices, of which
only the latter is used, since the driver counts events
for all slices on behalf of the specified cpu.

Also update amd_uncore_init() to base its L2/NB vs. L3/Data Fabric
mode decision based on Family 17h or above, not just 17h and 18h:
the Family 19h Data Fabric PMC is compatible with the Family 17h
DF PMC.

Signed-off-by: Kim Phillips <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Michael Petlan <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
v2: rewrote commit text to not use "We" etc.,
based on Boris' comments:

https://lkml.org/lkml/2020/3/12/583

arch/x86/events/amd/uncore.c | 20 ++++++++++++++------
arch/x86/include/asm/perf_event.h | 15 +++++++++++++--
2 files changed, 27 insertions(+), 8 deletions(-)

diff --git a/arch/x86/events/amd/uncore.c b/arch/x86/events/amd/uncore.c
index b622e59ccdd0..f3d5e4e2f285 100644
--- a/arch/x86/events/amd/uncore.c
+++ b/arch/x86/events/amd/uncore.c
@@ -191,10 +191,18 @@ static u64 l3_thread_slice_mask(int cpu)
if (topology_smt_supported() && !topology_is_primary_thread(cpu))
thread = 1;

- shift = AMD64_L3_THREAD_SHIFT + 2 * (core % 4) + thread;
+ if (boot_cpu_data.x86 <= 0x18) {
+ shift = AMD64_L3_THREAD_SHIFT + 2 * (core % 4) + thread;
+ thread_mask = BIT_ULL(shift);
+
+ return AMD64_L3_SLICE_MASK | thread_mask;
+ }
+
+ core = (core << AMD64_L3_COREID_SHIFT) & AMD64_L3_COREID_MASK;
+ shift = AMD64_L3_THREAD_SHIFT + thread;
thread_mask = BIT_ULL(shift);

- return AMD64_L3_SLICE_MASK | thread_mask;
+ return AMD64_L3_EN_ALL_SLICES | core | thread_mask;
}

static int amd_uncore_event_init(struct perf_event *event)
@@ -223,8 +231,8 @@ static int amd_uncore_event_init(struct perf_event *event)
return -EINVAL;

/*
- * SliceMask and ThreadMask need to be set for certain L3 events in
- * Family 17h. For other events, the two fields do not affect the count.
+ * SliceMask and ThreadMask need to be set for certain L3 events.
+ * For other events, the two fields do not affect the count.
*/
if (l3_mask && is_llc_event(event))
hwc->config |= l3_thread_slice_mask(event->cpu);
@@ -533,9 +541,9 @@ static int __init amd_uncore_init(void)
if (!boot_cpu_has(X86_FEATURE_TOPOEXT))
return -ENODEV;

- if (boot_cpu_data.x86 == 0x17 || boot_cpu_data.x86 == 0x18) {
+ if (boot_cpu_data.x86 >= 0x17) {
/*
- * For F17h or F18h, the Northbridge counters are
+ * For F17h and above, the Northbridge counters are
* repurposed as Data Fabric counters. Also, L3
* counters are supported too. The PMUs are exported
* based on family as either L2 or L3 and NB or DF.
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 29964b0e1075..e855e9cf2c37 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -50,11 +50,22 @@

#define AMD64_L3_SLICE_SHIFT 48
#define AMD64_L3_SLICE_MASK \
- ((0xFULL) << AMD64_L3_SLICE_SHIFT)
+ (0xFULL << AMD64_L3_SLICE_SHIFT)
+#define AMD64_L3_SLICEID_MASK \
+ (0x7ULL << AMD64_L3_SLICE_SHIFT)

#define AMD64_L3_THREAD_SHIFT 56
#define AMD64_L3_THREAD_MASK \
- ((0xFFULL) << AMD64_L3_THREAD_SHIFT)
+ (0xFFULL << AMD64_L3_THREAD_SHIFT)
+#define AMD64_L3_F19H_THREAD_MASK \
+ (0x3ULL << AMD64_L3_THREAD_SHIFT)
+
+#define AMD64_L3_EN_ALL_CORES BIT_ULL(47)
+#define AMD64_L3_EN_ALL_SLICES BIT_ULL(46)
+
+#define AMD64_L3_COREID_SHIFT 42
+#define AMD64_L3_COREID_MASK \
+ (0x7ULL << AMD64_L3_COREID_SHIFT)

#define X86_RAW_EVENT_MASK \
(ARCH_PERFMON_EVENTSEL_EVENT | \
--
2.25.1

Subject: [tip: perf/core] perf/amd/uncore: Add support for Family 19h L3 PMU

The following commit has been merged into the perf/core branch of tip:

Commit-ID: e48667b865480d8bf0f1171a8b474ffc785b9ace
Gitweb: https://git.kernel.org/tip/e48667b865480d8bf0f1171a8b474ffc785b9ace
Author: Kim Phillips <[email protected]>
AuthorDate: Fri, 13 Mar 2020 18:10:24 -05:00
Committer: Borislav Petkov <[email protected]>
CommitterDate: Tue, 17 Mar 2020 13:01:03 +01:00

perf/amd/uncore: Add support for Family 19h L3 PMU

Family 19h introduces change in slice, core and thread specification in
its L3 Performance Event Select (ChL3PmcCfg) h/w register. The change is
incompatible with Family 17h's version of the register.

Introduce a new path in l3_thread_slice_mask() to do things differently
for Family 19h vs. Family 17h, otherwise the new hardware doesn't get
programmed correctly.

Instead of a linear core--thread bitmask, Family 19h takes an encoded
core number, and a separate thread mask. There are new bits that are set
for all cores and all slices, of which only the latter is used, since
the driver counts events for all slices on behalf of the specified CPU.

Also update amd_uncore_init() to base its L2/NB vs. L3/Data Fabric mode
decision based on Family 17h or above, not just 17h and 18h: the Family
19h Data Fabric PMC is compatible with the Family 17h DF PMC.

[ bp: Touchups. ]

Signed-off-by: Kim Phillips <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
arch/x86/events/amd/uncore.c | 20 ++++++++++++++------
arch/x86/include/asm/perf_event.h | 15 +++++++++++++--
2 files changed, 27 insertions(+), 8 deletions(-)

diff --git a/arch/x86/events/amd/uncore.c b/arch/x86/events/amd/uncore.c
index 07af497..46018e5 100644
--- a/arch/x86/events/amd/uncore.c
+++ b/arch/x86/events/amd/uncore.c
@@ -191,10 +191,18 @@ static u64 l3_thread_slice_mask(int cpu)
if (topology_smt_supported() && !topology_is_primary_thread(cpu))
thread = 1;

- shift = AMD64_L3_THREAD_SHIFT + 2 * (core % 4) + thread;
+ if (boot_cpu_data.x86 <= 0x18) {
+ shift = AMD64_L3_THREAD_SHIFT + 2 * (core % 4) + thread;
+ thread_mask = BIT_ULL(shift);
+
+ return AMD64_L3_SLICE_MASK | thread_mask;
+ }
+
+ core = (core << AMD64_L3_COREID_SHIFT) & AMD64_L3_COREID_MASK;
+ shift = AMD64_L3_THREAD_SHIFT + thread;
thread_mask = BIT_ULL(shift);

- return AMD64_L3_SLICE_MASK | thread_mask;
+ return AMD64_L3_EN_ALL_SLICES | core | thread_mask;
}

static int amd_uncore_event_init(struct perf_event *event)
@@ -223,8 +231,8 @@ static int amd_uncore_event_init(struct perf_event *event)
return -EINVAL;

/*
- * SliceMask and ThreadMask need to be set for certain L3 events in
- * Family 17h. For other events, the two fields do not affect the count.
+ * SliceMask and ThreadMask need to be set for certain L3 events.
+ * For other events, the two fields do not affect the count.
*/
if (l3_mask && is_llc_event(event))
hwc->config |= l3_thread_slice_mask(event->cpu);
@@ -533,9 +541,9 @@ static int __init amd_uncore_init(void)
if (!boot_cpu_has(X86_FEATURE_TOPOEXT))
return -ENODEV;

- if (boot_cpu_data.x86 == 0x17 || boot_cpu_data.x86 == 0x18) {
+ if (boot_cpu_data.x86 >= 0x17) {
/*
- * For F17h or F18h, the Northbridge counters are
+ * For F17h and above, the Northbridge counters are
* repurposed as Data Fabric counters. Also, L3
* counters are supported too. The PMUs are exported
* based on family as either L2 or L3 and NB or DF.
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 29964b0..e855e9c 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -50,11 +50,22 @@

#define AMD64_L3_SLICE_SHIFT 48
#define AMD64_L3_SLICE_MASK \
- ((0xFULL) << AMD64_L3_SLICE_SHIFT)
+ (0xFULL << AMD64_L3_SLICE_SHIFT)
+#define AMD64_L3_SLICEID_MASK \
+ (0x7ULL << AMD64_L3_SLICE_SHIFT)

#define AMD64_L3_THREAD_SHIFT 56
#define AMD64_L3_THREAD_MASK \
- ((0xFFULL) << AMD64_L3_THREAD_SHIFT)
+ (0xFFULL << AMD64_L3_THREAD_SHIFT)
+#define AMD64_L3_F19H_THREAD_MASK \
+ (0x3ULL << AMD64_L3_THREAD_SHIFT)
+
+#define AMD64_L3_EN_ALL_CORES BIT_ULL(47)
+#define AMD64_L3_EN_ALL_SLICES BIT_ULL(46)
+
+#define AMD64_L3_COREID_SHIFT 42
+#define AMD64_L3_COREID_MASK \
+ (0x7ULL << AMD64_L3_COREID_SHIFT)

#define X86_RAW_EVENT_MASK \
(ARCH_PERFMON_EVENTSEL_EVENT | \

Subject: [tip: perf/core] perf/amd/uncore: Make L3 thread mask code more readable

The following commit has been merged into the perf/core branch of tip:

Commit-ID: 9689dbbeaea884d19e3085439c6a247ef986b2af
Gitweb: https://git.kernel.org/tip/9689dbbeaea884d19e3085439c6a247ef986b2af
Author: Kim Phillips <[email protected]>
AuthorDate: Fri, 13 Mar 2020 18:10:23 -05:00
Committer: Borislav Petkov <[email protected]>
CommitterDate: Tue, 17 Mar 2020 13:00:49 +01:00

perf/amd/uncore: Make L3 thread mask code more readable

Convert the l3_thread_slice_mask() function to use the more readable
topology_* helper functions, more intuitive variable names like shift
and thread_mask, and BIT_ULL().

No functional changes.

Signed-off-by: Kim Phillips <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
arch/x86/events/amd/uncore.c | 13 ++++++++-----
1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/arch/x86/events/amd/uncore.c b/arch/x86/events/amd/uncore.c
index 2abcb1a..07af497 100644
--- a/arch/x86/events/amd/uncore.c
+++ b/arch/x86/events/amd/uncore.c
@@ -185,13 +185,16 @@ static void amd_uncore_del(struct perf_event *event, int flags)
*/
static u64 l3_thread_slice_mask(int cpu)
{
- int thread = 2 * (cpu_data(cpu).cpu_core_id % 4);
+ u64 thread_mask, core = topology_core_id(cpu);
+ unsigned int shift, thread = 0;

- if (smp_num_siblings > 1)
- thread += cpu_data(cpu).apicid & 1;
+ if (topology_smt_supported() && !topology_is_primary_thread(cpu))
+ thread = 1;

- return (1ULL << (AMD64_L3_THREAD_SHIFT + thread) &
- AMD64_L3_THREAD_MASK) | AMD64_L3_SLICE_MASK;
+ shift = AMD64_L3_THREAD_SHIFT + 2 * (core % 4) + thread;
+ thread_mask = BIT_ULL(shift);
+
+ return AMD64_L3_SLICE_MASK | thread_mask;
}

static int amd_uncore_event_init(struct perf_event *event)

Subject: [tip: perf/core] perf/amd/uncore: Prepare L3 thread mask code for Family 19h

The following commit has been merged into the perf/core branch of tip:

Commit-ID: 4dcc3df82573a946c620dda5fb00e27c7b080105
Gitweb: https://git.kernel.org/tip/4dcc3df82573a946c620dda5fb00e27c7b080105
Author: Kim Phillips <[email protected]>
AuthorDate: Fri, 13 Mar 2020 18:10:22 -05:00
Committer: Borislav Petkov <[email protected]>
CommitterDate: Tue, 17 Mar 2020 13:00:29 +01:00

perf/amd/uncore: Prepare L3 thread mask code for Family 19h

In order to better accommodate the upcoming Family 19h, given
the 80-char line limit, move the existing code into a new
l3_thread_slice_mask() function.

No functional changes.

[ bp: Touchups. ]

Signed-off-by: Kim Phillips <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
arch/x86/events/amd/uncore.c | 25 ++++++++++++++++---------
1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/arch/x86/events/amd/uncore.c b/arch/x86/events/amd/uncore.c
index a6ea07f..2abcb1a 100644
--- a/arch/x86/events/amd/uncore.c
+++ b/arch/x86/events/amd/uncore.c
@@ -180,6 +180,20 @@ static void amd_uncore_del(struct perf_event *event, int flags)
hwc->idx = -1;
}

+/*
+ * Convert logical CPU number to L3 PMC Config ThreadMask format
+ */
+static u64 l3_thread_slice_mask(int cpu)
+{
+ int thread = 2 * (cpu_data(cpu).cpu_core_id % 4);
+
+ if (smp_num_siblings > 1)
+ thread += cpu_data(cpu).apicid & 1;
+
+ return (1ULL << (AMD64_L3_THREAD_SHIFT + thread) &
+ AMD64_L3_THREAD_MASK) | AMD64_L3_SLICE_MASK;
+}
+
static int amd_uncore_event_init(struct perf_event *event)
{
struct amd_uncore *uncore;
@@ -209,15 +223,8 @@ static int amd_uncore_event_init(struct perf_event *event)
* SliceMask and ThreadMask need to be set for certain L3 events in
* Family 17h. For other events, the two fields do not affect the count.
*/
- if (l3_mask && is_llc_event(event)) {
- int thread = 2 * (cpu_data(event->cpu).cpu_core_id % 4);
-
- if (smp_num_siblings > 1)
- thread += cpu_data(event->cpu).apicid & 1;
-
- hwc->config |= (1ULL << (AMD64_L3_THREAD_SHIFT + thread) &
- AMD64_L3_THREAD_MASK) | AMD64_L3_SLICE_MASK;
- }
+ if (l3_mask && is_llc_event(event))
+ hwc->config |= l3_thread_slice_mask(event->cpu);

uncore = event_to_amd_uncore(event);
if (!uncore)

2020-03-18 02:10:27

by Stephane Eranian

[permalink] [raw]
Subject: Re: [PATCH 1/3 v2] perf/amd/uncore: Prepare L3 thread mask code for Family 19h support

On Fri, Mar 13, 2020 at 4:10 PM Kim Phillips <[email protected]> wrote:
>
> In order to better accommodate the upcoming Family 19h support,
> given the 80-char line limit, move the existing code into a new
> l3_thread_slice_mask function.
>
> Signed-off-by: Kim Phillips <[email protected]>
> Cc: Alexander Shishkin <[email protected]>
> Cc: Arnaldo Carvalho de Melo <[email protected]>
> Cc: Borislav Petkov <[email protected]>
> Cc: "H. Peter Anvin" <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Jiri Olsa <[email protected]>
> Cc: Mark Rutland <[email protected]>
> Cc: Michael Petlan <[email protected]>
> Cc: Namhyung Kim <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Stephane Eranian <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> ---
> v2: split into two parts, this one being the mechanical
> move based on Boris' review comments:
>
> https://lkml.org/lkml/2020/3/12/581
>
> arch/x86/events/amd/uncore.c | 25 ++++++++++++++++---------
> 1 file changed, 16 insertions(+), 9 deletions(-)
>
> diff --git a/arch/x86/events/amd/uncore.c b/arch/x86/events/amd/uncore.c
> index a6ea07f2aa84..dcdac001431e 100644
> --- a/arch/x86/events/amd/uncore.c
> +++ b/arch/x86/events/amd/uncore.c
> @@ -180,6 +180,20 @@ static void amd_uncore_del(struct perf_event *event, int flags)
> hwc->idx = -1;
> }
>
> +/*
> + * Convert logical cpu number to L3 PMC Config ThreadMask format
> + */
> +static u64 l3_thread_slice_mask(int cpu)
> +{
> + int thread = 2 * (cpu_data(cpu).cpu_core_id % 4);
> +
> + if (smp_num_siblings > 1)
> + thread += cpu_data(cpu).apicid & 1;
> +
> + return (1ULL << (AMD64_L3_THREAD_SHIFT + thread) &
> + AMD64_L3_THREAD_MASK) | AMD64_L3_SLICE_MASK;
> +}
> +
> static int amd_uncore_event_init(struct perf_event *event)
> {
> struct amd_uncore *uncore;
> @@ -209,15 +223,8 @@ static int amd_uncore_event_init(struct perf_event *event)
> * SliceMask and ThreadMask need to be set for certain L3 events in
> * Family 17h. For other events, the two fields do not affect the count.
> */
> - if (l3_mask && is_llc_event(event)) {
> - int thread = 2 * (cpu_data(event->cpu).cpu_core_id % 4);
> -
> - if (smp_num_siblings > 1)
> - thread += cpu_data(event->cpu).apicid & 1;
> -
> - hwc->config |= (1ULL << (AMD64_L3_THREAD_SHIFT + thread) &
> - AMD64_L3_THREAD_MASK) | AMD64_L3_SLICE_MASK;
> - }
> + if (l3_mask && is_llc_event(event))
> + hwc->config |= l3_thread_slice_mask(event->cpu);
>
By looking at this code, I realized that even on Zen2 this is wrong.
It does not work well.
You are basically saying that the L3 event is tied to the CPU the
event is programmed to.
But this does not work with the cpumask programmed for the amd_l3 PMU. This mask
shows, as it should, one CPU/CCX. So that means that when I do:

$ perf stat -a amd_l3/event=llc_event/

This only collects on the CPUs listed in the cpumask: 0,4,8,12 ....
That means that L3 events generated by the other CPUs on the CCX are
not monitored.
I can easily see the problem by pinning a memory bound program to
CPU64, for instance.

I think the thread mask should be exposed to the user. If not
specified, then set the mask to
cover all CPUs of the CCX. That way you can pick and choose what you
want. And with one event/CCX
you can monitor for all CPUs. I can send a patch that does that.

With what you have now, you have to force the list of CPUs with -C to
work around
the cpumask. And forcing the cpumask to 0-255 does not make sense because not
all L3 events necessarily need the L3 mask, so you don't want to program them on
all CPUs especially with 8 cpus/CCX and only 6 counters.


On Fri, Mar 13, 2020 at 4:10 PM Kim Phillips <[email protected]> wrote:
>
> In order to better accommodate the upcoming Family 19h support,
> given the 80-char line limit, move the existing code into a new
> l3_thread_slice_mask function.
>
> Signed-off-by: Kim Phillips <[email protected]>
> Cc: Alexander Shishkin <[email protected]>
> Cc: Arnaldo Carvalho de Melo <[email protected]>
> Cc: Borislav Petkov <[email protected]>
> Cc: "H. Peter Anvin" <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Jiri Olsa <[email protected]>
> Cc: Mark Rutland <[email protected]>
> Cc: Michael Petlan <[email protected]>
> Cc: Namhyung Kim <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Stephane Eranian <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> ---
> v2: split into two parts, this one being the mechanical
> move based on Boris' review comments:
>
> https://lkml.org/lkml/2020/3/12/581
>
> arch/x86/events/amd/uncore.c | 25 ++++++++++++++++---------
> 1 file changed, 16 insertions(+), 9 deletions(-)
>
> diff --git a/arch/x86/events/amd/uncore.c b/arch/x86/events/amd/uncore.c
> index a6ea07f2aa84..dcdac001431e 100644
> --- a/arch/x86/events/amd/uncore.c
> +++ b/arch/x86/events/amd/uncore.c
> @@ -180,6 +180,20 @@ static void amd_uncore_del(struct perf_event *event, int flags)
> hwc->idx = -1;
> }
>
> +/*
> + * Convert logical cpu number to L3 PMC Config ThreadMask format
> + */
> +static u64 l3_thread_slice_mask(int cpu)
> +{
> + int thread = 2 * (cpu_data(cpu).cpu_core_id % 4);
> +
> + if (smp_num_siblings > 1)
> + thread += cpu_data(cpu).apicid & 1;
> +
> + return (1ULL << (AMD64_L3_THREAD_SHIFT + thread) &
> + AMD64_L3_THREAD_MASK) | AMD64_L3_SLICE_MASK;
> +}
> +
> static int amd_uncore_event_init(struct perf_event *event)
> {
> struct amd_uncore *uncore;
> @@ -209,15 +223,8 @@ static int amd_uncore_event_init(struct perf_event *event)
> * SliceMask and ThreadMask need to be set for certain L3 events in
> * Family 17h. For other events, the two fields do not affect the count.
> */
> - if (l3_mask && is_llc_event(event)) {
> - int thread = 2 * (cpu_data(event->cpu).cpu_core_id % 4);
> -
> - if (smp_num_siblings > 1)
> - thread += cpu_data(event->cpu).apicid & 1;
> -
> - hwc->config |= (1ULL << (AMD64_L3_THREAD_SHIFT + thread) &
> - AMD64_L3_THREAD_MASK) | AMD64_L3_SLICE_MASK;
> - }
> + if (l3_mask && is_llc_event(event))
> + hwc->config |= l3_thread_slice_mask(event->cpu);
>
> uncore = event_to_amd_uncore(event);
> if (!uncore)
> --
> 2.25.1
>

2020-03-18 14:49:33

by Kim Phillips

[permalink] [raw]
Subject: Re: [PATCH 1/3 v2] perf/amd/uncore: Prepare L3 thread mask code for Family 19h support

On 3/17/20 9:09 PM, Stephane Eranian wrote:
> On Fri, Mar 13, 2020 at 4:10 PM Kim Phillips <[email protected]> wrote:
>> +++ b/arch/x86/events/amd/uncore.c
>> @@ -180,6 +180,20 @@ static void amd_uncore_del(struct perf_event *event, int flags)
>> hwc->idx = -1;
>> }
>>
>> +/*
>> + * Convert logical cpu number to L3 PMC Config ThreadMask format
>> + */
>> +static u64 l3_thread_slice_mask(int cpu)
>> +{
>> + int thread = 2 * (cpu_data(cpu).cpu_core_id % 4);
>> +
>> + if (smp_num_siblings > 1)
>> + thread += cpu_data(cpu).apicid & 1;
>> +
>> + return (1ULL << (AMD64_L3_THREAD_SHIFT + thread) &
>> + AMD64_L3_THREAD_MASK) | AMD64_L3_SLICE_MASK;
>> +}
>> +
>> static int amd_uncore_event_init(struct perf_event *event)
>> {
>> struct amd_uncore *uncore;
>> @@ -209,15 +223,8 @@ static int amd_uncore_event_init(struct perf_event *event)
>> * SliceMask and ThreadMask need to be set for certain L3 events in
>> * Family 17h. For other events, the two fields do not affect the count.
>> */
>> - if (l3_mask && is_llc_event(event)) {
>> - int thread = 2 * (cpu_data(event->cpu).cpu_core_id % 4);
>> -
>> - if (smp_num_siblings > 1)
>> - thread += cpu_data(event->cpu).apicid & 1;
>> -
>> - hwc->config |= (1ULL << (AMD64_L3_THREAD_SHIFT + thread) &
>> - AMD64_L3_THREAD_MASK) | AMD64_L3_SLICE_MASK;
>> - }
>> + if (l3_mask && is_llc_event(event))
>> + hwc->config |= l3_thread_slice_mask(event->cpu);
>>
> By looking at this code, I realized that even on Zen2 this is wrong.
> It does not work well.
> You are basically saying that the L3 event is tied to the CPU the
> event is programmed to.
> But this does not work with the cpumask programmed for the amd_l3 PMU. This mask
> shows, as it should, one CPU/CCX. So that means that when I do:
>
> $ perf stat -a amd_l3/event=llc_event/
>
> This only collects on the CPUs listed in the cpumask: 0,4,8,12 ....
> That means that L3 events generated by the other CPUs on the CCX are
> not monitored.
> I can easily see the problem by pinning a memory bound program to
> CPU64, for instance.

Right, the higher level code calls the driver with a single cpu==0
call if the perf tool is invoked with a simple -a style system-wide.
If the tool is invoked with supplemental switches to -a, like -C 0-255,
and -A, the driver gets called multiple times with all the unique cpu
values. The latter is the expected invocation style when measuring
a benchmark pinned on a subset of cpus, i.e., when evaluating
the driver, and is the more deterministic behaviour for the driver
to have, given it cannot tell the difference otherwise.

> I think the thread mask should be exposed to the user. If not
> specified, then set the mask to
> cover all CPUs of the CCX. That way you can pick and choose what you
> want. And with one event/CCX
> you can monitor for all CPUs. I can send a patch that does that.

Do you mean something that will allow the user to do something
like this?:

perf stat -a amd_l3/event=llc_event,core=X,thread_mask={1,2,3}/

Wouldn't users rather specify cpus using -C etc.?

> With what you have now, you have to force the list of CPUs with -C to
> work around
> the cpumask. And forcing the cpumask to 0-255 does not make sense because not
> all L3 events necessarily need the L3 mask, so you don't want to program them on
> all CPUs especially with 8 cpus/CCX and only 6 counters.

Is it not possible for those to be run in separate invocations
that use the simple system-wide case, e.g., -a?

How would adding core=X,thread_mask={1,2,3} specification
change the -C invocation behaviour?

I thought of having the driver set all CPUs in the threadmask
if invoked with a cpu == 0, but that means one cannot specify
-C 0,4,8, etc.

Thanks,

Kim

2020-03-18 20:43:56

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/3 v2] perf/amd/uncore: Prepare L3 thread mask code for Family 19h support

On Wed, Mar 18, 2020 at 09:46:41AM -0500, Kim Phillips wrote:

> > But this does not work with the cpumask programmed for the amd_l3 PMU. This mask
> > shows, as it should, one CPU/CCX. So that means that when I do:
> >
> > $ perf stat -a amd_l3/event=llc_event/
> >
> > This only collects on the CPUs listed in the cpumask: 0,4,8,12 ....
> > That means that L3 events generated by the other CPUs on the CCX are
> > not monitored.
> > I can easily see the problem by pinning a memory bound program to
> > CPU64, for instance.
>
> Right, the higher level code calls the driver with a single cpu==0
> call if the perf tool is invoked with a simple -a style system-wide.
> If the tool is invoked with supplemental switches to -a, like -C 0-255,
> and -A, the driver gets called multiple times with all the unique cpu
> values. The latter is the expected invocation style when measuring
> a benchmark pinned on a subset of cpus, i.e., when evaluating
> the driver, and is the more deterministic behaviour for the driver
> to have, given it cannot tell the difference otherwise.

That seems to suggest it is all horribly broken.

2020-03-18 21:42:13

by Stephane Eranian

[permalink] [raw]
Subject: Re: [PATCH 1/3 v2] perf/amd/uncore: Prepare L3 thread mask code for Family 19h support

On Wed, Mar 18, 2020 at 1:43 PM Peter Zijlstra <[email protected]> wrote:
>
> On Wed, Mar 18, 2020 at 09:46:41AM -0500, Kim Phillips wrote:
>
> > > But this does not work with the cpumask programmed for the amd_l3 PMU. This mask
> > > shows, as it should, one CPU/CCX. So that means that when I do:
> > >
> > > $ perf stat -a amd_l3/event=llc_event/
> > >
> > > This only collects on the CPUs listed in the cpumask: 0,4,8,12 ....
> > > That means that L3 events generated by the other CPUs on the CCX are
> > > not monitored.
> > > I can easily see the problem by pinning a memory bound program to
> > > CPU64, for instance.
> >
> > Right, the higher level code calls the driver with a single cpu==0
> > call if the perf tool is invoked with a simple -a style system-wide.

No, it does not.

With -a, when -C is not passed, the perf tool picks up the cpumask for
the PMU from sysfs:
$ cat /proc/sys/devices/amd_l3/cpumask

You can easily verify this by running: strace -etrace=perf_event_open
perf stat -a -e amd_l3/event=0x00/.
This is the default common mode.

The problem is that here to get any meaningful result, you need to force a -C.
The CPU in the cpumask is just the CPU to which to attach the event in
order to access the correct uncore PMU.
Here, you have one CPU per CCX which is expected and perfectly fine.

The thread_mask is a hardware filter on the uncore L3 PMU. If you set
by default the thread_mask to 0xff, then
you obtain a full system view with a simple -a, or per socket with
--per-socket. So we need to find a way to
make this common case work properly first. Expecting the users to know
that for some amd_l3 events you need
to force -C 0-255 is not practical. I also think that forcing the
cpumask to 0-255 is not right solution. This is not how
this is done for any other uncore PMU I know of and some do have the
thread filter, such as the Skylake CHA.



> > If the tool is invoked with supplemental switches to -a, like -C 0-255,
> > and -A, the driver gets called multiple times with all the unique cpu
> > values. The latter is the expected invocation style when measuring
> > a benchmark pinned on a subset of cpus, i.e., when evaluating
> > the driver, and is the more deterministic behaviour for the driver
> > to have, given it cannot tell the difference otherwise.
>
> That seems to suggest it is all horribly broken.

2020-03-23 20:51:49

by Kim Phillips

[permalink] [raw]
Subject: Re: [PATCH 1/3 v2] perf/amd/uncore: Prepare L3 thread mask code for Family 19h support



On 3/18/20 4:26 PM, Stephane Eranian wrote:
> On Wed, Mar 18, 2020 at 1:43 PM Peter Zijlstra <[email protected]> wrote:
>>
>> On Wed, Mar 18, 2020 at 09:46:41AM -0500, Kim Phillips wrote:
>>
>>>> But this does not work with the cpumask programmed for the amd_l3 PMU. This mask
>>>> shows, as it should, one CPU/CCX. So that means that when I do:
>>>>
>>>> $ perf stat -a amd_l3/event=llc_event/
>>>>
>>>> This only collects on the CPUs listed in the cpumask: 0,4,8,12 ....
>>>> That means that L3 events generated by the other CPUs on the CCX are
>>>> not monitored.
>>>> I can easily see the problem by pinning a memory bound program to
>>>> CPU64, for instance.
>>>
>>> Right, the higher level code calls the driver with a single cpu==0
>>> call if the perf tool is invoked with a simple -a style system-wide.
>
> No, it does not.
>
> With -a, when -C is not passed, the perf tool picks up the cpumask for
> the PMU from sysfs:
> $ cat /proc/sys/devices/amd_l3/cpumask
>
> You can easily verify this by running: strace -etrace=perf_event_open
> perf stat -a -e amd_l3/event=0x00/.
> This is the default common mode.

What I meant was that with -a, the driver only gets called with the
'base' cpu for each L3 PMU domain, i.e., 0, 4, 8, and so on. With -C, it
gets called with all the CPUs the user specifies: these are different
behaviours, and the driver can't tell the difference between e.g., -a
or -C 0,4,8, etc.

> The problem is that here to get any meaningful result, you need to force a -C.
> The CPU in the cpumask is just the CPU to which to attach the event in
> order to access the correct uncore PMU.
> Here, you have one CPU per CCX which is expected and perfectly fine.
>
> The thread_mask is a hardware filter on the uncore L3 PMU. If you set
> by default the thread_mask to 0xff, then
> you obtain a full system view with a simple -a, or per socket with
> --per-socket. So we need to find a way to
> make this common case work properly first. Expecting the users to know

OK, I'll send a patch to revert the thread filter feature until the above
issue is addressed.

> that for some amd_l3 events you need
> to force -C 0-255 is not practical. I also think that forcing the
> cpumask to 0-255 is not right solution. This is not how
> this is done for any other uncore PMU I know of and some do have the
> thread filter, such as the Skylake CHA.

Odd, the Intel uncore driver's cpumask is 0, so not sure if AMD's
is right to set it any more...

Thanks,

Kim

>>> If the tool is invoked with supplemental switches to -a, like -C 0-255,
>>> and -A, the driver gets called multiple times with all the unique cpu
>>> values. The latter is the expected invocation style when measuring
>>> a benchmark pinned on a subset of cpus, i.e., when evaluating
>>> the driver, and is the more deterministic behaviour for the driver
>>> to have, given it cannot tell the difference otherwise.
>>
>> That seems to suggest it is all horribly broken.