LinuxLists.cc - [PATCH 2/3] powerpc/smp: Add support detecting thread-groups sharing L2 cache

2020-12-04 04:53:25

Subject: [PATCH 2/3] powerpc/smp: Add support detecting thread-groups sharing L2 cache

From: "Gautham R. Shenoy" <[email protected]>

On POWER systems, groups of threads within a core sharing the L2-cache
can be indicated by the "ibm,thread-groups" property array with the
identifier "2".

This patch adds support for detecting this, and when present, populate
the populating the cpu_l2_cache_mask of every CPU to the core-siblings
which share L2 with the CPU as specified in the by the
"ibm,thread-groups" property array.

On a platform with the following "ibm,thread-group" configuration
00000001 00000002 00000004 00000000
00000002 00000004 00000006 00000001
00000003 00000005 00000007 00000002
00000002 00000004 00000000 00000002
00000004 00000006 00000001 00000003
00000005 00000007

Without this patch, the sched-domain hierarchy for CPUs 0,1 would be
CPU0 attaching sched-domain(s):
domain-0: span=0,2,4,6 level=SMT
domain-1: span=0-7 level=CACHE
domain-2: span=0-15,24-39,48-55 level=MC
domain-3: span=0-55 level=DIE

CPU1 attaching sched-domain(s):
domain-0: span=1,3,5,7 level=SMT
domain-1: span=0-7 level=CACHE
domain-2: span=0-15,24-39,48-55 level=MC
domain-3: span=0-55 level=DIE

The CACHE domain at 0-7 is incorrect since the ibm,thread-groups
sub-array
[00000002 00000002 00000004
00000000 00000002 00000004 00000006
00000001 00000003 00000005 00000007]
indicates that L2 (Property "2") is shared only between the threads of a single
group. There are "2" groups of threads where each group contains "4"
threads each. The groups being {0,2,4,6} and {1,3,5,7}.

With this patch, the sched-domain hierarchy for CPUs 0,1 would be
CPU0 attaching sched-domain(s):
domain-0: span=0,2,4,6 level=SMT
domain-1: span=0-15,24-39,48-55 level=MC
domain-2: span=0-55 level=DIE

CPU1 attaching sched-domain(s):
domain-0: span=1,3,5,7 level=SMT
domain-1: span=0-15,24-39,48-55 level=MC
domain-2: span=0-55 level=DIE

The CACHE domain with span=0,2,4,6 for CPU 0 (span=1,3,5,7 for CPU 1
resp.) gets degenerated into the SMT domain. Furthermore, the
last-level-cache domain gets correctly set to the SMT sched-domain.

Signed-off-by: Gautham R. Shenoy <[email protected]>
---
arch/powerpc/kernel/smp.c | 66 +++++++++++++++++++++++++++++++++++++++++------
1 file changed, 58 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 6a242a3..a116d2d 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -76,6 +76,7 @@
struct task_struct *secondary_current;
bool has_big_cores;
bool coregroup_enabled;
+bool thread_group_shares_l2;

DEFINE_PER_CPU(cpumask_var_t, cpu_sibling_map);
DEFINE_PER_CPU(cpumask_var_t, cpu_smallcore_map);
@@ -99,6 +100,7 @@ enum {

#define MAX_THREAD_LIST_SIZE 8
#define THREAD_GROUP_SHARE_L1 1
+#define THREAD_GROUP_SHARE_L2 2
struct thread_groups {
unsigned int property;
unsigned int nr_groups;
@@ -107,7 +109,7 @@ struct thread_groups {
};

/* Maximum number of properties that groups of threads within a core can share */
-#define MAX_THREAD_GROUP_PROPERTIES 1
+#define MAX_THREAD_GROUP_PROPERTIES 2

struct thread_groups_list {
unsigned int nr_properties;
@@ -121,6 +123,13 @@ struct thread_groups_list {
*/
DEFINE_PER_CPU(cpumask_var_t, cpu_l1_cache_map);

+/*
+ * On some big-cores system, thread_group_l2_cache_map for each CPU
+ * corresponds to the set its siblings within the core that share the
+ * L2-cache.
+ */
+DEFINE_PER_CPU(cpumask_var_t, thread_group_l2_cache_map);
+
/* SMP operations for this machine */
struct smp_ops_t *smp_ops;

@@ -718,7 +727,9 @@ static void or_cpumasks_related(int i, int j, struct cpumask *(*srcmask)(int),
*
* ibm,thread-groups[i + 0] tells us the property based on which the
* threads are being grouped together. If this value is 1, it implies
- * that the threads in the same group share L1, translation cache.
+ * that the threads in the same group share L1, translation cache. If
+ * the value is 2, it implies that the threads in the same group share
+ * the same L2 cache.
*
* ibm,thread-groups[i+1] tells us how many such thread groups exist for the
* property ibm,thread-groups[i]
@@ -745,10 +756,10 @@ static void or_cpumasks_related(int i, int j, struct cpumask *(*srcmask)(int),
* 12}.
*
* b) there are 2 groups of 4 threads each, where each group of
- * threads share some property indicated by the first value 2. The
- * "ibm,ppc-interrupt-server#s" of the first group is {5,7,9,11}
- * and the "ibm,ppc-interrupt-server#s" of the second group is
- * {6,8,10,12} structure
+ * threads share some property indicated by the first value 2 (L2
+ * cache). The "ibm,ppc-interrupt-server#s" of the first group is
+ * {5,7,9,11} and the "ibm,ppc-interrupt-server#s" of the second
+ * group is {6,8,10,12} structure
*
* Returns 0 on success, -EINVAL if the property does not exist,
* -ENODATA if property does not have a value, and -EOVERFLOW if the
@@ -840,7 +851,8 @@ static int init_cpu_cache_map(int cpu, unsigned int cache_property)
if (!dn)
return -ENODATA;

- if (!(cache_property == THREAD_GROUP_SHARE_L1))
+ if (!(cache_property == THREAD_GROUP_SHARE_L1 ||
+ cache_property == THREAD_GROUP_SHARE_L2))
return -EINVAL;

if (!cpu_tgl->nr_properties) {
@@ -867,7 +879,10 @@ static int init_cpu_cache_map(int cpu, unsigned int cache_property)
goto out;
}

- mask = &per_cpu(cpu_l1_cache_map, cpu);
+ if (cache_property == THREAD_GROUP_SHARE_L1)
+ mask = &per_cpu(cpu_l1_cache_map, cpu);
+ else if (cache_property == THREAD_GROUP_SHARE_L2)
+ mask = &per_cpu(thread_group_l2_cache_map, cpu);

zalloc_cpumask_var_node(mask, GFP_KERNEL, cpu_to_node(cpu));

@@ -973,6 +988,16 @@ static int init_big_cores(void)
}

has_big_cores = true;
+
+ for_each_possible_cpu(cpu) {
+ int err = init_cpu_cache_map(cpu, THREAD_GROUP_SHARE_L2);
+
+ if (err)
+ return err;
+ }
+
+ thread_group_shares_l2 = true;
+ pr_info("Thread-groups in a core share L2-cache\n");
return 0;
}

@@ -1287,6 +1312,31 @@ static bool update_mask_by_l2(int cpu, cpumask_var_t *mask)
if (has_big_cores)
submask_fn = cpu_smallcore_mask;

+
+ /*
+ * If the threads in a thread-group share L2 cache, then then
+ * the L2-mask can be obtained from thread_group_l2_cache_map.
+ */
+ if (thread_group_shares_l2) {
+ /* Siblings that share L1 is a subset of siblings that share L2.*/
+ or_cpumasks_related(cpu, cpu, submask_fn, cpu_l2_cache_mask);
+ if (*mask) {
+ cpumask_andnot(*mask,
+ per_cpu(thread_group_l2_cache_map, cpu),
+ cpu_l2_cache_mask(cpu));
+ } else {
+ mask = &per_cpu(thread_group_l2_cache_map, cpu);
+ }
+
+ for_each_cpu(i, *mask) {
+ if (!cpu_online(i))
+ continue;
+ set_cpus_related(i, cpu, cpu_l2_cache_mask);
+ }
+
+ return true;
+ }
+
l2_cache = cpu_to_l2cache(cpu);
if (!l2_cache || !*mask) {
/* Assume only core siblings share cache with this CPU */
--
1.9.4

2020-12-07 12:44:42

by Srikar Dronamraju

[permalink] [raw]

Subject: Re: [PATCH 2/3] powerpc/smp: Add support detecting thread-groups sharing L2 cache

* Gautham R. Shenoy <[email protected]> [2020-12-04 10:18:46]:

> From: "Gautham R. Shenoy" <[email protected]>
>
> On POWER systems, groups of threads within a core sharing the L2-cache
> can be indicated by the "ibm,thread-groups" property array with the
> identifier "2".
>
> This patch adds support for detecting this, and when present, populate
> the populating the cpu_l2_cache_mask of every CPU to the core-siblings
> which share L2 with the CPU as specified in the by the
> "ibm,thread-groups" property array.
>
> On a platform with the following "ibm,thread-group" configuration
> 00000001 00000002 00000004 00000000
> 00000002 00000004 00000006 00000001
> 00000003 00000005 00000007 00000002
> 00000002 00000004 00000000 00000002
> 00000004 00000006 00000001 00000003
> 00000005 00000007
>
> Without this patch, the sched-domain hierarchy for CPUs 0,1 would be
> CPU0 attaching sched-domain(s):
> domain-0: span=0,2,4,6 level=SMT
> domain-1: span=0-7 level=CACHE
> domain-2: span=0-15,24-39,48-55 level=MC
> domain-3: span=0-55 level=DIE
>
> CPU1 attaching sched-domain(s):
> domain-0: span=1,3,5,7 level=SMT
> domain-1: span=0-7 level=CACHE
> domain-2: span=0-15,24-39,48-55 level=MC
> domain-3: span=0-55 level=DIE
>
> The CACHE domain at 0-7 is incorrect since the ibm,thread-groups
> sub-array
> [00000002 00000002 00000004
> 00000000 00000002 00000004 00000006
> 00000001 00000003 00000005 00000007]
> indicates that L2 (Property "2") is shared only between the threads of a single
> group. There are "2" groups of threads where each group contains "4"
> threads each. The groups being {0,2,4,6} and {1,3,5,7}.
>
> With this patch, the sched-domain hierarchy for CPUs 0,1 would be
> CPU0 attaching sched-domain(s):
> domain-0: span=0,2,4,6 level=SMT
> domain-1: span=0-15,24-39,48-55 level=MC
> domain-2: span=0-55 level=DIE
>
> CPU1 attaching sched-domain(s):
> domain-0: span=1,3,5,7 level=SMT
> domain-1: span=0-15,24-39,48-55 level=MC
> domain-2: span=0-55 level=DIE
>
> The CACHE domain with span=0,2,4,6 for CPU 0 (span=1,3,5,7 for CPU 1
> resp.) gets degenerated into the SMT domain. Furthermore, the
> last-level-cache domain gets correctly set to the SMT sched-domain.
>
> Signed-off-by: Gautham R. Shenoy <[email protected]>
> ---
> arch/powerpc/kernel/smp.c | 66 +++++++++++++++++++++++++++++++++++++++++------
> 1 file changed, 58 insertions(+), 8 deletions(-)
>
> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index 6a242a3..a116d2d 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -76,6 +76,7 @@
> struct task_struct *secondary_current;
> bool has_big_cores;
> bool coregroup_enabled;
> +bool thread_group_shares_l2;

Either keep this as static in this patch or add its declaration

>
> DEFINE_PER_CPU(cpumask_var_t, cpu_sibling_map);
> DEFINE_PER_CPU(cpumask_var_t, cpu_smallcore_map);
> @@ -99,6 +100,7 @@ enum {
>
> #define MAX_THREAD_LIST_SIZE 8
> #define THREAD_GROUP_SHARE_L1 1
> +#define THREAD_GROUP_SHARE_L2 2
> struct thread_groups {
> unsigned int property;
> unsigned int nr_groups;
> @@ -107,7 +109,7 @@ struct thread_groups {
> };
>
> /* Maximum number of properties that groups of threads within a core can share */
> -#define MAX_THREAD_GROUP_PROPERTIES 1
> +#define MAX_THREAD_GROUP_PROPERTIES 2
>
> struct thread_groups_list {
> unsigned int nr_properties;
> @@ -121,6 +123,13 @@ struct thread_groups_list {
> */
> DEFINE_PER_CPU(cpumask_var_t, cpu_l1_cache_map);
>
> +/*
> + * On some big-cores system, thread_group_l2_cache_map for each CPU
> + * corresponds to the set its siblings within the core that share the
> + * L2-cache.
> + */
> +DEFINE_PER_CPU(cpumask_var_t, thread_group_l2_cache_map);
> +

NIT:
We are trying to confuse ourselves with the names.
For L1 we have cpu_l2_cache_map to store the tasks from the thread group.
but cpu_smallcore_map for keeping track of tasks.

For L2 we have thread_group_l2_cache_map to store the tasks from the thread
group. but cpu_l2_cache_map for keeping track of tasks.

I think we should do some renaming to keep the names consistent.
I would say probably say move the current cpu_l2_cache_map to
cpu_llc_cache_map and move the new aka thread_group_l2_cache_map as
cpu_l2_cache_map to be somewhat consistent.

> /* SMP operations for this machine */
> struct smp_ops_t *smp_ops;
>
> @@ -840,7 +851,8 @@ static int init_cpu_cache_map(int cpu, unsigned int cache_property)
> if (!dn)
> return -ENODATA;
>
> - if (!(cache_property == THREAD_GROUP_SHARE_L1))
> + if (!(cache_property == THREAD_GROUP_SHARE_L1 ||
> + cache_property == THREAD_GROUP_SHARE_L2))
> return -EINVAL;
>
> if (!cpu_tgl->nr_properties) {
> @@ -867,7 +879,10 @@ static int init_cpu_cache_map(int cpu, unsigned int cache_property)
> goto out;
> }
>
> - mask = &per_cpu(cpu_l1_cache_map, cpu);
> + if (cache_property == THREAD_GROUP_SHARE_L1)
> + mask = &per_cpu(cpu_l1_cache_map, cpu);
> + else if (cache_property == THREAD_GROUP_SHARE_L2)
> + mask = &per_cpu(thread_group_l2_cache_map, cpu);
>
> zalloc_cpumask_var_node(mask, GFP_KERNEL, cpu_to_node(cpu));
>
> @@ -973,6 +988,16 @@ static int init_big_cores(void)
> }
>
> has_big_cores = true;
> +
> + for_each_possible_cpu(cpu) {
> + int err = init_cpu_cache_map(cpu, THREAD_GROUP_SHARE_L2);
> +
> + if (err)
> + return err;
> + }
> +
> + thread_group_shares_l2 = true;

Why do we need a separate loop. Why cant we merge this in the above loop
itself?

> + pr_info("Thread-groups in a core share L2-cache\n");

Can this be moved to a pr_debug? Does it help any regular user/admins to
know if thread-groups shared l2 cache. Infact it may confuse users on what
thread groups are and which thread groups dont share cache.
I would prefer some other name than thread_group_shares_l2 but dont know any
better alternatives and may be my choices are even worse.

> return 0;
> }
>
> @@ -1287,6 +1312,31 @@ static bool update_mask_by_l2(int cpu, cpumask_var_t *mask)
> if (has_big_cores)
> submask_fn = cpu_smallcore_mask;
>
> +

NIT: extra blank line?

> + /*
> + * If the threads in a thread-group share L2 cache, then then
> + * the L2-mask can be obtained from thread_group_l2_cache_map.
> + */
> + if (thread_group_shares_l2) {
> + /* Siblings that share L1 is a subset of siblings that share L2.*/
> + or_cpumasks_related(cpu, cpu, submask_fn, cpu_l2_cache_mask);
> + if (*mask) {
> + cpumask_andnot(*mask,
> + per_cpu(thread_group_l2_cache_map, cpu),
> + cpu_l2_cache_mask(cpu));
> + } else {
> + mask = &per_cpu(thread_group_l2_cache_map, cpu);
> + }
> +
> + for_each_cpu(i, *mask) {
> + if (!cpu_online(i))
> + continue;
> + set_cpus_related(i, cpu, cpu_l2_cache_mask);
> + }
> +
> + return true;
> + }
> +

Ah this can be simplified to:
if (thread_group_shares_l2) {
cpumask_set_cpu(cpu, cpu_l2_cache_mask(cpu));

for_each_cpu(i, per_cpu(thread_group_l2_cache_map, cpu)) {
if (cpu_online(i))
set_cpus_related(i, cpu, cpu_l2_cache_mask);
}
}

No?

> l2_cache = cpu_to_l2cache(cpu);
> if (!l2_cache || !*mask) {
> /* Assume only core siblings share cache with this CPU */

--
Thanks and Regards
Srikar Dronamraju

2020-12-08 17:47:16

by Gautham R Shenoy

[permalink] [raw]

Subject: Re: [PATCH 2/3] powerpc/smp: Add support detecting thread-groups sharing L2 cache

Hello Srikar,

On Mon, Dec 07, 2020 at 06:10:39PM +0530, Srikar Dronamraju wrote:
> * Gautham R. Shenoy <[email protected]> [2020-12-04 10:18:46]:
>
> > From: "Gautham R. Shenoy" <[email protected]>
> >
> > On POWER systems, groups of threads within a core sharing the L2-cache
> > can be indicated by the "ibm,thread-groups" property array with the
> > identifier "2".
> >
> > This patch adds support for detecting this, and when present, populate
> > the populating the cpu_l2_cache_mask of every CPU to the core-siblings
> > which share L2 with the CPU as specified in the by the
> > "ibm,thread-groups" property array.
> >
> > On a platform with the following "ibm,thread-group" configuration
> > 00000001 00000002 00000004 00000000
> > 00000002 00000004 00000006 00000001
> > 00000003 00000005 00000007 00000002
> > 00000002 00000004 00000000 00000002
> > 00000004 00000006 00000001 00000003
> > 00000005 00000007
> >
> > Without this patch, the sched-domain hierarchy for CPUs 0,1 would be
> > CPU0 attaching sched-domain(s):
> > domain-0: span=0,2,4,6 level=SMT
> > domain-1: span=0-7 level=CACHE
> > domain-2: span=0-15,24-39,48-55 level=MC
> > domain-3: span=0-55 level=DIE
> >
> > CPU1 attaching sched-domain(s):
> > domain-0: span=1,3,5,7 level=SMT
> > domain-1: span=0-7 level=CACHE
> > domain-2: span=0-15,24-39,48-55 level=MC
> > domain-3: span=0-55 level=DIE
> >
> > The CACHE domain at 0-7 is incorrect since the ibm,thread-groups
> > sub-array
> > [00000002 00000002 00000004
> > 00000000 00000002 00000004 00000006
> > 00000001 00000003 00000005 00000007]
> > indicates that L2 (Property "2") is shared only between the threads of a single
> > group. There are "2" groups of threads where each group contains "4"
> > threads each. The groups being {0,2,4,6} and {1,3,5,7}.
> >
> > With this patch, the sched-domain hierarchy for CPUs 0,1 would be
> > CPU0 attaching sched-domain(s):
> > domain-0: span=0,2,4,6 level=SMT
> > domain-1: span=0-15,24-39,48-55 level=MC
> > domain-2: span=0-55 level=DIE
> >
> > CPU1 attaching sched-domain(s):
> > domain-0: span=1,3,5,7 level=SMT
> > domain-1: span=0-15,24-39,48-55 level=MC
> > domain-2: span=0-55 level=DIE
> >
> > The CACHE domain with span=0,2,4,6 for CPU 0 (span=1,3,5,7 for CPU 1
> > resp.) gets degenerated into the SMT domain. Furthermore, the
> > last-level-cache domain gets correctly set to the SMT sched-domain.
> >
> > Signed-off-by: Gautham R. Shenoy <[email protected]>
> > ---
> > arch/powerpc/kernel/smp.c | 66 +++++++++++++++++++++++++++++++++++++++++------
> > 1 file changed, 58 insertions(+), 8 deletions(-)
> >
> > diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> > index 6a242a3..a116d2d 100644
> > --- a/arch/powerpc/kernel/smp.c
> > +++ b/arch/powerpc/kernel/smp.c
> > @@ -76,6 +76,7 @@
> > struct task_struct *secondary_current;
> > bool has_big_cores;
> > bool coregroup_enabled;
> > +bool thread_group_shares_l2;
>
> Either keep this as static in this patch or add its declaration
>

This will be used in Patch 3 in kernel/cacheinfo.c, but not any other
place. Hence I am not making it static here.

> >
> > DEFINE_PER_CPU(cpumask_var_t, cpu_sibling_map);
> > DEFINE_PER_CPU(cpumask_var_t, cpu_smallcore_map);
> > @@ -99,6 +100,7 @@ enum {
> >
> > #define MAX_THREAD_LIST_SIZE 8
> > #define THREAD_GROUP_SHARE_L1 1
> > +#define THREAD_GROUP_SHARE_L2 2
> > struct thread_groups {
> > unsigned int property;
> > unsigned int nr_groups;
> > @@ -107,7 +109,7 @@ struct thread_groups {
> > };
> >
> > /* Maximum number of properties that groups of threads within a core can share */
> > -#define MAX_THREAD_GROUP_PROPERTIES 1
> > +#define MAX_THREAD_GROUP_PROPERTIES 2
> >
> > struct thread_groups_list {
> > unsigned int nr_properties;
> > @@ -121,6 +123,13 @@ struct thread_groups_list {
> > */
> > DEFINE_PER_CPU(cpumask_var_t, cpu_l1_cache_map);
> >
> > +/*
> > + * On some big-cores system, thread_group_l2_cache_map for each CPU
> > + * corresponds to the set its siblings within the core that share the
> > + * L2-cache.
> > + */
> > +DEFINE_PER_CPU(cpumask_var_t, thread_group_l2_cache_map);
> > +
>
> NIT:
> We are trying to confuse ourselves with the names.
> For L1 we have cpu_l2_cache_map to store the tasks from the thread group.
> but cpu_smallcore_map for keeping track of tasks.
>

I suppose you mean cpu_l1_cache_map here. We are using
cpu_smallcore_map, because when the ibm,thread-groups-property=1, it
shares both L1 and the instruction data flow (SMT). We already have a
cpu_smt_map, hence, this was named cpu_smallcore_map a couple of years
ago when I wrote that patch.

> For L2 we have thread_group_l2_cache_map to store the tasks from the thread
> group. but cpu_l2_cache_map for keeping track of tasks.

>
> I think we should do some renaming to keep the names consistent.
> I would say probably say move the current cpu_l2_cache_map to
> cpu_llc_cache_map and move the new aka thread_group_l2_cache_map as
> cpu_l2_cache_map to be somewhat consistent.

Hmm.. cpu_llc_cache_map is still very generic. We want to have
something that defines l2 map.

I agree that we need to keep it consistent. How about renaming
cpu_l1_cache_map to thread_groups_l1_cache_map ?

That way thread_groups_l1_cache_map and thread_groups_l2_cache_map
refer to the corresponding L1 and L2 siblings as discovered from
ibm,thread-groups property.

We have the cpu_smallcore_mask and the cpu_l2_cache_map unchanged as
it was before.

>
> > /* SMP operations for this machine */
> > struct smp_ops_t *smp_ops;
> >
> > @@ -840,7 +851,8 @@ static int init_cpu_cache_map(int cpu, unsigned int cache_property)
> > if (!dn)
> > return -ENODATA;
> >
> > - if (!(cache_property == THREAD_GROUP_SHARE_L1))
> > + if (!(cache_property == THREAD_GROUP_SHARE_L1 ||
> > + cache_property == THREAD_GROUP_SHARE_L2))
> > return -EINVAL;
> >
> > if (!cpu_tgl->nr_properties) {
> > @@ -867,7 +879,10 @@ static int init_cpu_cache_map(int cpu, unsigned int cache_property)
> > goto out;
> > }
> >
> > - mask = &per_cpu(cpu_l1_cache_map, cpu);
> > + if (cache_property == THREAD_GROUP_SHARE_L1)
> > + mask = &per_cpu(cpu_l1_cache_map, cpu);
> > + else if (cache_property == THREAD_GROUP_SHARE_L2)
> > + mask = &per_cpu(thread_group_l2_cache_map, cpu);
> >
> > zalloc_cpumask_var_node(mask, GFP_KERNEL, cpu_to_node(cpu));
> >
> > @@ -973,6 +988,16 @@ static int init_big_cores(void)
> > }
> >
> > has_big_cores = true;
> > +
> > + for_each_possible_cpu(cpu) {
> > + int err = init_cpu_cache_map(cpu, THREAD_GROUP_SHARE_L2);
> > +
> > + if (err)
> > + return err;
> > + }
> > +
> > + thread_group_shares_l2 = true;
>
> Why do we need a separate loop. Why cant we merge this in the above loop
> itself?

No, there are platforms where one THREAD_GROUP_SHARE_L1 exists while
THREAD_GROUP_SHARE_L2 doesn't exist. It becomes easier if these are
separately tracked. Also, what do we gain if we put this in the same
loop? It will be (nr_possible_cpus * 2 * invocations of
init_cpu_cache_map()) as opposed to 2 * (nr_possible_cpus *
invocations of init_cpu_cache_map()). Isn't it ?

>
> > + pr_info("Thread-groups in a core share L2-cache\n");
>
> Can this be moved to a pr_debug? Does it help any regular user/admins to
> know if thread-groups shared l2 cache. Infact it may confuse users on what
> thread groups are and which thread groups dont share cache.
> I would prefer some other name than thread_group_shares_l2 but dont know any
> better alternatives and may be my choices are even worse.

Would you be ok with "L2 cache shared by threads of the small core" ?

>
> > return 0;
> > }
> >
> > @@ -1287,6 +1312,31 @@ static bool update_mask_by_l2(int cpu, cpumask_var_t *mask)
> > if (has_big_cores)
> > submask_fn = cpu_smallcore_mask;
> >
> > +
>
> NIT: extra blank line?

Will remove this.
>
> > + /*
> > + * If the threads in a thread-group share L2 cache, then then
> > + * the L2-mask can be obtained from thread_group_l2_cache_map.
> > + */
> > + if (thread_group_shares_l2) {
> > + /* Siblings that share L1 is a subset of siblings that share L2.*/
> > + or_cpumasks_related(cpu, cpu, submask_fn, cpu_l2_cache_mask);
> > + if (*mask) {
> > + cpumask_andnot(*mask,
> > + per_cpu(thread_group_l2_cache_map, cpu),
> > + cpu_l2_cache_mask(cpu));
> > + } else {
> > + mask = &per_cpu(thread_group_l2_cache_map, cpu);
> > + }
> > +
> > + for_each_cpu(i, *mask) {
> > + if (!cpu_online(i))
> > + continue;
> > + set_cpus_related(i, cpu, cpu_l2_cache_mask);
> > + }
> > +
> > + return true;
> > + }
> > +
>
> Ah this can be simplified to:
> if (thread_group_shares_l2) {
> cpumask_set_cpu(cpu, cpu_l2_cache_mask(cpu));
>
> for_each_cpu(i, per_cpu(thread_group_l2_cache_map, cpu)) {
> if (cpu_online(i))
> set_cpus_related(i, cpu, cpu_l2_cache_mask);
> }

Don't we want to enforce that the siblings sharing L1 be a subset of
the siblings sharing L2 ? Or do you recommend putting in a check for
that somewhere ?

> }
>
> No?
>
> > l2_cache = cpu_to_l2cache(cpu);
> > if (!l2_cache || !*mask) {
> > /* Assume only core siblings share cache with this CPU */
>
> --
> Thanks and Regards
> Srikar Dronamraju

2020-12-09 23:04:09

by Srikar Dronamraju

[permalink] [raw]

Subject: Re: [PATCH 2/3] powerpc/smp: Add support detecting thread-groups sharing L2 cache

* Gautham R Shenoy <[email protected]> [2020-12-08 23:12:37]:

>
> > For L2 we have thread_group_l2_cache_map to store the tasks from the thread
> > group. but cpu_l2_cache_map for keeping track of tasks.
>
> >
> > I think we should do some renaming to keep the names consistent.
> > I would say probably say move the current cpu_l2_cache_map to
> > cpu_llc_cache_map and move the new aka thread_group_l2_cache_map as
> > cpu_l2_cache_map to be somewhat consistent.
>
> Hmm.. cpu_llc_cache_map is still very generic. We want to have
> something that defines l2 map.
>
> I agree that we need to keep it consistent. How about renaming
> cpu_l1_cache_map to thread_groups_l1_cache_map ?
>
> That way thread_groups_l1_cache_map and thread_groups_l2_cache_map
> refer to the corresponding L1 and L2 siblings as discovered from
> ibm,thread-groups property.

I am fine with this.

> > > +
> > > + for_each_possible_cpu(cpu) {
> > > + int err = init_cpu_cache_map(cpu, THREAD_GROUP_SHARE_L2);
> > > +
> > > + if (err)
> > > + return err;
> > > + }
> > > +
> > > + thread_group_shares_l2 = true;
> >
> > Why do we need a separate loop. Why cant we merge this in the above loop
> > itself?
>
> No, there are platforms where one THREAD_GROUP_SHARE_L1 exists while
> THREAD_GROUP_SHARE_L2 doesn't exist. It becomes easier if these are
> separately tracked. Also, what do we gain if we put this in the same
> loop? It will be (nr_possible_cpus * 2 * invocations of
> init_cpu_cache_map()) as opposed to 2 * (nr_possible_cpus *
> invocations of init_cpu_cache_map()). Isn't it ?
>
Its not about the number of invocations but per-cpu thread group list
that would need not be loaded again. Currently they would probably be in the
cache-line, but get dropped to be loaded again in the next loop.
And we still can support platforms with only THREAD_GROUP_SHARE_L1 since
parse_thread_groups would have given us how many levels of thread groups are
supported on a platform.

> >
> > > + pr_info("Thread-groups in a core share L2-cache\n");
> >
> > Can this be moved to a pr_debug? Does it help any regular user/admins to
> > know if thread-groups shared l2 cache. Infact it may confuse users on what
> > thread groups are and which thread groups dont share cache.
> > I would prefer some other name than thread_group_shares_l2 but dont know any
> > better alternatives and may be my choices are even worse.
>
> Would you be ok with "L2 cache shared by threads of the small core" ?

Sounds better to me. I would still think pr_debug is better since regular
Admins/users may not make too much information from this.
>
> >
> > Ah this can be simplified to:
> > if (thread_group_shares_l2) {
> > cpumask_set_cpu(cpu, cpu_l2_cache_mask(cpu));
> >
> > for_each_cpu(i, per_cpu(thread_group_l2_cache_map, cpu)) {
> > if (cpu_online(i))
> > set_cpus_related(i, cpu, cpu_l2_cache_mask);
> > }
>
> Don't we want to enforce that the siblings sharing L1 be a subset of
> the siblings sharing L2 ? Or do you recommend putting in a check for
> that somewhere ?
>
I didnt think about the case where the device-tree could show L2 to be a
subset of L1.

How about initializing thread_group_l2_cache_map itself with
cpu_l1_cache_map. It would be a simple one time operation and reduce the
overhead here every CPU online.
And it would help in your subsequent patch too. We dont want the cacheinfo
for L1 showing CPUs not present in L2.

--
Thanks and Regards
Srikar Dronamraju