2020-07-21 11:40:34

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH v2 06/10] powerpc/smp: Generalize 2nd sched domain

Currently "CACHE" domain happens to be the 2nd sched domain as per
powerpc_topology. This domain will collapse if cpumask of l2-cache is
same as SMT domain. However we could generalize this domain such that it
could mean either be a "CACHE" domain or a "BIGCORE" domain.

While setting up the "CACHE" domain, check if shared_cache is already
set.

Cc: linuxppc-dev <[email protected]>
Cc: LKML <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Nick Piggin <[email protected]>
Cc: Oliver OHalloran <[email protected]>
Cc: Nathan Lynch <[email protected]>
Cc: Michael Neuling <[email protected]>
Cc: Anton Blanchard <[email protected]>
Cc: Gautham R Shenoy <[email protected]>
Cc: Vaidyanathan Srinivasan <[email protected]>
Cc: Jordan Niethe <[email protected]>
Signed-off-by: Srikar Dronamraju <[email protected]>
---
Changelog v1 -> v2:
powerpc/smp: Generalize 2nd sched domain
Moved shared_cache topology fixup to fixup_topology (Gautham)

arch/powerpc/kernel/smp.c | 49 ++++++++++++++++++++++++++++-----------
1 file changed, 35 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 57468877499a..933ebdf97432 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -85,6 +85,14 @@ EXPORT_PER_CPU_SYMBOL(cpu_l2_cache_map);
EXPORT_PER_CPU_SYMBOL(cpu_core_map);
EXPORT_SYMBOL_GPL(has_big_cores);

+enum {
+#ifdef CONFIG_SCHED_SMT
+ smt_idx,
+#endif
+ bigcore_idx,
+ die_idx,
+};
+
#define MAX_THREAD_LIST_SIZE 8
#define THREAD_GROUP_SHARE_L1 1
struct thread_groups {
@@ -851,13 +859,7 @@ static int powerpc_shared_cache_flags(void)
*/
static const struct cpumask *shared_cache_mask(int cpu)
{
- if (shared_caches)
- return cpu_l2_cache_mask(cpu);
-
- if (has_big_cores)
- return cpu_smallcore_mask(cpu);
-
- return per_cpu(cpu_sibling_map, cpu);
+ return per_cpu(cpu_l2_cache_map, cpu);
}

#ifdef CONFIG_SCHED_SMT
@@ -867,11 +869,16 @@ static const struct cpumask *smallcore_smt_mask(int cpu)
}
#endif

+static const struct cpumask *cpu_bigcore_mask(int cpu)
+{
+ return per_cpu(cpu_sibling_map, cpu);
+}
+
static struct sched_domain_topology_level powerpc_topology[] = {
#ifdef CONFIG_SCHED_SMT
{ cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) },
#endif
- { shared_cache_mask, powerpc_shared_cache_flags, SD_INIT_NAME(CACHE) },
+ { cpu_bigcore_mask, SD_INIT_NAME(BIGCORE) },
{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
{ NULL, },
};
@@ -1313,7 +1320,6 @@ static void add_cpu_to_masks(int cpu)
void start_secondary(void *unused)
{
unsigned int cpu = smp_processor_id();
- struct cpumask *(*sibling_mask)(int) = cpu_sibling_mask;

mmgrab(&init_mm);
current->active_mm = &init_mm;
@@ -1339,14 +1345,20 @@ void start_secondary(void *unused)
/* Update topology CPU masks */
add_cpu_to_masks(cpu);

- if (has_big_cores)
- sibling_mask = cpu_smallcore_mask;
/*
* Check for any shared caches. Note that this must be done on a
* per-core basis because one core in the pair might be disabled.
*/
- if (!cpumask_equal(cpu_l2_cache_mask(cpu), sibling_mask(cpu)))
- shared_caches = true;
+ if (!shared_caches) {
+ struct cpumask *(*sibling_mask)(int) = cpu_sibling_mask;
+ struct cpumask *mask = cpu_l2_cache_mask(cpu);
+
+ if (has_big_cores)
+ sibling_mask = cpu_smallcore_mask;
+
+ if (cpumask_weight(mask) > cpumask_weight(sibling_mask(cpu)))
+ shared_caches = true;
+ }

set_numa_node(numa_cpu_lookup_table[cpu]);
set_numa_mem(local_memory_node(numa_cpu_lookup_table[cpu]));
@@ -1374,10 +1386,19 @@ int setup_profiling_timer(unsigned int multiplier)

static void fixup_topology(void)
{
+ if (shared_caches) {
+ pr_info("Using shared cache scheduler topology\n");
+ powerpc_topology[bigcore_idx].mask = shared_cache_mask;
+#ifdef CONFIG_SCHED_DEBUG
+ powerpc_topology[bigcore_idx].name = "CACHE";
+#endif
+ powerpc_topology[bigcore_idx].sd_flags = powerpc_shared_cache_flags;
+ }
+
#ifdef CONFIG_SCHED_SMT
if (has_big_cores) {
pr_info("Big cores detected but using small core scheduling\n");
- powerpc_topology[0].mask = smallcore_smt_mask;
+ powerpc_topology[smt_idx].mask = smallcore_smt_mask;
}
#endif
}
--
2.17.1


2020-07-22 06:58:01

by Gautham R Shenoy

[permalink] [raw]
Subject: Re: [PATCH v2 06/10] powerpc/smp: Generalize 2nd sched domain

Hello Srikar,

On Tue, Jul 21, 2020 at 05:08:10PM +0530, Srikar Dronamraju wrote:
> Currently "CACHE" domain happens to be the 2nd sched domain as per
> powerpc_topology. This domain will collapse if cpumask of l2-cache is
> same as SMT domain. However we could generalize this domain such that it
> could mean either be a "CACHE" domain or a "BIGCORE" domain.
>
> While setting up the "CACHE" domain, check if shared_cache is already
> set.
>
> Cc: linuxppc-dev <[email protected]>
> Cc: LKML <[email protected]>
> Cc: Michael Ellerman <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Valentin Schneider <[email protected]>
> Cc: Nick Piggin <[email protected]>
> Cc: Oliver OHalloran <[email protected]>
> Cc: Nathan Lynch <[email protected]>
> Cc: Michael Neuling <[email protected]>
> Cc: Anton Blanchard <[email protected]>
> Cc: Gautham R Shenoy <[email protected]>
> Cc: Vaidyanathan Srinivasan <[email protected]>
> Cc: Jordan Niethe <[email protected]>
> Signed-off-by: Srikar Dronamraju <[email protected]>
> ---
> Changelog v1 -> v2:
> powerpc/smp: Generalize 2nd sched domain
> Moved shared_cache topology fixup to fixup_topology (Gautham)
>

Just one comment below.

> arch/powerpc/kernel/smp.c | 49 ++++++++++++++++++++++++++++-----------
> 1 file changed, 35 insertions(+), 14 deletions(-)
>
> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index 57468877499a..933ebdf97432 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -85,6 +85,14 @@ EXPORT_PER_CPU_SYMBOL(cpu_l2_cache_map);
> EXPORT_PER_CPU_SYMBOL(cpu_core_map);
> EXPORT_SYMBOL_GPL(has_big_cores);
>
> +enum {
> +#ifdef CONFIG_SCHED_SMT
> + smt_idx,
> +#endif
> + bigcore_idx,
> + die_idx,
> +};
> +


[..snip..]

> @@ -1339,14 +1345,20 @@ void start_secondary(void *unused)
> /* Update topology CPU masks */
> add_cpu_to_masks(cpu);
>
> - if (has_big_cores)
> - sibling_mask = cpu_smallcore_mask;
> /*
> * Check for any shared caches. Note that this must be done on a
> * per-core basis because one core in the pair might be disabled.
> */
> - if (!cpumask_equal(cpu_l2_cache_mask(cpu), sibling_mask(cpu)))
> - shared_caches = true;
> + if (!shared_caches) {
> + struct cpumask *(*sibling_mask)(int) = cpu_sibling_mask;
> + struct cpumask *mask = cpu_l2_cache_mask(cpu);
> +
> + if (has_big_cores)
> + sibling_mask = cpu_smallcore_mask;
> +
> + if (cpumask_weight(mask) > cpumask_weight(sibling_mask(cpu)))
> + shared_caches = true;

At the risk of repeating my comment to the v1 version of the patch, we
have shared caches only l2_cache_mask(cpu) is a strict superset of
sibling_mask(cpu).

"cpumask_weight(mask) > cpumask_weight(sibling_mask(cpu))" does not
capture this.

Could we please use

if (!cpumask_equal(sibling_mask(cpu), mask) &&
cpumask_subset(sibling_mask(cpu), mask) {
}

?


> + }
>
> set_numa_node(numa_cpu_lookup_table[cpu]);
> set_numa_mem(local_memory_node(numa_cpu_lookup_table[cpu]));
> @@ -1374,10 +1386,19 @@ int setup_profiling_timer(unsigned int multiplier)
>
> static void fixup_topology(void)
> {
> + if (shared_caches) {
> + pr_info("Using shared cache scheduler topology\n");
> + powerpc_topology[bigcore_idx].mask = shared_cache_mask;
> +#ifdef CONFIG_SCHED_DEBUG
> + powerpc_topology[bigcore_idx].name = "CACHE";
> +#endif
> + powerpc_topology[bigcore_idx].sd_flags = powerpc_shared_cache_flags;
> + }
> +
> #ifdef CONFIG_SCHED_SMT
> if (has_big_cores) {
> pr_info("Big cores detected but using small core scheduling\n");
> - powerpc_topology[0].mask = smallcore_smt_mask;
> + powerpc_topology[smt_idx].mask = smallcore_smt_mask;
> }
> #endif


Otherwise the patch looks good to me.

--
Thanks and Regards
gautham.

2020-07-22 07:41:28

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH v2 06/10] powerpc/smp: Generalize 2nd sched domain

* Gautham R Shenoy <[email protected]> [2020-07-22 12:26:40]:

> Hello Srikar,
>
> On Tue, Jul 21, 2020 at 05:08:10PM +0530, Srikar Dronamraju wrote:
> > Currently "CACHE" domain happens to be the 2nd sched domain as per
> > powerpc_topology. This domain will collapse if cpumask of l2-cache is
> > same as SMT domain. However we could generalize this domain such that it
> > could mean either be a "CACHE" domain or a "BIGCORE" domain.
> >
> > While setting up the "CACHE" domain, check if shared_cache is already
> > set.
> > @@ -1339,14 +1345,20 @@ void start_secondary(void *unused)
> > /* Update topology CPU masks */
> > add_cpu_to_masks(cpu);
> >
> > - if (has_big_cores)
> > - sibling_mask = cpu_smallcore_mask;
> > /*
> > * Check for any shared caches. Note that this must be done on a
> > * per-core basis because one core in the pair might be disabled.
> > */
> > - if (!cpumask_equal(cpu_l2_cache_mask(cpu), sibling_mask(cpu)))
> > - shared_caches = true;
> > + if (!shared_caches) {
> > + struct cpumask *(*sibling_mask)(int) = cpu_sibling_mask;
> > + struct cpumask *mask = cpu_l2_cache_mask(cpu);
> > +
> > + if (has_big_cores)
> > + sibling_mask = cpu_smallcore_mask;
> > +
> > + if (cpumask_weight(mask) > cpumask_weight(sibling_mask(cpu)))
> > + shared_caches = true;
>
> At the risk of repeating my comment to the v1 version of the patch, we
> have shared caches only l2_cache_mask(cpu) is a strict superset of
> sibling_mask(cpu).
>
> "cpumask_weight(mask) > cpumask_weight(sibling_mask(cpu))" does not
> capture this.

Why would it not? We are setting shared_caches if and only if l2_cache_mask
is a strict superset of sibling/smallcore mask.

> Could we please use
>
> if (!cpumask_equal(sibling_mask(cpu), mask) &&
> cpumask_subset(sibling_mask(cpu), mask) {
> }
>

Scheduler mandates that two cpumasks for the same CPU would either have to
be equal or one of them has to be a strict superset of the other. If not the
scheduler would mark our domains as broken. That being the case,
cpumask_weight will correctly capture what we are looking for. That said
your condition is also correct but slightly heavier and doesn't provide us
with any more information or correctness.

>
> Otherwise the patch looks good to me.
>
> --
> Thanks and Regards
> gautham.

--
Thanks and Regards
Srikar Dronamraju

2020-07-22 07:49:31

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 06/10] powerpc/smp: Generalize 2nd sched domain

On Tue, Jul 21, 2020 at 05:08:10PM +0530, Srikar Dronamraju wrote:
> Currently "CACHE" domain happens to be the 2nd sched domain as per
> powerpc_topology. This domain will collapse if cpumask of l2-cache is
> same as SMT domain. However we could generalize this domain such that it
> could mean either be a "CACHE" domain or a "BIGCORE" domain.

What's the whole smallcore vs bigcore thing?

Would it make sense to have an actual overview of the various topology
layers somewhere? Because I'm getting lost and can't really make sense
of the code due to that.

2020-07-22 08:19:25

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH v2 06/10] powerpc/smp: Generalize 2nd sched domain

* [email protected] <[email protected]> [2020-07-22 09:46:24]:

> On Tue, Jul 21, 2020 at 05:08:10PM +0530, Srikar Dronamraju wrote:
> > Currently "CACHE" domain happens to be the 2nd sched domain as per
> > powerpc_topology. This domain will collapse if cpumask of l2-cache is
> > same as SMT domain. However we could generalize this domain such that it
> > could mean either be a "CACHE" domain or a "BIGCORE" domain.
>
> What's the whole smallcore vs bigcore thing?
>
> Would it make sense to have an actual overview of the various topology
> layers somewhere? Because I'm getting lost and can't really make sense
> of the code due to that.

To quote with an example: using (Power 9)

Power 9 is an SMT 8 core by design. However this 8 thread core can run as 2
independent core with threads 0,2,4 and 6 acting like a core, while threads
1,3,5,7 acting as another core.

The firmware can decide to showcase them as 2 independent small cores or as
one big core. However the LLC (i.e last level of cache) is shared between
all the threads of the SMT 8 core. Future power chips, LLC might change, it
may be expanded to share with another SMT 8 core or it could be made
specific to SMT 4 core.

The smt 8 core is also known as fused core/ Big core.
The smaller smt 4 core is known as small core.

So on a Power9 Baremetal, the firmware would show up as SMT4 core.
and we have a CACHE level at SMT 8. CACHE level would be very very similar
to MC domain in X86.

Hope this is clear.

--
Thanks and Regards
Srikar Dronamraju

2020-07-22 08:52:03

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 06/10] powerpc/smp: Generalize 2nd sched domain

On Wed, Jul 22, 2020 at 01:48:22PM +0530, Srikar Dronamraju wrote:
> * [email protected] <[email protected]> [2020-07-22 09:46:24]:
>
> > On Tue, Jul 21, 2020 at 05:08:10PM +0530, Srikar Dronamraju wrote:
> > > Currently "CACHE" domain happens to be the 2nd sched domain as per
> > > powerpc_topology. This domain will collapse if cpumask of l2-cache is
> > > same as SMT domain. However we could generalize this domain such that it
> > > could mean either be a "CACHE" domain or a "BIGCORE" domain.
> >
> > What's the whole smallcore vs bigcore thing?
> >
> > Would it make sense to have an actual overview of the various topology
> > layers somewhere? Because I'm getting lost and can't really make sense
> > of the code due to that.
>
> To quote with an example: using (Power 9)
>
> Power 9 is an SMT 8 core by design. However this 8 thread core can run as 2
> independent core with threads 0,2,4 and 6 acting like a core, while threads
> 1,3,5,7 acting as another core.
>
> The firmware can decide to showcase them as 2 independent small cores or as
> one big core. However the LLC (i.e last level of cache) is shared between
> all the threads of the SMT 8 core. Future power chips, LLC might change, it
> may be expanded to share with another SMT 8 core or it could be made
> specific to SMT 4 core.
>
> The smt 8 core is also known as fused core/ Big core.
> The smaller smt 4 core is known as small core.
>
> So on a Power9 Baremetal, the firmware would show up as SMT4 core.
> and we have a CACHE level at SMT 8. CACHE level would be very very similar
> to MC domain in X86.
>
> Hope this is clear.

Ooh, that thing. I thought P9 was in actual fact an SMT4 hardware part,
but to be compatible with P8 it has this fused option where two cores
act like a single SMT8 part.

The interleaving enumeration is done due to P7 asymmetric cores,
resuting in schedulers having the preference to use the lower threads.

Combined that results in a P9-fused configuration using two independent
full cores when there's only 2 runnable threads.

Which is a subtly different story from yours.

But reading your explanation, it looks like the Linux topology setup
could actually unravel the fused-faux-SMT8 into two independent SMT4
parts, negating that firmware option.

Anyway, a few comments just around there might be helpfull.


Thanks!

2020-07-22 08:54:53

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 06/10] powerpc/smp: Generalize 2nd sched domain

On Wed, Jul 22, 2020 at 10:48:54AM +0200, Peter Zijlstra wrote:
> But reading your explanation, it looks like the Linux topology setup
> could actually unravel the fused-faux-SMT8 into two independent SMT4
> parts, negating that firmware option.

Ah, it looks like that's exactly what you're doing. Nice!