2022-05-25 10:48:04

by Sudeep Holla

[permalink] [raw]
Subject: [PATCH v3 14/16] arch_topology: Drop unnecessary check for uninitialised package_id

With the support of socket node parsing from the device tree and
assigning 0 as package_id in absence of socket nodes, there is no need
to check for invalid package_id. It is always initialised to 0 or values
from the device tree socket nodes.

Just drop that now redundant check for uninitialised package_id.

Signed-off-by: Sudeep Holla <[email protected]>
---
drivers/base/arch_topology.c | 11 -----------
1 file changed, 11 deletions(-)

diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index e7876a7a82ec..b8f0d72908c8 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -606,7 +606,6 @@ static int __init parse_dt_topology(void)
{
struct device_node *cn, *map;
int ret = 0;
- int cpu;

cn = of_find_node_by_path("/cpus");
if (!cn) {
@@ -628,16 +627,6 @@ static int __init parse_dt_topology(void)

topology_normalize_cpu_scale();

- /*
- * Check that all cores are in the topology; the SMP code will
- * only mark cores described in the DT as possible.
- */
- for_each_possible_cpu(cpu)
- if (cpu_topology[cpu].package_id < 0) {
- ret = -EINVAL;
- break;
- }
-
out_map:
of_node_put(map);
out:
--
2.36.1



2022-05-25 12:18:39

by Sudeep Holla

[permalink] [raw]
Subject: [PATCH v3 15/16] arch_topology: Set cluster identifier in each core/thread from /cpu-map

Let us set the cluster identifier as parsed from the device tree
cluster nodes within /cpu-map.

We don't support nesting of clusters yet as there are no real hardware
to support clusters of clusters.

Signed-off-by: Sudeep Holla <[email protected]>
---
drivers/base/arch_topology.c | 13 ++++++++-----
1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index b8f0d72908c8..5f4f148a7769 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -492,7 +492,7 @@ static int __init get_cpu_for_node(struct device_node *node)
}

static int __init parse_core(struct device_node *core, int package_id,
- int core_id)
+ int cluster_id, int core_id)
{
char name[20];
bool leaf = true;
@@ -508,6 +508,7 @@ static int __init parse_core(struct device_node *core, int package_id,
cpu = get_cpu_for_node(t);
if (cpu >= 0) {
cpu_topology[cpu].package_id = package_id;
+ cpu_topology[cpu].cluster_id = cluster_id;
cpu_topology[cpu].core_id = core_id;
cpu_topology[cpu].thread_id = i;
} else if (cpu != -ENODEV) {
@@ -529,6 +530,7 @@ static int __init parse_core(struct device_node *core, int package_id,
}

cpu_topology[cpu].package_id = package_id;
+ cpu_topology[cpu].cluster_id = cluster_id;
cpu_topology[cpu].core_id = core_id;
} else if (leaf && cpu != -ENODEV) {
pr_err("%pOF: Can't get CPU for leaf core\n", core);
@@ -538,7 +540,8 @@ static int __init parse_core(struct device_node *core, int package_id,
return 0;
}

-static int __init parse_cluster(struct device_node *cluster, int depth)
+static int __init
+parse_cluster(struct device_node *cluster, int cluster_id, int depth)
{
char name[20];
bool leaf = true;
@@ -558,7 +561,7 @@ static int __init parse_cluster(struct device_node *cluster, int depth)
c = of_get_child_by_name(cluster, name);
if (c) {
leaf = false;
- ret = parse_cluster(c, depth + 1);
+ ret = parse_cluster(c, i, depth + 1);
of_node_put(c);
if (ret != 0)
return ret;
@@ -582,7 +585,7 @@ static int __init parse_cluster(struct device_node *cluster, int depth)
}

if (leaf) {
- ret = parse_core(c, 0, core_id++);
+ ret = parse_core(c, 0, cluster_id, core_id++);
} else {
pr_err("%pOF: Non-leaf cluster with core %s\n",
cluster, name);
@@ -621,7 +624,7 @@ static int __init parse_dt_topology(void)
if (!map)
goto out;

- ret = parse_cluster(map, 0);
+ ret = parse_cluster(map, -1, 0);
if (ret != 0)
goto out_map;

--
2.36.1


2022-05-25 21:08:27

by Sudeep Holla

[permalink] [raw]
Subject: [PATCH v3 16/16] arch_topology: Add support for parsing sockets in /cpu-map

Finally let us add support for socket nodes in /cpu-map in the device
tree. Since this may not be present in all the old platforms and even
most of the existing platforms, we need to assume absence of the socket
node indicates that it is a single socket system and handle appropriately.

Also it is likely that most single socket systems skip to as the node
since it is optional.

Signed-off-by: Sudeep Holla <[email protected]>
---
drivers/base/arch_topology.c | 37 +++++++++++++++++++++++++++++++-----
1 file changed, 32 insertions(+), 5 deletions(-)

diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index 5f4f148a7769..676e0ed333b1 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -540,8 +540,8 @@ static int __init parse_core(struct device_node *core, int package_id,
return 0;
}

-static int __init
-parse_cluster(struct device_node *cluster, int cluster_id, int depth)
+static int __init parse_cluster(struct device_node *cluster, int package_id,
+ int cluster_id, int depth)
{
char name[20];
bool leaf = true;
@@ -561,7 +561,7 @@ parse_cluster(struct device_node *cluster, int cluster_id, int depth)
c = of_get_child_by_name(cluster, name);
if (c) {
leaf = false;
- ret = parse_cluster(c, i, depth + 1);
+ ret = parse_cluster(c, package_id, i, depth + 1);
of_node_put(c);
if (ret != 0)
return ret;
@@ -585,7 +585,8 @@ parse_cluster(struct device_node *cluster, int cluster_id, int depth)
}

if (leaf) {
- ret = parse_core(c, 0, cluster_id, core_id++);
+ ret = parse_core(c, package_id, cluster_id,
+ core_id++);
} else {
pr_err("%pOF: Non-leaf cluster with core %s\n",
cluster, name);
@@ -605,6 +606,32 @@ parse_cluster(struct device_node *cluster, int cluster_id, int depth)
return 0;
}

+static int __init parse_socket(struct device_node *socket)
+{
+ char name[20];
+ struct device_node *c;
+ bool has_socket = false;
+ int package_id = 0, ret;
+
+ do {
+ snprintf(name, sizeof(name), "socket%d", package_id);
+ c = of_get_child_by_name(socket, name);
+ if (c) {
+ has_socket = true;
+ ret = parse_cluster(c, package_id, -1, 0);
+ of_node_put(c);
+ if (ret != 0)
+ return ret;
+ }
+ package_id++;
+ } while (c);
+
+ if (!has_socket)
+ ret = parse_cluster(socket, 0, -1, 0);
+
+ return ret;
+}
+
static int __init parse_dt_topology(void)
{
struct device_node *cn, *map;
@@ -624,7 +651,7 @@ static int __init parse_dt_topology(void)
if (!map)
goto out;

- ret = parse_cluster(map, -1, 0);
+ ret = parse_socket(map);
if (ret != 0)
goto out_map;

--
2.36.1


2022-06-03 18:01:25

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v3 15/16] arch_topology: Set cluster identifier in each core/thread from /cpu-map

On 25/05/2022 10:14, Sudeep Holla wrote:
> Let us set the cluster identifier as parsed from the device tree
> cluster nodes within /cpu-map.
>
> We don't support nesting of clusters yet as there are no real hardware
> to support clusters of clusters.
>
> Signed-off-by: Sudeep Holla <[email protected]>
> ---
> drivers/base/arch_topology.c | 13 ++++++++-----
> 1 file changed, 8 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
> index b8f0d72908c8..5f4f148a7769 100644
> --- a/drivers/base/arch_topology.c
> +++ b/drivers/base/arch_topology.c
> @@ -492,7 +492,7 @@ static int __init get_cpu_for_node(struct device_node *node)
> }
>
> static int __init parse_core(struct device_node *core, int package_id,
> - int core_id)
> + int cluster_id, int core_id)
> {
> char name[20];
> bool leaf = true;
> @@ -508,6 +508,7 @@ static int __init parse_core(struct device_node *core, int package_id,
> cpu = get_cpu_for_node(t);
> if (cpu >= 0) {
> cpu_topology[cpu].package_id = package_id;
> + cpu_topology[cpu].cluster_id = cluster_id;
> cpu_topology[cpu].core_id = core_id;
> cpu_topology[cpu].thread_id = i;
> } else if (cpu != -ENODEV) {
> @@ -529,6 +530,7 @@ static int __init parse_core(struct device_node *core, int package_id,
> }
>
> cpu_topology[cpu].package_id = package_id;
> + cpu_topology[cpu].cluster_id = cluster_id;

I'm still not convinced that this is the right thing to do. Let's take
the juno board as an example here. And I guess our assumption should be
that we want to make CONFIG_SCHED_CLUSTER a default option, like
CONFIG_SCHED_MC is. Simply to avoid a unmanageable zoo of config-option
combinations.

(1) Scheduler Domains (SDs) w/o CONFIG_SCHED_CLUSTER:

MC <-- !!!
DIE

(2) SDs w/ CONFIG_SCHED_CLUSTER:

CLS <-- !!!
DIE

In (2) MC gets degenerated in sd_parent_degenerate() since CLS and MC
cpumasks are equal and MC does not have any additional flags compared to
CLS.
I'm not convinced that we can change the degeneration rules without
destroying other scenarios of the scheduler so that here MC stays and
CLS gets removed instead.

Even though MC and CLS are doing the same right now from the perspective
of the scheduler, we should also see MC and not CLS under (2). CLS only
makes sense longer term if the scheduler also makes use of it (next to
MC) in the wakeup-path for instance. Especially when this happens, a
platform should always construct the same scheduler domain hierarchy, no
matter which CONFIG_SCHED_XXX options are enabled.


You can see this in update_siblings_masks()

if (last_level_cache_is_shared)
set llc_sibling

if (cpuid_topo->package_id != cpu_topo->package_id)
continue

set core_sibling

If llc cache and socket boundaries are congruent, llc_sibling and
core_sibling are the same.

if (cpuid_topo->cluster_id != cpu_topo->cluster_id)
continue

set cluster_sibling

Now we potentially set clusters. Since socket=0 is by default and we
use the existing juno.dts, the cluster nodes end up being congruent to
the llc cache cpumasks as well.

The problem is that we code `llc cache` and `DT cluster nodes` as the
same thing in juno.dts. `Cluster0/1` are congruent with the llc
information, although they should be actually `socket0/1` right now. But
we can't set-up a cpu-map with a `socketX` containing `coreY` directly.
And then we use llc_sibling and cluster_sibling in two different SD
cpumask functions (cpu_coregroup_mask() and cpu_clustergroup_mask()).

Remember, CONFIG_SCHED_CLUSTER was introduced in ACPI/PPTT as a cpumask
which is a subset of the cpumasks of CONFIG_SCHED_MC.

---

IMHO we probably could just introduce your changes w/o setting `cpu-map
cluster nodes` in DT for now. We would just have to make sure that for
all `*.dts` affected, the `llc cache` info can take over the old role of
the `cluster nodes`. In this case e.g. Juno ends up with MC, DIE no
matter if CONFIG_SCHED_CLUSTER is set or not.

[...]

2022-06-06 10:45:22

by Sudeep Holla

[permalink] [raw]
Subject: Re: [PATCH v3 15/16] arch_topology: Set cluster identifier in each core/thread from /cpu-map

On Fri, Jun 03, 2022 at 02:30:04PM +0200, Dietmar Eggemann wrote:
> On 25/05/2022 10:14, Sudeep Holla wrote:
> > Let us set the cluster identifier as parsed from the device tree
> > cluster nodes within /cpu-map.
> >
> > We don't support nesting of clusters yet as there are no real hardware
> > to support clusters of clusters.
> >
> > Signed-off-by: Sudeep Holla <[email protected]>
> > ---
> > drivers/base/arch_topology.c | 13 ++++++++-----
> > 1 file changed, 8 insertions(+), 5 deletions(-)
> >
> > diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
> > index b8f0d72908c8..5f4f148a7769 100644
> > --- a/drivers/base/arch_topology.c
> > +++ b/drivers/base/arch_topology.c
> > @@ -492,7 +492,7 @@ static int __init get_cpu_for_node(struct device_node *node)
> > }
> >
> > static int __init parse_core(struct device_node *core, int package_id,
> > - int core_id)
> > + int cluster_id, int core_id)
> > {
> > char name[20];
> > bool leaf = true;
> > @@ -508,6 +508,7 @@ static int __init parse_core(struct device_node *core, int package_id,
> > cpu = get_cpu_for_node(t);
> > if (cpu >= 0) {
> > cpu_topology[cpu].package_id = package_id;
> > + cpu_topology[cpu].cluster_id = cluster_id;
> > cpu_topology[cpu].core_id = core_id;
> > cpu_topology[cpu].thread_id = i;
> > } else if (cpu != -ENODEV) {
> > @@ -529,6 +530,7 @@ static int __init parse_core(struct device_node *core, int package_id,
> > }
> >
> > cpu_topology[cpu].package_id = package_id;
> > + cpu_topology[cpu].cluster_id = cluster_id;
>
> I'm still not convinced that this is the right thing to do. Let's take
> the juno board as an example here. And I guess our assumption should be
> that we want to make CONFIG_SCHED_CLUSTER a default option, like
> CONFIG_SCHED_MC is. Simply to avoid a unmanageable zoo of config-option
> combinations.
>

Agreed on the config part.

> (1) Scheduler Domains (SDs) w/o CONFIG_SCHED_CLUSTER:
>
> MC <-- !!!
> DIE
>
> (2) SDs w/ CONFIG_SCHED_CLUSTER:
>
> CLS <-- !!!
> DIE
>

Yes I have seen this.

> In (2) MC gets degenerated in sd_parent_degenerate() since CLS and MC
> cpumasks are equal and MC does not have any additional flags compared to
> CLS.
> I'm not convinced that we can change the degeneration rules without
> destroying other scenarios of the scheduler so that here MC stays and
> CLS gets removed instead.
>

Why ? Are you suggesting that we shouldn't present the hardware cluster
to the topology because of the above reason ? If so, sorry that is not a
valid reason. We could add login to return NULL or appropriate value
needed in cpu_clustergroup_mask id it matches MC level mask if we can't
deal that in generic scheduler code. But the topology code can't be
compromised for that reason as it is user visible.

> Even though MC and CLS are doing the same right now from the perspective
> of the scheduler, we should also see MC and not CLS under (2). CLS only
> makes sense longer term if the scheduler also makes use of it (next to
> MC) in the wakeup-path for instance. Especially when this happens, a
> platform should always construct the same scheduler domain hierarchy, no
> matter which CONFIG_SCHED_XXX options are enabled.
>
>
> You can see this in update_siblings_masks()
>
> if (last_level_cache_is_shared)
> set llc_sibling
>
> if (cpuid_topo->package_id != cpu_topo->package_id)
> continue
>
> set core_sibling
>
> If llc cache and socket boundaries are congruent, llc_sibling and
> core_sibling are the same.
>
> if (cpuid_topo->cluster_id != cpu_topo->cluster_id)
> continue
>
> set cluster_sibling
>
> Now we potentially set clusters. Since socket=0 is by default and we
> use the existing juno.dts, the cluster nodes end up being congruent to
> the llc cache cpumasks as well.
>

Correct and I see no problems as it matches what the hardware is. So I am
not expecting any change in any cpumasks there as they all are aligned with
the hardware.

> The problem is that we code `llc cache` and `DT cluster nodes` as the
> same thing in juno.dts.

Why is that a problem ? If so, blame hardware and deal with it as we have to
???? as usual we get all sorts of topology.

> `Cluster0/1` are congruent with the llc information, although they should
> be actually `socket0/1` right now.

That was complete non-sense and wrong. Boot and check in ACPI mode.

> But we can't set-up a cpu-map with a `socketX` containing `coreY` directly.
> And then we use llc_sibling and cluster_sibling in two different SD
> cpumask functions (cpu_coregroup_mask() and cpu_clustergroup_mask()).
>

We just need to deal with that. How is that dealt today with ACPI. My
changes are making these aligned with ACPI. If something is broken as
per you understanding with ACPI, then that needs fixing. The topology
presented and parsed by ACPI is correct and we are aligning DT with that.
There is no question on that.

> Remember, CONFIG_SCHED_CLUSTER was introduced in ACPI/PPTT as a cpumask
> which is a subset of the cpumasks of CONFIG_SCHED_MC.
>

But that change also introduced cluster masks into the topology which again
aligns with my changes.

> IMHO we probably could just introduce your changes w/o setting `cpu-map
> cluster nodes` in DT for now. We would just have to make sure that for
> all `*.dts` affected, the `llc cache` info can take over the old role of
> the `cluster nodes`. In this case e.g. Juno ends up with MC, DIE no
> matter if CONFIG_SCHED_CLUSTER is set or not.

Sure I can agree with that if Juno ACPI is not broken. But I am sure it is
broken based on your argument above. If it is, that needs fixing and this
series just gets topology parsing in both ACPI and DT aligned, nothing
more or nothing less. In the process it may be introducing clusters, but
if it is not dealt correctly in ACPI, then it won't be in DT too and needs
fixing anyways.

--
Regards,
Sudeep

2022-06-10 10:32:36

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH v3 15/16] arch_topology: Set cluster identifier in each core/thread from /cpu-map

On Mon, 6 Jun 2022 at 12:22, Sudeep Holla <[email protected]> wrote:
>
> On Fri, Jun 03, 2022 at 02:30:04PM +0200, Dietmar Eggemann wrote:
> > On 25/05/2022 10:14, Sudeep Holla wrote:
> > > Let us set the cluster identifier as parsed from the device tree
> > > cluster nodes within /cpu-map.
> > >
> > > We don't support nesting of clusters yet as there are no real hardware
> > > to support clusters of clusters.
> > >
> > > Signed-off-by: Sudeep Holla <[email protected]>
> > > ---
> > > drivers/base/arch_topology.c | 13 ++++++++-----
> > > 1 file changed, 8 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
> > > index b8f0d72908c8..5f4f148a7769 100644
> > > --- a/drivers/base/arch_topology.c
> > > +++ b/drivers/base/arch_topology.c
> > > @@ -492,7 +492,7 @@ static int __init get_cpu_for_node(struct device_node *node)
> > > }
> > >
> > > static int __init parse_core(struct device_node *core, int package_id,
> > > - int core_id)
> > > + int cluster_id, int core_id)
> > > {
> > > char name[20];
> > > bool leaf = true;
> > > @@ -508,6 +508,7 @@ static int __init parse_core(struct device_node *core, int package_id,
> > > cpu = get_cpu_for_node(t);
> > > if (cpu >= 0) {
> > > cpu_topology[cpu].package_id = package_id;
> > > + cpu_topology[cpu].cluster_id = cluster_id;
> > > cpu_topology[cpu].core_id = core_id;
> > > cpu_topology[cpu].thread_id = i;
> > > } else if (cpu != -ENODEV) {
> > > @@ -529,6 +530,7 @@ static int __init parse_core(struct device_node *core, int package_id,
> > > }
> > >
> > > cpu_topology[cpu].package_id = package_id;
> > > + cpu_topology[cpu].cluster_id = cluster_id;
> >
> > I'm still not convinced that this is the right thing to do. Let's take
> > the juno board as an example here. And I guess our assumption should be
> > that we want to make CONFIG_SCHED_CLUSTER a default option, like
> > CONFIG_SCHED_MC is. Simply to avoid a unmanageable zoo of config-option
> > combinations.
> >
>
> Agreed on the config part.
>
> > (1) Scheduler Domains (SDs) w/o CONFIG_SCHED_CLUSTER:
> >
> > MC <-- !!!
> > DIE
> >
> > (2) SDs w/ CONFIG_SCHED_CLUSTER:
> >
> > CLS <-- !!!
> > DIE
> >
>
> Yes I have seen this.
>
> > In (2) MC gets degenerated in sd_parent_degenerate() since CLS and MC
> > cpumasks are equal and MC does not have any additional flags compared to
> > CLS.
> > I'm not convinced that we can change the degeneration rules without
> > destroying other scenarios of the scheduler so that here MC stays and
> > CLS gets removed instead.
> >
>
> Why ? Are you suggesting that we shouldn't present the hardware cluster
> to the topology because of the above reason ? If so, sorry that is not a
> valid reason. We could add login to return NULL or appropriate value
> needed in cpu_clustergroup_mask id it matches MC level mask if we can't
> deal that in generic scheduler code. But the topology code can't be
> compromised for that reason as it is user visible.

I tend to agree with Dietmar. The legacy use of cluster node in DT
refers to the dynamiQ or legacy b.L cluster which is also aligned to
the LLC and the MC scheduling level. The new cluster level that has
been introduced recently does not target this level but some
intermediate levels either inside like for the kupeng920 or the v9
complex or outside like for the ampere altra. So I would say that
there is one cluster node level in DT that refers to the same MC/LLC
level and only an additional child/parent cluster node should be used
to fill the clustergroup_mask.

IIUC, we don't describe the dynamiQ level in ACPI which uses cache
topology instead to define cpu_coregroup_mask whereas DT described the
dynamiQ instead of using cache topology. If you use cache topology
now, then you should skip the dynamiQ

Finally, even if CLS and MC have the same scheduling behavior for now,
they might ends up with different scheduling properties which would
mean that replacing MC level by CLS one for current SoC would become
wrong

>
> > Even though MC and CLS are doing the same right now from the perspective
> > of the scheduler, we should also see MC and not CLS under (2). CLS only
> > makes sense longer term if the scheduler also makes use of it (next to
> > MC) in the wakeup-path for instance. Especially when this happens, a
> > platform should always construct the same scheduler domain hierarchy, no
> > matter which CONFIG_SCHED_XXX options are enabled.
> >
> >
> > You can see this in update_siblings_masks()
> >
> > if (last_level_cache_is_shared)
> > set llc_sibling
> >
> > if (cpuid_topo->package_id != cpu_topo->package_id)
> > continue
> >
> > set core_sibling
> >
> > If llc cache and socket boundaries are congruent, llc_sibling and
> > core_sibling are the same.
> >
> > if (cpuid_topo->cluster_id != cpu_topo->cluster_id)
> > continue
> >
> > set cluster_sibling
> >
> > Now we potentially set clusters. Since socket=0 is by default and we
> > use the existing juno.dts, the cluster nodes end up being congruent to
> > the llc cache cpumasks as well.
> >
>
> Correct and I see no problems as it matches what the hardware is. So I am
> not expecting any change in any cpumasks there as they all are aligned with
> the hardware.
>
> > The problem is that we code `llc cache` and `DT cluster nodes` as the
> > same thing in juno.dts.
>
> Why is that a problem ? If so, blame hardware and deal with it as we have to
> ???? as usual we get all sorts of topology.
>
> > `Cluster0/1` are congruent with the llc information, although they should
> > be actually `socket0/1` right now.
>
> That was complete non-sense and wrong. Boot and check in ACPI mode.
>
> > But we can't set-up a cpu-map with a `socketX` containing `coreY` directly.
> > And then we use llc_sibling and cluster_sibling in two different SD
> > cpumask functions (cpu_coregroup_mask() and cpu_clustergroup_mask()).
> >
>
> We just need to deal with that. How is that dealt today with ACPI. My
> changes are making these aligned with ACPI. If something is broken as
> per you understanding with ACPI, then that needs fixing. The topology
> presented and parsed by ACPI is correct and we are aligning DT with that.
> There is no question on that.
>
> > Remember, CONFIG_SCHED_CLUSTER was introduced in ACPI/PPTT as a cpumask
> > which is a subset of the cpumasks of CONFIG_SCHED_MC.
> >
>
> But that change also introduced cluster masks into the topology which again
> aligns with my changes.
>
> > IMHO we probably could just introduce your changes w/o setting `cpu-map
> > cluster nodes` in DT for now. We would just have to make sure that for
> > all `*.dts` affected, the `llc cache` info can take over the old role of
> > the `cluster nodes`. In this case e.g. Juno ends up with MC, DIE no
> > matter if CONFIG_SCHED_CLUSTER is set or not.
>
> Sure I can agree with that if Juno ACPI is not broken. But I am sure it is
> broken based on your argument above. If it is, that needs fixing and this
> series just gets topology parsing in both ACPI and DT aligned, nothing
> more or nothing less. In the process it may be introducing clusters, but
> if it is not dealt correctly in ACPI, then it won't be in DT too and needs
> fixing anyways.
>
> --
> Regards,
> Sudeep

2022-06-10 10:37:30

by Sudeep Holla

[permalink] [raw]
Subject: Re: [PATCH v3 15/16] arch_topology: Set cluster identifier in each core/thread from /cpu-map

On Fri, Jun 10, 2022 at 12:08:44PM +0200, Vincent Guittot wrote:
> On Mon, 6 Jun 2022 at 12:22, Sudeep Holla <[email protected]> wrote:
> >

[...]

> > Why ? Are you suggesting that we shouldn't present the hardware cluster
> > to the topology because of the above reason ? If so, sorry that is not a
> > valid reason. We could add login to return NULL or appropriate value
> > needed in cpu_clustergroup_mask id it matches MC level mask if we can't
> > deal that in generic scheduler code. But the topology code can't be
> > compromised for that reason as it is user visible.
>
> I tend to agree with Dietmar. The legacy use of cluster node in DT
> refers to the dynamiQ or legacy b.L cluster which is also aligned to
> the LLC and the MC scheduling level. The new cluster level that has
> been introduced recently does not target this level but some
> intermediate levels either inside like for the kupeng920 or the v9
> complex or outside like for the ampere altra. So I would say that
> there is one cluster node level in DT that refers to the same MC/LLC
> level and only an additional child/parent cluster node should be used
> to fill the clustergroup_mask.
>

Again I completely disagree. Let us look at the problems separately.
The hardware topology that some of the tools like lscpu and lstopo expects
what the hardware looks like and not the scheduler's view of the hardware.
So the topology masks that gets exposed to the user-space needs fixing
even today. I have reports from various tooling people about the same.
E.g. Juno getting exposed as dual socket system is utter non-sense.

Yes scheduler uses most of the topology masks as is but that is not a must.
There are these *group_mask functions that can implement what scheduler
needs to be fed.

I am not sure why the 2 issues are getting mixed up and that is the main
reason why I jumped into this to make sure the topology masks are
not tampered based on the way it needs to be used for scheduler.

Both ACPI and DT on a platform must present exact same hardware topology
to the user-space, there is no space for argument there.

> IIUC, we don't describe the dynamiQ level in ACPI which uses cache
> topology instead to define cpu_coregroup_mask whereas DT described the
> dynamiQ instead of using cache topology. If you use cache topology
> now, then you should skip the dynamiQ
>

Yes, unless someone can work out a binding to represent that and convince
DT maintainers ;).

> Finally, even if CLS and MC have the same scheduling behavior for now,
> they might ends up with different scheduling properties which would
> mean that replacing MC level by CLS one for current SoC would become
> wrong
>

Again as I mentioned to Dietmar, that is something we can and must deal with
in those *group_mask and not expect topology mask to be altered to meet
CLS/MC or whatever sched domains needs. Sorry, that is my strong opinion
as the topology is already user-space visible and (tooling) people are
complaining that DT systems are broken and doesn't match ACPI systems.

So unless someone gives me non-scheduler and topology specific reasons
to change that, sorry but my opinion on this matter is not going to change ;).

You will get this view of topology, find a way to manage with all those
*group_mask functions. By the way it is already handled for ACPI systems,
so if you are not happy with that, then that needs fixing as this change
set just aligns the behaviour on similar ACPI system. So the Juno example
is incorrect for the reason that the behaviour of scheduler there is different
with DT and ACPI.

--
Regards,
Sudeep

2022-06-13 09:29:58

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v3 15/16] arch_topology: Set cluster identifier in each core/thread from /cpu-map

On 10/06/2022 12:27, Sudeep Holla wrote:
> On Fri, Jun 10, 2022 at 12:08:44PM +0200, Vincent Guittot wrote:
>> On Mon, 6 Jun 2022 at 12:22, Sudeep Holla <[email protected]> wrote:
>>>
>
> [...]
>
>>> Why ? Are you suggesting that we shouldn't present the hardware cluster
>>> to the topology because of the above reason ? If so, sorry that is not a
>>> valid reason. We could add login to return NULL or appropriate value
>>> needed in cpu_clustergroup_mask id it matches MC level mask if we can't
>>> deal that in generic scheduler code. But the topology code can't be
>>> compromised for that reason as it is user visible.
>>
>> I tend to agree with Dietmar. The legacy use of cluster node in DT
>> refers to the dynamiQ or legacy b.L cluster which is also aligned to
>> the LLC and the MC scheduling level. The new cluster level that has
>> been introduced recently does not target this level but some
>> intermediate levels either inside like for the kupeng920 or the v9
>> complex or outside like for the ampere altra. So I would say that
>> there is one cluster node level in DT that refers to the same MC/LLC
>> level and only an additional child/parent cluster node should be used
>> to fill the clustergroup_mask.
>>
>
> Again I completely disagree. Let us look at the problems separately.
> The hardware topology that some of the tools like lscpu and lstopo expects
> what the hardware looks like and not the scheduler's view of the hardware.
> So the topology masks that gets exposed to the user-space needs fixing
> even today. I have reports from various tooling people about the same.
> E.g. Juno getting exposed as dual socket system is utter non-sense.
>
> Yes scheduler uses most of the topology masks as is but that is not a must.
> There are these *group_mask functions that can implement what scheduler
> needs to be fed.
>
> I am not sure why the 2 issues are getting mixed up and that is the main
> reason why I jumped into this to make sure the topology masks are
> not tampered based on the way it needs to be used for scheduler.

I'm all in favor of not mixing up those 2 issues. But I don't understand
why you have to glue them together.

(1) DT systems broken in userspace (lstopo shows Juno with 2 Packages)

(2) Introduce CONFIG_SCHED_CLUSTER for DT systems


(1) This can be solved with your patch-set w/o setting `(1. level)
cpu-map cluster nodes`. The `socket nodes` taking over the
functionality of the `cluster nodes` sorts out the `Juno is seen as
having 2 packages`.
This will make core_sibling not suitable for cpu_coregroup_mask()
anymore. But this is OK since llc from cacheinfo (i.e. llc_sibling)
takes over here.
There is no need to involve `cluster nodes` anymore.

(2) This will only make sense for Armv9 L2 complexes if we connect `2.
level cpu-map cluster nodes` with cluster_id and cluster_sibling.
And only then clusters would mean the same thing in ACPI and DT.
I guess this was mentioned already a couple of times.

> Both ACPI and DT on a platform must present exact same hardware topology
> to the user-space, there is no space for argument there.
>
>> IIUC, we don't describe the dynamiQ level in ACPI which uses cache
>> topology instead to define cpu_coregroup_mask whereas DT described the
>> dynamiQ instead of using cache topology. If you use cache topology
>> now, then you should skip the dynamiQ
>>
>
> Yes, unless someone can work out a binding to represent that and convince
> DT maintainers ;).
>
>> Finally, even if CLS and MC have the same scheduling behavior for now,
>> they might ends up with different scheduling properties which would
>> mean that replacing MC level by CLS one for current SoC would become
>> wrong
>>
>
> Again as I mentioned to Dietmar, that is something we can and must deal with
> in those *group_mask and not expect topology mask to be altered to meet
> CLS/MC or whatever sched domains needs. Sorry, that is my strong opinion
> as the topology is already user-space visible and (tooling) people are
> complaining that DT systems are broken and doesn't match ACPI systems.
>
> So unless someone gives me non-scheduler and topology specific reasons
> to change that, sorry but my opinion on this matter is not going to change ;).

`lstopo` is fine with a now correct /sys/.../topology/package_cpus (or
core_siblings (old filename). It's not reading
/sys/.../topology/cluster_cpus (yet) so why set it (wrongly) to 0x39 for
CPU0 on Juno when it can stay 0x01?

> You will get this view of topology, find a way to manage with all those
> *group_mask functions. By the way it is already handled for ACPI systems,
> so if you are not happy with that, then that needs fixing as this change
> set just aligns the behaviour on similar ACPI system. So the Juno example
> is incorrect for the reason that the behaviour of scheduler there is different
> with DT and ACPI.

[...]

2022-06-13 16:25:35

by Sudeep Holla

[permalink] [raw]
Subject: Re: [PATCH v3 15/16] arch_topology: Set cluster identifier in each core/thread from /cpu-map

On Mon, Jun 13, 2022 at 11:19:36AM +0200, Dietmar Eggemann wrote:
> On 10/06/2022 12:27, Sudeep Holla wrote:
> > On Fri, Jun 10, 2022 at 12:08:44PM +0200, Vincent Guittot wrote:
> >> On Mon, 6 Jun 2022 at 12:22, Sudeep Holla <[email protected]> wrote:
> >>>
> >
> > [...]
> >
> >>> Why ? Are you suggesting that we shouldn't present the hardware cluster
> >>> to the topology because of the above reason ? If so, sorry that is not a
> >>> valid reason. We could add login to return NULL or appropriate value
> >>> needed in cpu_clustergroup_mask id it matches MC level mask if we can't
> >>> deal that in generic scheduler code. But the topology code can't be
> >>> compromised for that reason as it is user visible.
> >>
> >> I tend to agree with Dietmar. The legacy use of cluster node in DT
> >> refers to the dynamiQ or legacy b.L cluster which is also aligned to
> >> the LLC and the MC scheduling level. The new cluster level that has
> >> been introduced recently does not target this level but some
> >> intermediate levels either inside like for the kupeng920 or the v9
> >> complex or outside like for the ampere altra. So I would say that
> >> there is one cluster node level in DT that refers to the same MC/LLC
> >> level and only an additional child/parent cluster node should be used
> >> to fill the clustergroup_mask.
> >>
> >
> > Again I completely disagree. Let us look at the problems separately.
> > The hardware topology that some of the tools like lscpu and lstopo expects
> > what the hardware looks like and not the scheduler's view of the hardware.
> > So the topology masks that gets exposed to the user-space needs fixing
> > even today. I have reports from various tooling people about the same.
> > E.g. Juno getting exposed as dual socket system is utter non-sense.
> >
> > Yes scheduler uses most of the topology masks as is but that is not a must.
> > There are these *group_mask functions that can implement what scheduler
> > needs to be fed.
> >
> > I am not sure why the 2 issues are getting mixed up and that is the main
> > reason why I jumped into this to make sure the topology masks are
> > not tampered based on the way it needs to be used for scheduler.
>
> I'm all in favor of not mixing up those 2 issues. But I don't understand
> why you have to glue them together.
>

What are you referring as 'glue them together'. As I said this series just
address the hardware topology and if there is any impact on sched domains
then it is do with alignment with ACPI and DT platform behaviour. I am not
adding anything more to glue topology and info needed for sched domains.

> (1) DT systems broken in userspace (lstopo shows Juno with 2 Packages)
>

Correct.

> (2) Introduce CONFIG_SCHED_CLUSTER for DT systems
>

If that is a problem, you need to disable it for DT platforms. Not
supporting proper hardware topology is not the way to workaround the
issues enabling CONFIG_SCHED_CLUSTER for DT systems IMO.

>
> (1) This can be solved with your patch-set w/o setting `(1. level)
> cpu-map cluster nodes`. The `socket nodes` taking over the
> functionality of the `cluster nodes` sorts out the `Juno is seen as
> having 2 packages`.
> This will make core_sibling not suitable for cpu_coregroup_mask()
> anymore. But this is OK since llc from cacheinfo (i.e. llc_sibling)
> takes over here.
> There is no need to involve `cluster nodes` anymore.
>

Again you are just deferring introducing CONFIG_SCHED_CLUSTER on DT
which is fine but I don't agree with your approach.

> (2) This will only make sense for Armv9 L2 complexes if we connect `2.
> level cpu-map cluster nodes` with cluster_id and cluster_sibling.
> And only then clusters would mean the same thing in ACPI and DT.
> I guess this was mentioned already a couple of times.
>

Indeed. But I don't get what you mean by 2 level here. ACPI puts 1st level
cpu nodes in cluster mask. So we are just aligning to it on DT platforms
here. So if you are saying that is an issue, please fix that for ACPI too.

> > Both ACPI and DT on a platform must present exact same hardware topology
> > to the user-space, there is no space for argument there.
> >
> >> IIUC, we don't describe the dynamiQ level in ACPI which uses cache
> >> topology instead to define cpu_coregroup_mask whereas DT described the
> >> dynamiQ instead of using cache topology. If you use cache topology
> >> now, then you should skip the dynamiQ
> >>
> >
> > Yes, unless someone can work out a binding to represent that and convince
> > DT maintainers ;).
> >
> >> Finally, even if CLS and MC have the same scheduling behavior for now,
> >> they might ends up with different scheduling properties which would
> >> mean that replacing MC level by CLS one for current SoC would become
> >> wrong
> >>
> >
> > Again as I mentioned to Dietmar, that is something we can and must deal with
> > in those *group_mask and not expect topology mask to be altered to meet
> > CLS/MC or whatever sched domains needs. Sorry, that is my strong opinion
> > as the topology is already user-space visible and (tooling) people are
> > complaining that DT systems are broken and doesn't match ACPI systems.
> >
> > So unless someone gives me non-scheduler and topology specific reasons
> > to change that, sorry but my opinion on this matter is not going to change ;).
>
> `lstopo` is fine with a now correct /sys/.../topology/package_cpus (or
> core_siblings (old filename). It's not reading
> /sys/.../topology/cluster_cpus (yet) so why set it (wrongly) to 0x39 for
> CPU0 on Juno when it can stay 0x01?
>

On ACPI ? If so, it could be the package ID in the ACPI table which can be
invalid and kernel use the table offset as ID. It is not ideal but doesn't
affect the masks. The current value 0 or 1 on Juno is cluster ID and they
contribute to wrong package cpu masks.


And yes lstopo doesn't read cluster IDs. But we expose them in ACPI system
and not on DT which was my main point.

As pointed out earlier, have you checked ACPI on Juno and with
CONFIG_SCHED_CLUSTER ? If the behaviour with my series on DT and ACPI
differs, then it is an issue. But AFAIU, it doesn't and that is my main
argument. You are just assuming what we have on Juno with DT is correct
which may be w.r.t to scheduler but definitely not with respect to the
hardware topology exposed to the users. So my aim is to get that fixed.
If you are not happy with that, then how can be be happy with what is the
current behaviour on ACPI + and - CONFIG_SCHED_CLUSTER. I haven't got
your opinion yet on that matter.

--
Regards,
Sudeep

2022-06-14 18:11:32

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH v3 15/16] arch_topology: Set cluster identifier in each core/thread from /cpu-map

On Fri, 10 Jun 2022 at 12:27, Sudeep Holla <[email protected]> wrote:
>
> On Fri, Jun 10, 2022 at 12:08:44PM +0200, Vincent Guittot wrote:
> > On Mon, 6 Jun 2022 at 12:22, Sudeep Holla <[email protected]> wrote:
> > >
>
> [...]
>
> > > Why ? Are you suggesting that we shouldn't present the hardware cluster
> > > to the topology because of the above reason ? If so, sorry that is not a
> > > valid reason. We could add login to return NULL or appropriate value
> > > needed in cpu_clustergroup_mask id it matches MC level mask if we can't
> > > deal that in generic scheduler code. But the topology code can't be
> > > compromised for that reason as it is user visible.
> >
> > I tend to agree with Dietmar. The legacy use of cluster node in DT
> > refers to the dynamiQ or legacy b.L cluster which is also aligned to
> > the LLC and the MC scheduling level. The new cluster level that has
> > been introduced recently does not target this level but some
> > intermediate levels either inside like for the kupeng920 or the v9
> > complex or outside like for the ampere altra. So I would say that
> > there is one cluster node level in DT that refers to the same MC/LLC
> > level and only an additional child/parent cluster node should be used
> > to fill the clustergroup_mask.
> >
>
> Again I completely disagree. Let us look at the problems separately.
> The hardware topology that some of the tools like lscpu and lstopo expects
> what the hardware looks like and not the scheduler's view of the hardware.
> So the topology masks that gets exposed to the user-space needs fixing
> even today. I have reports from various tooling people about the same.
> E.g. Juno getting exposed as dual socket system is utter non-sense.
>
> Yes scheduler uses most of the topology masks as is but that is not a must.
> There are these *group_mask functions that can implement what scheduler
> needs to be fed.
>
> I am not sure why the 2 issues are getting mixed up and that is the main
> reason why I jumped into this to make sure the topology masks are
> not tampered based on the way it needs to be used for scheduler.
>
> Both ACPI and DT on a platform must present exact same hardware topology
> to the user-space, there is no space for argument there.

But that's exactly my point there:
ACPI doesn't show the dynamiQ level anywhere but only the llc which
are the same and your patch makes the dynamiQ level visible for DT in
addition to llc

>
> > IIUC, we don't describe the dynamiQ level in ACPI which uses cache
> > topology instead to define cpu_coregroup_mask whereas DT described the
> > dynamiQ instead of using cache topology. If you use cache topology
> > now, then you should skip the dynamiQ
> >
>
> Yes, unless someone can work out a binding to represent that and convince
> DT maintainers ;).
>
> > Finally, even if CLS and MC have the same scheduling behavior for now,
> > they might ends up with different scheduling properties which would
> > mean that replacing MC level by CLS one for current SoC would become
> > wrong
> >
>
> Again as I mentioned to Dietmar, that is something we can and must deal with
> in those *group_mask and not expect topology mask to be altered to meet
> CLS/MC or whatever sched domains needs. Sorry, that is my strong opinion
> as the topology is already user-space visible and (tooling) people are
> complaining that DT systems are broken and doesn't match ACPI systems.

again, your proposal doesn't help here because the DT will show a
level that doesn't appears in ACPI

>
> So unless someone gives me non-scheduler and topology specific reasons
> to change that, sorry but my opinion on this matter is not going to change ;).
>
> You will get this view of topology, find a way to manage with all those
> *group_mask functions. By the way it is already handled for ACPI systems,

AFAICT, no it's not, the cluster described in ACPI is not the dynamiQ
level that you make now visible to DT

> so if you are not happy with that, then that needs fixing as this change
> set just aligns the behaviour on similar ACPI system. So the Juno example
> is incorrect for the reason that the behaviour of scheduler there is different
> with DT and ACPI.
>
> --
> Regards,
> Sudeep

2022-06-15 17:30:42

by Sudeep Holla

[permalink] [raw]
Subject: Re: [PATCH v3 15/16] arch_topology: Set cluster identifier in each core/thread from /cpu-map

Please note until we agree on unified view for hardware topology, I will
temporarily ignore any scheduler domain related issues/concerns as this
thread/discussion is mixing up too much IMO. I am not ignoring sched_domain
concerns, but deferring it until we agree on the hardware topology view
which is user visible and how that impacts sched domain topology can be
considered soon following that.

On Tue, Jun 14, 2022 at 07:59:23PM +0200, Vincent Guittot wrote:
> On Fri, 10 Jun 2022 at 12:27, Sudeep Holla <[email protected]> wrote:
> >
> > On Fri, Jun 10, 2022 at 12:08:44PM +0200, Vincent Guittot wrote:
> > > On Mon, 6 Jun 2022 at 12:22, Sudeep Holla <[email protected]> wrote:
> > > >
> >
> > [...]
> >
> > > > Why ? Are you suggesting that we shouldn't present the hardware cluster
> > > > to the topology because of the above reason ? If so, sorry that is not a
> > > > valid reason. We could add login to return NULL or appropriate value
> > > > needed in cpu_clustergroup_mask id it matches MC level mask if we can't
> > > > deal that in generic scheduler code. But the topology code can't be
> > > > compromised for that reason as it is user visible.
> > >
> > > I tend to agree with Dietmar. The legacy use of cluster node in DT
> > > refers to the dynamiQ or legacy b.L cluster which is also aligned to
> > > the LLC and the MC scheduling level. The new cluster level that has
> > > been introduced recently does not target this level but some
> > > intermediate levels either inside like for the kupeng920 or the v9
> > > complex or outside like for the ampere altra. So I would say that
> > > there is one cluster node level in DT that refers to the same MC/LLC
> > > level and only an additional child/parent cluster node should be used
> > > to fill the clustergroup_mask.
> > >
> >
> > Again I completely disagree. Let us look at the problems separately.
> > The hardware topology that some of the tools like lscpu and lstopo expects
> > what the hardware looks like and not the scheduler's view of the hardware.
> > So the topology masks that gets exposed to the user-space needs fixing
> > even today. I have reports from various tooling people about the same.
> > E.g. Juno getting exposed as dual socket system is utter non-sense.
> >
> > Yes scheduler uses most of the topology masks as is but that is not a must.
> > There are these *group_mask functions that can implement what scheduler
> > needs to be fed.
> >
> > I am not sure why the 2 issues are getting mixed up and that is the main
> > reason why I jumped into this to make sure the topology masks are
> > not tampered based on the way it needs to be used for scheduler.
> >
> > Both ACPI and DT on a platform must present exact same hardware topology
> > to the user-space, there is no space for argument there.
>
> But that's exactly my point there:
> ACPI doesn't show the dynamiQ level anywhere but only the llc which
> are the same and your patch makes the dynamiQ level visible for DT in
> addition to llc
>

Sorry if I am missing something obvious here, but both ACPI and DT has no
special representation for dynamiQ clusters and hence it is impossible to
deduce the same from either DT or ACPI. Can you provide some details
or example as what you are referring as dynamiQ. Also what you mean by
dynamiQ not shown on ACPI while shown with DT systems. If there is any
discrepancies, we need to fix.

Now, what I refer as discrepancy for example on Juno is below:
(value read from a subset of per cpu sysfs files)
cpu 0 1 2 3 4 5
cluster_id -1 -1 -1 -1 -1 -1
physical_package_id 1 0 0 1 1 1
cluster_cpus_list 0 1 2 3 4 5
package_cpus_list 0,3-5 1-2 1-2 0,3-5 0,3-5 0,3-5

The above one is for DT which is wrong in all the 4 entries above.
The below one is on ACPI and after applying my series on Juno.

cpu 0 1 2 3 4 5
cluster_id 1 0 0 1 1 1
physical_package_id 0 0 0 0 0 0
cluster_cpus_list 0,3-5 1-2 1-2 0,3-5 0,3-5 0,3-5
package_cpus_list 0-5 0-5 0-5 0-5 0-5 0-5

This matches the expectation from the various userspace tools like lscpu,
lstopo,..etc.

> >
> > > IIUC, we don't describe the dynamiQ level in ACPI which uses cache
> > > topology instead to define cpu_coregroup_mask whereas DT described the
> > > dynamiQ instead of using cache topology. If you use cache topology
> > > now, then you should skip the dynamiQ
> > >
> >
> > Yes, unless someone can work out a binding to represent that and convince
> > DT maintainers ;).
> >
> > > Finally, even if CLS and MC have the same scheduling behavior for now,
> > > they might ends up with different scheduling properties which would
> > > mean that replacing MC level by CLS one for current SoC would become
> > > wrong
> > >
> >
> > Again as I mentioned to Dietmar, that is something we can and must deal with
> > in those *group_mask and not expect topology mask to be altered to meet
> > CLS/MC or whatever sched domains needs. Sorry, that is my strong opinion
> > as the topology is already user-space visible and (tooling) people are
> > complaining that DT systems are broken and doesn't match ACPI systems.
>
> again, your proposal doesn't help here because the DT will show a
> level that doesn't appears in ACPI
>

Which level exactly ? It matches exactly for Juno, the sysfs files are
exact match after my changes. Again don't mix the scheduler domains for
arguments here.

> >
> > So unless someone gives me non-scheduler and topology specific reasons
> > to change that, sorry but my opinion on this matter is not going to change ;).
> >
> > You will get this view of topology, find a way to manage with all those
> > *group_mask functions. By the way it is already handled for ACPI systems,
>
> AFAICT, no it's not, the cluster described in ACPI is not the dynamiQ
> level that you make now visible to DT

Again, no. There is no binding for dynamiQ level either in DT or ACPI and
hence there is no way it can become visible on DT. So I have no idea why
there is a thought process or assumption about existence of dynamiQ level
in the DT. It doesn't exist. If that is wrong, can you point me to the
bindings as well as existing device tree ? If you are referring to the
phantom domains Dietmar mentioned in earlier threads, then they don't exist.
It is made up and one need to get the bindings pushed before we can address
such a system.

--
Regards,
Sudeep

2022-06-15 23:17:09

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH v3 15/16] arch_topology: Set cluster identifier in each core/thread from /cpu-map

On Wed, 15 Jun 2022 at 19:01, Sudeep Holla <[email protected]> wrote:
>
> Please note until we agree on unified view for hardware topology, I will
> temporarily ignore any scheduler domain related issues/concerns as this
> thread/discussion is mixing up too much IMO. I am not ignoring sched_domain
> concerns, but deferring it until we agree on the hardware topology view
> which is user visible and how that impacts sched domain topology can be
> considered soon following that.

On my side, what i'm really interested in, it's the hardware topology
reported to the scheduler

>
> On Tue, Jun 14, 2022 at 07:59:23PM +0200, Vincent Guittot wrote:
> > On Fri, 10 Jun 2022 at 12:27, Sudeep Holla <[email protected]> wrote:
> > >
> > > On Fri, Jun 10, 2022 at 12:08:44PM +0200, Vincent Guittot wrote:
> > > > On Mon, 6 Jun 2022 at 12:22, Sudeep Holla <[email protected]> wrote:
> > > > >
> > >
> > > [...]
> > >
> > > > > Why ? Are you suggesting that we shouldn't present the hardware cluster
> > > > > to the topology because of the above reason ? If so, sorry that is not a
> > > > > valid reason. We could add login to return NULL or appropriate value
> > > > > needed in cpu_clustergroup_mask id it matches MC level mask if we can't
> > > > > deal that in generic scheduler code. But the topology code can't be
> > > > > compromised for that reason as it is user visible.
> > > >
> > > > I tend to agree with Dietmar. The legacy use of cluster node in DT
> > > > refers to the dynamiQ or legacy b.L cluster which is also aligned to
> > > > the LLC and the MC scheduling level. The new cluster level that has
> > > > been introduced recently does not target this level but some
> > > > intermediate levels either inside like for the kupeng920 or the v9
> > > > complex or outside like for the ampere altra. So I would say that
> > > > there is one cluster node level in DT that refers to the same MC/LLC
> > > > level and only an additional child/parent cluster node should be used
> > > > to fill the clustergroup_mask.
> > > >
> > >
> > > Again I completely disagree. Let us look at the problems separately.
> > > The hardware topology that some of the tools like lscpu and lstopo expects
> > > what the hardware looks like and not the scheduler's view of the hardware.
> > > So the topology masks that gets exposed to the user-space needs fixing
> > > even today. I have reports from various tooling people about the same.
> > > E.g. Juno getting exposed as dual socket system is utter non-sense.
> > >
> > > Yes scheduler uses most of the topology masks as is but that is not a must.
> > > There are these *group_mask functions that can implement what scheduler
> > > needs to be fed.
> > >
> > > I am not sure why the 2 issues are getting mixed up and that is the main
> > > reason why I jumped into this to make sure the topology masks are
> > > not tampered based on the way it needs to be used for scheduler.
> > >
> > > Both ACPI and DT on a platform must present exact same hardware topology
> > > to the user-space, there is no space for argument there.
> >
> > But that's exactly my point there:
> > ACPI doesn't show the dynamiQ level anywhere but only the llc which
> > are the same and your patch makes the dynamiQ level visible for DT in
> > addition to llc
> >
>
> Sorry if I am missing something obvious here, but both ACPI and DT has no
> special representation for dynamiQ clusters and hence it is impossible to
> deduce the same from either DT or ACPI. Can you provide some details
> or example as what you are referring as dynamiQ. Also what you mean by
> dynamiQ not shown on ACPI while shown with DT systems. If there is any
> discrepancies, we need to fix.
>
> Now, what I refer as discrepancy for example on Juno is below:
> (value read from a subset of per cpu sysfs files)
> cpu 0 1 2 3 4 5
> cluster_id -1 -1 -1 -1 -1 -1
> physical_package_id 1 0 0 1 1 1
> cluster_cpus_list 0 1 2 3 4 5
> package_cpus_list 0,3-5 1-2 1-2 0,3-5 0,3-5 0,3-5
>
> The above one is for DT which is wrong in all the 4 entries above.
> The below one is on ACPI and after applying my series on Juno.
>
> cpu 0 1 2 3 4 5
> cluster_id 1 0 0 1 1 1
> physical_package_id 0 0 0 0 0 0
> cluster_cpus_list 0,3-5 1-2 1-2 0,3-5 0,3-5 0,3-5
> package_cpus_list 0-5 0-5 0-5 0-5 0-5 0-5
>
> This matches the expectation from the various userspace tools like lscpu,
> lstopo,..etc.
>
> > >
> > > > IIUC, we don't describe the dynamiQ level in ACPI which uses cache
> > > > topology instead to define cpu_coregroup_mask whereas DT described the
> > > > dynamiQ instead of using cache topology. If you use cache topology
> > > > now, then you should skip the dynamiQ
> > > >
> > >
> > > Yes, unless someone can work out a binding to represent that and convince
> > > DT maintainers ;).
> > >
> > > > Finally, even if CLS and MC have the same scheduling behavior for now,
> > > > they might ends up with different scheduling properties which would
> > > > mean that replacing MC level by CLS one for current SoC would become
> > > > wrong
> > > >
> > >
> > > Again as I mentioned to Dietmar, that is something we can and must deal with
> > > in those *group_mask and not expect topology mask to be altered to meet
> > > CLS/MC or whatever sched domains needs. Sorry, that is my strong opinion
> > > as the topology is already user-space visible and (tooling) people are
> > > complaining that DT systems are broken and doesn't match ACPI systems.
> >
> > again, your proposal doesn't help here because the DT will show a
> > level that doesn't appears in ACPI
> >
>
> Which level exactly ? It matches exactly for Juno, the sysfs files are
> exact match after my changes. Again don't mix the scheduler domains for
> arguments here.
>
> > >
> > > So unless someone gives me non-scheduler and topology specific reasons
> > > to change that, sorry but my opinion on this matter is not going to change ;).
> > >
> > > You will get this view of topology, find a way to manage with all those
> > > *group_mask functions. By the way it is already handled for ACPI systems,
> >
> > AFAICT, no it's not, the cluster described in ACPI is not the dynamiQ
> > level that you make now visible to DT
>
> Again, no. There is no binding for dynamiQ level either in DT or ACPI and
> hence there is no way it can become visible on DT. So I have no idea why
> there is a thought process or assumption about existence of dynamiQ level
> in the DT. It doesn't exist. If that is wrong, can you point me to the
> bindings as well as existing device tree ? If you are referring to the
> phantom domains Dietmar mentioned in earlier threads, then they don't exist.
> It is made up and one need to get the bindings pushed before we can address
> such a system.
>
> --
> Regards,
> Sudeep

2022-06-15 23:32:31

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH v3 15/16] arch_topology: Set cluster identifier in each core/thread from /cpu-map

On Wed, 15 Jun 2022 at 19:01, Sudeep Holla <[email protected]> wrote:
>
> Please note until we agree on unified view for hardware topology, I will
> temporarily ignore any scheduler domain related issues/concerns as this
> thread/discussion is mixing up too much IMO. I am not ignoring sched_domain
> concerns, but deferring it until we agree on the hardware topology view
> which is user visible and how that impacts sched domain topology can be
> considered soon following that.
>
> On Tue, Jun 14, 2022 at 07:59:23PM +0200, Vincent Guittot wrote:
> > On Fri, 10 Jun 2022 at 12:27, Sudeep Holla <[email protected]> wrote:
> > >
> > > On Fri, Jun 10, 2022 at 12:08:44PM +0200, Vincent Guittot wrote:
> > > > On Mon, 6 Jun 2022 at 12:22, Sudeep Holla <[email protected]> wrote:
> > > > >
> > >
> > > [...]
> > >
> > > > > Why ? Are you suggesting that we shouldn't present the hardware cluster
> > > > > to the topology because of the above reason ? If so, sorry that is not a
> > > > > valid reason. We could add login to return NULL or appropriate value
> > > > > needed in cpu_clustergroup_mask id it matches MC level mask if we can't
> > > > > deal that in generic scheduler code. But the topology code can't be
> > > > > compromised for that reason as it is user visible.
> > > >
> > > > I tend to agree with Dietmar. The legacy use of cluster node in DT
> > > > refers to the dynamiQ or legacy b.L cluster which is also aligned to
> > > > the LLC and the MC scheduling level. The new cluster level that has
> > > > been introduced recently does not target this level but some
> > > > intermediate levels either inside like for the kupeng920 or the v9
> > > > complex or outside like for the ampere altra. So I would say that
> > > > there is one cluster node level in DT that refers to the same MC/LLC
> > > > level and only an additional child/parent cluster node should be used
> > > > to fill the clustergroup_mask.
> > > >
> > >
> > > Again I completely disagree. Let us look at the problems separately.
> > > The hardware topology that some of the tools like lscpu and lstopo expects
> > > what the hardware looks like and not the scheduler's view of the hardware.
> > > So the topology masks that gets exposed to the user-space needs fixing
> > > even today. I have reports from various tooling people about the same.
> > > E.g. Juno getting exposed as dual socket system is utter non-sense.
> > >
> > > Yes scheduler uses most of the topology masks as is but that is not a must.
> > > There are these *group_mask functions that can implement what scheduler
> > > needs to be fed.
> > >
> > > I am not sure why the 2 issues are getting mixed up and that is the main
> > > reason why I jumped into this to make sure the topology masks are
> > > not tampered based on the way it needs to be used for scheduler.
> > >
> > > Both ACPI and DT on a platform must present exact same hardware topology
> > > to the user-space, there is no space for argument there.
> >
> > But that's exactly my point there:
> > ACPI doesn't show the dynamiQ level anywhere but only the llc which
> > are the same and your patch makes the dynamiQ level visible for DT in
> > addition to llc
> >
>
> Sorry if I am missing something obvious here, but both ACPI and DT has no
> special representation for dynamiQ clusters and hence it is impossible to
> deduce the same from either DT or ACPI. Can you provide some details
> or example as what you are referring as dynamiQ. Also what you mean by
> dynamiQ not shown on ACPI while shown with DT systems. If there is any
> discrepancies, we need to fix.

The cpu-map node in DT is following the dynamiQ or the legacy
big.LITTLE topology. As an example the hikey6220 has 2 clusters, the
hikey960 has 2 clusters that reflect big.LITTLE and the sdm845 has one
cluster that reflects the dynamiQ cluster.

now my mistake is to have made the assumption that core_sibling is
arch_topology was used to reflect the cores sharing cache but after
looking more deeply in the code it seems to be a lucky coincidence

>
> Now, what I refer as discrepancy for example on Juno is below:
> (value read from a subset of per cpu sysfs files)
> cpu 0 1 2 3 4 5
> cluster_id -1 -1 -1 -1 -1 -1
> physical_package_id 1 0 0 1 1 1
> cluster_cpus_list 0 1 2 3 4 5
> package_cpus_list 0,3-5 1-2 1-2 0,3-5 0,3-5 0,3-5
>
> The above one is for DT which is wrong in all the 4 entries above.
> The below one is on ACPI and after applying my series on Juno.
>
> cpu 0 1 2 3 4 5
> cluster_id 1 0 0 1 1 1
> physical_package_id 0 0 0 0 0 0
> cluster_cpus_list 0,3-5 1-2 1-2 0,3-5 0,3-5 0,3-5
> package_cpus_list 0-5 0-5 0-5 0-5 0-5 0-5
>
> This matches the expectation from the various userspace tools like lscpu,
> lstopo,..etc.
>
> > >
> > > > IIUC, we don't describe the dynamiQ level in ACPI which uses cache
> > > > topology instead to define cpu_coregroup_mask whereas DT described the
> > > > dynamiQ instead of using cache topology. If you use cache topology
> > > > now, then you should skip the dynamiQ
> > > >
> > >
> > > Yes, unless someone can work out a binding to represent that and convince
> > > DT maintainers ;).
> > >
> > > > Finally, even if CLS and MC have the same scheduling behavior for now,
> > > > they might ends up with different scheduling properties which would
> > > > mean that replacing MC level by CLS one for current SoC would become
> > > > wrong
> > > >
> > >
> > > Again as I mentioned to Dietmar, that is something we can and must deal with
> > > in those *group_mask and not expect topology mask to be altered to meet
> > > CLS/MC or whatever sched domains needs. Sorry, that is my strong opinion
> > > as the topology is already user-space visible and (tooling) people are
> > > complaining that DT systems are broken and doesn't match ACPI systems.
> >
> > again, your proposal doesn't help here because the DT will show a
> > level that doesn't appears in ACPI
> >
>
> Which level exactly ? It matches exactly for Juno, the sysfs files are
> exact match after my changes. Again don't mix the scheduler domains for
> arguments here.
>
> > >
> > > So unless someone gives me non-scheduler and topology specific reasons
> > > to change that, sorry but my opinion on this matter is not going to change ;).
> > >
> > > You will get this view of topology, find a way to manage with all those
> > > *group_mask functions. By the way it is already handled for ACPI systems,
> >
> > AFAICT, no it's not, the cluster described in ACPI is not the dynamiQ
> > level that you make now visible to DT
>
> Again, no. There is no binding for dynamiQ level either in DT or ACPI and
> hence there is no way it can become visible on DT. So I have no idea why
> there is a thought process or assumption about existence of dynamiQ level
> in the DT. It doesn't exist. If that is wrong, can you point me to the
> bindings as well as existing device tree ? If you are referring to the
> phantom domains Dietmar mentioned in earlier threads, then they don't exist.
> It is made up and one need to get the bindings pushed before we can address
> such a system.
>
> --
> Regards,
> Sudeep

2022-06-16 16:20:04

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v3 15/16] arch_topology: Set cluster identifier in each core/thread from /cpu-map

On 13/06/2022 12:17, Sudeep Holla wrote:
> On Mon, Jun 13, 2022 at 11:19:36AM +0200, Dietmar Eggemann wrote:
>> On 10/06/2022 12:27, Sudeep Holla wrote:
>>> On Fri, Jun 10, 2022 at 12:08:44PM +0200, Vincent Guittot wrote:
>>>> On Mon, 6 Jun 2022 at 12:22, Sudeep Holla <[email protected]> wrote:

[...]

>>> Again I completely disagree. Let us look at the problems separately.
>>> The hardware topology that some of the tools like lscpu and lstopo expects
>>> what the hardware looks like and not the scheduler's view of the hardware.
>>> So the topology masks that gets exposed to the user-space needs fixing
>>> even today. I have reports from various tooling people about the same.
>>> E.g. Juno getting exposed as dual socket system is utter non-sense.
>>>
>>> Yes scheduler uses most of the topology masks as is but that is not a must.
>>> There are these *group_mask functions that can implement what scheduler
>>> needs to be fed.
>>>
>>> I am not sure why the 2 issues are getting mixed up and that is the main
>>> reason why I jumped into this to make sure the topology masks are
>>> not tampered based on the way it needs to be used for scheduler.
>>
>> I'm all in favor of not mixing up those 2 issues. But I don't understand
>> why you have to glue them together.
>>
>
> What are you referring as 'glue them together'. As I said this series just
> address the hardware topology and if there is any impact on sched domains
> then it is do with alignment with ACPI and DT platform behaviour. I am not
> adding anything more to glue topology and info needed for sched domains.

You can fix (1) without (2) parsing 1. level cluster nodes as
cluster_siblings.

>> (1) DT systems broken in userspace (lstopo shows Juno with 2 Packages)
>>
>
> Correct.
>
>> (2) Introduce CONFIG_SCHED_CLUSTER for DT systems
>>
>
> If that is a problem, you need to disable it for DT platforms. Not
> supporting proper hardware topology is not the way to workaround the
> issues enabling CONFIG_SCHED_CLUSTER for DT systems IMO.
>
>>
>> (1) This can be solved with your patch-set w/o setting `(1. level)
>> cpu-map cluster nodes`. The `socket nodes` taking over the
>> functionality of the `cluster nodes` sorts out the `Juno is seen as
>> having 2 packages`.
>> This will make core_sibling not suitable for cpu_coregroup_mask()
>> anymore. But this is OK since llc from cacheinfo (i.e. llc_sibling)
>> takes over here.
>> There is no need to involve `cluster nodes` anymore.
>>
>
> Again you are just deferring introducing CONFIG_SCHED_CLUSTER on DT
> which is fine but I don't agree with your approach.
>
>> (2) This will only make sense for Armv9 L2 complexes if we connect `2.
>> level cpu-map cluster nodes` with cluster_id and cluster_sibling.
>> And only then clusters would mean the same thing in ACPI and DT.
>> I guess this was mentioned already a couple of times.
>>
>
> Indeed. But I don't get what you mean by 2 level here. ACPI puts 1st level

cpu_map {
socket0 {
cluster0 { <-- 1. level cluster
cluster0 { <-- 2. level cluster (3 -->)
core0 {

};
core1 {

};
cluster1 {
...

Armv9 L2 complexes: e.g. QC SM8450:

.---------------.
CPU |0 1 2 3 4 5 6 7|
+---------------+
uarch |l l l l m m m b| (so called tri-gear: little, medium, big)
+---------------+
L2 | | | | | | | <-- 2. level cluster, Armv9 L2 complexes (<-- 3)
+---------------+
L3 |<-- -->|
+---------------+
|<-- cluster -->|
+---------------+
|<-- DSU -->|
'---------------'

Only if we map (i) into cluster_sibling, we get the same hardware
representation (for the task scheduler) for ACPI (4) and DT (5) systems.

(4) examples:

Kunpeng920 - 24 CPUs sharing LLC (cpu_coregroup_mask()), 4 CPUs sharing
L3-tag (cpu_clustergroup_mask()).

X86 Jacobsville - 24 CPUs sharing LLC (L3), 4 CPUs sharing L2

Armv9 L2 complexes: e.g. QC SM8450 - 8 CPUs sharing LLC (L3), (for A510
(little CPUs)) 2 CPUs sharing L2

> cpu nodes in cluster mask. So we are just aligning to it on DT platforms
> here. So if you are saying that is an issue, please fix that for ACPI too.
>

[...]

>>> So unless someone gives me non-scheduler and topology specific reasons
>>> to change that, sorry but my opinion on this matter is not going to change ;).
>>
>> `lstopo` is fine with a now correct /sys/.../topology/package_cpus (or
>> core_siblings (old filename). It's not reading
>> /sys/.../topology/cluster_cpus (yet) so why set it (wrongly) to 0x39 for
>> CPU0 on Juno when it can stay 0x01?
>>
>
> On ACPI ? If so, it could be the package ID in the ACPI table which can be
> invalid and kernel use the table offset as ID. It is not ideal but doesn't
> affect the masks. The current value 0 or 1 on Juno is cluster ID and they
> contribute to wrong package cpu masks.
>
>
> And yes lstopo doesn't read cluster IDs. But we expose them in ACPI system
> and not on DT which was my main point.

Understood. But a Kunpeng920 `cluster_cpus_list` file would contain
logically different information than a Juno `cluster_cpus_list` file.

Kunpeng920 `cluster_cpus_list` contain 4 CPUs sharing L3-tag (less than
LLC) whereas Juno cluster_cpus_list contain 2 or 4 CPUs (which is LLC).

I think key is to agree what a CLUSTER actually represent and especially
in case when `the level of topology above CPUs` is congruent with LLC
boundaries. Because my feeling is that people didn't pay attention to
this detail when they introduced SCHED_CONFIG_CLUSTER. A related issue
is the Ampere Altra hack in cpu_coregroup_mask().

> As pointed out earlier, have you checked ACPI on Juno and with
> CONFIG_SCHED_CLUSTER ? If the behaviour with my series on DT and ACPI
> differs, then it is an issue. But AFAIU, it doesn't and that is my main
> argument. You are just assuming what we have on Juno with DT is correct
> which may be w.r.t to scheduler but definitely not with respect to the
> hardware topology exposed to the users. So my aim is to get that fixed.

I never run Juno w/ ACPI. Are you saying that
find_acpi_cpu_topology_cluster() returns cpu_topology[cpu].cluster_id's
which match the `1. level cluster nodes`?

The function header of find_acpi_cpu_topology_cluster() says that `...
the cluster, if present is the level of topology above CPUs. ...`.

From this perspective I can see your point. But this is then still
clearly poorly designed. How would we ever support CONFIG_SCHED_CLUSTER
in DT when it really (potentially) would bring a benefit (i.e. in the
Armv9 L2-complex case) and not only create trouble for the task
scheduler to setup its domains correctly? Also in case we stick to
setting CONFIG_SCHED_CLUSTER=1 by default, CLS should be the default LLC
sched domain and MC the exceptional one. Just to respect the way how the
task scheduler removes not useful domains today.

> If you are not happy with that, then how can be be happy with what is the
> current behaviour on ACPI + and - CONFIG_SCHED_CLUSTER. I haven't got
> your opinion yet on that matter.

I guess it's clear now that ACPI + CONFIG_SCHED_CLUSTER with ``the level
of topology above CPUs` is congruent with LLC` creates trouble to the
scheduler. So I don't see why we should replicate this for DT. Let's
discuss further tomorrow in person.

[...]




2022-06-17 11:58:57

by Sudeep Holla

[permalink] [raw]
Subject: Re: [PATCH v3 15/16] arch_topology: Set cluster identifier in each core/thread from /cpu-map

On Thu, Jun 16, 2022 at 05:02:28PM +0100, Dietmar Eggemann wrote:
> On 13/06/2022 12:17, Sudeep Holla wrote:
> > On Mon, Jun 13, 2022 at 11:19:36AM +0200, Dietmar Eggemann wrote:
> >> On 10/06/2022 12:27, Sudeep Holla wrote:
> >>> On Fri, Jun 10, 2022 at 12:08:44PM +0200, Vincent Guittot wrote:
> >>>> On Mon, 6 Jun 2022 at 12:22, Sudeep Holla <[email protected]> wrote:
>
> [...]
>
> >>> Again I completely disagree. Let us look at the problems separately.
> >>> The hardware topology that some of the tools like lscpu and lstopo expects
> >>> what the hardware looks like and not the scheduler's view of the hardware.
> >>> So the topology masks that gets exposed to the user-space needs fixing
> >>> even today. I have reports from various tooling people about the same.
> >>> E.g. Juno getting exposed as dual socket system is utter non-sense.
> >>>
> >>> Yes scheduler uses most of the topology masks as is but that is not a must.
> >>> There are these *group_mask functions that can implement what scheduler
> >>> needs to be fed.
> >>>
> >>> I am not sure why the 2 issues are getting mixed up and that is the main
> >>> reason why I jumped into this to make sure the topology masks are
> >>> not tampered based on the way it needs to be used for scheduler.
> >>
> >> I'm all in favor of not mixing up those 2 issues. But I don't understand
> >> why you have to glue them together.
> >>
> >
> > What are you referring as 'glue them together'. As I said this series just
> > address the hardware topology and if there is any impact on sched domains
> > then it is do with alignment with ACPI and DT platform behaviour. I am not
> > adding anything more to glue topology and info needed for sched domains.
>
> You can fix (1) without (2) parsing 1. level cluster nodes as
> cluster_siblings.
>

Technically yes, but I see no point in delaying it as it is considered as
broken with respect to the moment ACPI exposed the correct value and at the
same time resulted in exposing incorrect value in case of DT. I am referring
to the same change that introduced SCHED_CLUSTER. The damage is done and it
needs repairing ASAP.

> > Indeed. But I don't get what you mean by 2 level here. ACPI puts 1st level
>
> cpu_map {
> socket0 {
> cluster0 { <-- 1. level cluster
> cluster0 { <-- 2. level cluster (3 -->)

Oh I had misunderstood this level nomenclature, I refer it as leaf cluster
node which is 2. level cluster in this DT snippet.

> core0 {
>
> };
> core1 {
>
> };
> cluster1 {
> ...
>
> Armv9 L2 complexes: e.g. QC SM8450:
>
> .---------------.
> CPU |0 1 2 3 4 5 6 7|
> +---------------+
> uarch |l l l l m m m b| (so called tri-gear: little, medium, big)
> +---------------+
> L2 | | | | | | | <-- 2. level cluster, Armv9 L2 complexes (<-- 3)

Again before I assume, what exactly <--3 here and in above snippet mean ?

> +---------------+
> L3 |<-- -->|
> +---------------+
> |<-- cluster -->|

I think this is 2 level cluster or only cluster in this system w.r.t hardware.
So lets not confuse with multi-level if not necessary.

> +---------------+
> |<-- DSU -->|
> '---------------'
>
> Only if we map (i) into cluster_sibling, we get the same hardware
> representation (for the task scheduler) for ACPI (4) and DT (5) systems.
>

What is (i) above ?

> (4) examples:
>
> Kunpeng920 - 24 CPUs sharing LLC (cpu_coregroup_mask()), 4 CPUs sharing
> L3-tag (cpu_clustergroup_mask()).
>

Again decouple cache info and cluster info from h/w, you have all the info.
You can couple them together if that helps when you feed sched_domains.

> X86 Jacobsville - 24 CPUs sharing LLC (L3), 4 CPUs sharing L2
>
> Armv9 L2 complexes: e.g. QC SM8450 - 8 CPUs sharing LLC (L3), (for A510
> (little CPUs)) 2 CPUs sharing L2

[...]

> > And yes lstopo doesn't read cluster IDs. But we expose them in ACPI system
> > and not on DT which was my main point.
>
> Understood. But a Kunpeng920 `cluster_cpus_list` file would contain
> logically different information than a Juno `cluster_cpus_list` file.
>

And why is that ?

> Kunpeng920 `cluster_cpus_list` contain 4 CPUs sharing L3-tag (less than
> LLC) whereas Juno cluster_cpus_list contain 2 or 4 CPUs (which is LLC).
>

Correct because that is how the hardware clusters are designed on those SoC.
Cache topology is different.

> I think key is to agree what a CLUSTER actually represent and especially
> in case when `the level of topology above CPUs` is congruent with LLC
> boundaries. Because my feeling is that people didn't pay attention to
> this detail when they introduced SCHED_CONFIG_CLUSTER. A related issue
> is the Ampere Altra hack in cpu_coregroup_mask().
>

The is defined by hardware and DT/ACPI has bindings for it. We can't redefine
CLUSTER in a way that breaks those definitions. Again cluster is part of CPU
topology and cache topology can be different and need not be congruent with
CPU topology in some ways, Ofcourse they are in some other ways but there will
be no single line of alignment across SoCs which is quite evident with the
examples you have listed. Just add Ampere Altra to make it more fun.

> > As pointed out earlier, have you checked ACPI on Juno and with
> > CONFIG_SCHED_CLUSTER ? If the behaviour with my series on DT and ACPI
> > differs, then it is an issue. But AFAIU, it doesn't and that is my main
> > argument. You are just assuming what we have on Juno with DT is correct
> > which may be w.r.t to scheduler but definitely not with respect to the
> > hardware topology exposed to the users. So my aim is to get that fixed.
>
> I never run Juno w/ ACPI. Are you saying that
> find_acpi_cpu_topology_cluster() returns cpu_topology[cpu].cluster_id's
> which match the `1. level cluster nodes`?
>

Again I am totally confused as why this is now 1.level cluster where as above
SDM was 2.level cluster. Both SoCs have only 1 level of cluster. While SDM
has 1 DSU cluster, Juno has 2 clusters.

> The function header of find_acpi_cpu_topology_cluster() says that `...
> the cluster, if present is the level of topology above CPUs. ...`.
>

Exactly and that's how sysfs is also defined and we can't go back and change
that now. I don't see any issue TBH.

> From this perspective I can see your point. But this is then still
> clearly poorly designed.

Not really as per the definition.

> How would we ever support CONFIG_SCHED_CLUSTER
> in DT when it really (potentially) would bring a benefit (i.e. in the
> Armv9 L2-complex case) and not only create trouble for the task
> scheduler to setup its domains correctly?

Indeed, that is the next problem once we get all these aligned across
DT and ACPI. They have diverged and I prefer not to allow that anymore
by adding more divergence e.g. to support Armv9 L2-complex case. Please
consider that on top of these, I am not addressing that at the moment.
In fact I am not addressing any sched_domain topics or issues. I have made
that clear ????.

> Also in case we stick to
> setting CONFIG_SCHED_CLUSTER=1 by default, CLS should be the default LLC
> sched domain and MC the exceptional one. Just to respect the way how the
> task scheduler removes not useful domains today.
>

Fix the cpu_clustergroup_mask or any other cpu_*group_mask as per your
taste. The topology masks are just inputs to these and will not be changed
or diverged for these reasons. Sorry if that is not helpful, but that is the
reality with sysfs exposed to the user-space.

> > If you are not happy with that, then how can be be happy with what is the
> > current behaviour on ACPI + and - CONFIG_SCHED_CLUSTER. I haven't got
> > your opinion yet on that matter.
>
> I guess it's clear now that ACPI + CONFIG_SCHED_CLUSTER with ``the level
> of topology above CPUs` is congruent with LLC` creates trouble to the
> scheduler. So I don't see why we should replicate this for DT. Let's
> discuss further tomorrow in person.

I see it differently. If that creates a trouble, fix that and you will not
have any issues with DT too.

--
Regards,
Sudeep

2022-06-20 14:43:08

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v3 15/16] arch_topology: Set cluster identifier in each core/thread from /cpu-map

On 17/06/2022 13:16, Sudeep Holla wrote:
> On Thu, Jun 16, 2022 at 05:02:28PM +0100, Dietmar Eggemann wrote:
>> On 13/06/2022 12:17, Sudeep Holla wrote:
>>> On Mon, Jun 13, 2022 at 11:19:36AM +0200, Dietmar Eggemann wrote:
>>>> On 10/06/2022 12:27, Sudeep Holla wrote:
>>>>> On Fri, Jun 10, 2022 at 12:08:44PM +0200, Vincent Guittot wrote:
>>>>>> On Mon, 6 Jun 2022 at 12:22, Sudeep Holla <[email protected]> wrote:

[...]

>>> What are you referring as 'glue them together'. As I said this series just
>>> address the hardware topology and if there is any impact on sched domains
>>> then it is do with alignment with ACPI and DT platform behaviour. I am not
>>> adding anything more to glue topology and info needed for sched domains.
>>
>> You can fix (1) without (2) parsing 1. level cluster nodes as
>> cluster_siblings.
>>
>
> Technically yes, but I see no point in delaying it as it is considered as
> broken with respect to the moment ACPI exposed the correct value and at the
> same time resulted in exposing incorrect value in case of DT. I am referring
> to the same change that introduced SCHED_CLUSTER. The damage is done and it
> needs repairing ASAP.

OK, then lets `/sys/.../topology/cluster_cpus` refer to pure
topology-based cluster information. This can be DT cluster-node
information or ACPI L3-tag information e.g. for KP920.

>>> Indeed. But I don't get what you mean by 2 level here. ACPI puts 1st level
>>
>> cpu_map {
>> socket0 {
>> cluster0 { <-- 1. level cluster
>> cluster0 { <-- 2. level cluster (3 -->)
>
> Oh I had misunderstood this level nomenclature, I refer it as leaf cluster
> node which is 2. level cluster in this DT snippet.
>
>> core0 {
>>
>> };
>> core1 {
>>
>> };
>> cluster1 {
>> ...
>>
>> Armv9 L2 complexes: e.g. QC SM8450:
>>
>> .---------------.
>> CPU |0 1 2 3 4 5 6 7|
>> +---------------+
>> uarch |l l l l m m m b| (so called tri-gear: little, medium, big)
>> +---------------+
>> L2 | | | | | | | <-- 2. level cluster, Armv9 L2 complexes (<-- 3)
>
> Again before I assume, what exactly <--3 here and in above snippet mean ?

I wanted to show that we could encode `2. level cluster` as `Armv9 L2
complexes`. But since we agreed (after the email was sent) not to
support `nested cluster-nodes`, this idea does not hold anymore.

>> +---------------+
>> L3 |<-- -->|
>> +---------------+
>> |<-- cluster -->|
>
> I think this is 2 level cluster or only cluster in this system w.r.t hardware.
> So lets not confuse with multi-level if not necessary.

No need, we said no `nested cluster-node` support in DT.

>> +---------------+
>> |<-- DSU -->|
>> '---------------'
>>
>> Only if we map (i) into cluster_sibling, we get the same hardware
>> representation (for the task scheduler) for ACPI (4) and DT (5) systems.
>>
>
> What is (i) above ?

Sorry, (i) was meant to be `3 -->`.

>> (4) examples:
>>
>> Kunpeng920 - 24 CPUs sharing LLC (cpu_coregroup_mask()), 4 CPUs sharing
>> L3-tag (cpu_clustergroup_mask()).
>>
>
> Again decouple cache info and cluster info from h/w, you have all the info.
> You can couple them together if that helps when you feed sched_domains.

OK, this is what we finally agreed.

>> X86 Jacobsville - 24 CPUs sharing LLC (L3), 4 CPUs sharing L2
>>
>> Armv9 L2 complexes: e.g. QC SM8450 - 8 CPUs sharing LLC (L3), (for A510
>> (little CPUs)) 2 CPUs sharing L2
>
> [...]
>
>>> And yes lstopo doesn't read cluster IDs. But we expose them in ACPI system
>>> and not on DT which was my main point.

OK, no further discussion needed here. `/sys/.../topology/cluster_cpus`
shows L2 (LLC) on Juno, L3-tag an KP920 or L2 on X86 Jacobsville. The
cpu_xxx_mask()s of (e.g.) default_topology[] have to make sure that the
scheduler sees the correct information (the hierarchy of cpumasks).

>> Understood. But a Kunpeng920 `cluster_cpus_list` file would contain
>> logically different information than a Juno `cluster_cpus_list` file.
>>
> And why is that ?

If we see it from the angle of the definition of SCHED_CONFIG_CLUSTER
(`... the level of topology above CPUs ...` then I can see that both
definitions are the same. (CPUS should be rather `cores` here, I guess?).

[...]

>>> As pointed out earlier, have you checked ACPI on Juno and with
>>> CONFIG_SCHED_CLUSTER ? If the behaviour with my series on DT and ACPI
>>> differs, then it is an issue. But AFAIU, it doesn't and that is my main
>>> argument. You are just assuming what we have on Juno with DT is correct
>>> which may be w.r.t to scheduler but definitely not with respect to the
>>> hardware topology exposed to the users. So my aim is to get that fixed.
>>
>> I never run Juno w/ ACPI. Are you saying that
>> find_acpi_cpu_topology_cluster() returns cpu_topology[cpu].cluster_id's
>> which match the `1. level cluster nodes`?
>>
>
> Again I am totally confused as why this is now 1.level cluster where as above
> SDM was 2.level cluster. Both SoCs have only 1 level of cluster. While SDM
> has 1 DSU cluster, Juno has 2 clusters.

No need in agreeing what level (could) stand(s) here for. We said `no
nested cluster-node`.

>> The function header of find_acpi_cpu_topology_cluster() says that `...
>> the cluster, if present is the level of topology above CPUs. ...`.
>>
>
> Exactly and that's how sysfs is also defined and we can't go back and change
> that now. I don't see any issue TBH.

OK.

>> From this perspective I can see your point. But this is then still
>> clearly poorly designed.
>
> Not really as per the definition.

Not from the viewpoint of topology and cacheinfo. But from the scheduler
and the whole thing gets activated by a scheduler CONFIG option.

>> How would we ever support CONFIG_SCHED_CLUSTER
>> in DT when it really (potentially) would bring a benefit (i.e. in the
>> Armv9 L2-complex case) and not only create trouble for the task
>> scheduler to setup its domains correctly?
>
> Indeed, that is the next problem once we get all these aligned across
> DT and ACPI. They have diverged and I prefer not to allow that anymore
> by adding more divergence e.g. to support Armv9 L2-complex case. Please
> consider that on top of these, I am not addressing that at the moment.
> In fact I am not addressing any sched_domain topics or issues. I have made
> that clear ????.

If I recall the content of our discussion correctly, `Armv9 L2
complexes` support could come from L2 (probably `LLC - 1` ?) detection
from cacheinfo. But this is not part of this patch-set.

>> Also in case we stick to
>> setting CONFIG_SCHED_CLUSTER=1 by default, CLS should be the default LLC
>> sched domain and MC the exceptional one. Just to respect the way how the
>> task scheduler removes not useful domains today.
>
> Fix the cpu_clustergroup_mask or any other cpu_*group_mask as per your
> taste. The topology masks are just inputs to these and will not be changed
> or diverged for these reasons. Sorry if that is not helpful, but that is the
> reality with sysfs exposed to the user-space.

Agreed. We don't have to rely on the task scheduler to build the right
sched domain hierarchy out of every cpu_xxx_mask() functions. We can
tweak cpu_xxx_mask() to get this done.

>>> If you are not happy with that, then how can be be happy with what is the
>>> current behaviour on ACPI + and - CONFIG_SCHED_CLUSTER. I haven't got
>>> your opinion yet on that matter.
>>
>> I guess it's clear now that ACPI + CONFIG_SCHED_CLUSTER with ``the level
>> of topology above CPUs` is congruent with LLC` creates trouble to the
>> scheduler. So I don't see why we should replicate this for DT. Let's
>> discuss further tomorrow in person.
>
> I see it differently. If that creates a trouble, fix that and you will not
> have any issues with DT too.

OK, I think we (arm64) found a way to support a default
CONFIG_SCHED_CLUSTER and fixing the `Juno lstopo` issue with a possible
way to include `Armv9 L2 complexes` support via cacheinfo later.

2022-06-21 16:04:59

by Sudeep Holla

[permalink] [raw]
Subject: Re: [PATCH v3 15/16] arch_topology: Set cluster identifier in each core/thread from /cpu-map

On Mon, Jun 20, 2022 at 03:27:33PM +0200, Dietmar Eggemann wrote:
> On 17/06/2022 13:16, Sudeep Holla wrote:
> > On Thu, Jun 16, 2022 at 05:02:28PM +0100, Dietmar Eggemann wrote:
> >> On 13/06/2022 12:17, Sudeep Holla wrote:
> >>> On Mon, Jun 13, 2022 at 11:19:36AM +0200, Dietmar Eggemann wrote:
> >>>> On 10/06/2022 12:27, Sudeep Holla wrote:
> >>>>> On Fri, Jun 10, 2022 at 12:08:44PM +0200, Vincent Guittot wrote:
> >>>>>> On Mon, 6 Jun 2022 at 12:22, Sudeep Holla <[email protected]> wrote:
>
> [...]
>
> >>> What are you referring as 'glue them together'. As I said this series just
> >>> address the hardware topology and if there is any impact on sched domains
> >>> then it is do with alignment with ACPI and DT platform behaviour. I am not
> >>> adding anything more to glue topology and info needed for sched domains.
> >>
> >> You can fix (1) without (2) parsing 1. level cluster nodes as
> >> cluster_siblings.
> >>
> >
> > Technically yes, but I see no point in delaying it as it is considered as
> > broken with respect to the moment ACPI exposed the correct value and at the
> > same time resulted in exposing incorrect value in case of DT. I am referring
> > to the same change that introduced SCHED_CLUSTER. The damage is done and it
> > needs repairing ASAP.
>
> OK, then lets `/sys/.../topology/cluster_cpus` refer to pure
> topology-based cluster information. This can be DT cluster-node
> information or ACPI L3-tag information e.g. for KP920.
>

Agreed. All the information under /sys/.../topology/ are the hardware view
and not the scheduler's view of the hardware for the purpose of building
sched domains.

> >>> Indeed. But I don't get what you mean by 2 level here. ACPI puts 1st level
> >>
> >> cpu_map {
> >> socket0 {
> >> cluster0 { <-- 1. level cluster
> >> cluster0 { <-- 2. level cluster (3 -->)
> >
> > Oh I had misunderstood this level nomenclature, I refer it as leaf cluster
> > node which is 2. level cluster in this DT snippet.
> >
> >> core0 {
> >>
> >> };
> >> core1 {
> >>
> >> };
> >> cluster1 {
> >> ...
> >>
> >> Armv9 L2 complexes: e.g. QC SM8450:
> >>
> >> .---------------.
> >> CPU |0 1 2 3 4 5 6 7|
> >> +---------------+
> >> uarch |l l l l m m m b| (so called tri-gear: little, medium, big)
> >> +---------------+
> >> L2 | | | | | | | <-- 2. level cluster, Armv9 L2 complexes (<-- 3)
> >
> > Again before I assume, what exactly <--3 here and in above snippet mean ?
>
> I wanted to show that we could encode `2. level cluster` as `Armv9 L2
> complexes`. But since we agreed (after the email was sent) not to
> support `nested cluster-nodes`, this idea does not hold anymore.
>

Yes I plan to throw warning if we encounter nested clusters, will be part of
next version(https://git.kernel.org/sudeep.holla/c/94fae12d64bb).

> >> +---------------+
> >> L3 |<-- -->|
> >> +---------------+
> >> |<-- cluster -->|
> >
> > I think this is 2 level cluster or only cluster in this system w.r.t hardware.
> > So lets not confuse with multi-level if not necessary.
>
> No need, we said no `nested cluster-node` support in DT.
>

????

> >> +---------------+
> >> |<-- DSU -->|
> >> '---------------'
> >>
> >> Only if we map (i) into cluster_sibling, we get the same hardware
> >> representation (for the task scheduler) for ACPI (4) and DT (5) systems.
> >>
> >
> > What is (i) above ?
>
> Sorry, (i) was meant to be `3 -->`.
>

Ah ok

> >> (4) examples:
> >>
> >> Kunpeng920 - 24 CPUs sharing LLC (cpu_coregroup_mask()), 4 CPUs sharing
> >> L3-tag (cpu_clustergroup_mask()).
> >>
> >
> > Again decouple cache info and cluster info from h/w, you have all the info.
> > You can couple them together if that helps when you feed sched_domains.
>
> OK, this is what we finally agreed.
>

????

> >> X86 Jacobsville - 24 CPUs sharing LLC (L3), 4 CPUs sharing L2
> >>
> >> Armv9 L2 complexes: e.g. QC SM8450 - 8 CPUs sharing LLC (L3), (for A510
> >> (little CPUs)) 2 CPUs sharing L2
> >
> > [...]
> >
> >>> And yes lstopo doesn't read cluster IDs. But we expose them in ACPI system
> >>> and not on DT which was my main point.
>
> OK, no further discussion needed here. `/sys/.../topology/cluster_cpus`
> shows L2 (LLC) on Juno, L3-tag an KP920 or L2 on X86 Jacobsville. The
> cpu_xxx_mask()s of (e.g.) default_topology[] have to make sure that the
> scheduler sees the correct information (the hierarchy of cpumasks).
>

Correct

> >> Understood. But a Kunpeng920 `cluster_cpus_list` file would contain
> >> logically different information than a Juno `cluster_cpus_list` file.
> >>
> > And why is that ?
>
> If we see it from the angle of the definition of SCHED_CONFIG_CLUSTER
> (`... the level of topology above CPUs ...` then I can see that both
> definitions are the same. (CPUS should be rather `cores` here, I guess?).
>
> [...]
>

Yes, I have use cores and CPUs interchanges at several places so far, will
try to be more specific in the future.

> >>> As pointed out earlier, have you checked ACPI on Juno and with
> >>> CONFIG_SCHED_CLUSTER ? If the behaviour with my series on DT and ACPI
> >>> differs, then it is an issue. But AFAIU, it doesn't and that is my main
> >>> argument. You are just assuming what we have on Juno with DT is correct
> >>> which may be w.r.t to scheduler but definitely not with respect to the
> >>> hardware topology exposed to the users. So my aim is to get that fixed.
> >>
> >> I never run Juno w/ ACPI. Are you saying that
> >> find_acpi_cpu_topology_cluster() returns cpu_topology[cpu].cluster_id's
> >> which match the `1. level cluster nodes`?
> >>
> >
> > Again I am totally confused as why this is now 1.level cluster where as above
> > SDM was 2.level cluster. Both SoCs have only 1 level of cluster. While SDM
> > has 1 DSU cluster, Juno has 2 clusters.
>
> No need in agreeing what level (could) stand(s) here for. We said `no
> nested cluster-node`.
>

????

> >> The function header of find_acpi_cpu_topology_cluster() says that `...
> >> the cluster, if present is the level of topology above CPUs. ...`.
> >>
> >
> > Exactly and that's how sysfs is also defined and we can't go back and change
> > that now. I don't see any issue TBH.
>
> OK.
>
> >> From this perspective I can see your point. But this is then still
> >> clearly poorly designed.
> >
> > Not really as per the definition.
>
> Not from the viewpoint of topology and cacheinfo. But from the scheduler
> and the whole thing gets activated by a scheduler CONFIG option.
>

Agreed and I too think it must be enabled by default.

> >> How would we ever support CONFIG_SCHED_CLUSTER
> >> in DT when it really (potentially) would bring a benefit (i.e. in the
> >> Armv9 L2-complex case) and not only create trouble for the task
> >> scheduler to setup its domains correctly?
> >
> > Indeed, that is the next problem once we get all these aligned across
> > DT and ACPI. They have diverged and I prefer not to allow that anymore
> > by adding more divergence e.g. to support Armv9 L2-complex case. Please
> > consider that on top of these, I am not addressing that at the moment.
> > In fact I am not addressing any sched_domain topics or issues. I have made
> > that clear ????.
>
> If I recall the content of our discussion correctly, `Armv9 L2
> complexes` support could come from L2 (probably `LLC - 1` ?) detection
> from cacheinfo. But this is not part of this patch-set.
>

Correct and thanks for making that clear here. I have intentionally not
mentioned it so far as I am not addressing anything specific to such a
topology.

> >> Also in case we stick to
> >> setting CONFIG_SCHED_CLUSTER=1 by default, CLS should be the default LLC
> >> sched domain and MC the exceptional one. Just to respect the way how the
> >> task scheduler removes not useful domains today.
> >
> > Fix the cpu_clustergroup_mask or any other cpu_*group_mask as per your
> > taste. The topology masks are just inputs to these and will not be changed
> > or diverged for these reasons. Sorry if that is not helpful, but that is the
> > reality with sysfs exposed to the user-space.
>
> Agreed. We don't have to rely on the task scheduler to build the right
> sched domain hierarchy out of every cpu_xxx_mask() functions. We can
> tweak cpu_xxx_mask() to get this done.
>

Correct, I am not saying it is as simple as that, but to keep hardware
topology and the info fed to scheduler or the info as viewed from
sched_domains perspective different. The former changes with hardware while
the latter changes with the view of scheduler. Based on the logic it
used to optimise, scheduler can try different view on the same hardware but
that's not where we want to be. We want to provide a generic view based on
all the h/w cpu and cache topology information. It may not be always optimal,
but we can try the best.

> >>> If you are not happy with that, then how can be be happy with what is the
> >>> current behaviour on ACPI + and - CONFIG_SCHED_CLUSTER. I haven't got
> >>> your opinion yet on that matter.
> >>
> >> I guess it's clear now that ACPI + CONFIG_SCHED_CLUSTER with ``the level
> >> of topology above CPUs` is congruent with LLC` creates trouble to the
> >> scheduler. So I don't see why we should replicate this for DT. Let's
> >> discuss further tomorrow in person.
> >
> > I see it differently. If that creates a trouble, fix that and you will not
> > have any issues with DT too.
>
> OK, I think we (arm64) found a way to support a default
> CONFIG_SCHED_CLUSTER and fixing the `Juno lstopo` issue with a possible
> way to include `Armv9 L2 complexes` support via cacheinfo later.

Thanks for all the discussion. Hope we may not have to revisit this topic
in such depth until another topology that breaks all our current assumption
arrives in the wild which may not be too long.

I remember talking with Morten and concluding LLC is sufficient for arm64
couple of years back, but now that can be questionable with Armv9 L2
complex.

--
Regards,
Sudeep