2022-05-18 09:45:07

by Sudeep Holla

[permalink] [raw]
Subject: [PATCH v2 0/8] arch_topology: Updates to add socket support and fix cluster ids

Hi All,

This series intends to fix some discrepancies we have in the CPU topology
parsing from the device tree /cpu-map node. Also this diverges from the
behaviour on a ACPI enabled platform. The expectation is that both DT
and ACPI enabled systems must present consistent view of the CPU topology.

Currently we assign generated cluster count as the physical package identifier
for each CPU which is wrong. The device tree bindings for CPU topology supports
sockets to infer the socket or physical package identifier for a given CPU.
Also we don't check if all the cores/threads belong to the same cluster before
updating their sibling masks which is fine as we don't set the cluster id yet.

These changes also assigns the cluster identifier as parsed from the device tree
cluster nodes within /cpu-map without support for nesting of the clusters.
Finally, it also add support for socket nodes in /cpu-map. With this the
parsing of exact same information from ACPI PPTT and /cpu-map DT node
aligns well.

The only exception is that the last level cache id information can be
inferred from the same ACPI PPTT while we need to parse CPU cache nodes
in the device tree.

P.S: I have not cc-ed Greg and Rafael so that all the users of arch_topology
agree with the changes first before we include them.

v1[1]->v2:
- Updated ID validity check include all non-negative value
- Added support to get the device node for the CPU's last level cache
- Added support to build llc_sibling on DT platforms

[1] https://lore.kernel.org/lkml/[email protected]

Sudeep Holla (8):
arch_topology: Don't set cluster identifier as physical package identifier
arch_topology: Set thread sibling cpumask only within the cluster
arch_topology: Set cluster identifier in each core/thread from /cpu-map
arch_topology: Add support for parsing sockets in /cpu-map
arch_topology: Check for non-negative value rather than -1 for IDs validity
arch_topology: Avoid parsing through all the CPUs once a outlier CPU is found
of: base: add support to get the device node for the CPU's last level cache
arch_topology: Add support to build llc_sibling on DT platforms

drivers/base/arch_topology.c | 75 +++++++++++++++++++++++++++--------
drivers/of/base.c | 33 +++++++++++----
include/linux/arch_topology.h | 1 +
include/linux/of.h | 1 +
4 files changed, 85 insertions(+), 25 deletions(-)

--
2.36.1



2022-05-18 09:45:17

by Sudeep Holla

[permalink] [raw]
Subject: [PATCH v2 2/8] arch_topology: Set thread sibling cpumask only within the cluster

Currently the cluster identifier is not set on the DT based platforms.
The reset or default value is -1 for all the CPUs. Once we assign the
cluster identifier values correctly that imay result in getting the thread
siblings wrongs as the core identifiers can be same for 2 different CPUs
belonging to 2 different cluster.

So, in order to get the thread sibling cpumasks correct, we need to
update them only if the cores they belong are in the same cluster within
the socket. Let us skip updation of the thread sibling cpumaks if the
cluster identifier doesn't match.

This change won't affect even if the cluster identifiers are not set
currently but will avoid any breakage once we set the same correctly.

Signed-off-by: Sudeep Holla <[email protected]>
---
drivers/base/arch_topology.c | 12 +++++++-----
1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index 44f733b365cc..7f5aa655c1f4 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -697,15 +697,17 @@ void update_siblings_masks(unsigned int cpuid)
if (cpuid_topo->package_id != cpu_topo->package_id)
continue;

- if (cpuid_topo->cluster_id == cpu_topo->cluster_id &&
- cpuid_topo->cluster_id != -1) {
+ cpumask_set_cpu(cpuid, &cpu_topo->core_sibling);
+ cpumask_set_cpu(cpu, &cpuid_topo->core_sibling);
+
+ if (cpuid_topo->cluster_id != cpu_topo->cluster_id)
+ continue;
+
+ if (cpuid_topo->cluster_id != -1) {
cpumask_set_cpu(cpu, &cpuid_topo->cluster_sibling);
cpumask_set_cpu(cpuid, &cpu_topo->cluster_sibling);
}

- cpumask_set_cpu(cpuid, &cpu_topo->core_sibling);
- cpumask_set_cpu(cpu, &cpuid_topo->core_sibling);
-
if (cpuid_topo->core_id != cpu_topo->core_id)
continue;

--
2.36.1


2022-05-18 09:45:20

by Sudeep Holla

[permalink] [raw]
Subject: [PATCH v2 1/8] arch_topology: Don't set cluster identifier as physical package identifier

Currently as we parse the CPU topology from /cpu-map node from the
device tree, we assign generated cluster count as the physical package
identifier for each CPU which is wrong.

The device tree bindings for CPU topology supports sockets to infer
the socket or physical package identifier for a given CPU. Since it is
fairly new and not support on most of the old and existing systems, we
can assume all such systems have single socket/physical package.

Fix the physical package identifier to 0 by removing the assignment of
cluster identifier to the same.

Signed-off-by: Sudeep Holla <[email protected]>
---
drivers/base/arch_topology.c | 6 +-----
1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index f73b836047cf..44f733b365cc 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -543,7 +543,6 @@ static int __init parse_cluster(struct device_node *cluster, int depth)
bool leaf = true;
bool has_cores = false;
struct device_node *c;
- static int package_id __initdata;
int core_id = 0;
int i, ret;

@@ -582,7 +581,7 @@ static int __init parse_cluster(struct device_node *cluster, int depth)
}

if (leaf) {
- ret = parse_core(c, package_id, core_id++);
+ ret = parse_core(c, 0, core_id++);
} else {
pr_err("%pOF: Non-leaf cluster with core %s\n",
cluster, name);
@@ -599,9 +598,6 @@ static int __init parse_cluster(struct device_node *cluster, int depth)
if (leaf && !has_cores)
pr_warn("%pOF: empty cluster\n", cluster);

- if (leaf)
- package_id++;
-
return 0;
}

--
2.36.1


2022-05-18 09:45:22

by Sudeep Holla

[permalink] [raw]
Subject: [PATCH v2 3/8] arch_topology: Set cluster identifier in each core/thread from /cpu-map

Let us set the cluster identifier as parsed from the device tree
cluster nodes within /cpu-map.

We don't support nesting of clusters yet as there are no real hardware
to support clusters of clusters.

Signed-off-by: Sudeep Holla <[email protected]>
---
drivers/base/arch_topology.c | 13 ++++++++-----
1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index 7f5aa655c1f4..bdb6f2a17df0 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -491,7 +491,7 @@ static int __init get_cpu_for_node(struct device_node *node)
}

static int __init parse_core(struct device_node *core, int package_id,
- int core_id)
+ int cluster_id, int core_id)
{
char name[20];
bool leaf = true;
@@ -507,6 +507,7 @@ static int __init parse_core(struct device_node *core, int package_id,
cpu = get_cpu_for_node(t);
if (cpu >= 0) {
cpu_topology[cpu].package_id = package_id;
+ cpu_topology[cpu].cluster_id = cluster_id;
cpu_topology[cpu].core_id = core_id;
cpu_topology[cpu].thread_id = i;
} else if (cpu != -ENODEV) {
@@ -528,6 +529,7 @@ static int __init parse_core(struct device_node *core, int package_id,
}

cpu_topology[cpu].package_id = package_id;
+ cpu_topology[cpu].cluster_id = cluster_id;
cpu_topology[cpu].core_id = core_id;
} else if (leaf && cpu != -ENODEV) {
pr_err("%pOF: Can't get CPU for leaf core\n", core);
@@ -537,7 +539,8 @@ static int __init parse_core(struct device_node *core, int package_id,
return 0;
}

-static int __init parse_cluster(struct device_node *cluster, int depth)
+static int __init
+parse_cluster(struct device_node *cluster, int cluster_id, int depth)
{
char name[20];
bool leaf = true;
@@ -557,7 +560,7 @@ static int __init parse_cluster(struct device_node *cluster, int depth)
c = of_get_child_by_name(cluster, name);
if (c) {
leaf = false;
- ret = parse_cluster(c, depth + 1);
+ ret = parse_cluster(c, i, depth + 1);
of_node_put(c);
if (ret != 0)
return ret;
@@ -581,7 +584,7 @@ static int __init parse_cluster(struct device_node *cluster, int depth)
}

if (leaf) {
- ret = parse_core(c, 0, core_id++);
+ ret = parse_core(c, 0, cluster_id, core_id++);
} else {
pr_err("%pOF: Non-leaf cluster with core %s\n",
cluster, name);
@@ -621,7 +624,7 @@ static int __init parse_dt_topology(void)
if (!map)
goto out;

- ret = parse_cluster(map, 0);
+ ret = parse_cluster(map, -1, 0);
if (ret != 0)
goto out_map;

--
2.36.1


2022-05-18 09:45:28

by Sudeep Holla

[permalink] [raw]
Subject: [PATCH v2 4/8] arch_topology: Add support for parsing sockets in /cpu-map

Finally let us add support for socket nodes in /cpu-map in the device
tree. Since this may not be present in all the old platforms and even
most of the existing platforms, we need to assume absence of the socket
node indicates that it is a single socket system and handle appropriately.

Also it is likely that most single socket systems skip to as the node
since it is optional.

Signed-off-by: Sudeep Holla <[email protected]>
---
drivers/base/arch_topology.c | 37 +++++++++++++++++++++++++++++++-----
1 file changed, 32 insertions(+), 5 deletions(-)

diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index bdb6f2a17df0..77aab5fea46b 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -539,8 +539,8 @@ static int __init parse_core(struct device_node *core, int package_id,
return 0;
}

-static int __init
-parse_cluster(struct device_node *cluster, int cluster_id, int depth)
+static int __init parse_cluster(struct device_node *cluster, int package_id,
+ int cluster_id, int depth)
{
char name[20];
bool leaf = true;
@@ -560,7 +560,7 @@ parse_cluster(struct device_node *cluster, int cluster_id, int depth)
c = of_get_child_by_name(cluster, name);
if (c) {
leaf = false;
- ret = parse_cluster(c, i, depth + 1);
+ ret = parse_cluster(c, package_id, i, depth + 1);
of_node_put(c);
if (ret != 0)
return ret;
@@ -584,7 +584,8 @@ parse_cluster(struct device_node *cluster, int cluster_id, int depth)
}

if (leaf) {
- ret = parse_core(c, 0, cluster_id, core_id++);
+ ret = parse_core(c, package_id, cluster_id,
+ core_id++);
} else {
pr_err("%pOF: Non-leaf cluster with core %s\n",
cluster, name);
@@ -604,6 +605,32 @@ parse_cluster(struct device_node *cluster, int cluster_id, int depth)
return 0;
}

+static int __init parse_socket(struct device_node *socket)
+{
+ char name[20];
+ struct device_node *c;
+ bool has_socket = false;
+ int package_id = 0, ret;
+
+ do {
+ snprintf(name, sizeof(name), "socket%d", package_id);
+ c = of_get_child_by_name(socket, name);
+ if (c) {
+ has_socket = true;
+ ret = parse_cluster(c, package_id, -1, 0);
+ of_node_put(c);
+ if (ret != 0)
+ return ret;
+ }
+ package_id++;
+ } while (c);
+
+ if (!has_socket)
+ ret = parse_cluster(socket, 0, -1, 0);
+
+ return ret;
+}
+
static int __init parse_dt_topology(void)
{
struct device_node *cn, *map;
@@ -624,7 +651,7 @@ static int __init parse_dt_topology(void)
if (!map)
goto out;

- ret = parse_cluster(map, -1, 0);
+ ret = parse_socket(map);
if (ret != 0)
goto out_map;

--
2.36.1


2022-05-18 09:45:30

by Sudeep Holla

[permalink] [raw]
Subject: [PATCH v2 6/8] arch_topology: Avoid parsing through all the CPUs once a outlier CPU is found

There is no point in looping through all the CPU's physical package
identifier to check if it is valid or not once a CPU which is outside
the topology(i.e. outlier CPU) is found.

Let us just break out of the loop early in such case.

Signed-off-by: Sudeep Holla <[email protected]>
---
drivers/base/arch_topology.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index be99a78f5b31..6d3346efe74b 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -662,8 +662,10 @@ static int __init parse_dt_topology(void)
* only mark cores described in the DT as possible.
*/
for_each_possible_cpu(cpu)
- if (cpu_topology[cpu].package_id < 0)
+ if (cpu_topology[cpu].package_id < 0) {
ret = -EINVAL;
+ break;
+ }

out_map:
of_node_put(map);
--
2.36.1


2022-05-18 09:45:32

by Sudeep Holla

[permalink] [raw]
Subject: [PATCH v2 5/8] arch_topology: Check for non-negative value rather than -1 for IDs validity

Instead of just comparing the cpu topology IDs with -1 to check their
validity, improve that by checking for a valid non-negative value.

Suggested-by: Andy Shevchenko <[email protected]>
Signed-off-by: Sudeep Holla <[email protected]>
---
drivers/base/arch_topology.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index 77aab5fea46b..be99a78f5b31 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -662,7 +662,7 @@ static int __init parse_dt_topology(void)
* only mark cores described in the DT as possible.
*/
for_each_possible_cpu(cpu)
- if (cpu_topology[cpu].package_id == -1)
+ if (cpu_topology[cpu].package_id < 0)
ret = -EINVAL;

out_map:
@@ -688,7 +688,7 @@ const struct cpumask *cpu_coregroup_mask(int cpu)
/* not numa in package, lets use the package siblings */
core_mask = &cpu_topology[cpu].core_sibling;
}
- if (cpu_topology[cpu].llc_id != -1) {
+ if (cpu_topology[cpu].llc_id >= 0) {
if (cpumask_subset(&cpu_topology[cpu].llc_sibling, core_mask))
core_mask = &cpu_topology[cpu].llc_sibling;
}
@@ -719,7 +719,8 @@ void update_siblings_masks(unsigned int cpuid)
for_each_online_cpu(cpu) {
cpu_topo = &cpu_topology[cpu];

- if (cpu_topo->llc_id != -1 && cpuid_topo->llc_id == cpu_topo->llc_id) {
+ if (cpu_topo->llc_id >= 0 &&
+ cpuid_topo->llc_id == cpu_topo->llc_id) {
cpumask_set_cpu(cpu, &cpuid_topo->llc_sibling);
cpumask_set_cpu(cpuid, &cpu_topo->llc_sibling);
}
@@ -733,7 +734,7 @@ void update_siblings_masks(unsigned int cpuid)
if (cpuid_topo->cluster_id != cpu_topo->cluster_id)
continue;

- if (cpuid_topo->cluster_id != -1) {
+ if (cpuid_topo->cluster_id >= 0) {
cpumask_set_cpu(cpu, &cpuid_topo->cluster_sibling);
cpumask_set_cpu(cpuid, &cpu_topo->cluster_sibling);
}
--
2.36.1


2022-05-18 09:45:32

by Sudeep Holla

[permalink] [raw]
Subject: [PATCH v2 7/8] of: base: add support to get the device node for the CPU's last level cache

It is useful to have helper function just to get the pointer to the device
node of the last level cache for a given logical cpu. It can be used as
unique firmware identifier for the last level cache.

This is useful to obtain the cpumask/cpumap of all the CPUs sharing the last
level cache using the device node pointer as unique identifier for the last
level cache.

Cc: Rob Herring <[email protected]>
Signed-off-by: Sudeep Holla <[email protected]>
---
drivers/of/base.c | 33 +++++++++++++++++++++++++--------
include/linux/of.h | 1 +
2 files changed, 26 insertions(+), 8 deletions(-)

diff --git a/drivers/of/base.c b/drivers/of/base.c
index d4f98c8469ed..0b6a8c3f9a85 100644
--- a/drivers/of/base.c
+++ b/drivers/of/base.c
@@ -2072,17 +2072,17 @@ struct device_node *of_find_next_cache_node(const struct device_node *np)
}

/**
- * of_find_last_cache_level - Find the level at which the last cache is
- * present for the given logical cpu
+ * of_find_last_level_cache_node - Find the device node at which the last
+ * cache is present for the given logical cpu
*
- * @cpu: cpu number(logical index) for which the last cache level is needed
+ * @cpu: cpu number(logical index) for which the last cache level's device
+ * node is needed
*
- * Return: The the level at which the last cache is present. It is exactly
- * same as the total number of cache levels for the given logical cpu.
+ * Return: The device node corresponding to the last cache for the given
+ * logical cpu
*/
-int of_find_last_cache_level(unsigned int cpu)
+struct device_node *of_find_last_level_cache_node(unsigned int cpu)
{
- u32 cache_level = 0;
struct device_node *prev = NULL, *np = of_cpu_device_node_get(cpu);

while (np) {
@@ -2091,7 +2091,24 @@ int of_find_last_cache_level(unsigned int cpu)
np = of_find_next_cache_node(np);
}

- of_property_read_u32(prev, "cache-level", &cache_level);
+ return prev;
+}
+
+/**
+ * of_find_last_cache_level - Find the level at which the last cache is
+ * present for the given logical cpu
+ *
+ * @cpu: cpu number(logical index) for which the last cache level is needed
+ *
+ * Return: The level at which the last cache is present. It is exactly
+ * same as the total number of cache levels for the given logical cpu.
+ */
+int of_find_last_cache_level(unsigned int cpu)
+{
+ u32 cache_level = 0;
+ struct device_node *np = of_find_last_level_cache_node(cpu);
+
+ of_property_read_u32(np, "cache-level", &cache_level);

return cache_level;
}
diff --git a/include/linux/of.h b/include/linux/of.h
index 04971e85fbc9..ca0384cf08a3 100644
--- a/include/linux/of.h
+++ b/include/linux/of.h
@@ -297,6 +297,7 @@ extern struct device_node *of_get_child_by_name(const struct device_node *node,

/* cache lookup */
extern struct device_node *of_find_next_cache_node(const struct device_node *);
+extern struct device_node *of_find_last_level_cache_node(unsigned int cpu);
extern int of_find_last_cache_level(unsigned int cpu);
extern struct device_node *of_find_node_with_property(
struct device_node *from, const char *prop_name);
--
2.36.1


2022-05-18 09:47:39

by Sudeep Holla

[permalink] [raw]
Subject: [PATCH v2 8/8] arch_topology: Add support to build llc_sibling on DT platforms

ACPI PPTT provides cache identifiers and especially the last level cache
identifier is used in obtaining last level cache siblings amongst CPUs.

While we have the cpu map representing all the CPUs sharing last level
cache in the cacheinfo driver, it is populated quite late in the boot
while the information is needed to build scheduler domains quite early.

On DT platforms we can use the pointer to the last level cache as the
firmware identifier for the last level cache and build the cpumap sharing
the last level cache based on the same.

Cc: Rob Herring <[email protected]>
Signed-off-by: Sudeep Holla <[email protected]>
---
drivers/base/arch_topology.c | 18 ++++++++++++++----
include/linux/arch_topology.h | 1 +
2 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index 6d3346efe74b..bc57f0f1862e 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -661,11 +661,14 @@ static int __init parse_dt_topology(void)
* Check that all cores are in the topology; the SMP code will
* only mark cores described in the DT as possible.
*/
- for_each_possible_cpu(cpu)
+ for_each_possible_cpu(cpu) {
if (cpu_topology[cpu].package_id < 0) {
ret = -EINVAL;
break;
}
+ cpu_topology[cpu].llc_fw_node =
+ of_find_last_level_cache_node(cpu);
+ }

out_map:
of_node_put(map);
@@ -681,6 +684,12 @@ static int __init parse_dt_topology(void)
struct cpu_topology cpu_topology[NR_CPUS];
EXPORT_SYMBOL_GPL(cpu_topology);

+#define IS_VALID_LLC_ID(x) \
+ ((x)->llc_id >= 0 || (x)->llc_fw_node)
+#define IS_MATCH_LLC_ID(x, y) \
+ (((x)->llc_id >= 0 && (x)->llc_id == (y)->llc_id) || \
+ ((x)->llc_fw_node && (x)->llc_fw_node == (y)->llc_fw_node))
+
const struct cpumask *cpu_coregroup_mask(int cpu)
{
const cpumask_t *core_mask = cpumask_of_node(cpu_to_node(cpu));
@@ -690,7 +699,8 @@ const struct cpumask *cpu_coregroup_mask(int cpu)
/* not numa in package, lets use the package siblings */
core_mask = &cpu_topology[cpu].core_sibling;
}
- if (cpu_topology[cpu].llc_id >= 0) {
+
+ if (IS_VALID_LLC_ID(&cpu_topology[cpu])) {
if (cpumask_subset(&cpu_topology[cpu].llc_sibling, core_mask))
core_mask = &cpu_topology[cpu].llc_sibling;
}
@@ -721,8 +731,7 @@ void update_siblings_masks(unsigned int cpuid)
for_each_online_cpu(cpu) {
cpu_topo = &cpu_topology[cpu];

- if (cpu_topo->llc_id >= 0 &&
- cpuid_topo->llc_id == cpu_topo->llc_id) {
+ if (IS_MATCH_LLC_ID(cpu_topo, cpuid_topo)) {
cpumask_set_cpu(cpu, &cpuid_topo->llc_sibling);
cpumask_set_cpu(cpuid, &cpu_topo->llc_sibling);
}
@@ -777,6 +786,7 @@ void __init reset_cpu_topology(void)
cpu_topo->cluster_id = -1;
cpu_topo->package_id = -1;
cpu_topo->llc_id = -1;
+ cpu_topo->llc_fw_node = NULL;

clear_cpu_topology(cpu);
}
diff --git a/include/linux/arch_topology.h b/include/linux/arch_topology.h
index 58cbe18d825c..d8a36b0e27c9 100644
--- a/include/linux/arch_topology.h
+++ b/include/linux/arch_topology.h
@@ -69,6 +69,7 @@ struct cpu_topology {
int cluster_id;
int package_id;
int llc_id;
+ void *llc_fw_node;
cpumask_t thread_sibling;
cpumask_t core_sibling;
cpumask_t cluster_sibling;
--
2.36.1


2022-05-20 08:16:05

by Ionela Voinescu

[permalink] [raw]
Subject: Re: [PATCH v2 3/8] arch_topology: Set cluster identifier in each core/thread from /cpu-map

Hi,

As said before, this creates trouble for CONFIG_SCHED_CLUSTER=y.
The output below is obtained from Juno.

When cluster_id is populated, a new CLS level is created by the scheduler
topology code. In this case the clusters in DT determine that the cluster
siblings and llc siblings are the same so the MC scheduler domain will
be removed and, for Juno, only CLS and DIE will be kept.

root@debian-arm64-buster:/sys/kernel/debug/sched/domains/cpu1# grep . */*
domain0/busy_factor:16
domain0/cache_nice_tries:1
domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING
domain0/imbalance_pct:117
domain0/max_interval:4
domain0/max_newidle_lb_cost:14907
domain0/min_interval:2
domain0/name:CLS
domain1/busy_factor:16
domain1/cache_nice_tries:1
domain1/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_ASYM_CPUCAPACITY SD_ASYM_CPUCAPACITY_FULL SD_PREFER_SIBLING
domain1/imbalance_pct:117
domain1/max_interval:12
domain1/max_newidle_lb_cost:11858
domain1/min_interval:6
domain1/name:DIE

To be noted that we also get a new flag SD_PREFER_SIBLING for the CLS
level that is not appropriate. We usually remove it for the child of a
SD_ASYM_CPUCAPACITY domain, but we don't currently redo this after some
levels are degenerated. This is a fixable issue.

But looking at the bigger picture, a good question is what is the best
thing to do when cluster domains and llc domains span the same CPUs?

Possibly it would be best to restrict clusters (which are almost an
arbitrary concept) to always span a subset of CPUs of the llc domain,
if llc siblings can be obtained? If those clusters are not properly set
up in DT to respect this condition, cluster_siblings would need to be
cleared (or set to the current CPU) so the CLS domain is not created at
all.

Additionally, should we use cluster information from DT (cluster_id) to
create an MC level if we don't have llc information, even if
CONFIG_SCHED_CLUSTER=n?

I currently don't have a very clear picture of how cluster domains and
llc domains would "live" together in a variety of topologies. I'll try
other DT topologies to see if there are others that can lead to trouble.

Thanks,
Ionela.

On Wednesday 18 May 2022 at 10:33:20 (+0100), Sudeep Holla wrote:
> Let us set the cluster identifier as parsed from the device tree
> cluster nodes within /cpu-map.
>
> We don't support nesting of clusters yet as there are no real hardware
> to support clusters of clusters.
>
> Signed-off-by: Sudeep Holla <[email protected]>
> ---
> drivers/base/arch_topology.c | 13 ++++++++-----
> 1 file changed, 8 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
> index 7f5aa655c1f4..bdb6f2a17df0 100644
> --- a/drivers/base/arch_topology.c
> +++ b/drivers/base/arch_topology.c
> @@ -491,7 +491,7 @@ static int __init get_cpu_for_node(struct device_node *node)
> }
>
> static int __init parse_core(struct device_node *core, int package_id,
> - int core_id)
> + int cluster_id, int core_id)
> {
> char name[20];
> bool leaf = true;
> @@ -507,6 +507,7 @@ static int __init parse_core(struct device_node *core, int package_id,
> cpu = get_cpu_for_node(t);
> if (cpu >= 0) {
> cpu_topology[cpu].package_id = package_id;
> + cpu_topology[cpu].cluster_id = cluster_id;
> cpu_topology[cpu].core_id = core_id;
> cpu_topology[cpu].thread_id = i;
> } else if (cpu != -ENODEV) {
> @@ -528,6 +529,7 @@ static int __init parse_core(struct device_node *core, int package_id,
> }
>
> cpu_topology[cpu].package_id = package_id;
> + cpu_topology[cpu].cluster_id = cluster_id;
> cpu_topology[cpu].core_id = core_id;
> } else if (leaf && cpu != -ENODEV) {
> pr_err("%pOF: Can't get CPU for leaf core\n", core);
> @@ -537,7 +539,8 @@ static int __init parse_core(struct device_node *core, int package_id,
> return 0;
> }
>
> -static int __init parse_cluster(struct device_node *cluster, int depth)
> +static int __init
> +parse_cluster(struct device_node *cluster, int cluster_id, int depth)
> {
> char name[20];
> bool leaf = true;
> @@ -557,7 +560,7 @@ static int __init parse_cluster(struct device_node *cluster, int depth)
> c = of_get_child_by_name(cluster, name);
> if (c) {
> leaf = false;
> - ret = parse_cluster(c, depth + 1);
> + ret = parse_cluster(c, i, depth + 1);
> of_node_put(c);
> if (ret != 0)
> return ret;
> @@ -581,7 +584,7 @@ static int __init parse_cluster(struct device_node *cluster, int depth)
> }
>
> if (leaf) {
> - ret = parse_core(c, 0, core_id++);
> + ret = parse_core(c, 0, cluster_id, core_id++);
> } else {
> pr_err("%pOF: Non-leaf cluster with core %s\n",
> cluster, name);
> @@ -621,7 +624,7 @@ static int __init parse_dt_topology(void)
> if (!map)
> goto out;
>
> - ret = parse_cluster(map, 0);
> + ret = parse_cluster(map, -1, 0);
> if (ret != 0)
> goto out_map;
>
> --
> 2.36.1
>
>

2022-05-20 22:32:09

by Rob Herring (Arm)

[permalink] [raw]
Subject: Re: [PATCH v2 8/8] arch_topology: Add support to build llc_sibling on DT platforms

On Fri, May 20, 2022 at 01:59:59PM +0100, Sudeep Holla wrote:
> On Thu, May 19, 2022 at 01:10:51PM -0500, Rob Herring wrote:
> > On Wed, May 18, 2022 at 4:34 AM Sudeep Holla <[email protected]> wrote:
> > >
> > > ACPI PPTT provides cache identifiers and especially the last level cache
> > > identifier is used in obtaining last level cache siblings amongst CPUs.
> > >
> > > While we have the cpu map representing all the CPUs sharing last level
> > > cache in the cacheinfo driver, it is populated quite late in the boot
> > > while the information is needed to build scheduler domains quite early.
> >
> > Late is because it's a device_initcall() rather than late in the cpu
> > hotplug state machine, right?
>
> Right. The expectation is to run in on each online CPU in CPU hotplug state
> machine for some architectures. We may not need that on arm64 especially
> since we get all info from DT or ACPI, but e.g. x86 uses cpuid which needs
> to be executed on that CPU.

That's a separate issue. I'm not suggesting changing that part (that
would just be an optimization).

> > The late aspect is for sysfs presumably,but I think we could decouple that.
>
> OK, not sure when this sched_domain info is actually needed. It think it
> could be decoupled if we can wait until all the cpus are online.

No need to wait for all cpus to be online. I think you keep doing
it as part of cpu hotplug. The device_initcall() is used because you
cannot have struct device or sysfs calls before the driver core is
initialized. If we run the cacheinfo code earlier (I think the arch code
will have to call it) just like the topology code and skip the sysfs
parts, then you can use it.

> > Do all the firmware cache parsing early and then populate the sysfs parts
> > later.
>
> Yes that may work on DT/ACPI based systems, as I said x86 relies on cpuid.

I'd assume using cpuid works at any time?

> > It's not a unique problem as the DT unflattening and init code has to
> > do the same thing. I'd assume the hotplug and cpu sysfs devices have
> > to deal with the same thing.
> >
>
> OK, I will take a look at how to do that.
>
> --
> Regards,
> Sudeep

2022-05-21 07:33:12

by Ionela Voinescu

[permalink] [raw]
Subject: Re: [PATCH v2 0/8] arch_topology: Updates to add socket support and fix cluster ids

Hi Sudeep,

On Wednesday 18 May 2022 at 10:33:17 (+0100), Sudeep Holla wrote:
> Hi All,
>
> This series intends to fix some discrepancies we have in the CPU topology
> parsing from the device tree /cpu-map node. Also this diverges from the
> behaviour on a ACPI enabled platform. The expectation is that both DT
> and ACPI enabled systems must present consistent view of the CPU topology.
>
> Currently we assign generated cluster count as the physical package identifier
> for each CPU which is wrong. The device tree bindings for CPU topology supports
> sockets to infer the socket or physical package identifier for a given CPU.
> Also we don't check if all the cores/threads belong to the same cluster before
> updating their sibling masks which is fine as we don't set the cluster id yet.
>
> These changes also assigns the cluster identifier as parsed from the device tree
> cluster nodes within /cpu-map without support for nesting of the clusters.
> Finally, it also add support for socket nodes in /cpu-map. With this the
> parsing of exact same information from ACPI PPTT and /cpu-map DT node
> aligns well.
>
> The only exception is that the last level cache id information can be
> inferred from the same ACPI PPTT while we need to parse CPU cache nodes
> in the device tree.
>
> P.S: I have not cc-ed Greg and Rafael so that all the users of arch_topology
> agree with the changes first before we include them.
>
> v1[1]->v2:
> - Updated ID validity check include all non-negative value
> - Added support to get the device node for the CPU's last level cache
> - Added support to build llc_sibling on DT platforms
>
> [1] https://lore.kernel.org/lkml/[email protected]
>
> Sudeep Holla (8):
> arch_topology: Don't set cluster identifier as physical package identifier
> arch_topology: Set thread sibling cpumask only within the cluster
> arch_topology: Set cluster identifier in each core/thread from /cpu-map
> arch_topology: Add support for parsing sockets in /cpu-map
> arch_topology: Check for non-negative value rather than -1 for IDs validity
> arch_topology: Avoid parsing through all the CPUs once a outlier CPU is found
> of: base: add support to get the device node for the CPU's last level cache
> arch_topology: Add support to build llc_sibling on DT platforms
>

Just a recommendation for patch-set structure: it would be best to have
the following sequence to maintain the same scheduler topology and
behaviour when partially applying the set (currently testing this on Juno,
but should be the case for other platforms as well):

2/8 arch_topology: Set thread sibling cpumask only within the cluster
5/8 arch_topology: Check for non-negative value rather than -1 for IDs validity
6/8 arch_topology: Avoid parsing through all the CPUs once a outlier CPU is found

--> these are only preparation/cleanup patches and don't affect current
functionality

7/8 of: base: add support to get the device node for the CPU's last level cache
8/8 arch_topology: Add support to build llc_sibling on DT platforms

--> these will populate llc siblings but this list will be equal to
core siblings (based on package_id) so nothing changes in the scheduler
topology. Even if CONFIG_SCHED_CLUSTER=y, we still have cluster_id=-1 so
nothing will change in that case either, for the patches so far.

1/8 arch_topology: Don't set cluster identifier as physical package identifier

--> 1/8 is the trouble maker if it's the first patch as it will result
in having all CPUs in core_siblings so the topology will be flattened to
just an MC level for a typical b.L system like Juno. But if you add it after
all of the above patches, the llc_siblings will contribute to create the same
MC and DIE levels we expect.

3/8 arch_topology: Set cluster identifier in each core/thread from /cpu-map
4/8 arch_topology: Add support for parsing sockets in /cpu-map

--> Here 3/8 will start creating complications when having clusters in
DT and we have CONFIG_SCHED_CLUSTER=y. But I'll detail this in a reply
to that patch. For CONFIG_SCHED_CLUSTER=n the topology and scheduler
behaviour should be the same as before this set.

Hope it helps,
Ionela.


> drivers/base/arch_topology.c | 75 +++++++++++++++++++++++++++--------
> drivers/of/base.c | 33 +++++++++++----
> include/linux/arch_topology.h | 1 +
> include/linux/of.h | 1 +
> 4 files changed, 85 insertions(+), 25 deletions(-)
>
> --
> 2.36.1
>
>

2022-05-21 07:48:01

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v2 1/8] arch_topology: Don't set cluster identifier as physical package identifier

On 18/05/2022 11:33, Sudeep Holla wrote:

You say `cluster identifier` which to me refers to
`cpu_topology[cpu].cluster_id`. But I guess you meant `cluster node`
from cpu-map DT node?

Otherwise you say we link (1.level) cpu-map cluster nodes to
`cluster_id\_sibling`? But then (1) we will never support nested
clusters and (2) why would we need llc support then?

[...]

2022-05-21 10:03:50

by Sudeep Holla

[permalink] [raw]
Subject: Re: [PATCH v2 1/8] arch_topology: Don't set cluster identifier as physical package identifier

On Fri, May 20, 2022 at 02:31:24PM +0200, Dietmar Eggemann wrote:
> On 18/05/2022 11:33, Sudeep Holla wrote:
>
> You say `cluster identifier` which to me refers to
> `cpu_topology[cpu].cluster_id`. But I guess you meant `cluster node`
> from cpu-map DT node?
>

Correct, I am referring to the leaf cluster node identifier in terms
of cpu-map DT node which we now store in cpu_topology[cpu].cluster_id as
part of this series instead of previous cpu_topology[cpu].package_id.

> Otherwise you say we link (1.level) cpu-map cluster nodes to
> `cluster_id\_sibling`? But then (1) we will never support nested
> clusters and (2) why would we need llc support then?
>

(1) Do we have any platforms with nested clusters today ? No phantom
clusters please as this is info that gets to the userspace and must
reflect the real hardware. If one exist, then we need to add nested
cluster if we need to support that hardware. I am not aware of any
platform in particular DT based one.
(2) LLC was added to support chiplets. IIRC, cpu_coregroup_mask was changed
to select the smallest of LLC, socket siblings, and NUMA node siblings
to ensure that the sched domain we build for the MC layer isn't larger
than the DIE above it or it's shrunk to the socket or NUMA node if LLC
exist across NUMA node/chiplets.

But overtime, we have patched cpu_coregroup_mask to workaround things
which I think is now about to break ????.

--
Regards,
Sudeep

2022-05-21 11:30:12

by Sudeep Holla

[permalink] [raw]
Subject: Re: [PATCH v2 8/8] arch_topology: Add support to build llc_sibling on DT platforms

On Thu, May 19, 2022 at 01:10:51PM -0500, Rob Herring wrote:
> On Wed, May 18, 2022 at 4:34 AM Sudeep Holla <[email protected]> wrote:
> >
> > ACPI PPTT provides cache identifiers and especially the last level cache
> > identifier is used in obtaining last level cache siblings amongst CPUs.
> >
> > While we have the cpu map representing all the CPUs sharing last level
> > cache in the cacheinfo driver, it is populated quite late in the boot
> > while the information is needed to build scheduler domains quite early.
>
> Late is because it's a device_initcall() rather than late in the cpu
> hotplug state machine, right?

Right. The expectation is to run in on each online CPU in CPU hotplug state
machine for some architectures. We may not need that on arm64 especially
since we get all info from DT or ACPI, but e.g. x86 uses cpuid which needs
to be executed on that CPU.

> The late aspect is for sysfs presumably,but I think we could decouple that.

OK, not sure when this sched_domain info is actually needed. It think it
could be decoupled if we can wait until all the cpus are online.

> Do all the firmware cache parsing early and then populate the sysfs parts
> later.

Yes that may work on DT/ACPI based systems, as I said x86 relies on cpuid.

> It's not a unique problem as the DT unflattening and init code has to
> do the same thing. I'd assume the hotplug and cpu sysfs devices have
> to deal with the same thing.
>

OK, I will take a look at how to do that.

--
Regards,
Sudeep

2022-05-22 08:25:28

by Sudeep Holla

[permalink] [raw]
Subject: Re: [PATCH v2 2/8] arch_topology: Set thread sibling cpumask only within the cluster

On Fri, May 20, 2022 at 02:32:19PM +0200, Dietmar Eggemann wrote:
> On 18/05/2022 11:33, Sudeep Holla wrote:
> > Currently the cluster identifier is not set on the DT based platforms.
> > The reset or default value is -1 for all the CPUs. Once we assign the
> > cluster identifier values correctly that imay result in getting the thread
> > siblings wrongs as the core identifiers can be same for 2 different CPUs
> > belonging to 2 different cluster.
> >
> > So, in order to get the thread sibling cpumasks correct, we need to
> > update them only if the cores they belong are in the same cluster within
> > the socket. Let us skip updation of the thread sibling cpumaks if the
> > cluster identifier doesn't match.
>
> So this issue should be there on ACPI systems as well then. Kunpeng920
> and Ampere Altra don't have SMT, so very likely nobody has noticed this
> so far.

No, I think it is handled correctly. May be the wordings of my commit log
needs changing. Previously the core ids were generated and unique across the
platform. But with this change I am assigned the id as read from the DT
node using core@<id>. That results in the problem and needs to be taken care.
This doesn't change the end result(the actual mask) but the way we arrive at
that. In short this is not fixing any issue that was broken before.

--
Regards,
Sudeep

2022-05-22 19:35:43

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v2 2/8] arch_topology: Set thread sibling cpumask only within the cluster

On 18/05/2022 11:33, Sudeep Holla wrote:
> Currently the cluster identifier is not set on the DT based platforms.
> The reset or default value is -1 for all the CPUs. Once we assign the
> cluster identifier values correctly that imay result in getting the thread
> siblings wrongs as the core identifiers can be same for 2 different CPUs
> belonging to 2 different cluster.
>
> So, in order to get the thread sibling cpumasks correct, we need to
> update them only if the cores they belong are in the same cluster within
> the socket. Let us skip updation of the thread sibling cpumaks if the
> cluster identifier doesn't match.

So this issue should be there on ACPI systems as well then. Kunpeng920
and Ampere Altra don't have SMT, so very likely nobody has noticed this
so far.

[...]

2022-05-23 06:29:25

by Sudeep Holla

[permalink] [raw]
Subject: Re: [PATCH v2 8/8] arch_topology: Add support to build llc_sibling on DT platforms

On Fri, May 20, 2022 at 09:36:20AM -0500, Rob Herring wrote:
> On Fri, May 20, 2022 at 01:59:59PM +0100, Sudeep Holla wrote:
> > On Thu, May 19, 2022 at 01:10:51PM -0500, Rob Herring wrote:
> > > On Wed, May 18, 2022 at 4:34 AM Sudeep Holla <[email protected]> wrote:
> > > >
> > > > ACPI PPTT provides cache identifiers and especially the last level cache
> > > > identifier is used in obtaining last level cache siblings amongst CPUs.
> > > >
> > > > While we have the cpu map representing all the CPUs sharing last level
> > > > cache in the cacheinfo driver, it is populated quite late in the boot
> > > > while the information is needed to build scheduler domains quite early.
> > >
> > > Late is because it's a device_initcall() rather than late in the cpu
> > > hotplug state machine, right?
> >
> > Right. The expectation is to run in on each online CPU in CPU hotplug state
> > machine for some architectures. We may not need that on arm64 especially
> > since we get all info from DT or ACPI, but e.g. x86 uses cpuid which needs
> > to be executed on that CPU.
>
> That's a separate issue. I'm not suggesting changing that part (that
> would just be an optimization).
>
> > > The late aspect is for sysfs presumably,but I think we could decouple that.
> >
> > OK, not sure when this sched_domain info is actually needed. It think it
> > could be decoupled if we can wait until all the cpus are online.
>
> No need to wait for all cpus to be online. I think you keep doing
> it as part of cpu hotplug. The device_initcall() is used because you
> cannot have struct device or sysfs calls before the driver core is
> initialized. If we run the cacheinfo code earlier (I think the arch code
> will have to call it) just like the topology code and skip the sysfs
> parts, then you can use it.
>

Yes I was thinking something on similar lines, though I didn't think of
pushing code to arch. Let me check, must be possible.

> > > Do all the firmware cache parsing early and then populate the sysfs parts
> > > later.
> >
> > Yes that may work on DT/ACPI based systems, as I said x86 relies on cpuid.
>
> I'd assume using cpuid works at any time?
>

I think so, I can't recall the details since I looked at that about 5 years
ago. I have to check again anyways to explore early initialisation.

--
Regards,
Sudeep

2022-05-23 06:57:51

by Sudeep Holla

[permalink] [raw]
Subject: Re: [PATCH v2 3/8] arch_topology: Set cluster identifier in each core/thread from /cpu-map

On Fri, May 20, 2022 at 02:33:19PM +0200, Dietmar Eggemann wrote:
> On 19/05/2022 18:55, Ionela Voinescu wrote:
> > Hi,
> >
> > As said before, this creates trouble for CONFIG_SCHED_CLUSTER=y.
> > The output below is obtained from Juno.
> >
> > When cluster_id is populated, a new CLS level is created by the scheduler
> > topology code. In this case the clusters in DT determine that the cluster
> > siblings and llc siblings are the same so the MC scheduler domain will
> > be removed and, for Juno, only CLS and DIE will be kept.
>
> [...]
>
> > To be noted that we also get a new flag SD_PREFER_SIBLING for the CLS
> > level that is not appropriate. We usually remove it for the child of a
> > SD_ASYM_CPUCAPACITY domain, but we don't currently redo this after some
> > levels are degenerated. This is a fixable issue.
> >
> > But looking at the bigger picture, a good question is what is the best
> > thing to do when cluster domains and llc domains span the same CPUs?
> >
> > Possibly it would be best to restrict clusters (which are almost an
> > arbitrary concept) to always span a subset of CPUs of the llc domain,
> > if llc siblings can be obtained? If those clusters are not properly set
> > up in DT to respect this condition, cluster_siblings would need to be
> > cleared (or set to the current CPU) so the CLS domain is not created at
> > all.
> >
> > Additionally, should we use cluster information from DT (cluster_id) to
> > create an MC level if we don't have llc information, even if
> > CONFIG_SCHED_CLUSTER=n?
> >
> > I currently don't have a very clear picture of how cluster domains and
> > llc domains would "live" together in a variety of topologies. I'll try
> > other DT topologies to see if there are others that can lead to trouble.
>
> This would be an issue. Depending on CONFIG_SCHED_CLUSTER we would get
> two different systems from the viewpoint of the scheduler.
>

Agreed, but that is the issue with the change that updates cpu_coregroup_mask
based on CONFIG_SCHED_CLUSTER, the one that we added recent for Ampere
systems. Sorry, but I admit I was OK for the work around then but all
these discussions has made to disagree with that change now.

> To me `cluster_id/_sibling` don't describe a certain level of CPU
> grouping (e.g. one level above core or one level below package).
>

True, but based on how it is extracted from the firmware todat(ACPI and DT
with this series, it is one level above the cores.

> They were introduced to describe one level below LLC (e.g. Kunpeng920 L3
> (24 CPUs LLC) -> L3 tag (4 CPUs) or x86 Jacobsville L3 -> L2), (Commit
> ^^^^^^ ^^
> c5e22feffdd7 ("topology: Represent clusters of CPUs within a die")).
>

Again correct, but the description of sysfs is what we need to rely on from
now. If that is not clear, we need to make it clear. But I don't real bother
much on how it is related to LLC as it is known to vary with different
systems.

> The Ampere Altra issue already gave us a taste of the possible issues of
> this definition, commit db1e59483dfd ("topology: make core_mask include
> at least cluster_siblings").
>

Yes this is one I am referring above. I tend to disagree with that now.

> If we link `cluster_id/_sibling` against (1. level) cpu-map cluster
> nodes plus using llc and `cluster_sibling >= llc_sibling` we will run
> into these issues.

As I said you can't change the topology because it causes issues the
way we build sched_domains. You need to figure how to build sched domains
for such systems. If current set of domains or how the masks for each of
the domains are derived is not sufficient or incorrect, we need to fix that.
We are not changing topology masks for that, whatever the reason might be,
sorry as these are userspace visible.

--
Regards,
Sudeep

2022-05-23 07:12:33

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v2 3/8] arch_topology: Set cluster identifier in each core/thread from /cpu-map

On 19/05/2022 18:55, Ionela Voinescu wrote:
> Hi,
>
> As said before, this creates trouble for CONFIG_SCHED_CLUSTER=y.
> The output below is obtained from Juno.
>
> When cluster_id is populated, a new CLS level is created by the scheduler
> topology code. In this case the clusters in DT determine that the cluster
> siblings and llc siblings are the same so the MC scheduler domain will
> be removed and, for Juno, only CLS and DIE will be kept.

[...]

> To be noted that we also get a new flag SD_PREFER_SIBLING for the CLS
> level that is not appropriate. We usually remove it for the child of a
> SD_ASYM_CPUCAPACITY domain, but we don't currently redo this after some
> levels are degenerated. This is a fixable issue.
>
> But looking at the bigger picture, a good question is what is the best
> thing to do when cluster domains and llc domains span the same CPUs?
>
> Possibly it would be best to restrict clusters (which are almost an
> arbitrary concept) to always span a subset of CPUs of the llc domain,
> if llc siblings can be obtained? If those clusters are not properly set
> up in DT to respect this condition, cluster_siblings would need to be
> cleared (or set to the current CPU) so the CLS domain is not created at
> all.
>
> Additionally, should we use cluster information from DT (cluster_id) to
> create an MC level if we don't have llc information, even if
> CONFIG_SCHED_CLUSTER=n?
>
> I currently don't have a very clear picture of how cluster domains and
> llc domains would "live" together in a variety of topologies. I'll try
> other DT topologies to see if there are others that can lead to trouble.

This would be an issue. Depending on CONFIG_SCHED_CLUSTER we would get
two different systems from the viewpoint of the scheduler.

To me `cluster_id/_sibling` don't describe a certain level of CPU
grouping (e.g. one level above core or one level below package).

They were introduced to describe one level below LLC (e.g. Kunpeng920 L3
(24 CPUs LLC) -> L3 tag (4 CPUs) or x86 Jacobsville L3 -> L2), (Commit
^^^^^^ ^^
c5e22feffdd7 ("topology: Represent clusters of CPUs within a die")).

The Ampere Altra issue already gave us a taste of the possible issues of
this definition, commit db1e59483dfd ("topology: make core_mask include
at least cluster_siblings").

If we link `cluster_id/_sibling` against (1. level) cpu-map cluster
nodes plus using llc and `cluster_sibling >= llc_sibling` we will run
into these issues.

2022-05-23 07:16:43

by Rob Herring

[permalink] [raw]
Subject: Re: [PATCH v2 8/8] arch_topology: Add support to build llc_sibling on DT platforms

On Wed, May 18, 2022 at 4:34 AM Sudeep Holla <[email protected]> wrote:
>
> ACPI PPTT provides cache identifiers and especially the last level cache
> identifier is used in obtaining last level cache siblings amongst CPUs.
>
> While we have the cpu map representing all the CPUs sharing last level
> cache in the cacheinfo driver, it is populated quite late in the boot
> while the information is needed to build scheduler domains quite early.

Late is because it's a device_initcall() rather than late in the cpu
hotplug state machine, right? The late aspect is for sysfs presumably,
but I think we could decouple that. Do all the firmware cache parsing
early and then populate the sysfs parts later. It's not a unique
problem as the DT unflattening and init code has to do the same thing.
I'd assume the hotplug and cpu sysfs devices have to deal with the
same thing.

Rob

2022-05-23 07:16:55

by Sudeep Holla

[permalink] [raw]
Subject: Re: [PATCH v2 0/8] arch_topology: Updates to add socket support and fix cluster ids

On Thu, May 19, 2022 at 05:32:49PM +0100, Ionela Voinescu wrote:
> Hi Sudeep,
>
> On Wednesday 18 May 2022 at 10:33:17 (+0100), Sudeep Holla wrote:
> > Hi All,
> >
> > This series intends to fix some discrepancies we have in the CPU topology
> > parsing from the device tree /cpu-map node. Also this diverges from the
> > behaviour on a ACPI enabled platform. The expectation is that both DT
> > and ACPI enabled systems must present consistent view of the CPU topology.
> >
> > Currently we assign generated cluster count as the physical package identifier
> > for each CPU which is wrong. The device tree bindings for CPU topology supports
> > sockets to infer the socket or physical package identifier for a given CPU.
> > Also we don't check if all the cores/threads belong to the same cluster before
> > updating their sibling masks which is fine as we don't set the cluster id yet.
> >
> > These changes also assigns the cluster identifier as parsed from the device tree
> > cluster nodes within /cpu-map without support for nesting of the clusters.
> > Finally, it also add support for socket nodes in /cpu-map. With this the
> > parsing of exact same information from ACPI PPTT and /cpu-map DT node
> > aligns well.
> >
> > The only exception is that the last level cache id information can be
> > inferred from the same ACPI PPTT while we need to parse CPU cache nodes
> > in the device tree.
> >
> > P.S: I have not cc-ed Greg and Rafael so that all the users of arch_topology
> > agree with the changes first before we include them.
> >
> > v1[1]->v2:
> > - Updated ID validity check include all non-negative value
> > - Added support to get the device node for the CPU's last level cache
> > - Added support to build llc_sibling on DT platforms
> >
> > [1] https://lore.kernel.org/lkml/[email protected]
> >
> > Sudeep Holla (8):
> > arch_topology: Don't set cluster identifier as physical package identifier
> > arch_topology: Set thread sibling cpumask only within the cluster
> > arch_topology: Set cluster identifier in each core/thread from /cpu-map
> > arch_topology: Add support for parsing sockets in /cpu-map
> > arch_topology: Check for non-negative value rather than -1 for IDs validity
> > arch_topology: Avoid parsing through all the CPUs once a outlier CPU is found
> > of: base: add support to get the device node for the CPU's last level cache
> > arch_topology: Add support to build llc_sibling on DT platforms
> >
>
> Just a recommendation for patch-set structure: it would be best to have
> the following sequence to maintain the same scheduler topology and
> behaviour when partially applying the set (currently testing this on Juno,
> but should be the case for other platforms as well):
>
> 2/8 arch_topology: Set thread sibling cpumask only within the cluster
> 5/8 arch_topology: Check for non-negative value rather than -1 for IDs validity
> 6/8 arch_topology: Avoid parsing through all the CPUs once a outlier CPU is found
>
> --> these are only preparation/cleanup patches and don't affect current
> functionality
>

Agreed. It was in my TODO list but I just to post this v2 for some reason.
I knew the patches were more in the order of my thoughts on what all needs
to be done and the order I added them. I knew it would result in issue
with kernel bisection.


> 7/8 of: base: add support to get the device node for the CPU's last level cache
> 8/8 arch_topology: Add support to build llc_sibling on DT platforms
>
> --> these will populate llc siblings but this list will be equal to
> core siblings (based on package_id) so nothing changes in the scheduler
> topology. Even if CONFIG_SCHED_CLUSTER=y, we still have cluster_id=-1 so
> nothing will change in that case either, for the patches so far.
>

Correct, I had worked out this much detail.

> 1/8 arch_topology: Don't set cluster identifier as physical package identifier
>
> --> 1/8 is the trouble maker if it's the first patch as it will result
> in having all CPUs in core_siblings so the topology will be flattened to
> just an MC level for a typical b.L system like Juno. But if you add it after
> all of the above patches, the llc_siblings will contribute to create the same
> MC and DIE levels we expect.
>
> 3/8 arch_topology: Set cluster identifier in each core/thread from /cpu-map
> 4/8 arch_topology: Add support for parsing sockets in /cpu-map
>
> --> Here 3/8 will start creating complications when having clusters in
> DT and we have CONFIG_SCHED_CLUSTER=y. But I'll detail this in a reply
> to that patch. For CONFIG_SCHED_CLUSTER=n the topology and scheduler
> behaviour should be the same as before this set.
>

Thanks for the ordering of the last 3. I wanted to work out, but you need
to have done more work than me on that and thanks for saving my time. Much
appreciated.

> Hope it helps,

Yes definitely, thanks again.

--
Regards,
Sudeep

2022-05-23 07:37:55

by Sudeep Holla

[permalink] [raw]
Subject: Re: [PATCH v2 8/8] arch_topology: Add support to build llc_sibling on DT platforms

On Fri, May 20, 2022 at 02:33:46PM +0200, Dietmar Eggemann wrote:
> On 18/05/2022 11:33, Sudeep Holla wrote:
> > ACPI PPTT provides cache identifiers and especially the last level cache
> > identifier is used in obtaining last level cache siblings amongst CPUs.
> >
> > While we have the cpu map representing all the CPUs sharing last level
> > cache in the cacheinfo driver, it is populated quite late in the boot
> > while the information is needed to build scheduler domains quite early.
> >
> > On DT platforms we can use the pointer to the last level cache as the
> > firmware identifier for the last level cache and build the cpumap sharing
> > the last level cache based on the same.
>
> [...]
>
> > diff --git a/include/linux/arch_topology.h b/include/linux/arch_topology.h
> > index 58cbe18d825c..d8a36b0e27c9 100644
> > --- a/include/linux/arch_topology.h
> > +++ b/include/linux/arch_topology.h
> > @@ -69,6 +69,7 @@ struct cpu_topology {
> > int cluster_id;
> > int package_id;
> > int llc_id;
> > + void *llc_fw_node;
>
> Would be nicer if you could set llc_id directly to avoid all this
> addition sync `llc_id and llc_fw_node` code. ACPI PPTT has this
> ACPI_PTR_DIFF() macro which IMHO lets it create distinct ids.
>

Indeed, I initially thought so, but they are 64 bit pointers and choosing
one reference is difficult to take some sort of diff while in ACPI it is in
a static table and we could use the reference to the top of the table.

--
Regards,
Sudeep

2022-05-23 08:06:50

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v2 8/8] arch_topology: Add support to build llc_sibling on DT platforms

On 18/05/2022 11:33, Sudeep Holla wrote:
> ACPI PPTT provides cache identifiers and especially the last level cache
> identifier is used in obtaining last level cache siblings amongst CPUs.
>
> While we have the cpu map representing all the CPUs sharing last level
> cache in the cacheinfo driver, it is populated quite late in the boot
> while the information is needed to build scheduler domains quite early.
>
> On DT platforms we can use the pointer to the last level cache as the
> firmware identifier for the last level cache and build the cpumap sharing
> the last level cache based on the same.

[...]

> diff --git a/include/linux/arch_topology.h b/include/linux/arch_topology.h
> index 58cbe18d825c..d8a36b0e27c9 100644
> --- a/include/linux/arch_topology.h
> +++ b/include/linux/arch_topology.h
> @@ -69,6 +69,7 @@ struct cpu_topology {
> int cluster_id;
> int package_id;
> int llc_id;
> + void *llc_fw_node;

Would be nicer if you could set llc_id directly to avoid all this
addition sync `llc_id and llc_fw_node` code. ACPI PPTT has this
ACPI_PTR_DIFF() macro which IMHO lets it create distinct ids.

> cpumask_t thread_sibling;
> cpumask_t core_sibling;
> cpumask_t cluster_sibling;


2022-05-23 08:12:18

by Sudeep Holla

[permalink] [raw]
Subject: Re: [PATCH v2 3/8] arch_topology: Set cluster identifier in each core/thread from /cpu-map

On Thu, May 19, 2022 at 05:55:30PM +0100, Ionela Voinescu wrote:
> Hi,
>
> As said before, this creates trouble for CONFIG_SCHED_CLUSTER=y.
> The output below is obtained from Juno.
>
> When cluster_id is populated, a new CLS level is created by the scheduler
> topology code. In this case the clusters in DT determine that the cluster
> siblings and llc siblings are the same so the MC scheduler domain will
> be removed and, for Juno, only CLS and DIE will be kept.
>

Yes I have seen that.

1. Will that differ with ACPI on juno ?

2. Is that a problem ? I mean we are fixing those masks that are user
visible with this series and if using them as is in sched domain is
incorrect or not sufficient, we need to fix that. We can't change these
masks.

> root@debian-arm64-buster:/sys/kernel/debug/sched/domains/cpu1# grep . */*
> domain0/busy_factor:16
> domain0/cache_nice_tries:1
> domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING
> domain0/imbalance_pct:117
> domain0/max_interval:4
> domain0/max_newidle_lb_cost:14907
> domain0/min_interval:2
> domain0/name:CLS
> domain1/busy_factor:16
> domain1/cache_nice_tries:1
> domain1/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_ASYM_CPUCAPACITY SD_ASYM_CPUCAPACITY_FULL SD_PREFER_SIBLING
> domain1/imbalance_pct:117
> domain1/max_interval:12
> domain1/max_newidle_lb_cost:11858
> domain1/min_interval:6
> domain1/name:DIE
>
> To be noted that we also get a new flag SD_PREFER_SIBLING for the CLS
> level that is not appropriate. We usually remove it for the child of a
> SD_ASYM_CPUCAPACITY domain, but we don't currently redo this after some
> levels are degenerated. This is a fixable issue.
>

OK good.

> But looking at the bigger picture, a good question is what is the best
> thing to do when cluster domains and llc domains span the same CPUs?
>

Indeed, I will leave that to scheduler experts ????. My main goal is to get
the topology masks that are user visible right and keeping current behavior
intact.

> Possibly it would be best to restrict clusters (which are almost an
> arbitrary concept) to always span a subset of CPUs of the llc domain,
> if llc siblings can be obtained? If those clusters are not properly set
> up in DT to respect this condition, cluster_siblings would need to be
> cleared (or set to the current CPU) so the CLS domain is not created at
> all.
>

Yes, we already have all these complex conditions in cpu_coregroup_mask.
Why not do something similar in cpu_clustergroup_mask ?

> Additionally, should we use cluster information from DT (cluster_id) to
> create an MC level if we don't have llc information, even if
> CONFIG_SCHED_CLUSTER=n?
>

I don't know. We have all the topology and llc info now, we need to decide
how to present them to scheduler.

> I currently don't have a very clear picture of how cluster domains and
> llc domains would "live" together in a variety of topologies. I'll try
> other DT topologies to see if there are others that can lead to trouble.
>

Me neither. I avoided these changes for years because of such complex
questions but that is wrong reason to, as the user visible topology has
now diverged from ACPI once(people who complained about user space breakage
cared only about ACPI) and we need to bring them in parity soon IMO.

--
Regards,
Sudeep