LinuxLists.cc - [RFC PATCH 1/2] NUMA balancing: fix NUMA topology type for memory tiering system

2022-01-29 11:01:09

Subject: [RFC PATCH 1/2] NUMA balancing: fix NUMA topology type for memory tiering system

With the advent of various new memory types, such as Intel Optane
DCPMM, some machines will have multiple types of memory, e.g. DRAM and
PMEM (persistent memory). The memory subsystem of these machines can
be called memory tiering system, because the performance of the
different types of memory are different.

After commit c221c0b0308f ("device-dax: "Hotplug" persistent memory
for use like normal RAM"), the PMEM could be used as the
cost-effective volatile memory in separate NUMA nodes. In a typical
memory tiering system, there are CPUs, DRAM and PMEM in each physical
NUMA node. The CPUs and the DRAM will be put in one logical node,
while the PMEM will be put in another (faked) logical node.

For example, the ACPI SLIT of one such system is as follows,

[000h 0000 4] Signature : "SLIT" [System Locality Information Table]
[004h 0004 4] Table Length : 0000042C
[008h 0008 1] Revision : 01
[009h 0009 1] Checksum : 59
[00Ah 0010 6] Oem ID : "XXXX"
[010h 0016 8] Oem Table ID : "XXXXXXX"
[018h 0024 4] Oem Revision : 00000001
[01Ch 0028 4] Asl Compiler ID : "INTL"
[020h 0032 4] Asl Compiler Revision : 20091013

[024h 0036 8] Localities : 0000000000000004
[02Ch 0044 4] Locality 0 : 0A 15 11 1C
[030h 0048 4] Locality 1 : 15 0A 1C 11
[034h 0052 4] Locality 2 : 11 1C 0A 1C
[038h 0056 4] Locality 3 : 1C 11 1C 0A

While the `numactl -H` output is as follows,

available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
node 0 size: 64136 MB
node 0 free: 5981 MB
node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
node 1 size: 64466 MB
node 1 free: 10415 MB
node 2 cpus:
node 2 size: 253952 MB
node 2 free: 253920 MB
node 3 cpus:
node 3 size: 253952 MB
node 3 free: 253951 MB
node distances:
node 0 1 2 3
0: 10 21 17 28
1: 21 10 28 17
2: 17 28 10 28
3: 28 17 28 10

In this system, there are only 2 sockets. In each memory controller,
both DRAM and PMEM DIMMs are installed. Although the physical NUMA
topology is simple, the logical NUMA topology becomes a little
complex. Because both the distance(0, 1) and distance (1, 3) are less
than the distance (0, 3), it appears that node 1 sits between node 0
and node 3. And the whole system appears to be a glueless mesh NUMA
topology type. But it's definitely not, there is even no CPU in node 3.

This isn't a practical problem now yet. Because the PMEM nodes (node
2 and node 3 in example system) are offlined by default during system
boot. So init_numa_topology_type() called during system boot will
ignore them and set sched_numa_topology_type to be NUMA_DIRECT. And
init_numa_topology_type() is only called at runtime when a CPU of a
never-onlined-before node gets plugged in. And there's no CPU in the
PMEM nodes. But it appears better to fix this to make the code more
robust.

To test the potential problem. We have used a debug patch to call
init_numa_topology_type() when the PMEM node is onlined (in
__set_migration_target_nodes()). And it's verified that
sched_numa_topology_type will be set to NUMA_GLUELESS_MESH with the
debug patch.

One possible fix is to ignore CPU-less nodes when detecting NUMA
topology type in init_numa_topology_type(). That works well for the
example system. Is it good in general for any system with CPU-less
nodes?

Signed-off-by: "Huang, Ying" <[email protected]>
Suggested-by: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Srikar Dronamraju <[email protected]>
---
kernel/sched/topology.c | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 9f26e6b651fe..ba975a29d444 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1738,7 +1738,13 @@ void init_numa_topology_type(void)
}

for_each_online_node(a) {
+ if (!node_state(a, N_CPU))
+ continue;
+
for_each_online_node(b) {
+ if (!node_state(b, N_CPU))
+ continue;
+
/* Find two nodes furthest removed from each other. */
if (node_distance(a, b) < n)
continue;
--
2.30.2

2022-01-29 11:02:28

by Huang, Ying

[permalink] [raw]

Subject: [RFC PATCH 2/2] NUMA balancing: avoid to migrate task to CPU-less node

In a typical memory tiering system, there's no CPU in slow (PMEM) NUMA
nodes. But if the number of the hint page faults on a PMEM node is
the max for a task, The current NUMA balancing policy may try to place
the task on the PMEM node instead of DRAM node. This is unreasonable,
because there's no CPU in PMEM NUMA nodes. To fix this, CPU-less
nodes are ignored when searching the migration target node for a task
in this patch.

To test the patch, we run a workload that accesses more memory in PMEM
node than memory in DRAM node. Without the patch, the PMEM node will
be chosen as preferred node in task_numa_placement(). While the DRAM
node will be chosen instead with the patch.

Signed-off-by: "Huang, Ying" <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Srikar Dronamraju <[email protected]>
---
kernel/sched/fair.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 54e1aad1c5d7..e462ac5c1e48 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2393,6 +2393,10 @@ static void task_numa_placement(struct task_struct *p)
}
}

+ /* Cannot migrate task to CPU-less node */
+ if (!node_state(nid, N_CPU))
+ continue;
+
if (!ng) {
if (faults > max_faults) {
max_faults = faults;
--
2.30.2

2022-01-29 16:28:18

by Srikar Dronamraju

[permalink] [raw]

Subject: Re: [RFC PATCH 1/2] NUMA balancing: fix NUMA topology type for memory tiering system

* Huang Ying <[email protected]> [2022-01-28 10:38:41]:

>
> One possible fix is to ignore CPU-less nodes when detecting NUMA
> topology type in init_numa_topology_type(). That works well for the
> example system. Is it good in general for any system with CPU-less
> nodes?
>

A CPUless node at the time online doesn't necessarily mean a CPUless node
for the entire boot. For example: On PowerVM Lpars, aka powerpc systems,
some of the nodes may start as CPUless nodes and then CPUS may get
populated/hotplugged on them.

Hence I am not sure if adding a check for CPUless nodes at node online may
work for such systems.

> Signed-off-by: "Huang, Ying" <[email protected]>
> Suggested-by: Peter Zijlstra <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Cc: Srikar Dronamraju <[email protected]>
> ---
> kernel/sched/topology.c | 6 ++++++
> 1 file changed, 6 insertions(+)
>

--
Thanks and Regards
Srikar Dronamraju

2022-01-29 17:56:54

by Srikar Dronamraju

[permalink] [raw]

Subject: Re: [RFC PATCH 2/2] NUMA balancing: avoid to migrate task to CPU-less node

* Huang Ying <[email protected]> [2022-01-28 10:38:42]:

> In a typical memory tiering system, there's no CPU in slow (PMEM) NUMA
> nodes. But if the number of the hint page faults on a PMEM node is
> the max for a task, The current NUMA balancing policy may try to place
> the task on the PMEM node instead of DRAM node. This is unreasonable,
> because there's no CPU in PMEM NUMA nodes. To fix this, CPU-less
> nodes are ignored when searching the migration target node for a task
> in this patch.
>
> To test the patch, we run a workload that accesses more memory in PMEM
> node than memory in DRAM node. Without the patch, the PMEM node will
> be chosen as preferred node in task_numa_placement(). While the DRAM
> node will be chosen instead with the patch.
>
> Signed-off-by: "Huang, Ying" <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Cc: Srikar Dronamraju <[email protected]>
> ---
> kernel/sched/fair.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 54e1aad1c5d7..e462ac5c1e48 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2393,6 +2393,10 @@ static void task_numa_placement(struct task_struct *p)
> }
> }
>
> + /* Cannot migrate task to CPU-less node */
> + if (!node_state(nid, N_CPU))
> + continue;
> +

Lets take the example that you quoted 2 socket machine with 1 DRAM node and
1 PMEM node per socket. Now lets say the task is placed on a CPU in node 1
but most of its memory faults are coming from node 2, which is the PMEM node
attached to node 0. Now without the hunk, there is a chance that the task
got moved to node 0. However with the change, are we inhibiting such a move?

> if (!ng) {
> if (faults > max_faults) {
> max_faults = faults;
> --
> 2.30.2
>

--
Thanks and Regards
Srikar Dronamraju

2022-01-30 03:53:10

by Huang, Ying

[permalink] [raw]

Subject: Re: [RFC PATCH 1/2] NUMA balancing: fix NUMA topology type for memory tiering system

Srikar Dronamraju <[email protected]> writes:

> * Huang Ying <[email protected]> [2022-01-28 10:38:41]:
>
>>
>> One possible fix is to ignore CPU-less nodes when detecting NUMA
>> topology type in init_numa_topology_type(). That works well for the
>> example system. Is it good in general for any system with CPU-less
>> nodes?
>>
>
> A CPUless node at the time online doesn't necessarily mean a CPUless node
> for the entire boot. For example: On PowerVM Lpars, aka powerpc systems,
> some of the nodes may start as CPUless nodes and then CPUS may get
> populated/hotplugged on them.

Got it!

> Hence I am not sure if adding a check for CPUless nodes at node online may
> work for such systems.

How about something as below?

Best Regards,
Huang, Ying

-----------------------8<-----------------------------

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index d201a7052a29..733e8bd930b4 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1737,7 +1737,13 @@ static void init_numa_topology_type(void)
}

for_each_online_node(a) {
+ if (!node_state(a, N_CPU))
+ continue;
+
for_each_online_node(b) {
+ if (!node_state(b, N_CPU))
+ continue;
+
/* Find two nodes furthest removed from each other. */
if (node_distance(a, b) < n)
continue;
@@ -1849,6 +1855,13 @@ void sched_init_numa(void)

sched_domains_numa_masks[i][j] = mask;

+ /*
+ * The mask will be initialized when the first CPU of
+ * the node is onlined.
+ */
+ if (!node_state(j, N_CPU))
+ continue;
+
for_each_node(k) {
/*
* Distance information can be unreliable for
@@ -1919,8 +1932,10 @@ void sched_init_numa(void)
return;

bitmap_zero(sched_numa_onlined_nodes, nr_node_ids);
- for_each_online_node(i)
- bitmap_set(sched_numa_onlined_nodes, i, 1);
+ for_each_online_node(i) {
+ if (node_state(i, N_CPU))
+ bitmap_set(sched_numa_onlined_nodes, i, 1);
+ }
}

static void __sched_domains_numa_masks_set(unsigned int node)
@@ -1928,7 +1943,7 @@ static void __sched_domains_numa_masks_set(unsigned int node)
int i, j;

/*
- * NUMA masks are not built for offline nodes in sched_init_numa().
+ * NUMA masks are not built for offline/CPU-less nodes in sched_init_numa().
* Thus, when a CPU of a never-onlined-before node gets plugged in,
* adding that new CPU to the right NUMA masks is not sufficient: the
* masks of that CPU's node must also be updated.

2022-01-30 12:58:15

by Huang, Ying

[permalink] [raw]

Subject: Re: [RFC PATCH 2/2] NUMA balancing: avoid to migrate task to CPU-less node

Srikar Dronamraju <[email protected]> writes:

> * Huang Ying <[email protected]> [2022-01-28 10:38:42]:
>
>> In a typical memory tiering system, there's no CPU in slow (PMEM) NUMA
>> nodes. But if the number of the hint page faults on a PMEM node is
>> the max for a task, The current NUMA balancing policy may try to place
>> the task on the PMEM node instead of DRAM node. This is unreasonable,
>> because there's no CPU in PMEM NUMA nodes. To fix this, CPU-less
>> nodes are ignored when searching the migration target node for a task
>> in this patch.
>>
>> To test the patch, we run a workload that accesses more memory in PMEM
>> node than memory in DRAM node. Without the patch, the PMEM node will
>> be chosen as preferred node in task_numa_placement(). While the DRAM
>> node will be chosen instead with the patch.
>>
>> Signed-off-by: "Huang, Ying" <[email protected]>
>> Cc: Peter Zijlstra <[email protected]>
>> Cc: Ingo Molnar <[email protected]>
>> Cc: Mel Gorman <[email protected]>
>> Cc: Rik van Riel <[email protected]>
>> Cc: Srikar Dronamraju <[email protected]>
>> ---
>> kernel/sched/fair.c | 4 ++++
>> 1 file changed, 4 insertions(+)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 54e1aad1c5d7..e462ac5c1e48 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -2393,6 +2393,10 @@ static void task_numa_placement(struct task_struct *p)
>> }
>> }
>>
>> + /* Cannot migrate task to CPU-less node */
>> + if (!node_state(nid, N_CPU))
>> + continue;
>> +
>
> Lets take the example that you quoted 2 socket machine with 1 DRAM node and
> 1 PMEM node per socket. Now lets say the task is placed on a CPU in node 1
> but most of its memory faults are coming from node 2, which is the PMEM node
> attached to node 0. Now without the hunk, there is a chance that the task
> got moved to node 0. However with the change, are we inhibiting such a move?

This sounds reasonable. How about the following solution? If a
CPU-less node is selected as migration target, we select a nearest node
with CPU instead? That is, something like the below patch.

Best Regards,
Huang, Ying

------------------------------8<---------------------------------
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5146163bfabb..52d926d8cbdb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2401,6 +2401,23 @@ static void task_numa_placement(struct task_struct *p)
}
}

+ /* Cannot migrate task to CPU-less node */
+ if (!node_state(max_nid, N_CPU)) {
+ int near_nid = max_nid;
+ int distance, near_distance = INT_MAX;
+
+ for_each_online_node(nid) {
+ if (!node_state(nid, N_CPU))
+ continue;
+ distance = node_distance(max_nid, nid);
+ if (distance < near_distance) {
+ near_nid = nid;
+ near_distance = distance;
+ }
+ }
+ max_nid = near_nid;
+ }
+
if (ng) {
numa_group_count_active_nodes(ng);
spin_unlock_irq(group_lock);

2022-01-31 11:11:13

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC PATCH 1/2] NUMA balancing: fix NUMA topology type for memory tiering system

On Fri, Jan 28, 2022 at 10:38:41AM +0800, Huang Ying wrote:
> kernel/sched/topology.c | 6 ++++++
> 1 file changed, 6 insertions(+)
>
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 9f26e6b651fe..ba975a29d444 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1738,7 +1738,13 @@ void init_numa_topology_type(void)
> }
>
> for_each_online_node(a) {
> + if (!node_state(a, N_CPU))
> + continue;
> +
> for_each_online_node(b) {
> + if (!node_state(b, N_CPU))
> + continue;
> +
> /* Find two nodes furthest removed from each other. */
> if (node_distance(a, b) < n)
> continue;

I think you forgot some.. by not skipping CPU-less nodes in
sched_init_numa() the whole premise of init_numa_topology_type() goes
out the window as well, by virtue of getting sched_domains_numa_levels
and sched_max_numa_distance wrong.

Did I get them all?

Do we want something like:

#define for_each_possible_cpu_node(n) for (n = 0; n < nr_node_ids; n++) if (!node_state(n, N_CPU)) continue; else
#define for_each_online_cpu_node(n) for_each_online_node(n) if (!node_state(n, N_CPU)) continue; else

To clean that up?

---
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1684,8 +1684,12 @@ static void sched_numa_warn(const char *

for (i = 0; i < nr_node_ids; i++) {
printk(KERN_WARNING " ");
- for (j = 0; j < nr_node_ids; j++)
- printk(KERN_CONT "%02d ", node_distance(i,j));
+ for (j = 0; j < nr_node_ids; j++) {
+ if (!node_state(i, N_CPU) || !node_state(j, N_CPU))
+ printk(KERN_CONT "(%02d) ", node_distance(i,j));
+ else
+ printk(KERN_CONT " %02d ", node_distance(i,j));
+ }
printk(KERN_CONT "\n");
}
printk(KERN_WARNING "\n");
@@ -1737,7 +1741,13 @@ static void init_numa_topology_type(void
}

for_each_online_node(a) {
+ if (!node_state(a, N_CPU))
+ continue;
+
for_each_online_node(b) {
+ if (!node_state(b, N_CPU))
+ continue;
+
/* Find two nodes furthest removed from each other. */
if (node_distance(a, b) < n)
continue;
@@ -1756,6 +1766,9 @@ static void init_numa_topology_type(void
return;
}
}
+
+ pr_err("Failed to find a NUMA topology type, defaulting to DIRECT\n");
+ sched_numa_topology_type = NUMA_DIRECT;
}

@@ -1778,9 +1791,15 @@ void sched_init_numa(void)

bitmap_zero(distance_map, NR_DISTANCE_VALUES);
for (i = 0; i < nr_node_ids; i++) {
+ if (!node_state(i, N_CPU))
+ continue;
+
for (j = 0; j < nr_node_ids; j++) {
int distance = node_distance(i, j);

+ if (!node_state(j, N_CPU))
+ continue;
+
if (distance < LOCAL_DISTANCE || distance >= NR_DISTANCE_VALUES) {
sched_numa_warn("Invalid distance value range");
return;
@@ -1863,6 +1882,12 @@ void sched_init_numa(void)
if (sched_debug() && (node_distance(j, k) != node_distance(k, j)))
sched_numa_warn("Node-distance not symmetric");

+ if (!node_state(j, N_CPU))
+ continue;
+
+ if (!node_state(j, N_CPU))
+ continue;
+
if (node_distance(j, k) > sched_domains_numa_distance[i])
continue;

@@ -1943,6 +1968,9 @@ static void __sched_domains_numa_masks_s
if (!node_online(j) || node == j)
continue;

+ if (!node_state(j, N_CPU))
+ continue;
+
if (node_distance(j, node) > sched_domains_numa_distance[i])
continue;

@@ -1974,6 +2002,9 @@ void sched_domains_numa_masks_set(unsign
if (!node_online(j))
continue;

+ if (!node_state(j, N_CPU))
+ continue;
+
/* Set ourselves in the remote node's masks */
if (node_distance(j, node) <= sched_domains_numa_distance[i])
cpumask_set_cpu(cpu, sched_domains_numa_masks[i][j]);

2022-01-31 11:11:44

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC PATCH 1/2] NUMA balancing: fix NUMA topology type for memory tiering system

On Fri, Jan 28, 2022 at 03:30:50PM +0800, Huang, Ying wrote:
> Srikar Dronamraju <[email protected]> writes:
>
> > * Huang Ying <[email protected]> [2022-01-28 10:38:41]:
> >
> >>
> >> One possible fix is to ignore CPU-less nodes when detecting NUMA
> >> topology type in init_numa_topology_type(). That works well for the
> >> example system. Is it good in general for any system with CPU-less
> >> nodes?
> >>
> >
> > A CPUless node at the time online doesn't necessarily mean a CPUless node
> > for the entire boot. For example: On PowerVM Lpars, aka powerpc systems,
> > some of the nodes may start as CPUless nodes and then CPUS may get
> > populated/hotplugged on them.
>
> Got it!
>
> > Hence I am not sure if adding a check for CPUless nodes at node online may
> > work for such systems.
>
> How about something as below?

I'm thinking that might not be enough in that scenario; if we're going
to consistently skip CPU-less nodes (as I really think we should) then
__sched_domains_numa_masks_set() is not sufficient for the hotplug case
since sched_domains_numa_levels and sched_max_numa_distance can also
change.

This means we need to re-do more of sched_init_numa() and possibly
re-alloc some of those arrays etc..

Same for offline ofc.

2022-01-31 11:22:24

by Valentin Schneider

[permalink] [raw]

Subject: Re: [RFC PATCH 1/2] NUMA balancing: fix NUMA topology type for memory tiering system

On 28/01/22 16:06, Peter Zijlstra wrote:
> On Fri, Jan 28, 2022 at 10:38:41AM +0800, Huang Ying wrote:
>> kernel/sched/topology.c | 6 ++++++
>> 1 file changed, 6 insertions(+)
>>
>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>> index 9f26e6b651fe..ba975a29d444 100644
>> --- a/kernel/sched/topology.c
>> +++ b/kernel/sched/topology.c
>> @@ -1738,7 +1738,13 @@ void init_numa_topology_type(void)
>> }
>>
>> for_each_online_node(a) {
>> + if (!node_state(a, N_CPU))
>> + continue;
>> +
>> for_each_online_node(b) {
>> + if (!node_state(b, N_CPU))
>> + continue;
>> +
>> /* Find two nodes furthest removed from each other. */
>> if (node_distance(a, b) < n)
>> continue;
>
> I think you forgot some.. by not skipping CPU-less nodes in
> sched_init_numa() the whole premise of init_numa_topology_type() goes
> out the window as well, by virtue of getting sched_domains_numa_levels
> and sched_max_numa_distance wrong.
>
> Did I get them all?
>

On the topology.c side of things I think so, but I'm thinking
score_nearby_nodes() and preferred_group_nid() would need some updating as
well.

> Do we want something like:
>
> #define for_each_possible_cpu_node(n) for (n = 0; n < nr_node_ids; n++) if (!node_state(n, N_CPU)) continue; else
> #define for_each_online_cpu_node(n) for_each_online_node(n) if (!node_state(n, N_CPU)) continue; else
>
> To clean that up?
>

Looks reasonable!

> ---
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1684,8 +1684,12 @@ static void sched_numa_warn(const char *
>
> for (i = 0; i < nr_node_ids; i++) {
> printk(KERN_WARNING " ");
> - for (j = 0; j < nr_node_ids; j++)
> - printk(KERN_CONT "%02d ", node_distance(i,j));
> + for (j = 0; j < nr_node_ids; j++) {
> + if (!node_state(i, N_CPU) || !node_state(j, N_CPU))
> + printk(KERN_CONT "(%02d) ", node_distance(i,j));
> + else
> + printk(KERN_CONT " %02d ", node_distance(i,j));
> + }
> printk(KERN_CONT "\n");
> }
> printk(KERN_WARNING "\n");
> @@ -1737,7 +1741,13 @@ static void init_numa_topology_type(void
> }
>
> for_each_online_node(a) {
> + if (!node_state(a, N_CPU))
> + continue;
> +
> for_each_online_node(b) {
> + if (!node_state(b, N_CPU))
> + continue;
> +
> /* Find two nodes furthest removed from each other. */
> if (node_distance(a, b) < n)
> continue;
> @@ -1756,6 +1766,9 @@ static void init_numa_topology_type(void
> return;
> }
> }
> +
> + pr_err("Failed to find a NUMA topology type, defaulting to DIRECT\n");
> + sched_numa_topology_type = NUMA_DIRECT;
> }
>
>
> @@ -1778,9 +1791,15 @@ void sched_init_numa(void)
>
> bitmap_zero(distance_map, NR_DISTANCE_VALUES);
> for (i = 0; i < nr_node_ids; i++) {
> + if (!node_state(i, N_CPU))
> + continue;
> +
> for (j = 0; j < nr_node_ids; j++) {
> int distance = node_distance(i, j);
>
> + if (!node_state(j, N_CPU))
> + continue;
> +
> if (distance < LOCAL_DISTANCE || distance >= NR_DISTANCE_VALUES) {
> sched_numa_warn("Invalid distance value range");
> return;
> @@ -1863,6 +1882,12 @@ void sched_init_numa(void)
> if (sched_debug() && (node_distance(j, k) != node_distance(k, j)))
> sched_numa_warn("Node-distance not symmetric");
>
> + if (!node_state(j, N_CPU))
> + continue;
> +
> + if (!node_state(j, N_CPU))
> + continue;
> +
> if (node_distance(j, k) > sched_domains_numa_distance[i])
> continue;
>
> @@ -1943,6 +1968,9 @@ static void __sched_domains_numa_masks_s
> if (!node_online(j) || node == j)
> continue;
>
> + if (!node_state(j, N_CPU))
> + continue;
> +
> if (node_distance(j, node) > sched_domains_numa_distance[i])
> continue;
>
> @@ -1974,6 +2002,9 @@ void sched_domains_numa_masks_set(unsign
> if (!node_online(j))
> continue;
>
> + if (!node_state(j, N_CPU))
> + continue;
> +
> /* Set ourselves in the remote node's masks */
> if (node_distance(j, node) <= sched_domains_numa_distance[i])
> cpumask_set_cpu(cpu, sched_domains_numa_masks[i][j]);

2022-02-01 09:16:04

by Huang, Ying

[permalink] [raw]

Subject: Re: [RFC PATCH 1/2] NUMA balancing: fix NUMA topology type for memory tiering system

Peter Zijlstra <[email protected]> writes:

> On Fri, Jan 28, 2022 at 03:30:50PM +0800, Huang, Ying wrote:
>> Srikar Dronamraju <[email protected]> writes:
>>
>> > * Huang Ying <[email protected]> [2022-01-28 10:38:41]:
>> >
>> >>
>> >> One possible fix is to ignore CPU-less nodes when detecting NUMA
>> >> topology type in init_numa_topology_type(). That works well for the
>> >> example system. Is it good in general for any system with CPU-less
>> >> nodes?
>> >>
>> >
>> > A CPUless node at the time online doesn't necessarily mean a CPUless node
>> > for the entire boot. For example: On PowerVM Lpars, aka powerpc systems,
>> > some of the nodes may start as CPUless nodes and then CPUS may get
>> > populated/hotplugged on them.
>>
>> Got it!
>>
>> > Hence I am not sure if adding a check for CPUless nodes at node online may
>> > work for such systems.
>>
>> How about something as below?
>
> I'm thinking that might not be enough in that scenario; if we're going
> to consistently skip CPU-less nodes (as I really think we should) then
> __sched_domains_numa_masks_set() is not sufficient for the hotplug case
> since sched_domains_numa_levels and sched_max_numa_distance can also
> change.
>
> This means we need to re-do more of sched_init_numa() and possibly
> re-alloc some of those arrays etc..
>
> Same for offline ofc.

Got it! It doesn't make sense to create schedule domains for CPU-less
nodes. I can work on this after Chinese New Year holiday week (the
whole next week). But if anyone want to work on this, feel free to do
that.

Best Regards,
Huang, Ying

2022-02-07 20:29:10

by Srikar Dronamraju

[permalink] [raw]

Subject: Re: [RFC PATCH 2/2] NUMA balancing: avoid to migrate task to CPU-less node

* Huang, Ying <[email protected]> [2022-01-28 15:51:36]:

> Srikar Dronamraju <[email protected]> writes:
>
> > * Huang Ying <[email protected]> [2022-01-28 10:38:42]:
> >
> This sounds reasonable. How about the following solution? If a
> CPU-less node is selected as migration target, we select a nearest node
> with CPU instead? That is, something like the below patch.
>
> Best Regards,
> Huang, Ying
>
> ------------------------------8<---------------------------------
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 5146163bfabb..52d926d8cbdb 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2401,6 +2401,23 @@ static void task_numa_placement(struct task_struct *p)
> }
> }
>
> + /* Cannot migrate task to CPU-less node */
> + if (!node_state(max_nid, N_CPU)) {
> + int near_nid = max_nid;
> + int distance, near_distance = INT_MAX;
> +
> + for_each_online_node(nid) {
> + if (!node_state(nid, N_CPU))
> + continue;
> + distance = node_distance(max_nid, nid);
> + if (distance < near_distance) {
> + near_nid = nid;
> + near_distance = distance;
> + }
> + }
> + max_nid = near_nid;
> + }
> +

This looks good. but should we move this into preferred_group_nid()?
i.e should we care for !ng case, since those would mean only private faults.

> if (ng) {
> numa_group_count_active_nodes(ng);
> spin_unlock_irq(group_lock);

--
Thanks and Regards
Srikar Dronamraju

2022-02-09 05:43:39

by Huang, Ying

[permalink] [raw]

Subject: Re: [RFC PATCH 2/2] NUMA balancing: avoid to migrate task to CPU-less node

Srikar Dronamraju <[email protected]> writes:

> * Huang, Ying <[email protected]> [2022-01-28 15:51:36]:
>
>> Srikar Dronamraju <[email protected]> writes:
>>
>> > * Huang Ying <[email protected]> [2022-01-28 10:38:42]:
>> >
>> This sounds reasonable. How about the following solution? If a
>> CPU-less node is selected as migration target, we select a nearest node
>> with CPU instead? That is, something like the below patch.
>>
>> Best Regards,
>> Huang, Ying
>>
>> ------------------------------8<---------------------------------
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 5146163bfabb..52d926d8cbdb 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -2401,6 +2401,23 @@ static void task_numa_placement(struct task_struct *p)
>> }
>> }
>>
>> + /* Cannot migrate task to CPU-less node */
>> + if (!node_state(max_nid, N_CPU)) {
>> + int near_nid = max_nid;
>> + int distance, near_distance = INT_MAX;
>> +
>> + for_each_online_node(nid) {
>> + if (!node_state(nid, N_CPU))
>> + continue;
>> + distance = node_distance(max_nid, nid);
>> + if (distance < near_distance) {
>> + near_nid = nid;
>> + near_distance = distance;
>> + }
>> + }
>> + max_nid = near_nid;
>> + }
>> +
>
>
> This looks good. but should we move this into preferred_group_nid()?

Yes. We need to take care of preferred_group_nid() too. Will do that
in the next version.

> i.e should we care for !ng case, since those would mean only private faults.

IMO we need to care for !ng case. If the fault number of the CPU-less
node is the max, we need to migrate to the nearest node with CPU
instead.

Best Regards,
Huang, Ying

>> if (ng) {
>> numa_group_count_active_nodes(ng);
>> spin_unlock_irq(group_lock);