2020-05-01 03:13:56

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH v3 0/3] Offline memoryless cpuless node 0

Changelog v2:->v3:
- Resolved comments from Gautham.
Link v2: https://lore.kernel.org/linuxppc-dev/[email protected]/t/#u

Changelog v1:->v2:
- Rebased to v5.7-rc3
- Updated the changelog.
Link v1: https://lore.kernel.org/linuxppc-dev/[email protected]/t/#u

Linux kernel configured with CONFIG_NUMA on a system with multiple
possible nodes, marks node 0 as online at boot. However in practice,
there are systems which have node 0 as memoryless and cpuless.

This can cause
1. numa_balancing to be enabled on systems with only one online node.
2. Existence of dummy (cpuless and memoryless) node which can confuse
users/scripts looking at output of lscpu / numactl.

This patchset wants to correct this anomaly.

This should only affect systems that have CONFIG_MEMORYLESS_NODES.
Currently there are only 2 architectures ia64 and powerpc that have this
config.

Note: Patch 3 in this patch series depends on patches 1 and 2.
Without patches 1 and 2, patch 3 might crash powerpc.

v5.7-rc3
available: 2 nodes (0,2)
node 0 cpus:
node 0 size: 0 MB
node 0 free: 0 MB
node 2 cpus: 0 1 2 3 4 5 6 7
node 2 size: 32625 MB
node 2 free: 31490 MB
node distances:
node 0 2
0: 10 20
2: 20 10

proc and sys files
------------------
/sys/devices/system/node/online: 0,2
/proc/sys/kernel/numa_balancing: 1
/sys/devices/system/node/has_cpu: 2
/sys/devices/system/node/has_memory: 2
/sys/devices/system/node/has_normal_memory: 2
/sys/devices/system/node/possible: 0-31

v5.7-rc3 + patches
------------------
available: 1 nodes (2)
node 2 cpus: 0 1 2 3 4 5 6 7
node 2 size: 32625 MB
node 2 free: 31487 MB
node distances:
node 2
2: 10

proc and sys files
------------------
/sys/devices/system/node/online: 2
/proc/sys/kernel/numa_balancing: 0
/sys/devices/system/node/has_cpu: 2
/sys/devices/system/node/has_memory: 2
/sys/devices/system/node/has_normal_memory: 2
/sys/devices/system/node/possible: 0-31

Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: Michal Hocko <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Christopher Lameter <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Gautham R Shenoy <[email protected]>

Srikar Dronamraju (3):
powerpc/numa: Set numa_node for all possible cpus
powerpc/numa: Prefer node id queried from vphn
mm/page_alloc: Keep memoryless cpuless node 0 offline

arch/powerpc/mm/numa.c | 35 ++++++++++++++++++++++++-----------
mm/page_alloc.c | 4 +++-
2 files changed, 27 insertions(+), 12 deletions(-)

--
2.20.1


2020-05-01 03:14:18

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH v3 1/3] powerpc/numa: Set numa_node for all possible cpus

A Powerpc system with multiple possible nodes and with CONFIG_NUMA
enabled always used to have a node 0, even if node 0 does not any cpus
or memory attached to it. As per PAPR, node affinity of a cpu is only
available once its present / online. For all cpus that are possible but
not present, cpu_to_node() would point to node 0.

To ensure a cpuless, memoryless dummy node is not online, powerpc need
to make sure all possible but not present cpu_to_node are set to a
proper node.

Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: Michal Hocko <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Christopher Lameter <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Gautham R Shenoy <[email protected]>
Signed-off-by: Srikar Dronamraju <[email protected]>
---
Changelog v1:->v2:
- Rebased to v5.7-rc3

arch/powerpc/mm/numa.c | 16 ++++++++++++++--
1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 9fcf2d195830..b3615b7fdbdf 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -931,8 +931,20 @@ void __init mem_topology_setup(void)

reset_numa_cpu_lookup_table();

- for_each_present_cpu(cpu)
- numa_setup_cpu(cpu);
+ for_each_possible_cpu(cpu) {
+ /*
+ * Powerpc with CONFIG_NUMA always used to have a node 0,
+ * even if it was memoryless or cpuless. For all cpus that
+ * are possible but not present, cpu_to_node() would point
+ * to node 0. To remove a cpuless, memoryless dummy node,
+ * powerpc need to make sure all possible but not present
+ * cpu_to_node are set to a proper node.
+ */
+ if (cpu_present(cpu))
+ numa_setup_cpu(cpu);
+ else
+ set_cpu_numa_node(cpu, first_online_node);
+ }
}

void __init initmem_init(void)
--
2.20.1

2020-05-01 03:14:46

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH v3 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

Currently Linux kernel with CONFIG_NUMA on a system with multiple
possible nodes, marks node 0 as online at boot. However in practice,
there are systems which have node 0 as memoryless and cpuless.

This can cause numa_balancing to be enabled on systems with only one node
with memory and CPUs. The existence of this dummy node which is cpuless and
memoryless node can confuse users/scripts looking at output of lscpu /
numactl.

By marking, N_ONLINE as NODE_MASK_NONE, lets stop assuming that Node 0 is
always online.

v5.7-rc3
available: 2 nodes (0,2)
node 0 cpus:
node 0 size: 0 MB
node 0 free: 0 MB
node 2 cpus: 0 1 2 3 4 5 6 7
node 2 size: 32625 MB
node 2 free: 31490 MB
node distances:
node 0 2
0: 10 20
2: 20 10

proc and sys files
------------------
/sys/devices/system/node/online: 0,2
/proc/sys/kernel/numa_balancing: 1
/sys/devices/system/node/has_cpu: 2
/sys/devices/system/node/has_memory: 2
/sys/devices/system/node/has_normal_memory: 2
/sys/devices/system/node/possible: 0-31

v5.7-rc3 + patch
------------------
available: 1 nodes (2)
node 2 cpus: 0 1 2 3 4 5 6 7
node 2 size: 32625 MB
node 2 free: 31487 MB
node distances:
node 2
2: 10

proc and sys files
------------------
/sys/devices/system/node/online: 2
/proc/sys/kernel/numa_balancing: 0
/sys/devices/system/node/has_cpu: 2
/sys/devices/system/node/has_memory: 2
/sys/devices/system/node/has_normal_memory: 2
/sys/devices/system/node/possible: 0-31

Note: On Powerpc, cpu_to_node of possible but not present cpus would
previously return 0. Hence this commit depends on commit ("powerpc/numa: Set
numa_node for all possible cpus") and commit ("powerpc/numa: Prefer node id
queried from vphn"). Without the 2 commits, Powerpc system might crash.

Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: Michal Hocko <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Christopher Lameter <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Gautham R Shenoy <[email protected]>
Signed-off-by: Srikar Dronamraju <[email protected]>
---
Changelog v1:->v2:
- Rebased to v5.7-rc3
Link v2: https://lore.kernel.org/linuxppc-dev/[email protected]/t/#u

mm/page_alloc.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 69827d4fa052..03b89592af04 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -116,8 +116,10 @@ EXPORT_SYMBOL(latent_entropy);
*/
nodemask_t node_states[NR_NODE_STATES] __read_mostly = {
[N_POSSIBLE] = NODE_MASK_ALL,
+#ifdef CONFIG_NUMA
+ [N_ONLINE] = NODE_MASK_NONE,
+#else
[N_ONLINE] = { { [0] = 1UL } },
-#ifndef CONFIG_NUMA
[N_NORMAL_MEMORY] = { { [0] = 1UL } },
#ifdef CONFIG_HIGHMEM
[N_HIGH_MEMORY] = { { [0] = 1UL } },
--
2.20.1

Subject: Re: [PATCH v3 1/3] powerpc/numa: Set numa_node for all possible cpus

On Fri, 1 May 2020, Srikar Dronamraju wrote:

> - for_each_present_cpu(cpu)
> - numa_setup_cpu(cpu);
> + for_each_possible_cpu(cpu) {
> + /*
> + * Powerpc with CONFIG_NUMA always used to have a node 0,
> + * even if it was memoryless or cpuless. For all cpus that
> + * are possible but not present, cpu_to_node() would point
> + * to node 0. To remove a cpuless, memoryless dummy node,
> + * powerpc need to make sure all possible but not present
> + * cpu_to_node are set to a proper node.
> + */
> + if (cpu_present(cpu))
> + numa_setup_cpu(cpu);
> + else
> + set_cpu_numa_node(cpu, first_online_node);
> + }
> }


Can this be folded into numa_setup_cpu?

This looks more like numa_setup_cpu needs to change?

Subject: Re: [PATCH v3 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

On Fri, 1 May 2020, Srikar Dronamraju wrote:

> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -116,8 +116,10 @@ EXPORT_SYMBOL(latent_entropy);
> */
> nodemask_t node_states[NR_NODE_STATES] __read_mostly = {
> [N_POSSIBLE] = NODE_MASK_ALL,
> +#ifdef CONFIG_NUMA
> + [N_ONLINE] = NODE_MASK_NONE,

Hmmm.... I would have expected that you would have added something early
in boot that would mark the current node (whatever is is) online instead?

2020-05-08 13:10:28

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH v3 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

* Christopher Lameter <[email protected]> [2020-05-02 23:05:28]:

> On Fri, 1 May 2020, Srikar Dronamraju wrote:
>
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -116,8 +116,10 @@ EXPORT_SYMBOL(latent_entropy);
> > */
> > nodemask_t node_states[NR_NODE_STATES] __read_mostly = {
> > [N_POSSIBLE] = NODE_MASK_ALL,
> > +#ifdef CONFIG_NUMA
> > + [N_ONLINE] = NODE_MASK_NONE,
>
> Hmmm.... I would have expected that you would have added something early
> in boot that would mark the current node (whatever is is) online instead?

Do correct me, but these are structure initialization in page_alloc.c
Wouldn't these happen much before the numa initialization happens?
I think we are already marking nodes as online as soon as we detect the
nodes.

--
Thanks and Regards
Srikar Dronamraju

2020-05-08 13:23:43

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH v3 1/3] powerpc/numa: Set numa_node for all possible cpus

* Christopher Lameter <[email protected]> [2020-05-02 22:55:16]:

> On Fri, 1 May 2020, Srikar Dronamraju wrote:
>
> > - for_each_present_cpu(cpu)
> > - numa_setup_cpu(cpu);
> > + for_each_possible_cpu(cpu) {
> > + /*
> > + * Powerpc with CONFIG_NUMA always used to have a node 0,
> > + * even if it was memoryless or cpuless. For all cpus that
> > + * are possible but not present, cpu_to_node() would point
> > + * to node 0. To remove a cpuless, memoryless dummy node,
> > + * powerpc need to make sure all possible but not present
> > + * cpu_to_node are set to a proper node.
> > + */
> > + if (cpu_present(cpu))
> > + numa_setup_cpu(cpu);
> > + else
> > + set_cpu_numa_node(cpu, first_online_node);
> > + }
> > }
>
>
> Can this be folded into numa_setup_cpu?
>
> This looks more like numa_setup_cpu needs to change?
>

We can fold this into numa_setup_cpu().

However till now we were sure that numa_setup_cpu() would be called only for
a present cpu. That assumption will change.
+ (non-consequential) an additional check everytime cpu is hotplugged in.

If Michael Ellerman is okay with the change, I can fold it in.

--
Thanks and Regards
Srikar Dronamraju

2020-05-11 11:29:38

by Michael Ellerman

[permalink] [raw]
Subject: Re: [PATCH v3 1/3] powerpc/numa: Set numa_node for all possible cpus

Srikar Dronamraju <[email protected]> writes:
> * Christopher Lameter <[email protected]> [2020-05-02 22:55:16]:
>
>> On Fri, 1 May 2020, Srikar Dronamraju wrote:
>>
>> > - for_each_present_cpu(cpu)
>> > - numa_setup_cpu(cpu);
>> > + for_each_possible_cpu(cpu) {
>> > + /*
>> > + * Powerpc with CONFIG_NUMA always used to have a node 0,
>> > + * even if it was memoryless or cpuless. For all cpus that
>> > + * are possible but not present, cpu_to_node() would point
>> > + * to node 0. To remove a cpuless, memoryless dummy node,
>> > + * powerpc need to make sure all possible but not present
>> > + * cpu_to_node are set to a proper node.
>> > + */
>> > + if (cpu_present(cpu))
>> > + numa_setup_cpu(cpu);
>> > + else
>> > + set_cpu_numa_node(cpu, first_online_node);
>> > + }
>> > }
>>
>> Can this be folded into numa_setup_cpu?
>>
>> This looks more like numa_setup_cpu needs to change?
>
> We can fold this into numa_setup_cpu().
>
> However till now we were sure that numa_setup_cpu() would be called only for
> a present cpu. That assumption will change.
> + (non-consequential) an additional check everytime cpu is hotplugged in.
>
> If Michael Ellerman is okay with the change, I can fold it in.

Yes I agree it would be better in numa_setup_cpu().

cheers