2020-06-24 09:30:17

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH v5 0/3] Offline memoryless cpuless node 0

Changelog v4:->v5:
- rebased to v5.8-rc2
link v4: http://lore.kernel.org/lkml/[email protected]/t/#u

Changelog v3:->v4:
- Resolved comments from Christopher.
Link v3: http://lore.kernel.org/lkml/[email protected]/t/#u

Changelog v2:->v3:
- Resolved comments from Gautham.
Link v2: https://lore.kernel.org/linuxppc-dev/[email protected]/t/#u

Changelog v1:->v2:
- Rebased to v5.7-rc3
- Updated the changelog.
Link v1: https://lore.kernel.org/linuxppc-dev/[email protected]/t/#u

Linux kernel configured with CONFIG_NUMA on a system with multiple
possible nodes, marks node 0 as online at boot. However in practice,
there are systems which have node 0 as memoryless and cpuless.

This can cause
1. numa_balancing to be enabled on systems with only one online node.
2. Existence of dummy (cpuless and memoryless) node which can confuse
users/scripts looking at output of lscpu / numactl.

This patchset wants to correct this anomaly.

This should only affect systems that have CONFIG_MEMORYLESS_NODES.
Currently there are only 2 architectures ia64 and powerpc that have this
config.

Note: Patch 3 in this patch series depends on patches 1 and 2.
Without patches 1 and 2, patch 3 might crash powerpc.

v5.8-rc2
available: 2 nodes (0,2)
node 0 cpus:
node 0 size: 0 MB
node 0 free: 0 MB
node 2 cpus: 0 1 2 3 4 5 6 7
node 2 size: 32625 MB
node 2 free: 31490 MB
node distances:
node 0 2
0: 10 20
2: 20 10

proc and sys files
------------------
/sys/devices/system/node/online: 0,2
/proc/sys/kernel/numa_balancing: 1
/sys/devices/system/node/has_cpu: 2
/sys/devices/system/node/has_memory: 2
/sys/devices/system/node/has_normal_memory: 2
/sys/devices/system/node/possible: 0-31

v5.8-rc2 + patches
------------------
available: 1 nodes (2)
node 2 cpus: 0 1 2 3 4 5 6 7
node 2 size: 32625 MB
node 2 free: 31487 MB
node distances:
node 2
2: 10

proc and sys files
------------------
/sys/devices/system/node/online: 2
/proc/sys/kernel/numa_balancing: 0
/sys/devices/system/node/has_cpu: 2
/sys/devices/system/node/has_memory: 2
/sys/devices/system/node/has_normal_memory: 2
/sys/devices/system/node/possible: 0-31

1. User space applications like Numactl, lscpu, that parse the sysfs tend to
believe there is an extra online node. This tends to confuse users and
applications. Other user space applications start believing that system was
not able to use all the resources (i.e missing resources) or the system was
not setup correctly.

2. Also existence of dummy node also leads to inconsistent information. The
number of online nodes is inconsistent with the information in the
device-tree and resource-dump

3. When the dummy node is present, single node non-Numa systems end up showing
up as NUMA systems and numa_balancing gets enabled. This will mean we take
the hit from the unnecessary numa hinting faults.

On a machine with just one node with node number not being 0,
the current setup will end up showing 2 online nodes. And when there are
more than one online nodes, numa_balancing gets enabled.

Without patch
$ grep numa /proc/vmstat
numa_hit 95179
numa_miss 0
numa_foreign 0
numa_interleave 3764
numa_local 95179
numa_other 0
numa_pte_updates 1206973 <----------
numa_huge_pte_updates 4654 <----------
numa_hint_faults 19560 <----------
numa_hint_faults_local 19560 <----------
numa_pages_migrated 0

With patch
$ grep numa /proc/vmstat
numa_hit 322338756
numa_miss 0
numa_foreign 0
numa_interleave 3790
numa_local 322338756
numa_other 0
numa_pte_updates 0 <----------
numa_huge_pte_updates 0 <----------
numa_hint_faults 0 <----------
numa_hint_faults_local 0 <----------
numa_pages_migrated 0

Here are 2 sample numa programs.

numa01.sh is a set of 2 process each running threads as many as number of
cpus;
each thread doing 50 loops on 3GB process shared memory operations.

numa02.sh is a single process with threads as many as number of cpus;
each thread doing 800 loops on 32MB thread local memory operations.

Testcase Time: Min Max Avg StdDev
./numa01.sh Real: 149.62 149.66 149.64 0.02
./numa01.sh Sys: 3.21 3.71 3.46 0.25
./numa01.sh User: 4755.13 4758.15 4756.64 1.51
./numa02.sh Real: 24.98 25.02 25.00 0.02
./numa02.sh Sys: 0.51 0.59 0.55 0.04
./numa02.sh User: 790.28 790.88 790.58 0.30

Testcase Time: Min Max Avg StdDev %Change
./numa01.sh Real: 149.44 149.46 149.45 0.01 0.127133%
./numa01.sh Sys: 0.71 0.89 0.80 0.09 332.5%
./numa01.sh User: 4754.19 4754.48 4754.33 0.15 0.0485873%
./numa02.sh Real: 24.97 24.98 24.98 0.00 0.0800641%
./numa02.sh Sys: 0.26 0.41 0.33 0.08 66.6667%
./numa02.sh User: 789.75 790.28 790.01 0.27 0.072151%

numa01.sh
param no_patch with_patch %Change
----- ---------- ---------- -------
numa_hint_faults 1131164 0 -100%
numa_hint_faults_local 1131164 0 -100%
numa_hit 213696 214244 0.256439%
numa_local 213696 214244 0.256439%
numa_pte_updates 1131294 0 -100%
pgfault 1380845 241424 -82.5162%
pgmajfault 75 60 -20%

Here are 2 sample numa programs.

numa01.sh is a set of 2 process each running threads as many as number of
cpus;
each thread doing 50 loops on 3GB process shared memory operations.

numa02.sh is a single process with threads as many as number of cpus;
each thread doing 800 loops on 32MB thread local memory operations.

Without patch
-------------
Testcase Time: Min Max Avg StdDev
./numa01.sh Real: 149.62 149.66 149.64 0.02
./numa01.sh Sys: 3.21 3.71 3.46 0.25
./numa01.sh User: 4755.13 4758.15 4756.64 1.51
./numa02.sh Real: 24.98 25.02 25.00 0.02
./numa02.sh Sys: 0.51 0.59 0.55 0.04
./numa02.sh User: 790.28 790.88 790.58 0.30

With patch
-----------
Testcase Time: Min Max Avg StdDev %Change
./numa01.sh Real: 149.44 149.46 149.45 0.01 0.127133%
./numa01.sh Sys: 0.71 0.89 0.80 0.09 332.5%
./numa01.sh User: 4754.19 4754.48 4754.33 0.15 0.0485873%
./numa02.sh Real: 24.97 24.98 24.98 0.00 0.0800641%
./numa02.sh Sys: 0.26 0.41 0.33 0.08 66.6667%
./numa02.sh User: 789.75 790.28 790.01 0.27 0.072151%

numa01.sh
param no_patch with_patch %Change
----- ---------- ---------- -------
numa_hint_faults 1131164 0 -100%
numa_hint_faults_local 1131164 0 -100%
numa_hit 213696 214244 0.256439%
numa_local 213696 214244 0.256439%
numa_pte_updates 1131294 0 -100%
pgfault 1380845 241424 -82.5162%
pgmajfault 75 60 -20%

numa02.sh
param no_patch with_patch %Change
----- ---------- ---------- -------
numa_hint_faults 111878 0 -100%
numa_hint_faults_local 111878 0 -100%
numa_hit 41854 43220 3.26373%
numa_local 41854 43220 3.26373%
numa_pte_updates 113926 0 -100%
pgfault 163662 51210 -68.7099%
pgmajfault 56 52 -7.14286%

Observations:
The real time and user time actually doesn't change much. However the system
time changes to some extent. The reason being the number of numa hinting
faults. With the patch we are not seeing the numa hinting faults.

Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: Michal Hocko <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Christopher Lameter <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Gautham R Shenoy <[email protected]>
Cc: Satheesh Rajendran <[email protected]>
Cc: David Hildenbrand <[email protected]>

Srikar Dronamraju (3):
powerpc/numa: Set numa_node for all possible cpus
powerpc/numa: Prefer node id queried from vphn
mm/page_alloc: Keep memoryless cpuless node 0 offline

arch/powerpc/mm/numa.c | 35 +++++++++++++++++++++++++----------
mm/page_alloc.c | 4 +++-
2 files changed, 28 insertions(+), 11 deletions(-)

--
2.18.1


2020-06-24 09:30:33

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH v5 1/3] powerpc/numa: Set numa_node for all possible cpus

A Powerpc system with multiple possible nodes and with CONFIG_NUMA
enabled always used to have a node 0, even if node 0 does not any cpus
or memory attached to it. As per PAPR, node affinity of a cpu is only
available once its present / online. For all cpus that are possible but
not present, cpu_to_node() would point to node 0.

To ensure a cpuless, memoryless dummy node is not online, powerpc need
to make sure all possible but not present cpu_to_node are set to a
proper node.

Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: Michal Hocko <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Christopher Lameter <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Gautham R Shenoy <[email protected]>
Cc: Satheesh Rajendran <[email protected]>
Cc: David Hildenbrand <[email protected]>
Signed-off-by: Srikar Dronamraju <[email protected]>
---
Changelog v4:->v5:
- rebased to v5.8-rc2
link v4: http://lore.kernel.org/lkml/[email protected]/t/#u

Changelog v3:->v4:
- Resolved comments from Christopher.
Link v3: http://lore.kernel.org/lkml/[email protected]/t/#u

Changelog v1:->v2:
- Rebased to v5.7-rc3

arch/powerpc/mm/numa.c | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 9fcf2d195830..5b7918c132f5 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -506,6 +506,11 @@ static int numa_setup_cpu(unsigned long lcpu)
int fcpu = cpu_first_thread_sibling(lcpu);
int nid = NUMA_NO_NODE;

+ if (!cpu_present(lcpu)) {
+ set_cpu_numa_node(lcpu, first_online_node);
+ return first_online_node;
+ }
+
/*
* If a valid cpu-to-node mapping is already available, use it
* directly instead of querying the firmware, since it represents
@@ -931,8 +936,17 @@ void __init mem_topology_setup(void)

reset_numa_cpu_lookup_table();

- for_each_present_cpu(cpu)
+ for_each_possible_cpu(cpu) {
+ /*
+ * Powerpc with CONFIG_NUMA always used to have a node 0,
+ * even if it was memoryless or cpuless. For all cpus that
+ * are possible but not present, cpu_to_node() would point
+ * to node 0. To remove a cpuless, memoryless dummy node,
+ * powerpc need to make sure all possible but not present
+ * cpu_to_node are set to a proper node.
+ */
numa_setup_cpu(cpu);
+ }
}

void __init initmem_init(void)
--
2.18.1

2020-06-24 09:30:46

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH v5 2/3] powerpc/numa: Prefer node id queried from vphn

Node id queried from the static device tree may not
be correct. For example: it may always show 0 on a shared processor.
Hence prefer the node id queried from vphn and fallback on the device tree
based node id if vphn query fails.

Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: Michal Hocko <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Christopher Lameter <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Gautham R Shenoy <[email protected]>
Cc: Satheesh Rajendran <[email protected]>
Cc: David Hildenbrand <[email protected]>
Signed-off-by: Srikar Dronamraju <[email protected]>
---
Changelog v4:->v5:
- rebased to v5.8-rc2
link v4: http://lore.kernel.org/lkml/[email protected]/t/#u

Changelog v2:->v3:
- Resolved comments from Gautham.
Link v2: https://lore.kernel.org/linuxppc-dev/[email protected]/t/#u

Changelog v1:->v2:
- Rebased to v5.7-rc3

arch/powerpc/mm/numa.c | 19 ++++++++++---------
1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 5b7918c132f5..163f5dc8fd46 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -724,21 +724,22 @@ static int __init parse_numa_properties(void)
*/
for_each_present_cpu(i) {
struct device_node *cpu;
- int nid;
-
- cpu = of_get_cpu_node(i, NULL);
- BUG_ON(!cpu);
- nid = of_node_to_nid_single(cpu);
- of_node_put(cpu);
+ int nid = vphn_get_nid(i);

/*
* Don't fall back to default_nid yet -- we will plug
* cpus into nodes once the memory scan has discovered
* the topology.
*/
- if (nid < 0)
- continue;
- node_set_online(nid);
+ if (nid == NUMA_NO_NODE) {
+ cpu = of_get_cpu_node(i, NULL);
+ BUG_ON(!cpu);
+ nid = of_node_to_nid_single(cpu);
+ of_node_put(cpu);
+ }
+
+ if (likely(nid > 0))
+ node_set_online(nid);
}

get_n_mem_cells(&n_mem_addr_cells, &n_mem_size_cells);
--
2.18.1

2020-06-24 09:31:22

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

Currently Linux kernel with CONFIG_NUMA on a system with multiple
possible nodes, marks node 0 as online at boot. However in practice,
there are systems which have node 0 as memoryless and cpuless.

This can cause numa_balancing to be enabled on systems with only one node
with memory and CPUs. The existence of this dummy node which is cpuless and
memoryless node can confuse users/scripts looking at output of lscpu /
numactl.

By marking, N_ONLINE as NODE_MASK_NONE, lets stop assuming that Node 0 is
always online.

v5.8-rc2
available: 2 nodes (0,2)
node 0 cpus:
node 0 size: 0 MB
node 0 free: 0 MB
node 2 cpus: 0 1 2 3 4 5 6 7
node 2 size: 32625 MB
node 2 free: 31490 MB
node distances:
node 0 2
0: 10 20
2: 20 10

proc and sys files
------------------
/sys/devices/system/node/online: 0,2
/proc/sys/kernel/numa_balancing: 1
/sys/devices/system/node/has_cpu: 2
/sys/devices/system/node/has_memory: 2
/sys/devices/system/node/has_normal_memory: 2
/sys/devices/system/node/possible: 0-31

v5.8-rc2 + patch
------------------
available: 1 nodes (2)
node 2 cpus: 0 1 2 3 4 5 6 7
node 2 size: 32625 MB
node 2 free: 31487 MB
node distances:
node 2
2: 10

proc and sys files
------------------
/sys/devices/system/node/online: 2
/proc/sys/kernel/numa_balancing: 0
/sys/devices/system/node/has_cpu: 2
/sys/devices/system/node/has_memory: 2
/sys/devices/system/node/has_normal_memory: 2
/sys/devices/system/node/possible: 0-31

Note: On Powerpc, cpu_to_node of possible but not present cpus would
previously return 0. Hence this commit depends on commit ("powerpc/numa: Set
numa_node for all possible cpus") and commit ("powerpc/numa: Prefer node id
queried from vphn"). Without the 2 commits, Powerpc system might crash.

1. User space applications like Numactl, lscpu, that parse the sysfs tend to
believe there is an extra online node. This tends to confuse users and
applications. Other user space applications start believing that system was
not able to use all the resources (i.e missing resources) or the system was
not setup correctly.

2. Also existence of dummy node also leads to inconsistent information. The
number of online nodes is inconsistent with the information in the
device-tree and resource-dump

3. When the dummy node is present, single node non-Numa systems end up showing
up as NUMA systems and numa_balancing gets enabled. This will mean we take
the hit from the unnecessary numa hinting faults.

Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: Michal Hocko <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Christopher Lameter <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Gautham R Shenoy <[email protected]>
Cc: Satheesh Rajendran <[email protected]>
Cc: David Hildenbrand <[email protected]>
Signed-off-by: Srikar Dronamraju <[email protected]>
---
Changelog v4:->v5:
- rebased to v5.8-rc2
link v4: http://lore.kernel.org/lkml/[email protected]/t/#u

Changelog v1:->v2:
- Rebased to v5.7-rc3
Link v2: https://lore.kernel.org/linuxppc-dev/[email protected]/t/#u

mm/page_alloc.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 48eb0f1410d4..5187664558e1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -117,8 +117,10 @@ EXPORT_SYMBOL(latent_entropy);
*/
nodemask_t node_states[NR_NODE_STATES] __read_mostly = {
[N_POSSIBLE] = NODE_MASK_ALL,
+#ifdef CONFIG_NUMA
+ [N_ONLINE] = NODE_MASK_NONE,
+#else
[N_ONLINE] = { { [0] = 1UL } },
-#ifndef CONFIG_NUMA
[N_NORMAL_MEMORY] = { { [0] = 1UL } },
#ifdef CONFIG_HIGHMEM
[N_HIGH_MEMORY] = { { [0] = 1UL } },
--
2.18.1

2020-06-24 09:51:18

by Gautham R Shenoy

[permalink] [raw]
Subject: Re: [PATCH v5 1/3] powerpc/numa: Set numa_node for all possible cpus

On Wed, Jun 24, 2020 at 02:58:44PM +0530, Srikar Dronamraju wrote:
> A Powerpc system with multiple possible nodes and with CONFIG_NUMA
> enabled always used to have a node 0, even if node 0 does not any cpus
> or memory attached to it. As per PAPR, node affinity of a cpu is only
> available once its present / online. For all cpus that are possible but
> not present, cpu_to_node() would point to node 0.
>
> To ensure a cpuless, memoryless dummy node is not online, powerpc need
> to make sure all possible but not present cpu_to_node are set to a
> proper node.
>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: Michal Hocko <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Vlastimil Babka <[email protected]>
> Cc: "Kirill A. Shutemov" <[email protected]>
> Cc: Christopher Lameter <[email protected]>
> Cc: Michael Ellerman <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Gautham R Shenoy <[email protected]>
> Cc: Satheesh Rajendran <[email protected]>
> Cc: David Hildenbrand <[email protected]>
> Signed-off-by: Srikar Dronamraju <[email protected]>

This looks good to me.

Reviewed-by: Gautham R. Shenoy <[email protected]>

--
Thanks and Regards
gautham.

2020-06-24 10:33:43

by Gautham R Shenoy

[permalink] [raw]
Subject: Re: [PATCH v5 2/3] powerpc/numa: Prefer node id queried from vphn

Hello Srikar,


On Wed, Jun 24, 2020 at 02:58:45PM +0530, Srikar Dronamraju wrote:
> Node id queried from the static device tree may not
> be correct. For example: it may always show 0 on a shared processor.
> Hence prefer the node id queried from vphn and fallback on the device tree
> based node id if vphn query fails.
>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: Michal Hocko <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Vlastimil Babka <[email protected]>
> Cc: "Kirill A. Shutemov" <[email protected]>
> Cc: Christopher Lameter <[email protected]>
> Cc: Michael Ellerman <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Gautham R Shenoy <[email protected]>
> Cc: Satheesh Rajendran <[email protected]>
> Cc: David Hildenbrand <[email protected]>
> Signed-off-by: Srikar Dronamraju <[email protected]>


This patch looks good to me.

Reviewed-by: Gautham R. Shenoy <[email protected]>


--
Thanks and Regards
gautham.

Subject: Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

On Wed, 24 Jun 2020, Srikar Dronamraju wrote:

> Currently Linux kernel with CONFIG_NUMA on a system with multiple
> possible nodes, marks node 0 as online at boot. However in practice,
> there are systems which have node 0 as memoryless and cpuless.

Maybe add something to explain why you are not simply mapping the
existing memory to NUMA node 0 which is after all just a numbering scheme
used by the kernel and can be used arbitrarily?

This could be seen more as a bug in the arch code during the setup of NUMA
nodes. The two nodes are created by the firmwware / bootstrap code after
all. Just do not do it?

2020-06-30 04:02:44

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

* Christopher Lameter <[email protected]> [2020-06-29 14:58:40]:

> On Wed, 24 Jun 2020, Srikar Dronamraju wrote:
>
> > Currently Linux kernel with CONFIG_NUMA on a system with multiple
> > possible nodes, marks node 0 as online at boot. However in practice,
> > there are systems which have node 0 as memoryless and cpuless.
>
> Maybe add something to explain why you are not simply mapping the
> existing memory to NUMA node 0 which is after all just a numbering scheme
> used by the kernel and can be used arbitrarily?
>

I thought Michal Hocko already gave a clear picture on why mapping is a bad
idea. https://lore.kernel.org/lkml/[email protected]/t/#u
Are you suggesting that we add that as part of the changelog?

> This could be seen more as a bug in the arch code during the setup of NUMA
> nodes. The two nodes are created by the firmwware / bootstrap code after
> all. Just do not do it?
>

- The arch/setup code in powerpc is not onlining these nodes.
- Later on cpus/memory in node 0 can be onlined.
- Firmware in this case Phyp is an independent code by itself.

--
Thanks and Regards
Srikar Dronamraju

2020-07-01 08:42:53

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline


On Wed 24-06-20 14:58:46, Srikar Dronamraju wrote:
> Currently Linux kernel with CONFIG_NUMA on a system with multiple
> possible nodes, marks node 0 as online at boot. However in practice,
> there are systems which have node 0 as memoryless and cpuless.
>
> This can cause numa_balancing to be enabled on systems with only one node
> with memory and CPUs. The existence of this dummy node which is cpuless and
> memoryless node can confuse users/scripts looking at output of lscpu /
> numactl.
>
> By marking, N_ONLINE as NODE_MASK_NONE, lets stop assuming that Node 0 is
> always online.
>
> v5.8-rc2
> available: 2 nodes (0,2)
> node 0 cpus:
> node 0 size: 0 MB
> node 0 free: 0 MB
> node 2 cpus: 0 1 2 3 4 5 6 7
> node 2 size: 32625 MB
> node 2 free: 31490 MB
> node distances:
> node 0 2
> 0: 10 20
> 2: 20 10
>
> proc and sys files
> ------------------
> /sys/devices/system/node/online: 0,2
> /proc/sys/kernel/numa_balancing: 1
> /sys/devices/system/node/has_cpu: 2
> /sys/devices/system/node/has_memory: 2
> /sys/devices/system/node/has_normal_memory: 2
> /sys/devices/system/node/possible: 0-31
>
> v5.8-rc2 + patch
> ------------------
> available: 1 nodes (2)
> node 2 cpus: 0 1 2 3 4 5 6 7
> node 2 size: 32625 MB
> node 2 free: 31487 MB
> node distances:
> node 2
> 2: 10
>
> proc and sys files
> ------------------
> /sys/devices/system/node/online: 2
> /proc/sys/kernel/numa_balancing: 0
> /sys/devices/system/node/has_cpu: 2
> /sys/devices/system/node/has_memory: 2
> /sys/devices/system/node/has_normal_memory: 2
> /sys/devices/system/node/possible: 0-31
>
> Note: On Powerpc, cpu_to_node of possible but not present cpus would
> previously return 0. Hence this commit depends on commit ("powerpc/numa: Set
> numa_node for all possible cpus") and commit ("powerpc/numa: Prefer node id
> queried from vphn"). Without the 2 commits, Powerpc system might crash.
>
> 1. User space applications like Numactl, lscpu, that parse the sysfs tend to
> believe there is an extra online node. This tends to confuse users and
> applications. Other user space applications start believing that system was
> not able to use all the resources (i.e missing resources) or the system was
> not setup correctly.
>
> 2. Also existence of dummy node also leads to inconsistent information. The
> number of online nodes is inconsistent with the information in the
> device-tree and resource-dump
>
> 3. When the dummy node is present, single node non-Numa systems end up showing
> up as NUMA systems and numa_balancing gets enabled. This will mean we take
> the hit from the unnecessary numa hinting faults.

I have to say that I dislike the node online/offline state and directly
exporting that to the userspace. Users should only care whether the node
has memory/cpus. Numa nodes can be online without any memory. Just
offline all the present memory blocks but do not physically hot remove
them and you are in the same situation. If users are confused by an
output of tools like numactl -H then those could be updated and hide
nodes without any memory&cpus.

The autonuma problem sounds interesting but again this patch doesn't
really solve the underlying problem because I strongly suspect that the
problem is still there when a numa node gets all its memory offline as
mentioned above.

While I completely agree that making node 0 special is wrong, I have
still hard time to review this very simply looking patch because all the
numa initialization is so spread around that this might just blow up
at unexpected places. IIRC we have discussed testing in the previous
version and David has provided a way to emulate these configurations
on x86. Did you manage to use those instruction for additional testing
on other than ppc architectures?

> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: Michal Hocko <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Vlastimil Babka <[email protected]>
> Cc: "Kirill A. Shutemov" <[email protected]>
> Cc: Christopher Lameter <[email protected]>
> Cc: Michael Ellerman <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Gautham R Shenoy <[email protected]>
> Cc: Satheesh Rajendran <[email protected]>
> Cc: David Hildenbrand <[email protected]>
> Signed-off-by: Srikar Dronamraju <[email protected]>
> ---
> Changelog v4:->v5:
> - rebased to v5.8-rc2
> link v4: http://lore.kernel.org/lkml/[email protected]/t/#u
>
> Changelog v1:->v2:
> - Rebased to v5.7-rc3
> Link v2: https://lore.kernel.org/linuxppc-dev/[email protected]/t/#u
>
> mm/page_alloc.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 48eb0f1410d4..5187664558e1 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -117,8 +117,10 @@ EXPORT_SYMBOL(latent_entropy);
> */
> nodemask_t node_states[NR_NODE_STATES] __read_mostly = {
> [N_POSSIBLE] = NODE_MASK_ALL,
> +#ifdef CONFIG_NUMA
> + [N_ONLINE] = NODE_MASK_NONE,
> +#else
> [N_ONLINE] = { { [0] = 1UL } },
> -#ifndef CONFIG_NUMA
> [N_NORMAL_MEMORY] = { { [0] = 1UL } },
> #ifdef CONFIG_HIGHMEM
> [N_HIGH_MEMORY] = { { [0] = 1UL } },
> --
> 2.18.1
>

--
Michal Hocko
SUSE Labs

2020-07-01 10:06:15

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

* Michal Hocko <[email protected]> [2020-07-01 10:42:00]:

>
> >
> > 2. Also existence of dummy node also leads to inconsistent information. The
> > number of online nodes is inconsistent with the information in the
> > device-tree and resource-dump
> >
> > 3. When the dummy node is present, single node non-Numa systems end up showing
> > up as NUMA systems and numa_balancing gets enabled. This will mean we take
> > the hit from the unnecessary numa hinting faults.
>
> I have to say that I dislike the node online/offline state and directly
> exporting that to the userspace. Users should only care whether the node
> has memory/cpus. Numa nodes can be online without any memory. Just
> offline all the present memory blocks but do not physically hot remove
> them and you are in the same situation. If users are confused by an
> output of tools like numactl -H then those could be updated and hide
> nodes without any memory&cpus.
>
> The autonuma problem sounds interesting but again this patch doesn't
> really solve the underlying problem because I strongly suspect that the
> problem is still there when a numa node gets all its memory offline as
> mentioned above.
>
> While I completely agree that making node 0 special is wrong, I have
> still hard time to review this very simply looking patch because all the
> numa initialization is so spread around that this might just blow up
> at unexpected places. IIRC we have discussed testing in the previous
> version and David has provided a way to emulate these configurations
> on x86. Did you manage to use those instruction for additional testing
> on other than ppc architectures?
>

I have tried all the steps that David mentioned and reported back at
https://lore.kernel.org/lkml/[email protected]/t/#u

As a summary, David's steps are still not creating a memoryless/cpuless on
x86 VM. I have tried booting with Numa/non-numa on all the x86 machines that
I could get to.

--
Thanks and Regards
Srikar Dronamraju

2020-07-01 10:16:41

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

On 01.07.20 12:04, Srikar Dronamraju wrote:
> * Michal Hocko <[email protected]> [2020-07-01 10:42:00]:
>
>>
>>>
>>> 2. Also existence of dummy node also leads to inconsistent information. The
>>> number of online nodes is inconsistent with the information in the
>>> device-tree and resource-dump
>>>
>>> 3. When the dummy node is present, single node non-Numa systems end up showing
>>> up as NUMA systems and numa_balancing gets enabled. This will mean we take
>>> the hit from the unnecessary numa hinting faults.
>>
>> I have to say that I dislike the node online/offline state and directly
>> exporting that to the userspace. Users should only care whether the node
>> has memory/cpus. Numa nodes can be online without any memory. Just
>> offline all the present memory blocks but do not physically hot remove
>> them and you are in the same situation. If users are confused by an
>> output of tools like numactl -H then those could be updated and hide
>> nodes without any memory&cpus.
>>
>> The autonuma problem sounds interesting but again this patch doesn't
>> really solve the underlying problem because I strongly suspect that the
>> problem is still there when a numa node gets all its memory offline as
>> mentioned above.
>>
>> While I completely agree that making node 0 special is wrong, I have
>> still hard time to review this very simply looking patch because all the
>> numa initialization is so spread around that this might just blow up
>> at unexpected places. IIRC we have discussed testing in the previous
>> version and David has provided a way to emulate these configurations
>> on x86. Did you manage to use those instruction for additional testing
>> on other than ppc architectures?
>>
>
> I have tried all the steps that David mentioned and reported back at
> https://lore.kernel.org/lkml/[email protected]/t/#u
>
> As a summary, David's steps are still not creating a memoryless/cpuless on
> x86 VM.

Now, that is wrong. You get a memoryless/cpuless node, which is *not
online*. Once you hotplug some memory, it will switch online. Once you
remove memory, it will switch back offline.

--
Thanks,

David / dhildenb

2020-07-01 11:03:23

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

* David Hildenbrand <[email protected]> [2020-07-01 12:15:54]:

> On 01.07.20 12:04, Srikar Dronamraju wrote:
> > * Michal Hocko <[email protected]> [2020-07-01 10:42:00]:
> >
> >>
> >>>
> >>> 2. Also existence of dummy node also leads to inconsistent information. The
> >>> number of online nodes is inconsistent with the information in the
> >>> device-tree and resource-dump
> >>>
> >>> 3. When the dummy node is present, single node non-Numa systems end up showing
> >>> up as NUMA systems and numa_balancing gets enabled. This will mean we take
> >>> the hit from the unnecessary numa hinting faults.
> >>
> >> I have to say that I dislike the node online/offline state and directly
> >> exporting that to the userspace. Users should only care whether the node
> >> has memory/cpus. Numa nodes can be online without any memory. Just
> >> offline all the present memory blocks but do not physically hot remove
> >> them and you are in the same situation. If users are confused by an
> >> output of tools like numactl -H then those could be updated and hide
> >> nodes without any memory&cpus.
> >>
> >> The autonuma problem sounds interesting but again this patch doesn't
> >> really solve the underlying problem because I strongly suspect that the
> >> problem is still there when a numa node gets all its memory offline as
> >> mentioned above.
> >>
> >> While I completely agree that making node 0 special is wrong, I have
> >> still hard time to review this very simply looking patch because all the
> >> numa initialization is so spread around that this might just blow up
> >> at unexpected places. IIRC we have discussed testing in the previous
> >> version and David has provided a way to emulate these configurations
> >> on x86. Did you manage to use those instruction for additional testing
> >> on other than ppc architectures?
> >>
> >
> > I have tried all the steps that David mentioned and reported back at
> > https://lore.kernel.org/lkml/[email protected]/t/#u
> >
> > As a summary, David's steps are still not creating a memoryless/cpuless on
> > x86 VM.
>
> Now, that is wrong. You get a memoryless/cpuless node, which is *not
> online*. Once you hotplug some memory, it will switch online. Once you
> remove memory, it will switch back offline.
>

Let me clarify, we are looking for a node 0 which is cpuless/memoryless at
boot. The code in question tries to handle a cpuless/memoryless node 0 at
boot.

With the steps that you gave the node 0 was always populated, node 1 or
some other node would be memoryless/cpuless and offline. But that should
have no impact by patch.

I don't see how adding/hotplugging/removing memory to a node after boot is
going to affect the changes that I have made. Please do correct me if I have
misunderstood.


--
Thanks and Regards
Srikar Dronamraju

2020-07-01 11:06:56

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

On 01.07.20 13:01, Srikar Dronamraju wrote:
> * David Hildenbrand <[email protected]> [2020-07-01 12:15:54]:
>
>> On 01.07.20 12:04, Srikar Dronamraju wrote:
>>> * Michal Hocko <[email protected]> [2020-07-01 10:42:00]:
>>>
>>>>
>>>>>
>>>>> 2. Also existence of dummy node also leads to inconsistent information. The
>>>>> number of online nodes is inconsistent with the information in the
>>>>> device-tree and resource-dump
>>>>>
>>>>> 3. When the dummy node is present, single node non-Numa systems end up showing
>>>>> up as NUMA systems and numa_balancing gets enabled. This will mean we take
>>>>> the hit from the unnecessary numa hinting faults.
>>>>
>>>> I have to say that I dislike the node online/offline state and directly
>>>> exporting that to the userspace. Users should only care whether the node
>>>> has memory/cpus. Numa nodes can be online without any memory. Just
>>>> offline all the present memory blocks but do not physically hot remove
>>>> them and you are in the same situation. If users are confused by an
>>>> output of tools like numactl -H then those could be updated and hide
>>>> nodes without any memory&cpus.
>>>>
>>>> The autonuma problem sounds interesting but again this patch doesn't
>>>> really solve the underlying problem because I strongly suspect that the
>>>> problem is still there when a numa node gets all its memory offline as
>>>> mentioned above.
>>>>
>>>> While I completely agree that making node 0 special is wrong, I have
>>>> still hard time to review this very simply looking patch because all the
>>>> numa initialization is so spread around that this might just blow up
>>>> at unexpected places. IIRC we have discussed testing in the previous
>>>> version and David has provided a way to emulate these configurations
>>>> on x86. Did you manage to use those instruction for additional testing
>>>> on other than ppc architectures?
>>>>
>>>
>>> I have tried all the steps that David mentioned and reported back at
>>> https://lore.kernel.org/lkml/[email protected]/t/#u
>>>
>>> As a summary, David's steps are still not creating a memoryless/cpuless on
>>> x86 VM.
>>
>> Now, that is wrong. You get a memoryless/cpuless node, which is *not
>> online*. Once you hotplug some memory, it will switch online. Once you
>> remove memory, it will switch back offline.
>>
>
> Let me clarify, we are looking for a node 0 which is cpuless/memoryless at
> boot. The code in question tries to handle a cpuless/memoryless node 0 at
> boot.

I was just correcting your statement, because it was wrong.

Could be that x86 code maps PXM 1 to node 0 because PXM 1 does neither
have CPUs nor memory. That would imply that we can, in fact, never have
node 0 offline during boot.

--
Thanks,

David / dhildenb

2020-07-01 11:34:51

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

On 01.07.20 13:06, David Hildenbrand wrote:
> On 01.07.20 13:01, Srikar Dronamraju wrote:
>> * David Hildenbrand <[email protected]> [2020-07-01 12:15:54]:
>>
>>> On 01.07.20 12:04, Srikar Dronamraju wrote:
>>>> * Michal Hocko <[email protected]> [2020-07-01 10:42:00]:
>>>>
>>>>>
>>>>>>
>>>>>> 2. Also existence of dummy node also leads to inconsistent information. The
>>>>>> number of online nodes is inconsistent with the information in the
>>>>>> device-tree and resource-dump
>>>>>>
>>>>>> 3. When the dummy node is present, single node non-Numa systems end up showing
>>>>>> up as NUMA systems and numa_balancing gets enabled. This will mean we take
>>>>>> the hit from the unnecessary numa hinting faults.
>>>>>
>>>>> I have to say that I dislike the node online/offline state and directly
>>>>> exporting that to the userspace. Users should only care whether the node
>>>>> has memory/cpus. Numa nodes can be online without any memory. Just
>>>>> offline all the present memory blocks but do not physically hot remove
>>>>> them and you are in the same situation. If users are confused by an
>>>>> output of tools like numactl -H then those could be updated and hide
>>>>> nodes without any memory&cpus.
>>>>>
>>>>> The autonuma problem sounds interesting but again this patch doesn't
>>>>> really solve the underlying problem because I strongly suspect that the
>>>>> problem is still there when a numa node gets all its memory offline as
>>>>> mentioned above.
>>>>>
>>>>> While I completely agree that making node 0 special is wrong, I have
>>>>> still hard time to review this very simply looking patch because all the
>>>>> numa initialization is so spread around that this might just blow up
>>>>> at unexpected places. IIRC we have discussed testing in the previous
>>>>> version and David has provided a way to emulate these configurations
>>>>> on x86. Did you manage to use those instruction for additional testing
>>>>> on other than ppc architectures?
>>>>>
>>>>
>>>> I have tried all the steps that David mentioned and reported back at
>>>> https://lore.kernel.org/lkml/[email protected]/t/#u
>>>>
>>>> As a summary, David's steps are still not creating a memoryless/cpuless on
>>>> x86 VM.
>>>
>>> Now, that is wrong. You get a memoryless/cpuless node, which is *not
>>> online*. Once you hotplug some memory, it will switch online. Once you
>>> remove memory, it will switch back offline.
>>>
>>
>> Let me clarify, we are looking for a node 0 which is cpuless/memoryless at
>> boot. The code in question tries to handle a cpuless/memoryless node 0 at
>> boot.
>
> I was just correcting your statement, because it was wrong.
>
> Could be that x86 code maps PXM 1 to node 0 because PXM 1 does neither
> have CPUs nor memory. That would imply that we can, in fact, never have
> node 0 offline during boot.
>

Yep, looks like it.

[ 0.009726] SRAT: PXM 1 -> APIC 0x00 -> Node 0
[ 0.009727] SRAT: PXM 1 -> APIC 0x01 -> Node 0
[ 0.009727] SRAT: PXM 1 -> APIC 0x02 -> Node 0
[ 0.009728] SRAT: PXM 1 -> APIC 0x03 -> Node 0
[ 0.009731] ACPI: SRAT: Node 0 PXM 1 [mem 0x00000000-0x0009ffff]
[ 0.009732] ACPI: SRAT: Node 0 PXM 1 [mem 0x00100000-0xbfffffff]
[ 0.009733] ACPI: SRAT: Node 0 PXM 1 [mem 0x100000000-0x13fffffff]



--
Thanks,

David / dhildenb

2020-07-01 12:22:04

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

On Wed 01-07-20 13:30:57, David Hildenbrand wrote:
> On 01.07.20 13:06, David Hildenbrand wrote:
> > On 01.07.20 13:01, Srikar Dronamraju wrote:
> >> * David Hildenbrand <[email protected]> [2020-07-01 12:15:54]:
> >>
> >>> On 01.07.20 12:04, Srikar Dronamraju wrote:
> >>>> * Michal Hocko <[email protected]> [2020-07-01 10:42:00]:
> >>>>
> >>>>>
> >>>>>>
> >>>>>> 2. Also existence of dummy node also leads to inconsistent information. The
> >>>>>> number of online nodes is inconsistent with the information in the
> >>>>>> device-tree and resource-dump
> >>>>>>
> >>>>>> 3. When the dummy node is present, single node non-Numa systems end up showing
> >>>>>> up as NUMA systems and numa_balancing gets enabled. This will mean we take
> >>>>>> the hit from the unnecessary numa hinting faults.
> >>>>>
> >>>>> I have to say that I dislike the node online/offline state and directly
> >>>>> exporting that to the userspace. Users should only care whether the node
> >>>>> has memory/cpus. Numa nodes can be online without any memory. Just
> >>>>> offline all the present memory blocks but do not physically hot remove
> >>>>> them and you are in the same situation. If users are confused by an
> >>>>> output of tools like numactl -H then those could be updated and hide
> >>>>> nodes without any memory&cpus.
> >>>>>
> >>>>> The autonuma problem sounds interesting but again this patch doesn't
> >>>>> really solve the underlying problem because I strongly suspect that the
> >>>>> problem is still there when a numa node gets all its memory offline as
> >>>>> mentioned above.

I would really appreciate a feedback to these two as well.

> >>>>> While I completely agree that making node 0 special is wrong, I have
> >>>>> still hard time to review this very simply looking patch because all the
> >>>>> numa initialization is so spread around that this might just blow up
> >>>>> at unexpected places. IIRC we have discussed testing in the previous
> >>>>> version and David has provided a way to emulate these configurations
> >>>>> on x86. Did you manage to use those instruction for additional testing
> >>>>> on other than ppc architectures?
> >>>>>
> >>>>
> >>>> I have tried all the steps that David mentioned and reported back at
> >>>> https://lore.kernel.org/lkml/[email protected]/t/#u
> >>>>
> >>>> As a summary, David's steps are still not creating a memoryless/cpuless on
> >>>> x86 VM.
> >>>
> >>> Now, that is wrong. You get a memoryless/cpuless node, which is *not
> >>> online*. Once you hotplug some memory, it will switch online. Once you
> >>> remove memory, it will switch back offline.
> >>>
> >>
> >> Let me clarify, we are looking for a node 0 which is cpuless/memoryless at
> >> boot. The code in question tries to handle a cpuless/memoryless node 0 at
> >> boot.
> >
> > I was just correcting your statement, because it was wrong.
> >
> > Could be that x86 code maps PXM 1 to node 0 because PXM 1 does neither
> > have CPUs nor memory. That would imply that we can, in fact, never have
> > node 0 offline during boot.
> >
>
> Yep, looks like it.
>
> [ 0.009726] SRAT: PXM 1 -> APIC 0x00 -> Node 0
> [ 0.009727] SRAT: PXM 1 -> APIC 0x01 -> Node 0
> [ 0.009727] SRAT: PXM 1 -> APIC 0x02 -> Node 0
> [ 0.009728] SRAT: PXM 1 -> APIC 0x03 -> Node 0
> [ 0.009731] ACPI: SRAT: Node 0 PXM 1 [mem 0x00000000-0x0009ffff]
> [ 0.009732] ACPI: SRAT: Node 0 PXM 1 [mem 0x00100000-0xbfffffff]
> [ 0.009733] ACPI: SRAT: Node 0 PXM 1 [mem 0x100000000-0x13fffffff]

This begs a question whether ppc can do the same thing?

I would swear that we've had x86 system with node 0 but I cannot really
find it and it is possible that it was not x86 after all...
--
Michal Hocko
SUSE Labs

2020-07-01 12:26:46

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

On Tue 30-06-20 09:31:25, Srikar Dronamraju wrote:
> * Christopher Lameter <[email protected]> [2020-06-29 14:58:40]:
>
> > On Wed, 24 Jun 2020, Srikar Dronamraju wrote:
> >
> > > Currently Linux kernel with CONFIG_NUMA on a system with multiple
> > > possible nodes, marks node 0 as online at boot. However in practice,
> > > there are systems which have node 0 as memoryless and cpuless.
> >
> > Maybe add something to explain why you are not simply mapping the
> > existing memory to NUMA node 0 which is after all just a numbering scheme
> > used by the kernel and can be used arbitrarily?
> >
>
> I thought Michal Hocko already gave a clear picture on why mapping is a bad
> idea. https://lore.kernel.org/lkml/[email protected]/t/#u
> Are you suggesting that we add that as part of the changelog?

Well, I was not aware x86 already does renumber. So there is a certain
precendence. As I've said I do not really like that but this is what
already is happening. If renumbering is not an option then just handle
that in the ppc code explicitly. Generic solution would be preferable of
course but as I've said it is really hard to check for correctness and
potential subtle issues.

--
Michal Hocko
SUSE Labs

2020-07-02 06:46:11

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

* Michal Hocko <[email protected]> [2020-07-01 14:21:10]:

> > >>>>>>
> > >>>>>> 2. Also existence of dummy node also leads to inconsistent information. The
> > >>>>>> number of online nodes is inconsistent with the information in the
> > >>>>>> device-tree and resource-dump
> > >>>>>>
> > >>>>>> 3. When the dummy node is present, single node non-Numa systems end up showing
> > >>>>>> up as NUMA systems and numa_balancing gets enabled. This will mean we take
> > >>>>>> the hit from the unnecessary numa hinting faults.
> > >>>>>
> > >>>>> I have to say that I dislike the node online/offline state and directly
> > >>>>> exporting that to the userspace. Users should only care whether the node
> > >>>>> has memory/cpus. Numa nodes can be online without any memory. Just
> > >>>>> offline all the present memory blocks but do not physically hot remove
> > >>>>> them and you are in the same situation. If users are confused by an
> > >>>>> output of tools like numactl -H then those could be updated and hide
> > >>>>> nodes without any memory&cpus.
> > >>>>>
> > >>>>> The autonuma problem sounds interesting but again this patch doesn't
> > >>>>> really solve the underlying problem because I strongly suspect that the
> > >>>>> problem is still there when a numa node gets all its memory offline as
> > >>>>> mentioned above.
>
> I would really appreciate a feedback to these two as well.

1. Its not just numactl that's to be fixed but all tools/utilities that
depend on /sys/devices/system/node/online. Are we saying to not rely/believe
in the output given by the kernel but do further verification?

Also how would the user space differentiate between the case where the
Kernel missed marking a node as offline to the case where the memory was
offlined on a cpuless node but node wasn't offline?.

2. Regarding the autonuma, the case of offline memory is user/admin driven,
so if there is a performance hit, its something that's driven by his
user/admin actions. Also how often do we see users offline complete memory
of cpuless node on a 2 node system?

>
> > [ 0.009726] SRAT: PXM 1 -> APIC 0x00 -> Node 0
> > [ 0.009727] SRAT: PXM 1 -> APIC 0x01 -> Node 0
> > [ 0.009727] SRAT: PXM 1 -> APIC 0x02 -> Node 0
> > [ 0.009728] SRAT: PXM 1 -> APIC 0x03 -> Node 0
> > [ 0.009731] ACPI: SRAT: Node 0 PXM 1 [mem 0x00000000-0x0009ffff]
> > [ 0.009732] ACPI: SRAT: Node 0 PXM 1 [mem 0x00100000-0xbfffffff]
> > [ 0.009733] ACPI: SRAT: Node 0 PXM 1 [mem 0x100000000-0x13fffffff]
>
> This begs a question whether ppc can do the same thing?

Certainly ppc can be made to adapt to this situation but that would be a
workaround. Do we have a reason why we think node 0 is unique and special?
If yes can we document it so that in future also people know why we consider
node 0 to be special. I do understand the *fear of the unknown* but when we
are unable to theoretically or practically come up a case, then it may
probably be better we hit the situation to understand what that unknown is?

> I would swear that we've had x86 system with node 0 but I cannot really
> find it and it is possible that it was not x86 after all...

--
Thanks and Regards
Srikar Dronamraju

2020-07-02 08:42:59

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

On Thu 02-07-20 12:14:08, Srikar Dronamraju wrote:
> * Michal Hocko <[email protected]> [2020-07-01 14:21:10]:
>
> > > >>>>>>
> > > >>>>>> 2. Also existence of dummy node also leads to inconsistent information. The
> > > >>>>>> number of online nodes is inconsistent with the information in the
> > > >>>>>> device-tree and resource-dump
> > > >>>>>>
> > > >>>>>> 3. When the dummy node is present, single node non-Numa systems end up showing
> > > >>>>>> up as NUMA systems and numa_balancing gets enabled. This will mean we take
> > > >>>>>> the hit from the unnecessary numa hinting faults.
> > > >>>>>
> > > >>>>> I have to say that I dislike the node online/offline state and directly
> > > >>>>> exporting that to the userspace. Users should only care whether the node
> > > >>>>> has memory/cpus. Numa nodes can be online without any memory. Just
> > > >>>>> offline all the present memory blocks but do not physically hot remove
> > > >>>>> them and you are in the same situation. If users are confused by an
> > > >>>>> output of tools like numactl -H then those could be updated and hide
> > > >>>>> nodes without any memory&cpus.
> > > >>>>>
> > > >>>>> The autonuma problem sounds interesting but again this patch doesn't
> > > >>>>> really solve the underlying problem because I strongly suspect that the
> > > >>>>> problem is still there when a numa node gets all its memory offline as
> > > >>>>> mentioned above.
> >
> > I would really appreciate a feedback to these two as well.
>
> 1. Its not just numactl that's to be fixed but all tools/utilities that
> depend on /sys/devices/system/node/online. Are we saying to not rely/believe
> in the output given by the kernel but do further verification?

No, what we are saying is that even an online node might have zero
number of online pages/cpus. So the online status is not really
something that matters. If people are confused by that output then user
space tools can make their confusion go away. I really do not understand
why the kernel should do any logic there.

> Also how would the user space differentiate between the case where the
> Kernel missed marking a node as offline to the case where the memory was
> offlined on a cpuless node but node wasn't offline?.

What I am arguing is that those two shouldn't be any different. Really!

> 2. Regarding the autonuma, the case of offline memory is user/admin driven,
> so if there is a performance hit, its something that's driven by his
> user/admin actions. Also how often do we see users offline complete memory
> of cpuless node on a 2 node system?

How often do we see crippled HW configurations like that? Really if
autonuma should be made more clever for one case it should recognize the
other as well.

> > > [ 0.009726] SRAT: PXM 1 -> APIC 0x00 -> Node 0
> > > [ 0.009727] SRAT: PXM 1 -> APIC 0x01 -> Node 0
> > > [ 0.009727] SRAT: PXM 1 -> APIC 0x02 -> Node 0
> > > [ 0.009728] SRAT: PXM 1 -> APIC 0x03 -> Node 0
> > > [ 0.009731] ACPI: SRAT: Node 0 PXM 1 [mem 0x00000000-0x0009ffff]
> > > [ 0.009732] ACPI: SRAT: Node 0 PXM 1 [mem 0x00100000-0xbfffffff]
> > > [ 0.009733] ACPI: SRAT: Node 0 PXM 1 [mem 0x100000000-0x13fffffff]
> >
> > This begs a question whether ppc can do the same thing?
>
> Certainly ppc can be made to adapt to this situation but that would be a
> workaround. Do we have a reason why we think node 0 is unique and special?

It is not. As replied in other email in this thread. I would hope for
having less hacks in the numa initialization. Cleaning up the mess is
would be a lot of work and testing on all NUMA capable architectures.
This is a heritage from the past I am afraid. All that I am arguing here
is that your touch to the generic code with a very simple looking patch
might have side effects which are pretty much impossible to review.
Moreover it seems that nothing but ppc really needs this treatment.
So fixing it in ppc specific code sounds much more safe.

Normally I would really push for a generic solution but after getting
burned several times in this area I do not dare anymore. The problem is
not in the code complexity but in how spread it is in places where you
do not expect side effects.

--
Michal Hocko
SUSE Labs

2020-07-02 14:37:29

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

* Michal Hocko <[email protected]> [2020-07-02 10:41:23]:

> On Thu 02-07-20 12:14:08, Srikar Dronamraju wrote:
> > * Michal Hocko <[email protected]> [2020-07-01 14:21:10]:
> >
> > > > >>>>> The autonuma problem sounds interesting but again this patch doesn't
> > > > >>>>> really solve the underlying problem because I strongly suspect that the
> > > > >>>>> problem is still there when a numa node gets all its memory offline as
> > > > >>>>> mentioned above.
> > >
> > > I would really appreciate a feedback to these two as well.
> >
> > 1. Its not just numactl that's to be fixed but all tools/utilities that
> > depend on /sys/devices/system/node/online. Are we saying to not rely/believe
> > in the output given by the kernel but do further verification?
>
> No, what we are saying is that even an online node might have zero
> number of online pages/cpus. So the online status is not really
> something that matters. If people are confused by that output then user
> space tools can make their confusion go away. I really do not understand
> why the kernel should do any logic there.

The user facing teams are saying they are getting queries from the users who
are unable to understand from the tools/sysfs files why a node is online and
but has no attached resources. Its the amount of time that is being spent on
these issues that triggered the patch. Initially even I was skeptical that
this was a non-issue.

>
> > Also how would the user space differentiate between the case where the
> > Kernel missed marking a node as offline to the case where the memory was
> > offlined on a cpuless node but node wasn't offline?.
>
> What I am arguing is that those two shouldn't be any different. Really!
>
> > 2. Regarding the autonuma, the case of offline memory is user/admin driven,
> > so if there is a performance hit, its something that's driven by his
> > user/admin actions. Also how often do we see users offline complete memory
> > of cpuless node on a 2 node system?
>
> How often do we see crippled HW configurations like that? Really if
> autonuma should be made more clever for one case it should recognize the
> other as well.
>

Lets take a 16 socket PowerVM system and assume that 32 lpars are created
on that socket, i.e 2 lpars for each socket. (PowerVM has the final say on
how the lpars are created.) In such a case, we can expect 30 out of the 32
lpars to face this problem, with the only 2 lpars that actually run on
socket 0 having the correct configuration.

> > >
> > > This begs a question whether ppc can do the same thing?
> >
> > Certainly ppc can be made to adapt to this situation but that would be a
> > workaround. Do we have a reason why we think node 0 is unique and special?
>
> It is not. As replied in other email in this thread. I would hope for
> having less hacks in the numa initialization. Cleaning up the mess is
> would be a lot of work and testing on all NUMA capable architectures.
> This is a heritage from the past I am afraid. All that I am arguing here
> is that your touch to the generic code with a very simple looking patch
> might have side effects which are pretty much impossible to review.
> Moreover it seems that nothing but ppc really needs this treatment.
> So fixing it in ppc specific code sounds much more safe.
>
> Normally I would really push for a generic solution but after getting
> burned several times in this area I do not dare anymore. The problem is
> not in the code complexity but in how spread it is in places where you
> do not expect side effects.
>

I do understand and respect your viewpoint.

> --
> Michal Hocko
> SUSE Labs

--
Thanks and Regards
Srikar Dronamraju

2020-07-03 09:16:40

by Michal Suchánek

[permalink] [raw]
Subject: Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

On Wed, Jul 01, 2020 at 02:21:10PM +0200, Michal Hocko wrote:
> On Wed 01-07-20 13:30:57, David Hildenbrand wrote:
> > On 01.07.20 13:06, David Hildenbrand wrote:
> > > On 01.07.20 13:01, Srikar Dronamraju wrote:
> > >> * David Hildenbrand <[email protected]> [2020-07-01 12:15:54]:
> > >>
> > >>> On 01.07.20 12:04, Srikar Dronamraju wrote:
> > >>>> * Michal Hocko <[email protected]> [2020-07-01 10:42:00]:
> > >>>>
> > >>>>>
> > >>>>>>
> > >>>>>> 2. Also existence of dummy node also leads to inconsistent information. The
> > >>>>>> number of online nodes is inconsistent with the information in the
> > >>>>>> device-tree and resource-dump
> > >>>>>>
> > >>>>>> 3. When the dummy node is present, single node non-Numa systems end up showing
> > >>>>>> up as NUMA systems and numa_balancing gets enabled. This will mean we take
> > >>>>>> the hit from the unnecessary numa hinting faults.
> > >>>>>
> > >>>>> I have to say that I dislike the node online/offline state and directly
> > >>>>> exporting that to the userspace. Users should only care whether the node
> > >>>>> has memory/cpus. Numa nodes can be online without any memory. Just
> > >>>>> offline all the present memory blocks but do not physically hot remove
> > >>>>> them and you are in the same situation. If users are confused by an
> > >>>>> output of tools like numactl -H then those could be updated and hide
> > >>>>> nodes without any memory&cpus.
> > >>>>>
> > >>>>> The autonuma problem sounds interesting but again this patch doesn't
> > >>>>> really solve the underlying problem because I strongly suspect that the
> > >>>>> problem is still there when a numa node gets all its memory offline as
> > >>>>> mentioned above.
>
> I would really appreciate a feedback to these two as well.
>
> > >>>>> While I completely agree that making node 0 special is wrong, I have
> > >>>>> still hard time to review this very simply looking patch because all the
> > >>>>> numa initialization is so spread around that this might just blow up
> > >>>>> at unexpected places. IIRC we have discussed testing in the previous
> > >>>>> version and David has provided a way to emulate these configurations
> > >>>>> on x86. Did you manage to use those instruction for additional testing
> > >>>>> on other than ppc architectures?
> > >>>>>
> > >>>>
> > >>>> I have tried all the steps that David mentioned and reported back at
> > >>>> https://lore.kernel.org/lkml/[email protected]/t/#u
> > >>>>
> > >>>> As a summary, David's steps are still not creating a memoryless/cpuless on
> > >>>> x86 VM.
> > >>>
> > >>> Now, that is wrong. You get a memoryless/cpuless node, which is *not
> > >>> online*. Once you hotplug some memory, it will switch online. Once you
> > >>> remove memory, it will switch back offline.
> > >>>
> > >>
> > >> Let me clarify, we are looking for a node 0 which is cpuless/memoryless at
> > >> boot. The code in question tries to handle a cpuless/memoryless node 0 at
> > >> boot.
> > >
> > > I was just correcting your statement, because it was wrong.
> > >
> > > Could be that x86 code maps PXM 1 to node 0 because PXM 1 does neither
> > > have CPUs nor memory. That would imply that we can, in fact, never have
> > > node 0 offline during boot.
> > >
> >
> > Yep, looks like it.
> >
> > [ 0.009726] SRAT: PXM 1 -> APIC 0x00 -> Node 0
> > [ 0.009727] SRAT: PXM 1 -> APIC 0x01 -> Node 0
> > [ 0.009727] SRAT: PXM 1 -> APIC 0x02 -> Node 0
> > [ 0.009728] SRAT: PXM 1 -> APIC 0x03 -> Node 0
> > [ 0.009731] ACPI: SRAT: Node 0 PXM 1 [mem 0x00000000-0x0009ffff]
> > [ 0.009732] ACPI: SRAT: Node 0 PXM 1 [mem 0x00100000-0xbfffffff]
> > [ 0.009733] ACPI: SRAT: Node 0 PXM 1 [mem 0x100000000-0x13fffffff]
>
> This begs a question whether ppc can do the same thing?
Or x86 stop doing it so that you can see on what node you are running?

What's the point of this indirection other than another way of avoiding
empty node 0?

Thanks

Michal

2020-07-03 09:25:57

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

[Cc Andi]

On Fri 03-07-20 11:10:01, Michal Suchanek wrote:
> On Wed, Jul 01, 2020 at 02:21:10PM +0200, Michal Hocko wrote:
> > On Wed 01-07-20 13:30:57, David Hildenbrand wrote:
[...]
> > > Yep, looks like it.
> > >
> > > [ 0.009726] SRAT: PXM 1 -> APIC 0x00 -> Node 0
> > > [ 0.009727] SRAT: PXM 1 -> APIC 0x01 -> Node 0
> > > [ 0.009727] SRAT: PXM 1 -> APIC 0x02 -> Node 0
> > > [ 0.009728] SRAT: PXM 1 -> APIC 0x03 -> Node 0
> > > [ 0.009731] ACPI: SRAT: Node 0 PXM 1 [mem 0x00000000-0x0009ffff]
> > > [ 0.009732] ACPI: SRAT: Node 0 PXM 1 [mem 0x00100000-0xbfffffff]
> > > [ 0.009733] ACPI: SRAT: Node 0 PXM 1 [mem 0x100000000-0x13fffffff]
> >
> > This begs a question whether ppc can do the same thing?
> Or x86 stop doing it so that you can see on what node you are running?
>
> What's the point of this indirection other than another way of avoiding
> empty node 0?

Honestly, I do not have any idea. I've traced it down to
Author: Andi Kleen <[email protected]>
Date: Tue Jan 11 15:35:48 2005 -0800

[PATCH] x86_64: Fix ACPI SRAT NUMA parsing

Fix fallout from the recent nodemask_t changes. The node ids assigned
in the SRAT parser were off by one.

I added a new first_unset_node() function to nodemask.h to allocate
IDs sanely.

Signed-off-by: Andi Kleen <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

which doesn't really tell all that much. The historical baggage and a
long term behavior which is not really trivial to fix I suspect.
--
Michal Hocko
SUSE Labs

2020-07-03 11:00:21

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

On Fri 03-07-20 11:24:17, Michal Hocko wrote:
> [Cc Andi]
>
> On Fri 03-07-20 11:10:01, Michal Suchanek wrote:
> > On Wed, Jul 01, 2020 at 02:21:10PM +0200, Michal Hocko wrote:
> > > On Wed 01-07-20 13:30:57, David Hildenbrand wrote:
> [...]
> > > > Yep, looks like it.
> > > >
> > > > [ 0.009726] SRAT: PXM 1 -> APIC 0x00 -> Node 0
> > > > [ 0.009727] SRAT: PXM 1 -> APIC 0x01 -> Node 0
> > > > [ 0.009727] SRAT: PXM 1 -> APIC 0x02 -> Node 0
> > > > [ 0.009728] SRAT: PXM 1 -> APIC 0x03 -> Node 0
> > > > [ 0.009731] ACPI: SRAT: Node 0 PXM 1 [mem 0x00000000-0x0009ffff]
> > > > [ 0.009732] ACPI: SRAT: Node 0 PXM 1 [mem 0x00100000-0xbfffffff]
> > > > [ 0.009733] ACPI: SRAT: Node 0 PXM 1 [mem 0x100000000-0x13fffffff]
> > >
> > > This begs a question whether ppc can do the same thing?
> > Or x86 stop doing it so that you can see on what node you are running?
> >
> > What's the point of this indirection other than another way of avoiding
> > empty node 0?
>
> Honestly, I do not have any idea. I've traced it down to
> Author: Andi Kleen <[email protected]>
> Date: Tue Jan 11 15:35:48 2005 -0800
>
> [PATCH] x86_64: Fix ACPI SRAT NUMA parsing
>
> Fix fallout from the recent nodemask_t changes. The node ids assigned
> in the SRAT parser were off by one.
>
> I added a new first_unset_node() function to nodemask.h to allocate
> IDs sanely.
>
> Signed-off-by: Andi Kleen <[email protected]>
> Signed-off-by: Linus Torvalds <[email protected]>
>
> which doesn't really tell all that much. The historical baggage and a
> long term behavior which is not really trivial to fix I suspect.

Thinking about this some more, this logic makes some sense afterall.
Especially in the world without memory hotplug which was very likely the
case back then. It is much better to have compact node mask rather than
sparse one. After all node numbers shouldn't really matter as long as
you have a clear mapping to the HW. I am not sure we export that
information (except for the kernel ring buffer) though.

The memory hotplug changes that somehow because you can hotremove numa
nodes and therefore make the nodemask sparse but that is not a common
case. I am not sure what would happen if a completely new node was added
and its corresponding node was already used by the renumbered one
though. It would likely conflate the two I am afraid. But I am not sure
this is really possible with x86 and a lack of a bug report would
suggest that nobody is doing that at least.

--
Michal Hocko
SUSE Labs

2020-07-03 11:35:45

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

On 03.07.20 12:59, Michal Hocko wrote:
> On Fri 03-07-20 11:24:17, Michal Hocko wrote:
>> [Cc Andi]
>>
>> On Fri 03-07-20 11:10:01, Michal Suchanek wrote:
>>> On Wed, Jul 01, 2020 at 02:21:10PM +0200, Michal Hocko wrote:
>>>> On Wed 01-07-20 13:30:57, David Hildenbrand wrote:
>> [...]
>>>>> Yep, looks like it.
>>>>>
>>>>> [ 0.009726] SRAT: PXM 1 -> APIC 0x00 -> Node 0
>>>>> [ 0.009727] SRAT: PXM 1 -> APIC 0x01 -> Node 0
>>>>> [ 0.009727] SRAT: PXM 1 -> APIC 0x02 -> Node 0
>>>>> [ 0.009728] SRAT: PXM 1 -> APIC 0x03 -> Node 0
>>>>> [ 0.009731] ACPI: SRAT: Node 0 PXM 1 [mem 0x00000000-0x0009ffff]
>>>>> [ 0.009732] ACPI: SRAT: Node 0 PXM 1 [mem 0x00100000-0xbfffffff]
>>>>> [ 0.009733] ACPI: SRAT: Node 0 PXM 1 [mem 0x100000000-0x13fffffff]
>>>>
>>>> This begs a question whether ppc can do the same thing?
>>> Or x86 stop doing it so that you can see on what node you are running?
>>>
>>> What's the point of this indirection other than another way of avoiding
>>> empty node 0?
>>
>> Honestly, I do not have any idea. I've traced it down to
>> Author: Andi Kleen <[email protected]>
>> Date: Tue Jan 11 15:35:48 2005 -0800
>>
>> [PATCH] x86_64: Fix ACPI SRAT NUMA parsing
>>
>> Fix fallout from the recent nodemask_t changes. The node ids assigned
>> in the SRAT parser were off by one.
>>
>> I added a new first_unset_node() function to nodemask.h to allocate
>> IDs sanely.
>>
>> Signed-off-by: Andi Kleen <[email protected]>
>> Signed-off-by: Linus Torvalds <[email protected]>
>>
>> which doesn't really tell all that much. The historical baggage and a
>> long term behavior which is not really trivial to fix I suspect.
>
> Thinking about this some more, this logic makes some sense afterall.
> Especially in the world without memory hotplug which was very likely the
> case back then. It is much better to have compact node mask rather than
> sparse one. After all node numbers shouldn't really matter as long as
> you have a clear mapping to the HW. I am not sure we export that
> information (except for the kernel ring buffer) though.
>
> The memory hotplug changes that somehow because you can hotremove numa
> nodes and therefore make the nodemask sparse but that is not a common
> case. I am not sure what would happen if a completely new node was added
> and its corresponding node was already used by the renumbered one
> though. It would likely conflate the two I am afraid. But I am not sure
> this is really possible with x86 and a lack of a bug report would
> suggest that nobody is doing that at least.
>

I think the ACPI code takes care of properly mapping PXM to nodes.

So if I start with PXM 0 empty and PXM 1 populated, I will get
PXM 1 == node 0 as described. Once I hotplug something to PXM 0 in QEMU

$ echo "object_add memory-backend-ram,id=mem0,size=1G" | sudo nc -U /var/tmp/monitor
$ echo "device_add pc-dimm,id=dimm0,memdev=mem0,node=0" | sudo nc -U /var/tmp/monitor

$ echo "info numa" | sudo nc -U /var/tmp/monitor
QEMU 5.0.50 monitor - type 'help' for more information
(qemu) info numa
2 nodes
node 0 cpus:
node 0 size: 1024 MB
node 0 plugged: 1024 MB
node 1 cpus: 0 1 2 3
node 1 size: 4096 MB
node 1 plugged: 0 MB

I get in the guest:

[ 50.174435] ------------[ cut here ]------------
[ 50.175436] node 1 was absent from the node_possible_map
[ 50.176844] WARNING: CPU: 0 PID: 7 at mm/memory_hotplug.c:1021 add_memory_resource+0x8c/0x290
[ 50.176844] Modules linked in:
[ 50.176845] CPU: 0 PID: 7 Comm: kworker/u8:0 Not tainted 5.8.0-rc2+ #4
[ 50.176846] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.4
[ 50.176846] Workqueue: kacpi_hotplug acpi_hotplug_work_fn
[ 50.176847] RIP: 0010:add_memory_resource+0x8c/0x290
[ 50.176849] Code: 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 63 c5 48 89 04 24 48 0f a3 05 94 6c 1c 01 72 17 89 ee 48 c78
[ 50.176849] RSP: 0018:ffffa7a1c0043d48 EFLAGS: 00010296
[ 50.176850] RAX: 000000000000002c RBX: ffff8bc633e63b80 RCX: 0000000000000000
[ 50.176851] RDX: ffff8bc63bc27060 RSI: ffff8bc63bc18d00 RDI: ffff8bc63bc18d00
[ 50.176851] RBP: 0000000000000001 R08: 00000000000001e1 R09: ffffa7a1c0043bd8
[ 50.176852] R10: 0000000000000005 R11: 0000000000000000 R12: 0000000140000000
[ 50.176852] R13: 000000017fffffff R14: 0000000040000000 R15: 0000000180000000
[ 50.176853] FS: 0000000000000000(0000) GS:ffff8bc63bc00000(0000) knlGS:0000000000000000
[ 50.176853] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 50.176855] CR2: 000055dfcbfc5ee8 CR3: 00000000aca0a000 CR4: 00000000000006f0
[ 50.176855] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 50.176856] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 50.176856] Call Trace:
[ 50.176856] __add_memory+0x33/0x70
[ 50.176857] acpi_memory_device_add+0x132/0x2f2
[ 50.176857] acpi_bus_attach+0xd2/0x200
[ 50.176858] acpi_bus_scan+0x33/0x70
[ 50.176858] acpi_device_hotplug+0x298/0x390
[ 50.176858] acpi_hotplug_work_fn+0x3d/0x50
[ 50.176859] process_one_work+0x1b4/0x370
[ 50.176859] worker_thread+0x53/0x3e0
[ 50.176860] ? process_one_work+0x370/0x370
[ 50.176860] kthread+0x119/0x140
[ 50.176860] ? __kthread_bind_mask+0x60/0x60
[ 50.176861] ret_from_fork+0x22/0x30
[ 50.176861] ---[ end trace 9a2a837c1e0164f1 ]---
[ 50.209816] acpi PNP0C80:00: add_memory failed
[ 50.210510] acpi PNP0C80:00: acpi_memory_enable_device() error
[ 50.211445] acpi PNP0C80:00: Enumeration failure


I remember that we added that check just recently (due to powerpc if I am not wrong).
Not sure why that triggers here.

But it properly maps PXM 0 to node 1.

--
Thanks,

David / dhildenb

2020-07-03 11:46:41

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

On Fri 03-07-20 13:32:21, David Hildenbrand wrote:
> On 03.07.20 12:59, Michal Hocko wrote:
> > On Fri 03-07-20 11:24:17, Michal Hocko wrote:
> >> [Cc Andi]
> >>
> >> On Fri 03-07-20 11:10:01, Michal Suchanek wrote:
> >>> On Wed, Jul 01, 2020 at 02:21:10PM +0200, Michal Hocko wrote:
> >>>> On Wed 01-07-20 13:30:57, David Hildenbrand wrote:
> >> [...]
> >>>>> Yep, looks like it.
> >>>>>
> >>>>> [ 0.009726] SRAT: PXM 1 -> APIC 0x00 -> Node 0
> >>>>> [ 0.009727] SRAT: PXM 1 -> APIC 0x01 -> Node 0
> >>>>> [ 0.009727] SRAT: PXM 1 -> APIC 0x02 -> Node 0
> >>>>> [ 0.009728] SRAT: PXM 1 -> APIC 0x03 -> Node 0
> >>>>> [ 0.009731] ACPI: SRAT: Node 0 PXM 1 [mem 0x00000000-0x0009ffff]
> >>>>> [ 0.009732] ACPI: SRAT: Node 0 PXM 1 [mem 0x00100000-0xbfffffff]
> >>>>> [ 0.009733] ACPI: SRAT: Node 0 PXM 1 [mem 0x100000000-0x13fffffff]
> >>>>
> >>>> This begs a question whether ppc can do the same thing?
> >>> Or x86 stop doing it so that you can see on what node you are running?
> >>>
> >>> What's the point of this indirection other than another way of avoiding
> >>> empty node 0?
> >>
> >> Honestly, I do not have any idea. I've traced it down to
> >> Author: Andi Kleen <[email protected]>
> >> Date: Tue Jan 11 15:35:48 2005 -0800
> >>
> >> [PATCH] x86_64: Fix ACPI SRAT NUMA parsing
> >>
> >> Fix fallout from the recent nodemask_t changes. The node ids assigned
> >> in the SRAT parser were off by one.
> >>
> >> I added a new first_unset_node() function to nodemask.h to allocate
> >> IDs sanely.
> >>
> >> Signed-off-by: Andi Kleen <[email protected]>
> >> Signed-off-by: Linus Torvalds <[email protected]>
> >>
> >> which doesn't really tell all that much. The historical baggage and a
> >> long term behavior which is not really trivial to fix I suspect.
> >
> > Thinking about this some more, this logic makes some sense afterall.
> > Especially in the world without memory hotplug which was very likely the
> > case back then. It is much better to have compact node mask rather than
> > sparse one. After all node numbers shouldn't really matter as long as
> > you have a clear mapping to the HW. I am not sure we export that
> > information (except for the kernel ring buffer) though.
> >
> > The memory hotplug changes that somehow because you can hotremove numa
> > nodes and therefore make the nodemask sparse but that is not a common
> > case. I am not sure what would happen if a completely new node was added
> > and its corresponding node was already used by the renumbered one
> > though. It would likely conflate the two I am afraid. But I am not sure
> > this is really possible with x86 and a lack of a bug report would
> > suggest that nobody is doing that at least.
> >
>
> I think the ACPI code takes care of properly mapping PXM to nodes.
>
> So if I start with PXM 0 empty and PXM 1 populated, I will get
> PXM 1 == node 0 as described. Once I hotplug something to PXM 0 in QEMU
>
> $ echo "object_add memory-backend-ram,id=mem0,size=1G" | sudo nc -U /var/tmp/monitor
> $ echo "device_add pc-dimm,id=dimm0,memdev=mem0,node=0" | sudo nc -U /var/tmp/monitor
>
> $ echo "info numa" | sudo nc -U /var/tmp/monitor
> QEMU 5.0.50 monitor - type 'help' for more information
> (qemu) info numa
> 2 nodes
> node 0 cpus:
> node 0 size: 1024 MB
> node 0 plugged: 1024 MB
> node 1 cpus: 0 1 2 3
> node 1 size: 4096 MB
> node 1 plugged: 0 MB

Thanks for double checking.

> I get in the guest:
>
> [ 50.174435] ------------[ cut here ]------------
> [ 50.175436] node 1 was absent from the node_possible_map
> [ 50.176844] WARNING: CPU: 0 PID: 7 at mm/memory_hotplug.c:1021 add_memory_resource+0x8c/0x290

This would mean that the ACPI code or whoever does the remaping is not
adding the new node into possible nodes.

[...]
> I remember that we added that check just recently (due to powerpc if I am not wrong).
> Not sure why that triggers here.

This was a misbehaving Qemu IIRC providing a garbage map.

--
Michal Hocko
SUSE Labs

2020-07-03 13:00:13

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

* Michal Hocko <[email protected]> [2020-07-03 12:59:44]:

> > Honestly, I do not have any idea. I've traced it down to
> > Author: Andi Kleen <[email protected]>
> > Date: Tue Jan 11 15:35:48 2005 -0800
> >
> > [PATCH] x86_64: Fix ACPI SRAT NUMA parsing
> >
> > Fix fallout from the recent nodemask_t changes. The node ids assigned
> > in the SRAT parser were off by one.
> >
> > I added a new first_unset_node() function to nodemask.h to allocate
> > IDs sanely.
> >
> > Signed-off-by: Andi Kleen <[email protected]>
> > Signed-off-by: Linus Torvalds <[email protected]>
> >
> > which doesn't really tell all that much. The historical baggage and a
> > long term behavior which is not really trivial to fix I suspect.
>
> Thinking about this some more, this logic makes some sense afterall.
> Especially in the world without memory hotplug which was very likely the
> case back then. It is much better to have compact node mask rather than
> sparse one. After all node numbers shouldn't really matter as long as
> you have a clear mapping to the HW. I am not sure we export that
> information (except for the kernel ring buffer) though.
>
> The memory hotplug changes that somehow because you can hotremove numa
> nodes and therefore make the nodemask sparse but that is not a common
> case. I am not sure what would happen if a completely new node was added
> and its corresponding node was already used by the renumbered one
> though. It would likely conflate the two I am afraid. But I am not sure
> this is really possible with x86 and a lack of a bug report would
> suggest that nobody is doing that at least.
>

JFYI,
Satheesh copied in this mailchain had opened a bug a year on crash with vcpu
hotplug on memoryless node.

https://bugzilla.kernel.org/show_bug.cgi?id=202187

--
Thanks and Regards
Srikar Dronamraju

2020-07-06 16:09:11

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

> > What's the point of this indirection other than another way of avoiding
> > empty node 0?
>
> Honestly, I do not have any idea. I've traced it down to
> Author: Andi Kleen <[email protected]>
> Date: Tue Jan 11 15:35:48 2005 -0800

I don't remember all the details, and I can't even find the commit
(is it in linux-historic?).

But AFAIK there's no guarantee PXMs are small and continuous, so it
seemed better to have a clean zero based space.

Back then we had a lot of problems with buggy SRAT tables in BIOS,
so we really tried to avoid trusting the BIOS as much as possible.

-Andi

2020-08-07 04:33:38

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

On Fri, 3 Jul 2020 18:28:23 +0530 Srikar Dronamraju <[email protected]> wrote:

> > The memory hotplug changes that somehow because you can hotremove numa
> > nodes and therefore make the nodemask sparse but that is not a common
> > case. I am not sure what would happen if a completely new node was added
> > and its corresponding node was already used by the renumbered one
> > though. It would likely conflate the two I am afraid. But I am not sure
> > this is really possible with x86 and a lack of a bug report would
> > suggest that nobody is doing that at least.
> >
>
> JFYI,
> Satheesh copied in this mailchain had opened a bug a year on crash with vcpu
> hotplug on memoryless node.
>
> https://bugzilla.kernel.org/show_bug.cgi?id=202187

So... do we merge this patch or not? Seems that the overall view is
"risky but nobody is likely to do anything better any time soon"?

2020-08-07 06:59:33

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

On 07.08.20 06:32, Andrew Morton wrote:
> On Fri, 3 Jul 2020 18:28:23 +0530 Srikar Dronamraju <[email protected]> wrote:
>
>>> The memory hotplug changes that somehow because you can hotremove numa
>>> nodes and therefore make the nodemask sparse but that is not a common
>>> case. I am not sure what would happen if a completely new node was added
>>> and its corresponding node was already used by the renumbered one
>>> though. It would likely conflate the two I am afraid. But I am not sure
>>> this is really possible with x86 and a lack of a bug report would
>>> suggest that nobody is doing that at least.
>>>
>>
>> JFYI,
>> Satheesh copied in this mailchain had opened a bug a year on crash with vcpu
>> hotplug on memoryless node.
>>
>> https://bugzilla.kernel.org/show_bug.cgi?id=202187
>
> So... do we merge this patch or not? Seems that the overall view is
> "risky but nobody is likely to do anything better any time soon"?

I recall the issue Michal saw was "fix powerpc" vs. "break other
architectures". @Michal how should we proceed? At least x86-64 won't be
affected IIUC.

--
Thanks,

David / dhildenb

2020-08-07 10:06:59

by Michal Suchánek

[permalink] [raw]
Subject: Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

On Fri, Aug 07, 2020 at 08:58:09AM +0200, David Hildenbrand wrote:
> On 07.08.20 06:32, Andrew Morton wrote:
> > On Fri, 3 Jul 2020 18:28:23 +0530 Srikar Dronamraju <[email protected]> wrote:
> >
> >>> The memory hotplug changes that somehow because you can hotremove numa
> >>> nodes and therefore make the nodemask sparse but that is not a common
> >>> case. I am not sure what would happen if a completely new node was added
> >>> and its corresponding node was already used by the renumbered one
> >>> though. It would likely conflate the two I am afraid. But I am not sure
> >>> this is really possible with x86 and a lack of a bug report would
> >>> suggest that nobody is doing that at least.
> >>>
> >>
> >> JFYI,
> >> Satheesh copied in this mailchain had opened a bug a year on crash with vcpu
> >> hotplug on memoryless node.
> >>
> >> https://bugzilla.kernel.org/show_bug.cgi?id=202187
> >
> > So... do we merge this patch or not? Seems that the overall view is
> > "risky but nobody is likely to do anything better any time soon"?
>
> I recall the issue Michal saw was "fix powerpc" vs. "break other
> architectures". @Michal how should we proceed? At least x86-64 won't be
> affected IIUC.
There is a patch to introduce the node remapping on ppc as well which
should eliminate the empty node 0.

https://patchwork.ozlabs.org/project/linuxppc-dev/patch/[email protected]/

Thanks

Michal

2020-08-12 06:03:13

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

Hi Andrew, Michal, David

* Andrew Morton <[email protected]> [2020-08-06 21:32:11]:

> On Fri, 3 Jul 2020 18:28:23 +0530 Srikar Dronamraju <[email protected]> wrote:
>
> > > The memory hotplug changes that somehow because you can hotremove numa
> > > nodes and therefore make the nodemask sparse but that is not a common
> > > case. I am not sure what would happen if a completely new node was added
> > > and its corresponding node was already used by the renumbered one
> > > though. It would likely conflate the two I am afraid. But I am not sure
> > > this is really possible with x86 and a lack of a bug report would
> > > suggest that nobody is doing that at least.
> > >
> >
> > JFYI,
> > Satheesh copied in this mailchain had opened a bug a year on crash with vcpu
> > hotplug on memoryless node.
> >
> > https://bugzilla.kernel.org/show_bug.cgi?id=202187
>
> So... do we merge this patch or not? Seems that the overall view is
> "risky but nobody is likely to do anything better any time soon"?

Can we decide on this one way or the other?

--
Thanks and Regards
Srikar Dronamraju

2020-08-18 07:34:23

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

On 12.08.20 08:01, Srikar Dronamraju wrote:
> Hi Andrew, Michal, David
>
> * Andrew Morton <[email protected]> [2020-08-06 21:32:11]:
>
>> On Fri, 3 Jul 2020 18:28:23 +0530 Srikar Dronamraju <[email protected]> wrote:
>>
>>>> The memory hotplug changes that somehow because you can hotremove numa
>>>> nodes and therefore make the nodemask sparse but that is not a common
>>>> case. I am not sure what would happen if a completely new node was added
>>>> and its corresponding node was already used by the renumbered one
>>>> though. It would likely conflate the two I am afraid. But I am not sure
>>>> this is really possible with x86 and a lack of a bug report would
>>>> suggest that nobody is doing that at least.
>>>>
>>>
>>> JFYI,
>>> Satheesh copied in this mailchain had opened a bug a year on crash with vcpu
>>> hotplug on memoryless node.
>>>
>>> https://bugzilla.kernel.org/show_bug.cgi?id=202187
>>
>> So... do we merge this patch or not? Seems that the overall view is
>> "risky but nobody is likely to do anything better any time soon"?
>
> Can we decide on this one way or the other?

Hmm, not sure who's the person to decide. I tend to prefer doing the
node renaming, handling this in ppc code; looking at the review of v2
there are still some concerns regarding numa distances.

https://patchwork.ozlabs.org/project/linuxppc-dev/patch/[email protected]/

--
Thanks,

David / dhildenb

2020-08-18 07:40:35

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

On Tue 18-08-20 09:32:52, David Hildenbrand wrote:
> On 12.08.20 08:01, Srikar Dronamraju wrote:
> > Hi Andrew, Michal, David
> >
> > * Andrew Morton <[email protected]> [2020-08-06 21:32:11]:
> >
> >> On Fri, 3 Jul 2020 18:28:23 +0530 Srikar Dronamraju <[email protected]> wrote:
> >>
> >>>> The memory hotplug changes that somehow because you can hotremove numa
> >>>> nodes and therefore make the nodemask sparse but that is not a common
> >>>> case. I am not sure what would happen if a completely new node was added
> >>>> and its corresponding node was already used by the renumbered one
> >>>> though. It would likely conflate the two I am afraid. But I am not sure
> >>>> this is really possible with x86 and a lack of a bug report would
> >>>> suggest that nobody is doing that at least.
> >>>>
> >>>
> >>> JFYI,
> >>> Satheesh copied in this mailchain had opened a bug a year on crash with vcpu
> >>> hotplug on memoryless node.
> >>>
> >>> https://bugzilla.kernel.org/show_bug.cgi?id=202187
> >>
> >> So... do we merge this patch or not? Seems that the overall view is
> >> "risky but nobody is likely to do anything better any time soon"?
> >
> > Can we decide on this one way or the other?
>
> Hmm, not sure who's the person to decide. I tend to prefer doing the
> node renaming, handling this in ppc code;

Agreed. That would be a safer option.
--
Michal Hocko
SUSE Labs

2020-08-18 07:50:53

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

* Michal Hocko <[email protected]> [2020-08-18 09:37:12]:

> On Tue 18-08-20 09:32:52, David Hildenbrand wrote:
> > On 12.08.20 08:01, Srikar Dronamraju wrote:
> > > Hi Andrew, Michal, David
> > >
> > > * Andrew Morton <[email protected]> [2020-08-06 21:32:11]:
> > >
> > >> On Fri, 3 Jul 2020 18:28:23 +0530 Srikar Dronamraju <[email protected]> wrote:
> > >>
> > >>>> The memory hotplug changes that somehow because you can hotremove numa
> > >>>> nodes and therefore make the nodemask sparse but that is not a common
> > >>>> case. I am not sure what would happen if a completely new node was added
> > >>>> and its corresponding node was already used by the renumbered one
> > >>>> though. It would likely conflate the two I am afraid. But I am not sure
> > >>>> this is really possible with x86 and a lack of a bug report would
> > >>>> suggest that nobody is doing that at least.
> > >>>>
> > >> So... do we merge this patch or not? Seems that the overall view is
> > >> "risky but nobody is likely to do anything better any time soon"?
> > >
> > > Can we decide on this one way or the other?
> >
> > Hmm, not sure who's the person to decide. I tend to prefer doing the
> > node renaming, handling this in ppc code;
>
> Agreed. That would be a safer option.

Okay, will send arch specific v6 version.

> --
> Michal Hocko
> SUSE Labs

--
Thanks and Regards
Srikar Dronamraju