Hi,
I was about to send this patch as a part of a cleanup patchset that goes on
top of last Michal's work about properly init memoryless nodes [1], but I
thought I may send it as a standalone one as it is the only real fix of
that series.
Discussing the fix with Michal, another concern came up, about whether
memoryless-nodes but with CPUs should be marked as online at all.
Here Michal and I have different opinions, while I think that
memoryless-nodes with its CPUs online should be marked online,
he thinks that the property 'online' applied to node has to do more
with memory, and I have to confess that after having a look at
some of the usages of for_each_online_node, most of them are checked
to do something memory-realted afterwards, so I guess he has a point
there.
My main concern is that if memoryless-nodes do not get online, its
sysfs showing the cpu<->numa topology will not be created and I am
not sure about losing that information.
E.g: i guess 'numactl -H' would not be able to show the right
topolovy as it coul not walk sysfs to get it right (in case it does).
Anyway, as a quick test, I wanted to see what happens if init_node_to_cpu()
and init_gi_nodes(), do not online the nodes.
(those nodes get online because of the CPU and Generic Initiator
affinity).
The outcome is that we blow up badly down the chain, and we do because
of a nasty side-effect (more information can be found at the end of
patch#1)
Short-long story: by the time bringup_nonboot_cpus() gets called to
bring up the cpus and bring up their respective nodes, we have not even
registered the node_subsys bus yet, so
bringup_nonboot_cpus()->..->__try_online_node->()register_one_node()->bus_add_device()
will crash when dereferencing some of the bus' struct fields.
We happen not to crash now, because init_to_cpu_node() and
init_gi_nodes() mark the node online, and by doing that,
it has the effect of __try_online_node() backing off and not
calling register_one_node().
I think the whole thing is kinda broken, or at the very least subtle and
fragile.
I am willing to have a look at how we can make this optimal, but wanted
to share with you.
Anyway, It is getting late here, so I just wanted to 1) send this quick
fix, 2) expose some nasty behavior and 3) kick off a discussion about
how to improve this situation.
[1] https://lore.kernel.org/lkml/[email protected]/
Oscar Salvador (1):
arch/x86/mm/numa: Do not initialize nodes twice
arch/x86/mm/numa.c | 15 ++-------------
include/linux/mm.h | 1 -
mm/page_alloc.c | 2 +-
3 files changed, 3 insertions(+), 15 deletions(-)
--
2.34.1
On x86, prior to ("mm: handle uninitialized numa nodes gracecully"),
NUMA nodes could be allocated at three different places.
- numa_register_memblks
- init_cpu_to_node
- init_gi_nodes
All these calls happen at setup_arch, and have the following order:
setup_arch
...
x86_numa_init
numa_init
numa_register_memblks
...
init_cpu_to_node
init_memory_less_node
alloc_node_data
free_area_init_memoryless_node
init_gi_nodes
init_memory_less_node
alloc_node_data
free_area_init_memoryless_node
numa_register_memblks() is only interested in those nodes which have memory,
so it skips over any memoryless node it founds.
Later on, when we have read ACPI's SRAT table, we call init_cpu_to_node()
and init_gi_nodes(), which initialize any memoryless node we might have
that have either CPU or Initiator affinity, meaning we allocate pg_data_t
struct for them and we mark them as ONLINE.
So far so good, but the thing is that after ("mm: handle uninitialized numa
nodes gracefully"), we allocate all possible NUMA nodes in free_area_init(),
meaning we have a picture like the following:
setup_arch
x86_numa_init
numa_init
numa_register_memblks <-- allocate non-memoryless node
x86_init.paging.pagetable_init
...
free_area_init
free_area_init_memoryless <-- allocate memoryless node
init_cpu_to_node
alloc_node_data <-- allocate memoryless node with CPU
free_area_init_memoryless_node
init_gi_nodes
alloc_node_data <-- allocate memoryless node with Initiator
free_area_init_memoryless_node
free_area_init() already allocates all possible NUMA nodes, but
init_cpu_to_node() and init_gi_nodes() are clueless about that,
so they go ahead and allocate a new pg_data_t struct without
checking anything, meaning we end up allocating twice.
It should be mad clear that this only happens in the case where
memoryless NUMA node happens to have a CPU/Initiator affinity.
So get rid of init_memory_less_node() and just set the node online.
Note that setting the node online is needed, otherwise we choke
down the chain when bringup_nonboot_cpus() ends up calling
__try_online_node()->register_one_node()->... and we blow up in
bus_add_device(). Like can be seen here:
=========
[ 0.585060] BUG: kernel NULL pointer dereference, address: 0000000000000060
[ 0.586091] #PF: supervisor read access in kernel mode
[ 0.586831] #PF: error_code(0x0000) - not-present page
[ 0.586930] PGD 0 P4D 0
[ 0.586930] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
[ 0.586930] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.17.0-rc4-1-default+ #45
[ 0.586930] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/4
[ 0.586930] RIP: 0010:bus_add_device+0x5a/0x140
[ 0.586930] Code: 8b 74 24 20 48 89 df e8 84 96 ff ff 85 c0 89 c5 75 38 48 8b 53 50 48 85 d2 0f 84 bb 00 004
[ 0.586930] RSP: 0000:ffffc9000022bd10 EFLAGS: 00010246
[ 0.586930] RAX: 0000000000000000 RBX: ffff888100987400 RCX: ffff8881003e4e19
[ 0.586930] RDX: ffff8881009a5e00 RSI: ffff888100987400 RDI: ffff888100987400
[ 0.586930] RBP: 0000000000000000 R08: ffff8881003e4e18 R09: ffff8881003e4c98
[ 0.586930] R10: 0000000000000000 R11: ffff888100402bc0 R12: ffffffff822ceba0
[ 0.586930] R13: 0000000000000000 R14: ffff888100987400 R15: 0000000000000000
[ 0.586930] FS: 0000000000000000(0000) GS:ffff88853fc00000(0000) knlGS:0000000000000000
[ 0.586930] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 0.586930] CR2: 0000000000000060 CR3: 000000000200a001 CR4: 00000000001706b0
[ 0.586930] Call Trace:
[ 0.586930] <TASK>
[ 0.586930] device_add+0x4c0/0x910
[ 0.586930] __register_one_node+0x97/0x2d0
[ 0.586930] __try_online_node+0x85/0xc0
[ 0.586930] try_online_node+0x25/0x40
[ 0.586930] cpu_up+0x4f/0x100
[ 0.586930] bringup_nonboot_cpus+0x4f/0x60
[ 0.586930] smp_init+0x26/0x79
[ 0.586930] kernel_init_freeable+0x130/0x2f1
[ 0.586930] ? rest_init+0x100/0x100
[ 0.586930] kernel_init+0x17/0x150
[ 0.586930] ? rest_init+0x100/0x100
[ 0.586930] ret_from_fork+0x22/0x30
[ 0.586930] </TASK>
[ 0.586930] Modules linked in:
[ 0.586930] CR2: 0000000000000060
[ 0.586930] ---[ end trace 0000000000000000 ]---
=========
The reason is simple, by the time bringup_nonboot_cpus() gets called,
we did not register the node_subsys bus yet, so we crash when bus_add_device()
tries to dereference bus()->p.
The following shows the order of the calls:
kernel_init_freeable
smp_init
bringup_nonboot_cpus
...
bus_add_device() <- we did not register node_subsys yet
do_basic_setup
do_initcalls
postcore_initcall(register_node_type);
register_node_type
subsys_system_register
subsys_register
bus_register <- register node_subsys bus
Why setting the node online saves us then? Well, simply because
__try_online_node() backs off when the node is online, meaning
we do not end up calling register_one_node() in the first place.
This is subtle, broken and deserves a deep analysis and thought
about how to put this into shape, but for now let us have this
easy fix for the leaking memory issue.
Signed-off-by: Oscar Salvador <[email protected]>
Fixes: da4490c958ad ("mm: handle uninitialized numa nodes gracefully")
---
arch/x86/mm/numa.c | 15 ++-------------
include/linux/mm.h | 1 -
mm/page_alloc.c | 2 +-
3 files changed, 3 insertions(+), 15 deletions(-)
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index c6b1213086d6..37039a6af8ae 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -738,17 +738,6 @@ void __init x86_numa_init(void)
numa_init(dummy_numa_init);
}
-static void __init init_memory_less_node(int nid)
-{
- /* Allocate and initialize node data. Memory-less node is now online.*/
- alloc_node_data(nid);
- free_area_init_memoryless_node(nid);
-
- /*
- * All zonelists will be built later in start_kernel() after per cpu
- * areas are initialized.
- */
-}
/*
* A node may exist which has one or more Generic Initiators but no CPUs and no
@@ -768,7 +757,7 @@ void __init init_gi_nodes(void)
for_each_node_state(nid, N_GENERIC_INITIATOR)
if (!node_online(nid))
- init_memory_less_node(nid);
+ node_set_online(nid);
}
/*
@@ -799,7 +788,7 @@ void __init init_cpu_to_node(void)
continue;
if (!node_online(node))
- init_memory_less_node(node);
+ node_set_online(node);
numa_set_node(cpu, node);
}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 213cc569b192..9ff1c4c8449e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2453,7 +2453,6 @@ static inline spinlock_t *pud_lock(struct mm_struct *mm, pud_t *pud)
}
extern void __init pagecache_init(void);
-extern void __init free_area_init_memoryless_node(int nid);
extern void free_initmem(void);
/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 83da2279be72..967085c1c78a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7698,7 +7698,7 @@ static void __init free_area_init_node(int nid)
free_area_init_core(pgdat);
}
-void __init free_area_init_memoryless_node(int nid)
+static void __init free_area_init_memoryless_node(int nid)
{
free_area_init_node(nid);
}
--
2.34.1
On Fri 18-02-22 23:43:02, Oscar Salvador wrote:
> On x86, prior to ("mm: handle uninitialized numa nodes gracecully"),
> NUMA nodes could be allocated at three different places.
>
> - numa_register_memblks
> - init_cpu_to_node
> - init_gi_nodes
>
> All these calls happen at setup_arch, and have the following order:
>
> setup_arch
> ...
> x86_numa_init
> numa_init
> numa_register_memblks
> ...
> init_cpu_to_node
> init_memory_less_node
> alloc_node_data
> free_area_init_memoryless_node
> init_gi_nodes
> init_memory_less_node
> alloc_node_data
> free_area_init_memoryless_node
>
> numa_register_memblks() is only interested in those nodes which have memory,
> so it skips over any memoryless node it founds.
> Later on, when we have read ACPI's SRAT table, we call init_cpu_to_node()
> and init_gi_nodes(), which initialize any memoryless node we might have
> that have either CPU or Initiator affinity, meaning we allocate pg_data_t
> struct for them and we mark them as ONLINE.
>
> So far so good, but the thing is that after ("mm: handle uninitialized numa
> nodes gracefully"), we allocate all possible NUMA nodes in free_area_init(),
> meaning we have a picture like the following:
>
> setup_arch
> x86_numa_init
> numa_init
> numa_register_memblks <-- allocate non-memoryless node
> x86_init.paging.pagetable_init
> ...
> free_area_init
> free_area_init_memoryless <-- allocate memoryless node
> init_cpu_to_node
> alloc_node_data <-- allocate memoryless node with CPU
> free_area_init_memoryless_node
> init_gi_nodes
> alloc_node_data <-- allocate memoryless node with Initiator
> free_area_init_memoryless_node
Thanks for diving into this and double checking after me. I misread the
ordering and thought free_area_init is the last one to be called.
> free_area_init() already allocates all possible NUMA nodes, but
> init_cpu_to_node() and init_gi_nodes() are clueless about that,
> so they go ahead and allocate a new pg_data_t struct without
> checking anything, meaning we end up allocating twice.
>
> It should be mad clear that this only happens in the case where
> memoryless NUMA node happens to have a CPU/Initiator affinity.
>
> So get rid of init_memory_less_node() and just set the node online.
>
> Note that setting the node online is needed, otherwise we choke
> down the chain when bringup_nonboot_cpus() ends up calling
> __try_online_node()->register_one_node()->... and we blow up in
> bus_add_device(). Like can be seen here:
>
> =========
> [ 0.585060] BUG: kernel NULL pointer dereference, address: 0000000000000060
> [ 0.586091] #PF: supervisor read access in kernel mode
> [ 0.586831] #PF: error_code(0x0000) - not-present page
> [ 0.586930] PGD 0 P4D 0
> [ 0.586930] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
> [ 0.586930] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.17.0-rc4-1-default+ #45
> [ 0.586930] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/4
> [ 0.586930] RIP: 0010:bus_add_device+0x5a/0x140
> [ 0.586930] Code: 8b 74 24 20 48 89 df e8 84 96 ff ff 85 c0 89 c5 75 38 48 8b 53 50 48 85 d2 0f 84 bb 00 004
> [ 0.586930] RSP: 0000:ffffc9000022bd10 EFLAGS: 00010246
> [ 0.586930] RAX: 0000000000000000 RBX: ffff888100987400 RCX: ffff8881003e4e19
> [ 0.586930] RDX: ffff8881009a5e00 RSI: ffff888100987400 RDI: ffff888100987400
> [ 0.586930] RBP: 0000000000000000 R08: ffff8881003e4e18 R09: ffff8881003e4c98
> [ 0.586930] R10: 0000000000000000 R11: ffff888100402bc0 R12: ffffffff822ceba0
> [ 0.586930] R13: 0000000000000000 R14: ffff888100987400 R15: 0000000000000000
> [ 0.586930] FS: 0000000000000000(0000) GS:ffff88853fc00000(0000) knlGS:0000000000000000
> [ 0.586930] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 0.586930] CR2: 0000000000000060 CR3: 000000000200a001 CR4: 00000000001706b0
> [ 0.586930] Call Trace:
> [ 0.586930] <TASK>
> [ 0.586930] device_add+0x4c0/0x910
> [ 0.586930] __register_one_node+0x97/0x2d0
> [ 0.586930] __try_online_node+0x85/0xc0
> [ 0.586930] try_online_node+0x25/0x40
> [ 0.586930] cpu_up+0x4f/0x100
> [ 0.586930] bringup_nonboot_cpus+0x4f/0x60
> [ 0.586930] smp_init+0x26/0x79
> [ 0.586930] kernel_init_freeable+0x130/0x2f1
> [ 0.586930] ? rest_init+0x100/0x100
> [ 0.586930] kernel_init+0x17/0x150
> [ 0.586930] ? rest_init+0x100/0x100
> [ 0.586930] ret_from_fork+0x22/0x30
> [ 0.586930] </TASK>
> [ 0.586930] Modules linked in:
> [ 0.586930] CR2: 0000000000000060
> [ 0.586930] ---[ end trace 0000000000000000 ]---
> =========
>
> The reason is simple, by the time bringup_nonboot_cpus() gets called,
> we did not register the node_subsys bus yet, so we crash when bus_add_device()
> tries to dereference bus()->p.
>
> The following shows the order of the calls:
>
> kernel_init_freeable
> smp_init
> bringup_nonboot_cpus
> ...
> bus_add_device() <- we did not register node_subsys yet
> do_basic_setup
> do_initcalls
> postcore_initcall(register_node_type);
> register_node_type
> subsys_system_register
> subsys_register
> bus_register <- register node_subsys bus
>
> Why setting the node online saves us then? Well, simply because
> __try_online_node() backs off when the node is online, meaning
> we do not end up calling register_one_node() in the first place.
This is really a mess and a house built on sand. Thanks for looking into
it and hopefully this can get cleaned up to a saner state.
> This is subtle, broken and deserves a deep analysis and thought
> about how to put this into shape, but for now let us have this
> easy fix for the leaking memory issue.
>
> Signed-off-by: Oscar Salvador <[email protected]>
> Fixes: da4490c958ad ("mm: handle uninitialized numa nodes gracefully")
This sha1 is from linux-next very likely so it won't be persistent.
Please drop it.
Other than that
Acked-by: Michal Hocko <[email protected]>
One nit below
Thanks!
> ---
> arch/x86/mm/numa.c | 15 ++-------------
> include/linux/mm.h | 1 -
> mm/page_alloc.c | 2 +-
> 3 files changed, 3 insertions(+), 15 deletions(-)
>
> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index c6b1213086d6..37039a6af8ae 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -738,17 +738,6 @@ void __init x86_numa_init(void)
> numa_init(dummy_numa_init);
> }
>
> -static void __init init_memory_less_node(int nid)
> -{
> - /* Allocate and initialize node data. Memory-less node is now online.*/
> - alloc_node_data(nid);
> - free_area_init_memoryless_node(nid);
> -
> - /*
> - * All zonelists will be built later in start_kernel() after per cpu
> - * areas are initialized.
> - */
> -}
>
> /*
> * A node may exist which has one or more Generic Initiators but no CPUs and no
> @@ -768,7 +757,7 @@ void __init init_gi_nodes(void)
>
> for_each_node_state(nid, N_GENERIC_INITIATOR)
> if (!node_online(nid))
> - init_memory_less_node(nid);
> + node_set_online(nid);
I would stick a TODO here.
/*
* Exclude this node from
* bringup_nonboot_cpus
* cpu_up
* __try_online_node
* register_one_node
* because node_subsys is not initialized yet
* TODO remove dependency on node_online()
*/
> }
>
> /*
> @@ -799,7 +788,7 @@ void __init init_cpu_to_node(void)
> continue;
>
> if (!node_online(node))
> - init_memory_less_node(node);
> + node_set_online(node);
and here as well.
--
Michal Hocko
SUSE Labs
On Mon, Feb 21, 2022 at 10:20:02AM +0100, Michal Hocko wrote:
> On Fri 18-02-22 23:43:02, Oscar Salvador wrote:
> > Why setting the node online saves us then? Well, simply because
> > __try_online_node() backs off when the node is online, meaning
> > we do not end up calling register_one_node() in the first place.
>
> This is really a mess and a house built on sand. Thanks for looking into
> it and hopefully this can get cleaned up to a saner state.
Yes, I am willing to have a deep look into that and see how we can
improve the situation.
> This sha1 is from linux-next very likely so it won't be persistent.
> Please drop it.
Yes, it is. I guess it is fine to not have a "Fixes" tag here, so I will
remove it then.
> I would stick a TODO here.
> /*
> * Exclude this node from
> * bringup_nonboot_cpus
> * cpu_up
> * __try_online_node
> * register_one_node
> * because node_subsys is not initialized yet
> * TODO remove dependency on node_online()
> */
Sure, will do.
Thanks!
--
Oscar Salvador
SUSE Labs
On Mon 21-02-22 10:47:44, Oscar Salvador wrote:
> On Mon, Feb 21, 2022 at 10:20:02AM +0100, Michal Hocko wrote:
> > On Fri 18-02-22 23:43:02, Oscar Salvador wrote:
> > > Why setting the node online saves us then? Well, simply because
> > > __try_online_node() backs off when the node is online, meaning
> > > we do not end up calling register_one_node() in the first place.
> >
> > This is really a mess and a house built on sand. Thanks for looking into
> > it and hopefully this can get cleaned up to a saner state.
>
> Yes, I am willing to have a deep look into that and see how we can
> improve the situation.
>
> > This sha1 is from linux-next very likely so it won't be persistent.
> > Please drop it.
>
> Yes, it is. I guess it is fine to not have a "Fixes" tag here, so I will
> remove it then.
Normally we use sha in Fixes tag and I am not sure how many scripts we
would confuse if there was no but I guess it is good enough to mention
the patch name in the description. Theoretically we could have folded it
to my patch but I think it would be better to have it separate because
a) it gives a nice overview of the mess we should be dealing with and b)
the original patch would likely be more convoluted than necessary.
--
Michal Hocko
SUSE Labs