LinuxLists.cc - [PATCH] mm: page_alloc: don't allocate page from memoryless nodes

[permalink] [raw]

Subject: Re: [PATCH] mm: page_alloc: don't allocate page from memoryless nodes

On 2023/2/13 16:47, Vlastimil Babka wrote:
> On 2/12/23 12:03, Qi Zheng wrote:
>> In x86, numa_register_memblks() is only interested in
>> those nodes which have enough memory, so it skips over
>> all nodes with memory below NODE_MIN_SIZE (treated as
>> a memoryless node). Later on, we will initialize these
>> memoryless nodes (allocate pgdat in free_area_init()
>> and build zonelist etc), and will online these nodes
>> in init_cpu_to_node() and init_gi_nodes().
>>
>> After boot, these memoryless nodes are in N_ONLINE
>> state but not in N_MEMORY state. But we can still allocate
>> pages from these memoryless nodes.
>>
>> In SLUB, we only process nodes in the N_MEMORY state,
>> such as allocating their struct kmem_cache_node. So if
>> we allocate a page from the memoryless node above to
>> SLUB, the struct kmem_cache_node of the node corresponding
>> to this page is NULL, which will cause panic.
>>
>> For example, if we use qemu to start a two numa node kernel,
>> one of the nodes has 2M memory (less than NODE_MIN_SIZE),
>> and the other node has 2G, then we will encounter the
>> following panic:
>>
>> [ 0.149844] BUG: kernel NULL pointer dereference, address: 0000000000000000
>> [ 0.150783] #PF: supervisor write access in kernel mode
>> [ 0.151488] #PF: error_code(0x0002) - not-present page
>> <...>
>> [ 0.156056] RIP: 0010:_raw_spin_lock_irqsave+0x22/0x40
>> <...>
>> [ 0.169781] Call Trace:
>> [ 0.170159] <TASK>
>> [ 0.170448] deactivate_slab+0x187/0x3c0
>> [ 0.171031] ? bootstrap+0x1b/0x10e
>> [ 0.171559] ? preempt_count_sub+0x9/0xa0
>> [ 0.172145] ? kmem_cache_alloc+0x12c/0x440
>> [ 0.172735] ? bootstrap+0x1b/0x10e
>> [ 0.173236] bootstrap+0x6b/0x10e
>> [ 0.173720] kmem_cache_init+0x10a/0x188
>> [ 0.174240] start_kernel+0x415/0x6ac
>> [ 0.174738] secondary_startup_64_no_verify+0xe0/0xeb
>> [ 0.175417] </TASK>
>> [ 0.175713] Modules linked in:
>> [ 0.176117] CR2: 0000000000000000
>>
>> In addition, we can also encountered this panic in the actual
>> production environment. We set up a 2c2g container with two
>> numa nodes, and then reserved 128M for kdump, and then we
>> can encountered the above panic in the kdump kernel.
>>
>> To fix it, we can filter memoryless nodes when allocating
>> pages.
>>
>> Signed-off-by: Qi Zheng <[email protected]>
>> Reported-by: Teng Hu <[email protected]>
>
> Well AFAIK the key mechanism to only allocate from "good" nodes is the
> zonelist, we shouldn't need to start putting extra checks like this. So it
> seems to me that the code building the zonelists should take the
> NODE_MIN_SIZE constraint in mind.

Indeed. How about the following patch:

@@ -6382,8 +6378,11 @@ int find_next_best_node(int node, nodemask_t
*used_node_mask)
int min_val = INT_MAX;
int best_node = NUMA_NO_NODE;

- /* Use the local node if we haven't already */
- if (!node_isset(node, *used_node_mask)) {
+ /*
+ * Use the local node if we haven't already. But for memoryless
local
+ * node, we should skip it and fallback to other nodes.
+ */
+ if (!node_isset(node, *used_node_mask) && node_state(node,
N_MEMORY)) {
node_set(node, *used_node_mask);
return node;
}

For memoryless node, we skip it and fallback to other nodes when
build its zonelists.

Say we have node0 and node1, and node0 is memoryless, then:

[ 0.102400] Fallback order for Node 0: 1
[ 0.102931] Fallback order for Node 1: 1

In this way, we will not allocate pages from memoryless node0.

>
>> ---
>> mm/page_alloc.c | 5 +++++
>> 1 file changed, 5 insertions(+)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 588555754601..b9cce56f4e21 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -4188,6 +4188,11 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
>> (alloc_flags & ALLOC_CPUSET) &&
>> !__cpuset_zone_allowed(zone, gfp_mask))
>> continue;
>> +
>> + /* Don't allocate page from memoryless nodes. */
>> + if (!node_state((zone_to_nid(zone)), N_MEMORY))
>> + continue;
>> +
>> /*
>> * When allocating a page cache page for writing, we
>> * want to get it from a node that is within its dirty
>

--
Thanks,
Qi

2023-02-14 08:42:31

by Vlastimil Babka

[permalink] [raw]

Subject: Re: [PATCH] mm: page_alloc: don't allocate page from memoryless nodes

On 2/13/23 12:00, Qi Zheng wrote:
>
>
> On 2023/2/13 16:47, Vlastimil Babka wrote:
>> On 2/12/23 12:03, Qi Zheng wrote:
>>> In x86, numa_register_memblks() is only interested in
>>> those nodes which have enough memory, so it skips over
>>> all nodes with memory below NODE_MIN_SIZE (treated as
>>> a memoryless node). Later on, we will initialize these
>>> memoryless nodes (allocate pgdat in free_area_init()
>>> and build zonelist etc), and will online these nodes
>>> in init_cpu_to_node() and init_gi_nodes().
>>>
>>> After boot, these memoryless nodes are in N_ONLINE
>>> state but not in N_MEMORY state. But we can still allocate
>>> pages from these memoryless nodes.
>>>
>>> In SLUB, we only process nodes in the N_MEMORY state,
>>> such as allocating their struct kmem_cache_node. So if
>>> we allocate a page from the memoryless node above to
>>> SLUB, the struct kmem_cache_node of the node corresponding
>>> to this page is NULL, which will cause panic.
>>>
>>> For example, if we use qemu to start a two numa node kernel,
>>> one of the nodes has 2M memory (less than NODE_MIN_SIZE),
>>> and the other node has 2G, then we will encounter the
>>> following panic:
>>>
>>> [ 0.149844] BUG: kernel NULL pointer dereference, address: 0000000000000000
>>> [ 0.150783] #PF: supervisor write access in kernel mode
>>> [ 0.151488] #PF: error_code(0x0002) - not-present page
>>> <...>
>>> [ 0.156056] RIP: 0010:_raw_spin_lock_irqsave+0x22/0x40
>>> <...>
>>> [ 0.169781] Call Trace:
>>> [ 0.170159] <TASK>
>>> [ 0.170448] deactivate_slab+0x187/0x3c0
>>> [ 0.171031] ? bootstrap+0x1b/0x10e
>>> [ 0.171559] ? preempt_count_sub+0x9/0xa0
>>> [ 0.172145] ? kmem_cache_alloc+0x12c/0x440
>>> [ 0.172735] ? bootstrap+0x1b/0x10e
>>> [ 0.173236] bootstrap+0x6b/0x10e
>>> [ 0.173720] kmem_cache_init+0x10a/0x188
>>> [ 0.174240] start_kernel+0x415/0x6ac
>>> [ 0.174738] secondary_startup_64_no_verify+0xe0/0xeb
>>> [ 0.175417] </TASK>
>>> [ 0.175713] Modules linked in:
>>> [ 0.176117] CR2: 0000000000000000
>>>
>>> In addition, we can also encountered this panic in the actual
>>> production environment. We set up a 2c2g container with two
>>> numa nodes, and then reserved 128M for kdump, and then we
>>> can encountered the above panic in the kdump kernel.
>>>
>>> To fix it, we can filter memoryless nodes when allocating
>>> pages.
>>>
>>> Signed-off-by: Qi Zheng <[email protected]>
>>> Reported-by: Teng Hu <[email protected]>
>>
>> Well AFAIK the key mechanism to only allocate from "good" nodes is the
>> zonelist, we shouldn't need to start putting extra checks like this. So it
>> seems to me that the code building the zonelists should take the
>> NODE_MIN_SIZE constraint in mind.
>
> Indeed. How about the following patch:

+Cc also David, forgot earlier.

Looks good to me, at least.

> @@ -6382,8 +6378,11 @@ int find_next_best_node(int node, nodemask_t
> *used_node_mask)
> int min_val = INT_MAX;
> int best_node = NUMA_NO_NODE;
>
> - /* Use the local node if we haven't already */
> - if (!node_isset(node, *used_node_mask)) {
> + /*
> + * Use the local node if we haven't already. But for memoryless
> local
> + * node, we should skip it and fallback to other nodes.
> + */
> + if (!node_isset(node, *used_node_mask) && node_state(node,
> N_MEMORY)) {
> node_set(node, *used_node_mask);
> return node;
> }
>
> For memoryless node, we skip it and fallback to other nodes when
> build its zonelists.
>
> Say we have node0 and node1, and node0 is memoryless, then:
>
> [ 0.102400] Fallback order for Node 0: 1
> [ 0.102931] Fallback order for Node 1: 1
>
> In this way, we will not allocate pages from memoryless node0.
>
>>
>>> ---
>>> mm/page_alloc.c | 5 +++++
>>> 1 file changed, 5 insertions(+)
>>>
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index 588555754601..b9cce56f4e21 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -4188,6 +4188,11 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
>>> (alloc_flags & ALLOC_CPUSET) &&
>>> !__cpuset_zone_allowed(zone, gfp_mask))
>>> continue;
>>> +
>>> + /* Don't allocate page from memoryless nodes. */
>>> + if (!node_state((zone_to_nid(zone)), N_MEMORY))
>>> + continue;
>>> +
>>> /*
>>> * When allocating a page cache page for writing, we
>>> * want to get it from a node that is within its dirty
>>
>

2023-02-14 09:13:17

[permalink] [raw]

Subject: Re: [PATCH] mm: page_alloc: don't allocate page from memoryless nodes

On Mon, Feb 13, 2023 at 09:47:43AM +0100, Vlastimil Babka wrote:
> On 2/12/23 12:03, Qi Zheng wrote:
> > In x86, numa_register_memblks() is only interested in
> > those nodes which have enough memory, so it skips over
> > all nodes with memory below NODE_MIN_SIZE (treated as
> > a memoryless node). Later on, we will initialize these
> > memoryless nodes (allocate pgdat in free_area_init()
> > and build zonelist etc), and will online these nodes
> > in init_cpu_to_node() and init_gi_nodes().
> >
> > After boot, these memoryless nodes are in N_ONLINE
> > state but not in N_MEMORY state. But we can still allocate
> > pages from these memoryless nodes.
> >
> > In SLUB, we only process nodes in the N_MEMORY state,
> > such as allocating their struct kmem_cache_node. So if
> > we allocate a page from the memoryless node above to
> > SLUB, the struct kmem_cache_node of the node corresponding
> > to this page is NULL, which will cause panic.
> >
> > For example, if we use qemu to start a two numa node kernel,
> > one of the nodes has 2M memory (less than NODE_MIN_SIZE),
> > and the other node has 2G, then we will encounter the
> > following panic:
> >
> > [ 0.149844] BUG: kernel NULL pointer dereference, address: 0000000000000000
> > [ 0.150783] #PF: supervisor write access in kernel mode
> > [ 0.151488] #PF: error_code(0x0002) - not-present page
> > <...>
> > [ 0.156056] RIP: 0010:_raw_spin_lock_irqsave+0x22/0x40
> > <...>
> > [ 0.169781] Call Trace:
> > [ 0.170159] <TASK>
> > [ 0.170448] deactivate_slab+0x187/0x3c0
> > [ 0.171031] ? bootstrap+0x1b/0x10e
> > [ 0.171559] ? preempt_count_sub+0x9/0xa0
> > [ 0.172145] ? kmem_cache_alloc+0x12c/0x440
> > [ 0.172735] ? bootstrap+0x1b/0x10e
> > [ 0.173236] bootstrap+0x6b/0x10e
> > [ 0.173720] kmem_cache_init+0x10a/0x188
> > [ 0.174240] start_kernel+0x415/0x6ac
> > [ 0.174738] secondary_startup_64_no_verify+0xe0/0xeb
> > [ 0.175417] </TASK>
> > [ 0.175713] Modules linked in:
> > [ 0.176117] CR2: 0000000000000000
> >
> > In addition, we can also encountered this panic in the actual
> > production environment. We set up a 2c2g container with two
> > numa nodes, and then reserved 128M for kdump, and then we
> > can encountered the above panic in the kdump kernel.
> >
> > To fix it, we can filter memoryless nodes when allocating
> > pages.
> >
> > Signed-off-by: Qi Zheng <[email protected]>
> > Reported-by: Teng Hu <[email protected]>
>
> Well AFAIK the key mechanism to only allocate from "good" nodes is the
> zonelist, we shouldn't need to start putting extra checks like this. So it
> seems to me that the code building the zonelists should take the
> NODE_MIN_SIZE constraint in mind.

Why just not drop the memory for nodes with size < NODE_MIN_SIZE from
memblock at the first place?
Then we won't need runtime checks at all.

> > ---
> > mm/page_alloc.c | 5 +++++
> > 1 file changed, 5 insertions(+)
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 588555754601..b9cce56f4e21 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -4188,6 +4188,11 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
> > (alloc_flags & ALLOC_CPUSET) &&
> > !__cpuset_zone_allowed(zone, gfp_mask))
> > continue;
> > +
> > + /* Don't allocate page from memoryless nodes. */
> > + if (!node_state((zone_to_nid(zone)), N_MEMORY))
> > + continue;
> > +
> > /*
> > * When allocating a page cache page for writing, we
> > * want to get it from a node that is within its dirty
>

--
Sincerely yours,
Mike.

2023-02-14 09:19:01

[permalink] [raw]

Subject: Re: [PATCH] mm: page_alloc: don't allocate page from memoryless nodes

On 14.02.23 09:42, Vlastimil Babka wrote:
> On 2/13/23 12:00, Qi Zheng wrote:
>>
>>
>> On 2023/2/13 16:47, Vlastimil Babka wrote:
>>> On 2/12/23 12:03, Qi Zheng wrote:
>>>> In x86, numa_register_memblks() is only interested in
>>>> those nodes which have enough memory, so it skips over
>>>> all nodes with memory below NODE_MIN_SIZE (treated as
>>>> a memoryless node). Later on, we will initialize these
>>>> memoryless nodes (allocate pgdat in free_area_init()
>>>> and build zonelist etc), and will online these nodes
>>>> in init_cpu_to_node() and init_gi_nodes().
>>>>
>>>> After boot, these memoryless nodes are in N_ONLINE
>>>> state but not in N_MEMORY state. But we can still allocate
>>>> pages from these memoryless nodes.
>>>>
>>>> In SLUB, we only process nodes in the N_MEMORY state,
>>>> such as allocating their struct kmem_cache_node. So if
>>>> we allocate a page from the memoryless node above to
>>>> SLUB, the struct kmem_cache_node of the node corresponding
>>>> to this page is NULL, which will cause panic.
>>>>
>>>> For example, if we use qemu to start a two numa node kernel,
>>>> one of the nodes has 2M memory (less than NODE_MIN_SIZE),
>>>> and the other node has 2G, then we will encounter the
>>>> following panic:
>>>>
>>>> [ 0.149844] BUG: kernel NULL pointer dereference, address: 0000000000000000
>>>> [ 0.150783] #PF: supervisor write access in kernel mode
>>>> [ 0.151488] #PF: error_code(0x0002) - not-present page
>>>> <...>
>>>> [ 0.156056] RIP: 0010:_raw_spin_lock_irqsave+0x22/0x40
>>>> <...>
>>>> [ 0.169781] Call Trace:
>>>> [ 0.170159] <TASK>
>>>> [ 0.170448] deactivate_slab+0x187/0x3c0
>>>> [ 0.171031] ? bootstrap+0x1b/0x10e
>>>> [ 0.171559] ? preempt_count_sub+0x9/0xa0
>>>> [ 0.172145] ? kmem_cache_alloc+0x12c/0x440
>>>> [ 0.172735] ? bootstrap+0x1b/0x10e
>>>> [ 0.173236] bootstrap+0x6b/0x10e
>>>> [ 0.173720] kmem_cache_init+0x10a/0x188
>>>> [ 0.174240] start_kernel+0x415/0x6ac
>>>> [ 0.174738] secondary_startup_64_no_verify+0xe0/0xeb
>>>> [ 0.175417] </TASK>
>>>> [ 0.175713] Modules linked in:
>>>> [ 0.176117] CR2: 0000000000000000
>>>>
>>>> In addition, we can also encountered this panic in the actual
>>>> production environment. We set up a 2c2g container with two
>>>> numa nodes, and then reserved 128M for kdump, and then we
>>>> can encountered the above panic in the kdump kernel.
>>>>
>>>> To fix it, we can filter memoryless nodes when allocating
>>>> pages.
>>>>
>>>> Signed-off-by: Qi Zheng <[email protected]>
>>>> Reported-by: Teng Hu <[email protected]>
>>>
>>> Well AFAIK the key mechanism to only allocate from "good" nodes is the
>>> zonelist, we shouldn't need to start putting extra checks like this. So it
>>> seems to me that the code building the zonelists should take the
>>> NODE_MIN_SIZE constraint in mind.
>>
>> Indeed. How about the following patch:
>
> +Cc also David, forgot earlier.
>
> Looks good to me, at least.
>
>> @@ -6382,8 +6378,11 @@ int find_next_best_node(int node, nodemask_t
>> *used_node_mask)
>> int min_val = INT_MAX;
>> int best_node = NUMA_NO_NODE;
>>
>> - /* Use the local node if we haven't already */
>> - if (!node_isset(node, *used_node_mask)) {
>> + /*
>> + * Use the local node if we haven't already. But for memoryless
>> local
>> + * node, we should skip it and fallback to other nodes.
>> + */
>> + if (!node_isset(node, *used_node_mask) && node_state(node,
>> N_MEMORY)) {
>> node_set(node, *used_node_mask);
>> return node;
>> }
>>
>> For memoryless node, we skip it and fallback to other nodes when
>> build its zonelists.
>>
>> Say we have node0 and node1, and node0 is memoryless, then:
>>
>> [ 0.102400] Fallback order for Node 0: 1
>> [ 0.102931] Fallback order for Node 1: 1
>>
>> In this way, we will not allocate pages from memoryless node0.
>>

In offline_pages(), we'll first build_all_zonelists() to then
node_states_clear_node()->node_clear_state(node, N_MEMORY);

So at least on the offlining path, we wouldn't detect it properly yet I
assume, and build a zonelist that contains a now-memory-less node?

--
Thanks,

David / dhildenb

2023-02-14 09:43:23

[permalink] [raw]

Subject: Re: [PATCH] mm: page_alloc: don't allocate page from memoryless nodes

On Tue, Feb 14, 2023 at 10:17:03AM +0100, David Hildenbrand wrote:
> On 14.02.23 09:42, Vlastimil Babka wrote:
> > On 2/13/23 12:00, Qi Zheng wrote:
> > >
> > >
> > > On 2023/2/13 16:47, Vlastimil Babka wrote:
> > > > On 2/12/23 12:03, Qi Zheng wrote:
> > > > > In x86, numa_register_memblks() is only interested in
> > > > > those nodes which have enough memory, so it skips over
> > > > > all nodes with memory below NODE_MIN_SIZE (treated as
> > > > > a memoryless node). Later on, we will initialize these
> > > > > memoryless nodes (allocate pgdat in free_area_init()
> > > > > and build zonelist etc), and will online these nodes
> > > > > in init_cpu_to_node() and init_gi_nodes().
> > > > >
> > > > > After boot, these memoryless nodes are in N_ONLINE
> > > > > state but not in N_MEMORY state. But we can still allocate
> > > > > pages from these memoryless nodes.
> > > > >
> > > > > In SLUB, we only process nodes in the N_MEMORY state,
> > > > > such as allocating their struct kmem_cache_node. So if
> > > > > we allocate a page from the memoryless node above to
> > > > > SLUB, the struct kmem_cache_node of the node corresponding
> > > > > to this page is NULL, which will cause panic.
> > > > >
> > > > > For example, if we use qemu to start a two numa node kernel,
> > > > > one of the nodes has 2M memory (less than NODE_MIN_SIZE),
> > > > > and the other node has 2G, then we will encounter the
> > > > > following panic:
> > > > >
> > > > > [ 0.149844] BUG: kernel NULL pointer dereference, address: 0000000000000000
> > > > > [ 0.150783] #PF: supervisor write access in kernel mode
> > > > > [ 0.151488] #PF: error_code(0x0002) - not-present page
> > > > > <...>
> > > > > [ 0.156056] RIP: 0010:_raw_spin_lock_irqsave+0x22/0x40
> > > > > <...>
> > > > > [ 0.169781] Call Trace:
> > > > > [ 0.170159] <TASK>
> > > > > [ 0.170448] deactivate_slab+0x187/0x3c0
> > > > > [ 0.171031] ? bootstrap+0x1b/0x10e
> > > > > [ 0.171559] ? preempt_count_sub+0x9/0xa0
> > > > > [ 0.172145] ? kmem_cache_alloc+0x12c/0x440
> > > > > [ 0.172735] ? bootstrap+0x1b/0x10e
> > > > > [ 0.173236] bootstrap+0x6b/0x10e
> > > > > [ 0.173720] kmem_cache_init+0x10a/0x188
> > > > > [ 0.174240] start_kernel+0x415/0x6ac
> > > > > [ 0.174738] secondary_startup_64_no_verify+0xe0/0xeb
> > > > > [ 0.175417] </TASK>
> > > > > [ 0.175713] Modules linked in:
> > > > > [ 0.176117] CR2: 0000000000000000
> > > > >
> > > > > In addition, we can also encountered this panic in the actual
> > > > > production environment. We set up a 2c2g container with two
> > > > > numa nodes, and then reserved 128M for kdump, and then we
> > > > > can encountered the above panic in the kdump kernel.
> > > > >
> > > > > To fix it, we can filter memoryless nodes when allocating
> > > > > pages.
> > > > >
> > > > > Signed-off-by: Qi Zheng <[email protected]>
> > > > > Reported-by: Teng Hu <[email protected]>
> > > >
> > > > Well AFAIK the key mechanism to only allocate from "good" nodes is the
> > > > zonelist, we shouldn't need to start putting extra checks like this. So it
> > > > seems to me that the code building the zonelists should take the
> > > > NODE_MIN_SIZE constraint in mind.
> > >
> > > Indeed. How about the following patch:
> >
> > +Cc also David, forgot earlier.
> >
> > Looks good to me, at least.
> >
> > > @@ -6382,8 +6378,11 @@ int find_next_best_node(int node, nodemask_t
> > > *used_node_mask)
> > > int min_val = INT_MAX;
> > > int best_node = NUMA_NO_NODE;
> > >
> > > - /* Use the local node if we haven't already */
> > > - if (!node_isset(node, *used_node_mask)) {
> > > + /*
> > > + * Use the local node if we haven't already. But for memoryless
> > > local
> > > + * node, we should skip it and fallback to other nodes.
> > > + */
> > > + if (!node_isset(node, *used_node_mask) && node_state(node,
> > > N_MEMORY)) {
> > > node_set(node, *used_node_mask);
> > > return node;
> > > }
> > >
> > > For memoryless node, we skip it and fallback to other nodes when
> > > build its zonelists.
> > >
> > > Say we have node0 and node1, and node0 is memoryless, then:
> > >
> > > [ 0.102400] Fallback order for Node 0: 1
> > > [ 0.102931] Fallback order for Node 1: 1
> > >
> > > In this way, we will not allocate pages from memoryless node0.
> > >
>
> In offline_pages(), we'll first build_all_zonelists() to then
> node_states_clear_node()->node_clear_state(node, N_MEMORY);
>
> So at least on the offlining path, we wouldn't detect it properly yet I
> assume, and build a zonelist that contains a now-memory-less node?

Another question is what happens if a new memory is plugged into a node
that had < NODE_MIN_SIZE of memory and after hotplug it stops being
"memoryless".

> --
> Thanks,
>
> David / dhildenb
>

--
Sincerely yours,
Mike.

2023-02-14 10:14:12

[permalink] [raw]

Subject: Re: [PATCH] mm: page_alloc: don't allocate page from memoryless nodes

On 2023/2/14 17:17, David Hildenbrand wrote:
> On 14.02.23 09:42, Vlastimil Babka wrote:
>> On 2/13/23 12:00, Qi Zheng wrote:
>>>
>>>
>>> On 2023/2/13 16:47, Vlastimil Babka wrote:
>>>> On 2/12/23 12:03, Qi Zheng wrote:
>>>>> In x86, numa_register_memblks() is only interested in
>>>>> those nodes which have enough memory, so it skips over
>>>>> all nodes with memory below NODE_MIN_SIZE (treated as
>>>>> a memoryless node). Later on, we will initialize these
>>>>> memoryless nodes (allocate pgdat in free_area_init()
>>>>> and build zonelist etc), and will online these nodes
>>>>> in init_cpu_to_node() and init_gi_nodes().
>>>>>
>>>>> After boot, these memoryless nodes are in N_ONLINE
>>>>> state but not in N_MEMORY state. But we can still allocate
>>>>> pages from these memoryless nodes.
>>>>>
>>>>> In SLUB, we only process nodes in the N_MEMORY state,
>>>>> such as allocating their struct kmem_cache_node. So if
>>>>> we allocate a page from the memoryless node above to
>>>>> SLUB, the struct kmem_cache_node of the node corresponding
>>>>> to this page is NULL, which will cause panic.
>>>>>
>>>>> For example, if we use qemu to start a two numa node kernel,
>>>>> one of the nodes has 2M memory (less than NODE_MIN_SIZE),
>>>>> and the other node has 2G, then we will encounter the
>>>>> following panic:
>>>>>
>>>>> [    0.149844] BUG: kernel NULL pointer dereference, address:
>>>>> 0000000000000000
>>>>> [    0.150783] #PF: supervisor write access in kernel mode
>>>>> [    0.151488] #PF: error_code(0x0002) - not-present page
>>>>> <...>
>>>>> [    0.156056] RIP: 0010:_raw_spin_lock_irqsave+0x22/0x40
>>>>> <...>
>>>>> [    0.169781] Call Trace:
>>>>> [    0.170159] <TASK>
>>>>> [    0.170448] deactivate_slab+0x187/0x3c0
>>>>> [    0.171031] ? bootstrap+0x1b/0x10e
>>>>> [    0.171559] ? preempt_count_sub+0x9/0xa0
>>>>> [    0.172145] ? kmem_cache_alloc+0x12c/0x440
>>>>> [    0.172735] ? bootstrap+0x1b/0x10e
>>>>> [    0.173236] bootstrap+0x6b/0x10e
>>>>> [    0.173720] kmem_cache_init+0x10a/0x188
>>>>> [    0.174240] start_kernel+0x415/0x6ac
>>>>> [    0.174738] secondary_startup_64_no_verify+0xe0/0xeb
>>>>> [    0.175417] </TASK>
>>>>> [    0.175713] Modules linked in:
>>>>> [    0.176117] CR2: 0000000000000000
>>>>>
>>>>> In addition, we can also encountered this panic in the actual
>>>>> production environment. We set up a 2c2g container with two
>>>>> numa nodes, and then reserved 128M for kdump, and then we
>>>>> can encountered the above panic in the kdump kernel.
>>>>>
>>>>> To fix it, we can filter memoryless nodes when allocating
>>>>> pages.
>>>>>
>>>>> Signed-off-by: Qi Zheng <[email protected]>
>>>>> Reported-by: Teng Hu <[email protected]>
>>>>
>>>> Well AFAIK the key mechanism to only allocate from "good" nodes is the
>>>> zonelist, we shouldn't need to start putting extra checks like this.
>>>> So it
>>>> seems to me that the code building the zonelists should take the
>>>> NODE_MIN_SIZE constraint in mind.
>>>
>>> Indeed. How about the following patch:
>>
>> +Cc also David, forgot earlier.
>>
>> Looks good to me, at least.
>>
>>> @@ -6382,8 +6378,11 @@ int find_next_best_node(int node, nodemask_t
>>> *used_node_mask)
>>>           int min_val = INT_MAX;
>>>           int best_node = NUMA_NO_NODE;
>>>
>>> -       /* Use the local node if we haven't already */
>>> -       if (!node_isset(node, *used_node_mask)) {
>>> +       /*
>>> +        * Use the local node if we haven't already. But for memoryless
>>> local
>>> +        * node, we should skip it and fallback to other nodes.
>>> +        */
>>> +       if (!node_isset(node, *used_node_mask) && node_state(node,
>>> N_MEMORY)) {
>>>                   node_set(node, *used_node_mask);
>>>                   return node;
>>>           }
>>>
>>> For memoryless node, we skip it and fallback to other nodes when
>>> build its zonelists.
>>>
>>> Say we have node0 and node1, and node0 is memoryless, then:
>>>
>>> [    0.102400] Fallback order for Node 0: 1
>>> [    0.102931] Fallback order for Node 1: 1
>>>
>>> In this way, we will not allocate pages from memoryless node0.
>>>
>
> In offline_pages(), we'll first build_all_zonelists() to then
> node_states_clear_node()->node_clear_state(node, N_MEMORY);
>
> So at least on the offlining path, we wouldn't detect it properly yet I
> assume, and build a zonelist that contains a now-memory-less node?

Without this patch, seems like the node_states_clear_node() should have
been called before build_all_zonelists()? Otherwise, the node whose
N_MEMORY state is about to be cleared will still be established in the
fallback list of other nodes.

>

--
Thanks,
Qi

2023-02-14 10:26:20

[permalink] [raw]

Subject: Re: [PATCH] mm: page_alloc: don't allocate page from memoryless nodes

On 2023/2/14 17:43, Mike Rapoport wrote:
> On Tue, Feb 14, 2023 at 10:17:03AM +0100, David Hildenbrand wrote:
>> On 14.02.23 09:42, Vlastimil Babka wrote:
>>> On 2/13/23 12:00, Qi Zheng wrote:
>>>>
>>>>
>>>> On 2023/2/13 16:47, Vlastimil Babka wrote:
>>>>> On 2/12/23 12:03, Qi Zheng wrote:
>>>>>> In x86, numa_register_memblks() is only interested in
>>>>>> those nodes which have enough memory, so it skips over
>>>>>> all nodes with memory below NODE_MIN_SIZE (treated as
>>>>>> a memoryless node). Later on, we will initialize these
>>>>>> memoryless nodes (allocate pgdat in free_area_init()
>>>>>> and build zonelist etc), and will online these nodes
>>>>>> in init_cpu_to_node() and init_gi_nodes().
>>>>>>
>>>>>> After boot, these memoryless nodes are in N_ONLINE
>>>>>> state but not in N_MEMORY state. But we can still allocate
>>>>>> pages from these memoryless nodes.
>>>>>>
>>>>>> In SLUB, we only process nodes in the N_MEMORY state,
>>>>>> such as allocating their struct kmem_cache_node. So if
>>>>>> we allocate a page from the memoryless node above to
>>>>>> SLUB, the struct kmem_cache_node of the node corresponding
>>>>>> to this page is NULL, which will cause panic.
>>>>>>
>>>>>> For example, if we use qemu to start a two numa node kernel,
>>>>>> one of the nodes has 2M memory (less than NODE_MIN_SIZE),
>>>>>> and the other node has 2G, then we will encounter the
>>>>>> following panic:
>>>>>>
>>>>>> [ 0.149844] BUG: kernel NULL pointer dereference, address: 0000000000000000
>>>>>> [ 0.150783] #PF: supervisor write access in kernel mode
>>>>>> [ 0.151488] #PF: error_code(0x0002) - not-present page
>>>>>> <...>
>>>>>> [ 0.156056] RIP: 0010:_raw_spin_lock_irqsave+0x22/0x40
>>>>>> <...>
>>>>>> [ 0.169781] Call Trace:
>>>>>> [ 0.170159] <TASK>
>>>>>> [ 0.170448] deactivate_slab+0x187/0x3c0
>>>>>> [ 0.171031] ? bootstrap+0x1b/0x10e
>>>>>> [ 0.171559] ? preempt_count_sub+0x9/0xa0
>>>>>> [ 0.172145] ? kmem_cache_alloc+0x12c/0x440
>>>>>> [ 0.172735] ? bootstrap+0x1b/0x10e
>>>>>> [ 0.173236] bootstrap+0x6b/0x10e
>>>>>> [ 0.173720] kmem_cache_init+0x10a/0x188
>>>>>> [ 0.174240] start_kernel+0x415/0x6ac
>>>>>> [ 0.174738] secondary_startup_64_no_verify+0xe0/0xeb
>>>>>> [ 0.175417] </TASK>
>>>>>> [ 0.175713] Modules linked in:
>>>>>> [ 0.176117] CR2: 0000000000000000
>>>>>>
>>>>>> In addition, we can also encountered this panic in the actual
>>>>>> production environment. We set up a 2c2g container with two
>>>>>> numa nodes, and then reserved 128M for kdump, and then we
>>>>>> can encountered the above panic in the kdump kernel.
>>>>>>
>>>>>> To fix it, we can filter memoryless nodes when allocating
>>>>>> pages.
>>>>>>
>>>>>> Signed-off-by: Qi Zheng <[email protected]>
>>>>>> Reported-by: Teng Hu <[email protected]>
>>>>>
>>>>> Well AFAIK the key mechanism to only allocate from "good" nodes is the
>>>>> zonelist, we shouldn't need to start putting extra checks like this. So it
>>>>> seems to me that the code building the zonelists should take the
>>>>> NODE_MIN_SIZE constraint in mind.
>>>>
>>>> Indeed. How about the following patch:
>>>
>>> +Cc also David, forgot earlier.
>>>
>>> Looks good to me, at least.
>>>
>>>> @@ -6382,8 +6378,11 @@ int find_next_best_node(int node, nodemask_t
>>>> *used_node_mask)
>>>> int min_val = INT_MAX;
>>>> int best_node = NUMA_NO_NODE;
>>>>
>>>> - /* Use the local node if we haven't already */
>>>> - if (!node_isset(node, *used_node_mask)) {
>>>> + /*
>>>> + * Use the local node if we haven't already. But for memoryless
>>>> local
>>>> + * node, we should skip it and fallback to other nodes.
>>>> + */
>>>> + if (!node_isset(node, *used_node_mask) && node_state(node,
>>>> N_MEMORY)) {
>>>> node_set(node, *used_node_mask);
>>>> return node;
>>>> }
>>>>
>>>> For memoryless node, we skip it and fallback to other nodes when
>>>> build its zonelists.
>>>>
>>>> Say we have node0 and node1, and node0 is memoryless, then:
>>>>
>>>> [ 0.102400] Fallback order for Node 0: 1
>>>> [ 0.102931] Fallback order for Node 1: 1
>>>>
>>>> In this way, we will not allocate pages from memoryless node0.
>>>>
>>
>> In offline_pages(), we'll first build_all_zonelists() to then
>> node_states_clear_node()->node_clear_state(node, N_MEMORY);
>>
>> So at least on the offlining path, we wouldn't detect it properly yet I
>> assume, and build a zonelist that contains a now-memory-less node?
>
> Another question is what happens if a new memory is plugged into a node
> that had < NODE_MIN_SIZE of memory and after hotplug it stops being
> "memoryless".

When going online and offline a memory will re-call
build_all_zonelists() to re-establish the zonelists (the zonelist of
itself and other nodes). So it can stop being "memoryless"
automatically.

But in online_pages(), did not see the check of < NODE_MIN_SIZE.

>
>> --
>> Thanks,
>>
>> David / dhildenb
>>
>

--
Thanks,
Qi

2023-02-14 10:33:49

[permalink] [raw]

Subject: Re: [PATCH] mm: page_alloc: don't allocate page from memoryless nodes

On 2023/2/14 17:10, Mike Rapoport wrote:
> On Mon, Feb 13, 2023 at 09:47:43AM +0100, Vlastimil Babka wrote:
>> On 2/12/23 12:03, Qi Zheng wrote:
>>> In x86, numa_register_memblks() is only interested in
>>> those nodes which have enough memory, so it skips over
>>> all nodes with memory below NODE_MIN_SIZE (treated as
>>> a memoryless node). Later on, we will initialize these
>>> memoryless nodes (allocate pgdat in free_area_init()
>>> and build zonelist etc), and will online these nodes
>>> in init_cpu_to_node() and init_gi_nodes().
>>>
>>> After boot, these memoryless nodes are in N_ONLINE
>>> state but not in N_MEMORY state. But we can still allocate
>>> pages from these memoryless nodes.
>>>
>>> In SLUB, we only process nodes in the N_MEMORY state,
>>> such as allocating their struct kmem_cache_node. So if
>>> we allocate a page from the memoryless node above to
>>> SLUB, the struct kmem_cache_node of the node corresponding
>>> to this page is NULL, which will cause panic.
>>>
>>> For example, if we use qemu to start a two numa node kernel,
>>> one of the nodes has 2M memory (less than NODE_MIN_SIZE),
>>> and the other node has 2G, then we will encounter the
>>> following panic:
>>>
>>> [ 0.149844] BUG: kernel NULL pointer dereference, address: 0000000000000000
>>> [ 0.150783] #PF: supervisor write access in kernel mode
>>> [ 0.151488] #PF: error_code(0x0002) - not-present page
>>> <...>
>>> [ 0.156056] RIP: 0010:_raw_spin_lock_irqsave+0x22/0x40
>>> <...>
>>> [ 0.169781] Call Trace:
>>> [ 0.170159] <TASK>
>>> [ 0.170448] deactivate_slab+0x187/0x3c0
>>> [ 0.171031] ? bootstrap+0x1b/0x10e
>>> [ 0.171559] ? preempt_count_sub+0x9/0xa0
>>> [ 0.172145] ? kmem_cache_alloc+0x12c/0x440
>>> [ 0.172735] ? bootstrap+0x1b/0x10e
>>> [ 0.173236] bootstrap+0x6b/0x10e
>>> [ 0.173720] kmem_cache_init+0x10a/0x188
>>> [ 0.174240] start_kernel+0x415/0x6ac
>>> [ 0.174738] secondary_startup_64_no_verify+0xe0/0xeb
>>> [ 0.175417] </TASK>
>>> [ 0.175713] Modules linked in:
>>> [ 0.176117] CR2: 0000000000000000
>>>
>>> In addition, we can also encountered this panic in the actual
>>> production environment. We set up a 2c2g container with two
>>> numa nodes, and then reserved 128M for kdump, and then we
>>> can encountered the above panic in the kdump kernel.
>>>
>>> To fix it, we can filter memoryless nodes when allocating
>>> pages.
>>>
>>> Signed-off-by: Qi Zheng <[email protected]>
>>> Reported-by: Teng Hu <[email protected]>
>>
>> Well AFAIK the key mechanism to only allocate from "good" nodes is the
>> zonelist, we shouldn't need to start putting extra checks like this. So it
>> seems to me that the code building the zonelists should take the
>> NODE_MIN_SIZE constraint in mind.
>
> Why just not drop the memory for nodes with size < NODE_MIN_SIZE from
> memblock at the first place?

In this way, it seems that no pages of size < NODE_MIN_SIZE nodes will
be released to buddy, so the pages of these nodes will not be allocated,
and the above-mentioned panic will be avoided.

But these nodes will still build zonelists for itself, which seems
unnecessary?

> Then we won't need runtime checks at all.
>
>>> ---
>>> mm/page_alloc.c | 5 +++++
>>> 1 file changed, 5 insertions(+)
>>>
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index 588555754601..b9cce56f4e21 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -4188,6 +4188,11 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
>>> (alloc_flags & ALLOC_CPUSET) &&
>>> !__cpuset_zone_allowed(zone, gfp_mask))
>>> continue;
>>> +
>>> + /* Don't allocate page from memoryless nodes. */
>>> + if (!node_state((zone_to_nid(zone)), N_MEMORY))
>>> + continue;
>>> +
>>> /*
>>> * When allocating a page cache page for writing, we
>>> * want to get it from a node that is within its dirty
>>
>

--
Thanks,
Qi

2023-02-14 11:24:23

[permalink] [raw]

Subject: Re: [PATCH] mm: page_alloc: don't allocate page from memoryless nodes

On 14.02.23 11:26, Qi Zheng wrote:
>
>
> On 2023/2/14 17:43, Mike Rapoport wrote:
>> On Tue, Feb 14, 2023 at 10:17:03AM +0100, David Hildenbrand wrote:
>>> On 14.02.23 09:42, Vlastimil Babka wrote:
>>>> On 2/13/23 12:00, Qi Zheng wrote:
>>>>>
>>>>>
>>>>> On 2023/2/13 16:47, Vlastimil Babka wrote:
>>>>>> On 2/12/23 12:03, Qi Zheng wrote:
>>>>>>> In x86, numa_register_memblks() is only interested in
>>>>>>> those nodes which have enough memory, so it skips over
>>>>>>> all nodes with memory below NODE_MIN_SIZE (treated as
>>>>>>> a memoryless node). Later on, we will initialize these
>>>>>>> memoryless nodes (allocate pgdat in free_area_init()
>>>>>>> and build zonelist etc), and will online these nodes
>>>>>>> in init_cpu_to_node() and init_gi_nodes().
>>>>>>>
>>>>>>> After boot, these memoryless nodes are in N_ONLINE
>>>>>>> state but not in N_MEMORY state. But we can still allocate
>>>>>>> pages from these memoryless nodes.
>>>>>>>
>>>>>>> In SLUB, we only process nodes in the N_MEMORY state,
>>>>>>> such as allocating their struct kmem_cache_node. So if
>>>>>>> we allocate a page from the memoryless node above to
>>>>>>> SLUB, the struct kmem_cache_node of the node corresponding
>>>>>>> to this page is NULL, which will cause panic.
>>>>>>>
>>>>>>> For example, if we use qemu to start a two numa node kernel,
>>>>>>> one of the nodes has 2M memory (less than NODE_MIN_SIZE),
>>>>>>> and the other node has 2G, then we will encounter the
>>>>>>> following panic:
>>>>>>>
>>>>>>> [ 0.149844] BUG: kernel NULL pointer dereference, address: 0000000000000000
>>>>>>> [ 0.150783] #PF: supervisor write access in kernel mode
>>>>>>> [ 0.151488] #PF: error_code(0x0002) - not-present page
>>>>>>> <...>
>>>>>>> [ 0.156056] RIP: 0010:_raw_spin_lock_irqsave+0x22/0x40
>>>>>>> <...>
>>>>>>> [ 0.169781] Call Trace:
>>>>>>> [ 0.170159] <TASK>
>>>>>>> [ 0.170448] deactivate_slab+0x187/0x3c0
>>>>>>> [ 0.171031] ? bootstrap+0x1b/0x10e
>>>>>>> [ 0.171559] ? preempt_count_sub+0x9/0xa0
>>>>>>> [ 0.172145] ? kmem_cache_alloc+0x12c/0x440
>>>>>>> [ 0.172735] ? bootstrap+0x1b/0x10e
>>>>>>> [ 0.173236] bootstrap+0x6b/0x10e
>>>>>>> [ 0.173720] kmem_cache_init+0x10a/0x188
>>>>>>> [ 0.174240] start_kernel+0x415/0x6ac
>>>>>>> [ 0.174738] secondary_startup_64_no_verify+0xe0/0xeb
>>>>>>> [ 0.175417] </TASK>
>>>>>>> [ 0.175713] Modules linked in:
>>>>>>> [ 0.176117] CR2: 0000000000000000
>>>>>>>
>>>>>>> In addition, we can also encountered this panic in the actual
>>>>>>> production environment. We set up a 2c2g container with two
>>>>>>> numa nodes, and then reserved 128M for kdump, and then we
>>>>>>> can encountered the above panic in the kdump kernel.
>>>>>>>
>>>>>>> To fix it, we can filter memoryless nodes when allocating
>>>>>>> pages.
>>>>>>>
>>>>>>> Signed-off-by: Qi Zheng <[email protected]>
>>>>>>> Reported-by: Teng Hu <[email protected]>
>>>>>>
>>>>>> Well AFAIK the key mechanism to only allocate from "good" nodes is the
>>>>>> zonelist, we shouldn't need to start putting extra checks like this. So it
>>>>>> seems to me that the code building the zonelists should take the
>>>>>> NODE_MIN_SIZE constraint in mind.
>>>>>
>>>>> Indeed. How about the following patch:
>>>>
>>>> +Cc also David, forgot earlier.
>>>>
>>>> Looks good to me, at least.
>>>>
>>>>> @@ -6382,8 +6378,11 @@ int find_next_best_node(int node, nodemask_t
>>>>> *used_node_mask)
>>>>> int min_val = INT_MAX;
>>>>> int best_node = NUMA_NO_NODE;
>>>>>
>>>>> - /* Use the local node if we haven't already */
>>>>> - if (!node_isset(node, *used_node_mask)) {
>>>>> + /*
>>>>> + * Use the local node if we haven't already. But for memoryless
>>>>> local
>>>>> + * node, we should skip it and fallback to other nodes.
>>>>> + */
>>>>> + if (!node_isset(node, *used_node_mask) && node_state(node,
>>>>> N_MEMORY)) {
>>>>> node_set(node, *used_node_mask);
>>>>> return node;
>>>>> }
>>>>>
>>>>> For memoryless node, we skip it and fallback to other nodes when
>>>>> build its zonelists.
>>>>>
>>>>> Say we have node0 and node1, and node0 is memoryless, then:
>>>>>
>>>>> [ 0.102400] Fallback order for Node 0: 1
>>>>> [ 0.102931] Fallback order for Node 1: 1
>>>>>
>>>>> In this way, we will not allocate pages from memoryless node0.
>>>>>
>>>
>>> In offline_pages(), we'll first build_all_zonelists() to then
>>> node_states_clear_node()->node_clear_state(node, N_MEMORY);
>>>
>>> So at least on the offlining path, we wouldn't detect it properly yet I
>>> assume, and build a zonelist that contains a now-memory-less node?
>>
>> Another question is what happens if a new memory is plugged into a node
>> that had < NODE_MIN_SIZE of memory and after hotplug it stops being
>> "memoryless".
>
> When going online and offline a memory will re-call
> build_all_zonelists() to re-establish the zonelists (the zonelist of
> itself and other nodes). So it can stop being "memoryless"
> automatically.
>
> But in online_pages(), did not see the check of < NODE_MIN_SIZE.

TBH, this is the first time I hear of NODE_MIN_SIZE and it seems to be a
pretty x86 specific thing.

Are we sure we want to get NODE_MIN_SIZE involved?

--
Thanks,

David / dhildenb

2023-02-14 11:27:10

[permalink] [raw]

Subject: Re: [PATCH] mm: page_alloc: don't allocate page from memoryless nodes

On 2023/2/14 19:22, David Hildenbrand wrote:
> On 14.02.23 11:26, Qi Zheng wrote:
>>
>>
>> On 2023/2/14 17:43, Mike Rapoport wrote:
>>> On Tue, Feb 14, 2023 at 10:17:03AM +0100, David Hildenbrand wrote:
>>>> On 14.02.23 09:42, Vlastimil Babka wrote:
>>>>> On 2/13/23 12:00, Qi Zheng wrote:
>>>>>>
>>>>>>
>>>>>> On 2023/2/13 16:47, Vlastimil Babka wrote:
>>>>>>> On 2/12/23 12:03, Qi Zheng wrote:
>>>>>>>> In x86, numa_register_memblks() is only interested in
>>>>>>>> those nodes which have enough memory, so it skips over
>>>>>>>> all nodes with memory below NODE_MIN_SIZE (treated as
>>>>>>>> a memoryless node). Later on, we will initialize these
>>>>>>>> memoryless nodes (allocate pgdat in free_area_init()
>>>>>>>> and build zonelist etc), and will online these nodes
>>>>>>>> in init_cpu_to_node() and init_gi_nodes().
>>>>>>>>
>>>>>>>> After boot, these memoryless nodes are in N_ONLINE
>>>>>>>> state but not in N_MEMORY state. But we can still allocate
>>>>>>>> pages from these memoryless nodes.
>>>>>>>>
>>>>>>>> In SLUB, we only process nodes in the N_MEMORY state,
>>>>>>>> such as allocating their struct kmem_cache_node. So if
>>>>>>>> we allocate a page from the memoryless node above to
>>>>>>>> SLUB, the struct kmem_cache_node of the node corresponding
>>>>>>>> to this page is NULL, which will cause panic.
>>>>>>>>
>>>>>>>> For example, if we use qemu to start a two numa node kernel,
>>>>>>>> one of the nodes has 2M memory (less than NODE_MIN_SIZE),
>>>>>>>> and the other node has 2G, then we will encounter the
>>>>>>>> following panic:
>>>>>>>>
>>>>>>>> [    0.149844] BUG: kernel NULL pointer dereference, address:
>>>>>>>> 0000000000000000
>>>>>>>> [    0.150783] #PF: supervisor write access in kernel mode
>>>>>>>> [    0.151488] #PF: error_code(0x0002) - not-present page
>>>>>>>> <...>
>>>>>>>> [    0.156056] RIP: 0010:_raw_spin_lock_irqsave+0x22/0x40
>>>>>>>> <...>
>>>>>>>> [    0.169781] Call Trace:
>>>>>>>> [    0.170159] <TASK>
>>>>>>>> [    0.170448] deactivate_slab+0x187/0x3c0
>>>>>>>> [    0.171031] ? bootstrap+0x1b/0x10e
>>>>>>>> [    0.171559] ? preempt_count_sub+0x9/0xa0
>>>>>>>> [    0.172145] ? kmem_cache_alloc+0x12c/0x440
>>>>>>>> [    0.172735] ? bootstrap+0x1b/0x10e
>>>>>>>> [    0.173236] bootstrap+0x6b/0x10e
>>>>>>>> [    0.173720] kmem_cache_init+0x10a/0x188
>>>>>>>> [    0.174240] start_kernel+0x415/0x6ac
>>>>>>>> [    0.174738] secondary_startup_64_no_verify+0xe0/0xeb
>>>>>>>> [    0.175417] </TASK>
>>>>>>>> [    0.175713] Modules linked in:
>>>>>>>> [    0.176117] CR2: 0000000000000000
>>>>>>>>
>>>>>>>> In addition, we can also encountered this panic in the actual
>>>>>>>> production environment. We set up a 2c2g container with two
>>>>>>>> numa nodes, and then reserved 128M for kdump, and then we
>>>>>>>> can encountered the above panic in the kdump kernel.
>>>>>>>>
>>>>>>>> To fix it, we can filter memoryless nodes when allocating
>>>>>>>> pages.
>>>>>>>>
>>>>>>>> Signed-off-by: Qi Zheng <[email protected]>
>>>>>>>> Reported-by: Teng Hu <[email protected]>
>>>>>>>
>>>>>>> Well AFAIK the key mechanism to only allocate from "good" nodes
>>>>>>> is the
>>>>>>> zonelist, we shouldn't need to start putting extra checks like
>>>>>>> this. So it
>>>>>>> seems to me that the code building the zonelists should take the
>>>>>>> NODE_MIN_SIZE constraint in mind.
>>>>>>
>>>>>> Indeed. How about the following patch:
>>>>>
>>>>> +Cc also David, forgot earlier.
>>>>>
>>>>> Looks good to me, at least.
>>>>>
>>>>>> @@ -6382,8 +6378,11 @@ int find_next_best_node(int node, nodemask_t
>>>>>> *used_node_mask)
>>>>>>             int min_val = INT_MAX;
>>>>>>             int best_node = NUMA_NO_NODE;
>>>>>>
>>>>>> -       /* Use the local node if we haven't already */
>>>>>> -       if (!node_isset(node, *used_node_mask)) {
>>>>>> +       /*
>>>>>> +        * Use the local node if we haven't already. But for
>>>>>> memoryless
>>>>>> local
>>>>>> +        * node, we should skip it and fallback to other nodes.
>>>>>> +        */
>>>>>> +       if (!node_isset(node, *used_node_mask) && node_state(node,
>>>>>> N_MEMORY)) {
>>>>>>                     node_set(node, *used_node_mask);
>>>>>>                     return node;
>>>>>>             }
>>>>>>
>>>>>> For memoryless node, we skip it and fallback to other nodes when
>>>>>> build its zonelists.
>>>>>>
>>>>>> Say we have node0 and node1, and node0 is memoryless, then:
>>>>>>
>>>>>> [    0.102400] Fallback order for Node 0: 1
>>>>>> [    0.102931] Fallback order for Node 1: 1
>>>>>>
>>>>>> In this way, we will not allocate pages from memoryless node0.
>>>>>>
>>>>
>>>> In offline_pages(), we'll first build_all_zonelists() to then
>>>> node_states_clear_node()->node_clear_state(node, N_MEMORY);
>>>>
>>>> So at least on the offlining path, we wouldn't detect it properly yet I
>>>> assume, and build a zonelist that contains a now-memory-less node?
>>>
>>> Another question is what happens if a new memory is plugged into a node
>>> that had < NODE_MIN_SIZE of memory and after hotplug it stops being
>>> "memoryless".
>>
>> When going online and offline a memory will re-call
>> build_all_zonelists() to re-establish the zonelists (the zonelist of
>> itself and other nodes). So it can stop being "memoryless"
>> automatically.
>>
>> But in online_pages(), did not see the check of < NODE_MIN_SIZE.
>
> TBH, this is the first time I hear of NODE_MIN_SIZE and it seems to be a
> pretty x86 specific thing.
>
> Are we sure we want to get NODE_MIN_SIZE involved?

Maybe add an arch_xxx() to handle it?

>

--
Thanks,
Qi

2023-02-14 11:30:35

[permalink] [raw]

Subject: Re: [PATCH] mm: page_alloc: don't allocate page from memoryless nodes

On 14.02.23 12:26, Qi Zheng wrote:
>
>
> On 2023/2/14 19:22, David Hildenbrand wrote:
>> On 14.02.23 11:26, Qi Zheng wrote:
>>>
>>>
>>> On 2023/2/14 17:43, Mike Rapoport wrote:
>>>> On Tue, Feb 14, 2023 at 10:17:03AM +0100, David Hildenbrand wrote:
>>>>> On 14.02.23 09:42, Vlastimil Babka wrote:
>>>>>> On 2/13/23 12:00, Qi Zheng wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 2023/2/13 16:47, Vlastimil Babka wrote:
>>>>>>>> On 2/12/23 12:03, Qi Zheng wrote:
>>>>>>>>> In x86, numa_register_memblks() is only interested in
>>>>>>>>> those nodes which have enough memory, so it skips over
>>>>>>>>> all nodes with memory below NODE_MIN_SIZE (treated as
>>>>>>>>> a memoryless node). Later on, we will initialize these
>>>>>>>>> memoryless nodes (allocate pgdat in free_area_init()
>>>>>>>>> and build zonelist etc), and will online these nodes
>>>>>>>>> in init_cpu_to_node() and init_gi_nodes().
>>>>>>>>>
>>>>>>>>> After boot, these memoryless nodes are in N_ONLINE
>>>>>>>>> state but not in N_MEMORY state. But we can still allocate
>>>>>>>>> pages from these memoryless nodes.
>>>>>>>>>
>>>>>>>>> In SLUB, we only process nodes in the N_MEMORY state,
>>>>>>>>> such as allocating their struct kmem_cache_node. So if
>>>>>>>>> we allocate a page from the memoryless node above to
>>>>>>>>> SLUB, the struct kmem_cache_node of the node corresponding
>>>>>>>>> to this page is NULL, which will cause panic.
>>>>>>>>>
>>>>>>>>> For example, if we use qemu to start a two numa node kernel,
>>>>>>>>> one of the nodes has 2M memory (less than NODE_MIN_SIZE),
>>>>>>>>> and the other node has 2G, then we will encounter the
>>>>>>>>> following panic:
>>>>>>>>>
>>>>>>>>> [    0.149844] BUG: kernel NULL pointer dereference, address:
>>>>>>>>> 0000000000000000
>>>>>>>>> [    0.150783] #PF: supervisor write access in kernel mode
>>>>>>>>> [    0.151488] #PF: error_code(0x0002) - not-present page
>>>>>>>>> <...>
>>>>>>>>> [    0.156056] RIP: 0010:_raw_spin_lock_irqsave+0x22/0x40
>>>>>>>>> <...>
>>>>>>>>> [    0.169781] Call Trace:
>>>>>>>>> [    0.170159] <TASK>
>>>>>>>>> [    0.170448] deactivate_slab+0x187/0x3c0
>>>>>>>>> [    0.171031] ? bootstrap+0x1b/0x10e
>>>>>>>>> [    0.171559] ? preempt_count_sub+0x9/0xa0
>>>>>>>>> [    0.172145] ? kmem_cache_alloc+0x12c/0x440
>>>>>>>>> [    0.172735] ? bootstrap+0x1b/0x10e
>>>>>>>>> [    0.173236] bootstrap+0x6b/0x10e
>>>>>>>>> [    0.173720] kmem_cache_init+0x10a/0x188
>>>>>>>>> [    0.174240] start_kernel+0x415/0x6ac
>>>>>>>>> [    0.174738] secondary_startup_64_no_verify+0xe0/0xeb
>>>>>>>>> [    0.175417] </TASK>
>>>>>>>>> [    0.175713] Modules linked in:
>>>>>>>>> [    0.176117] CR2: 0000000000000000
>>>>>>>>>
>>>>>>>>> In addition, we can also encountered this panic in the actual
>>>>>>>>> production environment. We set up a 2c2g container with two
>>>>>>>>> numa nodes, and then reserved 128M for kdump, and then we
>>>>>>>>> can encountered the above panic in the kdump kernel.
>>>>>>>>>
>>>>>>>>> To fix it, we can filter memoryless nodes when allocating
>>>>>>>>> pages.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Qi Zheng <[email protected]>
>>>>>>>>> Reported-by: Teng Hu <[email protected]>
>>>>>>>>
>>>>>>>> Well AFAIK the key mechanism to only allocate from "good" nodes
>>>>>>>> is the
>>>>>>>> zonelist, we shouldn't need to start putting extra checks like
>>>>>>>> this. So it
>>>>>>>> seems to me that the code building the zonelists should take the
>>>>>>>> NODE_MIN_SIZE constraint in mind.
>>>>>>>
>>>>>>> Indeed. How about the following patch:
>>>>>>
>>>>>> +Cc also David, forgot earlier.
>>>>>>
>>>>>> Looks good to me, at least.
>>>>>>
>>>>>>> @@ -6382,8 +6378,11 @@ int find_next_best_node(int node, nodemask_t
>>>>>>> *used_node_mask)
>>>>>>>             int min_val = INT_MAX;
>>>>>>>             int best_node = NUMA_NO_NODE;
>>>>>>>
>>>>>>> -       /* Use the local node if we haven't already */
>>>>>>> -       if (!node_isset(node, *used_node_mask)) {
>>>>>>> +       /*
>>>>>>> +        * Use the local node if we haven't already. But for
>>>>>>> memoryless
>>>>>>> local
>>>>>>> +        * node, we should skip it and fallback to other nodes.
>>>>>>> +        */
>>>>>>> +       if (!node_isset(node, *used_node_mask) && node_state(node,
>>>>>>> N_MEMORY)) {
>>>>>>>                     node_set(node, *used_node_mask);
>>>>>>>                     return node;
>>>>>>>             }
>>>>>>>
>>>>>>> For memoryless node, we skip it and fallback to other nodes when
>>>>>>> build its zonelists.
>>>>>>>
>>>>>>> Say we have node0 and node1, and node0 is memoryless, then:
>>>>>>>
>>>>>>> [    0.102400] Fallback order for Node 0: 1
>>>>>>> [    0.102931] Fallback order for Node 1: 1
>>>>>>>
>>>>>>> In this way, we will not allocate pages from memoryless node0.
>>>>>>>
>>>>>
>>>>> In offline_pages(), we'll first build_all_zonelists() to then
>>>>> node_states_clear_node()->node_clear_state(node, N_MEMORY);
>>>>>
>>>>> So at least on the offlining path, we wouldn't detect it properly yet I
>>>>> assume, and build a zonelist that contains a now-memory-less node?
>>>>
>>>> Another question is what happens if a new memory is plugged into a node
>>>> that had < NODE_MIN_SIZE of memory and after hotplug it stops being
>>>> "memoryless".
>>>
>>> When going online and offline a memory will re-call
>>> build_all_zonelists() to re-establish the zonelists (the zonelist of
>>> itself and other nodes). So it can stop being "memoryless"
>>> automatically.
>>>
>>> But in online_pages(), did not see the check of < NODE_MIN_SIZE.
>>
>> TBH, this is the first time I hear of NODE_MIN_SIZE and it seems to be a
>> pretty x86 specific thing.
>>
>> Are we sure we want to get NODE_MIN_SIZE involved?
>
> Maybe add an arch_xxx() to handle it?

I still haven't figured out what we want to achieve with NODE_MIN_SIZE
at all. It smells like an arch-specific hack looking at

"Don't confuse VM with a node that doesn't have the minimum amount of
memory"

Why shouldn't mm-core deal with that?

I'd appreciate an explanation of the bigger picture, what the issue is
and what the approach to solve it is (including memory onlining/offlining).

--
Thanks,

David / dhildenb

2023-02-14 11:39:05

[permalink] [raw]

Subject: Re: [PATCH] mm: page_alloc: don't allocate page from memoryless nodes

On 2023/2/14 19:29, David Hildenbrand wrote:
> On 14.02.23 12:26, Qi Zheng wrote:
>>
>>
>> On 2023/2/14 19:22, David Hildenbrand wrote:
>>> On 14.02.23 11:26, Qi Zheng wrote:
>>>>
>>>>
>>>> On 2023/2/14 17:43, Mike Rapoport wrote:
>>>>> On Tue, Feb 14, 2023 at 10:17:03AM +0100, David Hildenbrand wrote:
>>>>>> On 14.02.23 09:42, Vlastimil Babka wrote:
>>>>>>> On 2/13/23 12:00, Qi Zheng wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2023/2/13 16:47, Vlastimil Babka wrote:
>>>>>>>>> On 2/12/23 12:03, Qi Zheng wrote:
>>>>>>>>>> In x86, numa_register_memblks() is only interested in
>>>>>>>>>> those nodes which have enough memory, so it skips over
>>>>>>>>>> all nodes with memory below NODE_MIN_SIZE (treated as
>>>>>>>>>> a memoryless node). Later on, we will initialize these
>>>>>>>>>> memoryless nodes (allocate pgdat in free_area_init()
>>>>>>>>>> and build zonelist etc), and will online these nodes
>>>>>>>>>> in init_cpu_to_node() and init_gi_nodes().
>>>>>>>>>>
>>>>>>>>>> After boot, these memoryless nodes are in N_ONLINE
>>>>>>>>>> state but not in N_MEMORY state. But we can still allocate
>>>>>>>>>> pages from these memoryless nodes.
>>>>>>>>>>
>>>>>>>>>> In SLUB, we only process nodes in the N_MEMORY state,
>>>>>>>>>> such as allocating their struct kmem_cache_node. So if
>>>>>>>>>> we allocate a page from the memoryless node above to
>>>>>>>>>> SLUB, the struct kmem_cache_node of the node corresponding
>>>>>>>>>> to this page is NULL, which will cause panic.
>>>>>>>>>>
>>>>>>>>>> For example, if we use qemu to start a two numa node kernel,
>>>>>>>>>> one of the nodes has 2M memory (less than NODE_MIN_SIZE),
>>>>>>>>>> and the other node has 2G, then we will encounter the
>>>>>>>>>> following panic:
>>>>>>>>>>
>>>>>>>>>> [    0.149844] BUG: kernel NULL pointer dereference, address:
>>>>>>>>>> 0000000000000000
>>>>>>>>>> [    0.150783] #PF: supervisor write access in kernel mode
>>>>>>>>>> [    0.151488] #PF: error_code(0x0002) - not-present page
>>>>>>>>>> <...>
>>>>>>>>>> [    0.156056] RIP: 0010:_raw_spin_lock_irqsave+0x22/0x40
>>>>>>>>>> <...>
>>>>>>>>>> [    0.169781] Call Trace:
>>>>>>>>>> [    0.170159] <TASK>
>>>>>>>>>> [    0.170448] deactivate_slab+0x187/0x3c0
>>>>>>>>>> [    0.171031] ? bootstrap+0x1b/0x10e
>>>>>>>>>> [    0.171559] ? preempt_count_sub+0x9/0xa0
>>>>>>>>>> [    0.172145] ? kmem_cache_alloc+0x12c/0x440
>>>>>>>>>> [    0.172735] ? bootstrap+0x1b/0x10e
>>>>>>>>>> [    0.173236] bootstrap+0x6b/0x10e
>>>>>>>>>> [    0.173720] kmem_cache_init+0x10a/0x188
>>>>>>>>>> [    0.174240] start_kernel+0x415/0x6ac
>>>>>>>>>> [    0.174738] secondary_startup_64_no_verify+0xe0/0xeb
>>>>>>>>>> [    0.175417] </TASK>
>>>>>>>>>> [    0.175713] Modules linked in:
>>>>>>>>>> [    0.176117] CR2: 0000000000000000
>>>>>>>>>>
>>>>>>>>>> In addition, we can also encountered this panic in the actual
>>>>>>>>>> production environment. We set up a 2c2g container with two
>>>>>>>>>> numa nodes, and then reserved 128M for kdump, and then we
>>>>>>>>>> can encountered the above panic in the kdump kernel.
>>>>>>>>>>
>>>>>>>>>> To fix it, we can filter memoryless nodes when allocating
>>>>>>>>>> pages.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Qi Zheng <[email protected]>
>>>>>>>>>> Reported-by: Teng Hu <[email protected]>
>>>>>>>>>
>>>>>>>>> Well AFAIK the key mechanism to only allocate from "good" nodes
>>>>>>>>> is the
>>>>>>>>> zonelist, we shouldn't need to start putting extra checks like
>>>>>>>>> this. So it
>>>>>>>>> seems to me that the code building the zonelists should take the
>>>>>>>>> NODE_MIN_SIZE constraint in mind.
>>>>>>>>
>>>>>>>> Indeed. How about the following patch:
>>>>>>>
>>>>>>> +Cc also David, forgot earlier.
>>>>>>>
>>>>>>> Looks good to me, at least.
>>>>>>>
>>>>>>>> @@ -6382,8 +6378,11 @@ int find_next_best_node(int node, nodemask_t
>>>>>>>> *used_node_mask)
>>>>>>>>              int min_val = INT_MAX;
>>>>>>>>              int best_node = NUMA_NO_NODE;
>>>>>>>>
>>>>>>>> -       /* Use the local node if we haven't already */
>>>>>>>> -       if (!node_isset(node, *used_node_mask)) {
>>>>>>>> +       /*
>>>>>>>> +        * Use the local node if we haven't already. But for
>>>>>>>> memoryless
>>>>>>>> local
>>>>>>>> +        * node, we should skip it and fallback to other nodes.
>>>>>>>> +        */
>>>>>>>> +       if (!node_isset(node, *used_node_mask) && node_state(node,
>>>>>>>> N_MEMORY)) {
>>>>>>>>                      node_set(node, *used_node_mask);
>>>>>>>>                      return node;
>>>>>>>>              }
>>>>>>>>
>>>>>>>> For memoryless node, we skip it and fallback to other nodes when
>>>>>>>> build its zonelists.
>>>>>>>>
>>>>>>>> Say we have node0 and node1, and node0 is memoryless, then:
>>>>>>>>
>>>>>>>> [    0.102400] Fallback order for Node 0: 1
>>>>>>>> [    0.102931] Fallback order for Node 1: 1
>>>>>>>>
>>>>>>>> In this way, we will not allocate pages from memoryless node0.
>>>>>>>>
>>>>>>
>>>>>> In offline_pages(), we'll first build_all_zonelists() to then
>>>>>> node_states_clear_node()->node_clear_state(node, N_MEMORY);
>>>>>>
>>>>>> So at least on the offlining path, we wouldn't detect it properly
>>>>>> yet I
>>>>>> assume, and build a zonelist that contains a now-memory-less node?
>>>>>
>>>>> Another question is what happens if a new memory is plugged into a
>>>>> node
>>>>> that had < NODE_MIN_SIZE of memory and after hotplug it stops being
>>>>> "memoryless".
>>>>
>>>> When going online and offline a memory will re-call
>>>> build_all_zonelists() to re-establish the zonelists (the zonelist of
>>>> itself and other nodes). So it can stop being "memoryless"
>>>> automatically.
>>>>
>>>> But in online_pages(), did not see the check of < NODE_MIN_SIZE.
>>>
>>> TBH, this is the first time I hear of NODE_MIN_SIZE and it seems to be a
>>> pretty x86 specific thing.
>>>
>>> Are we sure we want to get NODE_MIN_SIZE involved?
>>
>> Maybe add an arch_xxx() to handle it?
>
> I still haven't figured out what we want to achieve with NODE_MIN_SIZE
> at all. It smells like an arch-specific hack looking at
>
> "Don't confuse VM with a node that doesn't have the minimum amount of
> memory"

I'm also confused about this comment.

I found the patch that originally introduced this:

https://github.com/torvalds/linux/commit/9391a3f9c7f17bdd82adf9a98905450642cc8970

But the commit message didn't explain it very clearly. :(

>
> Why shouldn't mm-core deal with that?
>
> I'd appreciate an explanation of the bigger picture, what the issue is
> and what the approach to solve it is (including memory onlining/offlining).
>

--
Thanks,
Qi

2023-02-14 11:44:27

[permalink] [raw]

Subject: Re: [PATCH] mm: page_alloc: don't allocate page from memoryless nodes

(added x86 folks)

On Tue, Feb 14, 2023 at 12:29:42PM +0100, David Hildenbrand wrote:
> On 14.02.23 12:26, Qi Zheng wrote:
> > On 2023/2/14 19:22, David Hildenbrand wrote:
> > >
> > > TBH, this is the first time I hear of NODE_MIN_SIZE and it seems to be a
> > > pretty x86 specific thing.
> > >
> > > Are we sure we want to get NODE_MIN_SIZE involved?
> >
> > Maybe add an arch_xxx() to handle it?
>
> I still haven't figured out what we want to achieve with NODE_MIN_SIZE at
> all. It smells like an arch-specific hack looking at
>
> "Don't confuse VM with a node that doesn't have the minimum amount of
> memory"
>
> Why shouldn't mm-core deal with that?

Well, a node with <4M RAM is not very useful and bears all the overhead of
an extra live node.

But, hey, why won't we just drop that '< NODE_MIN_SIZE' and let people with
weird HW configurations just live with this?

> I'd appreciate an explanation of the bigger picture, what the issue is and
> what the approach to solve it is (including memory onlining/offlining).
>
> --
> Thanks,
>
> David / dhildenb
>

--
Sincerely yours,
Mike.

2023-02-14 11:49:12

[permalink] [raw]

Subject: Re: [PATCH] mm: page_alloc: don't allocate page from memoryless nodes

On 14.02.23 12:44, Mike Rapoport wrote:
> (added x86 folks)
>
> On Tue, Feb 14, 2023 at 12:29:42PM +0100, David Hildenbrand wrote:
>> On 14.02.23 12:26, Qi Zheng wrote:
>>> On 2023/2/14 19:22, David Hildenbrand wrote:
>>>>
>>>> TBH, this is the first time I hear of NODE_MIN_SIZE and it seems to be a
>>>> pretty x86 specific thing.
>>>>
>>>> Are we sure we want to get NODE_MIN_SIZE involved?
>>>
>>> Maybe add an arch_xxx() to handle it?
>>
>> I still haven't figured out what we want to achieve with NODE_MIN_SIZE at
>> all. It smells like an arch-specific hack looking at
>>
>> "Don't confuse VM with a node that doesn't have the minimum amount of
>> memory"
>>
>> Why shouldn't mm-core deal with that?
>
> Well, a node with <4M RAM is not very useful and bears all the overhead of
> an extra live node.

And totally not with 4.1M, haha.

I really like the "Might fix boot" in the commit description.

>
> But, hey, why won't we just drop that '< NODE_MIN_SIZE' and let people with
> weird HW configurations just live with this?

;)

--
Thanks,

David / dhildenb

2023-02-14 12:00:11

[permalink] [raw]

Subject: Re: [PATCH] mm: page_alloc: don't allocate page from memoryless nodes

On 14.02.23 12:48, David Hildenbrand wrote:
> On 14.02.23 12:44, Mike Rapoport wrote:
>> (added x86 folks)
>>
>> On Tue, Feb 14, 2023 at 12:29:42PM +0100, David Hildenbrand wrote:
>>> On 14.02.23 12:26, Qi Zheng wrote:
>>>> On 2023/2/14 19:22, David Hildenbrand wrote:
>>>>>
>>>>> TBH, this is the first time I hear of NODE_MIN_SIZE and it seems to be a
>>>>> pretty x86 specific thing.
>>>>>
>>>>> Are we sure we want to get NODE_MIN_SIZE involved?
>>>>
>>>> Maybe add an arch_xxx() to handle it?
>>>
>>> I still haven't figured out what we want to achieve with NODE_MIN_SIZE at
>>> all. It smells like an arch-specific hack looking at
>>>
>>> "Don't confuse VM with a node that doesn't have the minimum amount of
>>> memory"
>>>
>>> Why shouldn't mm-core deal with that?
>>
>> Well, a node with <4M RAM is not very useful and bears all the overhead of
>> an extra live node.
>
> And totally not with 4.1M, haha.
>
> I really like the "Might fix boot" in the commit description.
>
>>
>> But, hey, why won't we just drop that '< NODE_MIN_SIZE' and let people with
>> weird HW configurations just live with this?
>
>
> ;)
>

Actually, remembering 09f49dca570a ("mm: handle uninitialized numa nodes
gracefully"), this might be the right thing to do. That commit assumes
that all offline nodes would get the pgdat allocated in
free_area_init(). So that we end up with an allocated pgdat for all
possible nodes. The reasoning IIRC was that we don't care about wasting
memory in weird VM setups.

CCing Michal.

--
Thanks,

David / dhildenb

2023-02-14 12:10:01

[permalink] [raw]

Subject: Re: [External] Re: [PATCH] mm: page_alloc: don't allocate page from memoryless nodes

On 2023/2/14 19:58, David Hildenbrand wrote:
> On 14.02.23 12:48, David Hildenbrand wrote:
>> On 14.02.23 12:44, Mike Rapoport wrote:
>>> (added x86 folks)
>>>
>>> On Tue, Feb 14, 2023 at 12:29:42PM +0100, David Hildenbrand wrote:
>>>> On 14.02.23 12:26, Qi Zheng wrote:
>>>>> On 2023/2/14 19:22, David Hildenbrand wrote:
>>>>>>
>>>>>> TBH, this is the first time I hear of NODE_MIN_SIZE and it seems
>>>>>> to be a
>>>>>> pretty x86 specific thing.
>>>>>>
>>>>>> Are we sure we want to get NODE_MIN_SIZE involved?
>>>>>
>>>>> Maybe add an arch_xxx() to handle it?
>>>>
>>>> I still haven't figured out what we want to achieve with
>>>> NODE_MIN_SIZE at
>>>> all. It smells like an arch-specific hack looking at
>>>>
>>>> "Don't confuse VM with a node that doesn't have the minimum amount of
>>>> memory"
>>>>
>>>> Why shouldn't mm-core deal with that?
>>>
>>> Well, a node with <4M RAM is not very useful and bears all the
>>> overhead of
>>> an extra live node.
>>
>> And totally not with 4.1M, haha.
>>
>> I really like the "Might fix boot" in the commit description.
>>
>>>
>>> But, hey, why won't we just drop that '< NODE_MIN_SIZE' and let
>>> people with
>>> weird HW configurations just live with this?
>>
>>
>> ;)
>>
>
> Actually, remembering 09f49dca570a ("mm: handle uninitialized numa nodes
> gracefully"), this might be the right thing to do. That commit assumes
> that all offline nodes would get the pgdat allocated in
> free_area_init(). So that we end up with an allocated pgdat for all
Can also See commit 1ca75fa7f19d ("arch/x86/mm/numa: Do not initialize
nodes twice"). The commit message explains the initialization process
more clearly, it may be helpful. :)

> possible nodes. The reasoning IIRC was that we don't care about wasting
> memory in weird VM setups.
>
> CCing Michal.
>

--
Thanks,
Qi

2023-02-14 12:34:27

[permalink] [raw]

Subject: Re: [PATCH] mm: page_alloc: don't allocate page from memoryless nodes

On 2023/2/14 19:44, Mike Rapoport wrote:
> (added x86 folks)
>
> On Tue, Feb 14, 2023 at 12:29:42PM +0100, David Hildenbrand wrote:
>> On 14.02.23 12:26, Qi Zheng wrote:
>>> On 2023/2/14 19:22, David Hildenbrand wrote:
>>>>
>>>> TBH, this is the first time I hear of NODE_MIN_SIZE and it seems to be a
>>>> pretty x86 specific thing.
>>>>
>>>> Are we sure we want to get NODE_MIN_SIZE involved?
>>>
>>> Maybe add an arch_xxx() to handle it?
>>
>> I still haven't figured out what we want to achieve with NODE_MIN_SIZE at
>> all. It smells like an arch-specific hack looking at
>>
>> "Don't confuse VM with a node that doesn't have the minimum amount of
>> memory"
>>
>> Why shouldn't mm-core deal with that?
>
> Well, a node with <4M RAM is not very useful and bears all the overhead of
> an extra live node.
>
> But, hey, why won't we just drop that '< NODE_MIN_SIZE' and let people with
> weird HW configurations just live with this?

Just to sum up, whether we deal with '< NODE_MIN_SIZE' or not, IIUC, the
following two should be modified:

1) we should skip memoryless nodes completely in find_next_best_node():

@@ -6382,8 +6378,11 @@ int find_next_best_node(int node, nodemask_t
*used_node_mask)
int min_val = INT_MAX;
int best_node = NUMA_NO_NODE;

- /* Use the local node if we haven't already */
- if (!node_isset(node, *used_node_mask)) {
+ /*
+ * Use the local node if we haven't already. But for memoryless
local
+ * node, we should skip it and fallback to other nodes.
+ */
+ if (!node_isset(node, *used_node_mask) && node_state(node,
N_MEMORY)) {
node_set(node, *used_node_mask);
return node;
}

This also fixes the bug mentioned in commit message.

2) we should call node_states_clear_node() before build_all_zonelists()
in offline_pages():

@@ -1931,12 +1931,12 @@ int __ref offline_pages(unsigned long start_pfn,
unsigned long nr_pages,
/* reinitialise watermarks and update pcp limits */
init_per_zone_wmark_min();

+ node_states_clear_node(node, &arg);
if (!populated_zone(zone)) {
zone_pcp_reset(zone);
build_all_zonelists(NULL);
}

- node_states_clear_node(node, &arg);
if (arg.status_change_nid >= 0) {
kcompactd_stop(node);
kswapd_stop(node);

Otherwise, the node whose N_MEMORY state is about to be cleared will
still be established in the fallback list of other nodes.

Right?

Thanks,
Qi

>
>> I'd appreciate an explanation of the bigger picture, what the issue is and
>> what the approach to solve it is (including memory onlining/offlining).
>>
>> --
>> Thanks,
>>
>> David / dhildenb
>>
>

--
Thanks,
Qi

2023-02-14 12:46:47

[permalink] [raw]

Subject: Re: [PATCH] mm: page_alloc: don't allocate page from memoryless nodes

Now added x86 folks for real :)

The thread starts here:
https://lore.kernel.org/all/[email protected]/

On Tue, Feb 14, 2023 at 01:44:06PM +0200, Mike Rapoport wrote:
> (added x86 folks)
>
> On Tue, Feb 14, 2023 at 12:29:42PM +0100, David Hildenbrand wrote:
> > On 14.02.23 12:26, Qi Zheng wrote:
> > > On 2023/2/14 19:22, David Hildenbrand wrote:
> > > >
> > > > TBH, this is the first time I hear of NODE_MIN_SIZE and it seems to be a
> > > > pretty x86 specific thing.
> > > >
> > > > Are we sure we want to get NODE_MIN_SIZE involved?
> > >
> > > Maybe add an arch_xxx() to handle it?
> >
> > I still haven't figured out what we want to achieve with NODE_MIN_SIZE at
> > all. It smells like an arch-specific hack looking at
> >
> > "Don't confuse VM with a node that doesn't have the minimum amount of
> > memory"
> >
> > Why shouldn't mm-core deal with that?
>
> Well, a node with <4M RAM is not very useful and bears all the overhead of
> an extra live node.
>
> But, hey, why won't we just drop that '< NODE_MIN_SIZE' and let people with
> weird HW configurations just live with this?
>
> > I'd appreciate an explanation of the bigger picture, what the issue is and
> > what the approach to solve it is (including memory onlining/offlining).
> >
> > --
> > Thanks,
> >
> > David / dhildenb
> >
>
> --
> Sincerely yours,
> Mike.
>

--
Sincerely yours,
Mike.

2023-02-14 13:39:25

by Michal Hocko

[permalink] [raw]

Subject: Re: [PATCH] mm: page_alloc: don't allocate page from memoryless nodes

On Tue 14-02-23 12:58:39, David Hildenbrand wrote:
> On 14.02.23 12:48, David Hildenbrand wrote:
> > On 14.02.23 12:44, Mike Rapoport wrote:
> > > (added x86 folks)
> > >
> > > On Tue, Feb 14, 2023 at 12:29:42PM +0100, David Hildenbrand wrote:
> > > > On 14.02.23 12:26, Qi Zheng wrote:
> > > > > On 2023/2/14 19:22, David Hildenbrand wrote:
> > > > > >
> > > > > > TBH, this is the first time I hear of NODE_MIN_SIZE and it seems to be a
> > > > > > pretty x86 specific thing.
> > > > > >
> > > > > > Are we sure we want to get NODE_MIN_SIZE involved?
> > > > >
> > > > > Maybe add an arch_xxx() to handle it?
> > > >
> > > > I still haven't figured out what we want to achieve with NODE_MIN_SIZE at
> > > > all. It smells like an arch-specific hack looking at
> > > >
> > > > "Don't confuse VM with a node that doesn't have the minimum amount of
> > > > memory"
> > > >
> > > > Why shouldn't mm-core deal with that?
> > >
> > > Well, a node with <4M RAM is not very useful and bears all the overhead of
> > > an extra live node.
> >
> > And totally not with 4.1M, haha.
> >
> > I really like the "Might fix boot" in the commit description.
> >
> > >
> > > But, hey, why won't we just drop that '< NODE_MIN_SIZE' and let people with
> > > weird HW configurations just live with this?
> >
> >
> > ;)
> >
>
> Actually, remembering 09f49dca570a ("mm: handle uninitialized numa nodes
> gracefully"), this might be the right thing to do. That commit assumes that
> all offline nodes would get the pgdat allocated in free_area_init(). So that
> we end up with an allocated pgdat for all possible nodes. The reasoning IIRC
> was that we don't care about wasting memory in weird VM setups.

Yes, that is the case indeed. I suspect the NODE_MIN_SIZE is a relict of
the past when some PXM entries were incorrect or fishy. I would just
drop the check and see whether something breaks. Or make those involved
back then remember whether this is addressing something that is relevant
these days. Even 5MB node makes (as the memmap is allocated for the
whole memory section anyway and that is 128MB) a very little sense if you ask me.
--
Michal Hocko
SUSE Labs

2023-02-15 09:30:44

[permalink] [raw]

Subject: Re: [PATCH] mm: page_alloc: don't allocate page from memoryless nodes

On Tue, Feb 14, 2023 at 02:38:44PM +0100, Michal Hocko wrote:
> On Tue 14-02-23 12:58:39, David Hildenbrand wrote:
> > On 14.02.23 12:48, David Hildenbrand wrote:
> > > On 14.02.23 12:44, Mike Rapoport wrote:
> > > > (added x86 folks)
> > > >
> > > > On Tue, Feb 14, 2023 at 12:29:42PM +0100, David Hildenbrand wrote:
> > > > > On 14.02.23 12:26, Qi Zheng wrote:
> > > > > > On 2023/2/14 19:22, David Hildenbrand wrote:
> > > > > > >
> > > > > > > TBH, this is the first time I hear of NODE_MIN_SIZE and it seems to be a
> > > > > > > pretty x86 specific thing.
> > > > > > >
> > > > > > > Are we sure we want to get NODE_MIN_SIZE involved?
> > > > > >
> > > > > > Maybe add an arch_xxx() to handle it?
> > > > >
> > > > > I still haven't figured out what we want to achieve with NODE_MIN_SIZE at
> > > > > all. It smells like an arch-specific hack looking at
> > > > >
> > > > > "Don't confuse VM with a node that doesn't have the minimum amount of
> > > > > memory"
> > > > >
> > > > > Why shouldn't mm-core deal with that?
> > > >
> > > > Well, a node with <4M RAM is not very useful and bears all the overhead of
> > > > an extra live node.
> > >
> > > And totally not with 4.1M, haha.
> > >
> > > I really like the "Might fix boot" in the commit description.
> > >
> > > >
> > > > But, hey, why won't we just drop that '< NODE_MIN_SIZE' and let people with
> > > > weird HW configurations just live with this?
> > >
> > >
> > > ;)
> > >
> >
> > Actually, remembering 09f49dca570a ("mm: handle uninitialized numa nodes
> > gracefully"), this might be the right thing to do. That commit assumes that
> > all offline nodes would get the pgdat allocated in free_area_init(). So that
> > we end up with an allocated pgdat for all possible nodes. The reasoning IIRC
> > was that we don't care about wasting memory in weird VM setups.
>
> Yes, that is the case indeed. I suspect the NODE_MIN_SIZE is a relict of
> the past when some PXM entries were incorrect or fishy. I would just
> drop the check and see whether something breaks. Or make those involved
> back then remember whether this is addressing something that is relevant
> these days. Even 5MB node makes (as the memmap is allocated for the
> whole memory section anyway and that is 128MB) a very little sense if you ask me.

How about we try this:

From b670120bcacd3fe34a40d7179c70ca2ab69279e0 Mon Sep 17 00:00:00 2001
From: "Mike Rapoport (IBM)" <[email protected]>
Date: Wed, 15 Feb 2023 11:12:18 +0200
Subject: [PATCH] x86/mm: drop 4MB restriction on minimal NUMA node size

Qi Zheng reports crashes in a production environment and provides a
simplified example as a reproducer:

For example, if we use qemu to start a two NUMA node kernel,
one of the nodes has 2M memory (less than NODE_MIN_SIZE),
and the other node has 2G, then we will encounter the
following panic:

[ 0.149844] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 0.150783] #PF: supervisor write access in kernel mode
[ 0.151488] #PF: error_code(0x0002) - not-present page
<...>
[ 0.156056] RIP: 0010:_raw_spin_lock_irqsave+0x22/0x40
<...>
[ 0.169781] Call Trace:
[ 0.170159] <TASK>
[ 0.170448] deactivate_slab+0x187/0x3c0
[ 0.171031] ? bootstrap+0x1b/0x10e
[ 0.171559] ? preempt_count_sub+0x9/0xa0
[ 0.172145] ? kmem_cache_alloc+0x12c/0x440
[ 0.172735] ? bootstrap+0x1b/0x10e
[ 0.173236] bootstrap+0x6b/0x10e
[ 0.173720] kmem_cache_init+0x10a/0x188
[ 0.174240] start_kernel+0x415/0x6ac
[ 0.174738] secondary_startup_64_no_verify+0xe0/0xeb
[ 0.175417] </TASK>
[ 0.175713] Modules linked in:
[ 0.176117] CR2: 0000000000000000

The crashes happen because of inconsistency between nodemask that has
nodes with less than 4MB as memoryless and the actual memory fed into
core mm.

The commit 9391a3f9c7f1 ("[PATCH] x86_64: Clear more state when ignoring
empty node in SRAT parsing") that introduced minimal size of a NUMA node
does not explain why a node size cannot be less than 4MB and what boot
failures this restriction might fix.

Since then a lot has changed and core mm won't confuse badly about small
node sizes.

Drop the limitation for the minimal node size.

Link: https://lore.kernel.org/all/[email protected]/
Signed-off-by: Mike Rapoport (IBM) <[email protected]>
---
arch/x86/include/asm/numa.h | 7 -------
arch/x86/mm/numa.c | 7 -------
2 files changed, 14 deletions(-)

diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h
index e3bae2b60a0d..ef2844d69173 100644
--- a/arch/x86/include/asm/numa.h
+++ b/arch/x86/include/asm/numa.h
@@ -12,13 +12,6 @@

#define NR_NODE_MEMBLKS (MAX_NUMNODES*2)

-/*
- * Too small node sizes may confuse the VM badly. Usually they
- * result from BIOS bugs. So dont recognize nodes as standalone
- * NUMA entities that have less than this amount of RAM listed:
- */
-#define NODE_MIN_SIZE (4*1024*1024)
-
extern int numa_off;

/*
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 2aadb2019b4f..55e3d895f15c 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -601,13 +601,6 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
if (start >= end)
continue;

- /*
- * Don't confuse VM with a node that doesn't have the
- * minimum amount of memory:
- */
- if (end && (end - start) < NODE_MIN_SIZE)
- continue;
-
alloc_node_data(nid);
}

--
2.35.1

> --
> Michal Hocko
> SUSE Labs

--
Sincerely yours,
Mike.

2023-02-15 09:43:53

[permalink] [raw]

Subject: Re: [PATCH] mm: page_alloc: don't allocate page from memoryless nodes

On 2023/2/15 17:30, Mike Rapoport wrote:
> On Tue, Feb 14, 2023 at 02:38:44PM +0100, Michal Hocko wrote:
>> On Tue 14-02-23 12:58:39, David Hildenbrand wrote:
>>> On 14.02.23 12:48, David Hildenbrand wrote:
>>>> On 14.02.23 12:44, Mike Rapoport wrote:
>>>>> (added x86 folks)
>>>>>
>>>>> On Tue, Feb 14, 2023 at 12:29:42PM +0100, David Hildenbrand wrote:
>>>>>> On 14.02.23 12:26, Qi Zheng wrote:
>>>>>>> On 2023/2/14 19:22, David Hildenbrand wrote:
>>>>>>>>
>>>>>>>> TBH, this is the first time I hear of NODE_MIN_SIZE and it seems to be a
>>>>>>>> pretty x86 specific thing.
>>>>>>>>
>>>>>>>> Are we sure we want to get NODE_MIN_SIZE involved?
>>>>>>>
>>>>>>> Maybe add an arch_xxx() to handle it?
>>>>>>
>>>>>> I still haven't figured out what we want to achieve with NODE_MIN_SIZE at
>>>>>> all. It smells like an arch-specific hack looking at
>>>>>>
>>>>>> "Don't confuse VM with a node that doesn't have the minimum amount of
>>>>>> memory"
>>>>>>
>>>>>> Why shouldn't mm-core deal with that?
>>>>>
>>>>> Well, a node with <4M RAM is not very useful and bears all the overhead of
>>>>> an extra live node.
>>>>
>>>> And totally not with 4.1M, haha.
>>>>
>>>> I really like the "Might fix boot" in the commit description.
>>>>
>>>>>
>>>>> But, hey, why won't we just drop that '< NODE_MIN_SIZE' and let people with
>>>>> weird HW configurations just live with this?
>>>>
>>>>
>>>> ;)
>>>>
>>>
>>> Actually, remembering 09f49dca570a ("mm: handle uninitialized numa nodes
>>> gracefully"), this might be the right thing to do. That commit assumes that
>>> all offline nodes would get the pgdat allocated in free_area_init(). So that
>>> we end up with an allocated pgdat for all possible nodes. The reasoning IIRC
>>> was that we don't care about wasting memory in weird VM setups.
>>
>> Yes, that is the case indeed. I suspect the NODE_MIN_SIZE is a relict of
>> the past when some PXM entries were incorrect or fishy. I would just
>> drop the check and see whether something breaks. Or make those involved
>> back then remember whether this is addressing something that is relevant
>> these days. Even 5MB node makes (as the memmap is allocated for the
>> whole memory section anyway and that is 128MB) a very little sense if you ask me.
>
> How about we try this:

I'm curious how we can test this? I guess no one remembers the
historical background of NODE_MIN_SIZE. :(

>
> From b670120bcacd3fe34a40d7179c70ca2ab69279e0 Mon Sep 17 00:00:00 2001
> From: "Mike Rapoport (IBM)" <[email protected]>
> Date: Wed, 15 Feb 2023 11:12:18 +0200
> Subject: [PATCH] x86/mm: drop 4MB restriction on minimal NUMA node size
>
> Qi Zheng reports crashes in a production environment and provides a
> simplified example as a reproducer:
>
> For example, if we use qemu to start a two NUMA node kernel,
> one of the nodes has 2M memory (less than NODE_MIN_SIZE),
> and the other node has 2G, then we will encounter the
> following panic:
>
> [ 0.149844] BUG: kernel NULL pointer dereference, address: 0000000000000000
> [ 0.150783] #PF: supervisor write access in kernel mode
> [ 0.151488] #PF: error_code(0x0002) - not-present page
> <...>
> [ 0.156056] RIP: 0010:_raw_spin_lock_irqsave+0x22/0x40
> <...>
> [ 0.169781] Call Trace:
> [ 0.170159] <TASK>
> [ 0.170448] deactivate_slab+0x187/0x3c0
> [ 0.171031] ? bootstrap+0x1b/0x10e
> [ 0.171559] ? preempt_count_sub+0x9/0xa0
> [ 0.172145] ? kmem_cache_alloc+0x12c/0x440
> [ 0.172735] ? bootstrap+0x1b/0x10e
> [ 0.173236] bootstrap+0x6b/0x10e
> [ 0.173720] kmem_cache_init+0x10a/0x188
> [ 0.174240] start_kernel+0x415/0x6ac
> [ 0.174738] secondary_startup_64_no_verify+0xe0/0xeb
> [ 0.175417] </TASK>
> [ 0.175713] Modules linked in:
> [ 0.176117] CR2: 0000000000000000
>
> The crashes happen because of inconsistency between nodemask that has
> nodes with less than 4MB as memoryless and the actual memory fed into
> core mm.
>
> The commit 9391a3f9c7f1 ("[PATCH] x86_64: Clear more state when ignoring
> empty node in SRAT parsing") that introduced minimal size of a NUMA node
> does not explain why a node size cannot be less than 4MB and what boot
> failures this restriction might fix.
>
> Since then a lot has changed and core mm won't confuse badly about small
> node sizes.
>
> Drop the limitation for the minimal node size.
>
> Link: https://lore.kernel.org/all/[email protected]/
> Signed-off-by: Mike Rapoport (IBM) <[email protected]>
> ---
> arch/x86/include/asm/numa.h | 7 -------
> arch/x86/mm/numa.c | 7 -------
> 2 files changed, 14 deletions(-)
>
> diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h
> index e3bae2b60a0d..ef2844d69173 100644
> --- a/arch/x86/include/asm/numa.h
> +++ b/arch/x86/include/asm/numa.h
> @@ -12,13 +12,6 @@
>
> #define NR_NODE_MEMBLKS (MAX_NUMNODES*2)
>
> -/*
> - * Too small node sizes may confuse the VM badly. Usually they
> - * result from BIOS bugs. So dont recognize nodes as standalone
> - * NUMA entities that have less than this amount of RAM listed:
> - */
> -#define NODE_MIN_SIZE (4*1024*1024)
> -
> extern int numa_off;
>
> /*
> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index 2aadb2019b4f..55e3d895f15c 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -601,13 +601,6 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
> if (start >= end)
> continue;
>
> - /*
> - * Don't confuse VM with a node that doesn't have the
> - * minimum amount of memory:
> - */
> - if (end && (end - start) < NODE_MIN_SIZE)
> - continue;
> -
> alloc_node_data(nid);
> }
>

--
Thanks,
Qi

2023-02-15 09:45:29

[permalink] [raw]

Subject: Re: [PATCH] mm: page_alloc: don't allocate page from memoryless nodes

On 15.02.23 10:30, Mike Rapoport wrote:
> On Tue, Feb 14, 2023 at 02:38:44PM +0100, Michal Hocko wrote:
>> On Tue 14-02-23 12:58:39, David Hildenbrand wrote:
>>> On 14.02.23 12:48, David Hildenbrand wrote:
>>>> On 14.02.23 12:44, Mike Rapoport wrote:
>>>>> (added x86 folks)
>>>>>
>>>>> On Tue, Feb 14, 2023 at 12:29:42PM +0100, David Hildenbrand wrote:
>>>>>> On 14.02.23 12:26, Qi Zheng wrote:
>>>>>>> On 2023/2/14 19:22, David Hildenbrand wrote:
>>>>>>>>
>>>>>>>> TBH, this is the first time I hear of NODE_MIN_SIZE and it seems to be a
>>>>>>>> pretty x86 specific thing.
>>>>>>>>
>>>>>>>> Are we sure we want to get NODE_MIN_SIZE involved?
>>>>>>>
>>>>>>> Maybe add an arch_xxx() to handle it?
>>>>>>
>>>>>> I still haven't figured out what we want to achieve with NODE_MIN_SIZE at
>>>>>> all. It smells like an arch-specific hack looking at
>>>>>>
>>>>>> "Don't confuse VM with a node that doesn't have the minimum amount of
>>>>>> memory"
>>>>>>
>>>>>> Why shouldn't mm-core deal with that?
>>>>>
>>>>> Well, a node with <4M RAM is not very useful and bears all the overhead of
>>>>> an extra live node.
>>>>
>>>> And totally not with 4.1M, haha.
>>>>
>>>> I really like the "Might fix boot" in the commit description.
>>>>
>>>>>
>>>>> But, hey, why won't we just drop that '< NODE_MIN_SIZE' and let people with
>>>>> weird HW configurations just live with this?
>>>>
>>>>
>>>> ;)
>>>>
>>>
>>> Actually, remembering 09f49dca570a ("mm: handle uninitialized numa nodes
>>> gracefully"), this might be the right thing to do. That commit assumes that
>>> all offline nodes would get the pgdat allocated in free_area_init(). So that
>>> we end up with an allocated pgdat for all possible nodes. The reasoning IIRC
>>> was that we don't care about wasting memory in weird VM setups.
>>
>> Yes, that is the case indeed. I suspect the NODE_MIN_SIZE is a relict of
>> the past when some PXM entries were incorrect or fishy. I would just
>> drop the check and see whether something breaks. Or make those involved
>> back then remember whether this is addressing something that is relevant
>> these days. Even 5MB node makes (as the memmap is allocated for the
>> whole memory section anyway and that is 128MB) a very little sense if you ask me.
>
> How about we try this:
>
> From b670120bcacd3fe34a40d7179c70ca2ab69279e0 Mon Sep 17 00:00:00 2001
> From: "Mike Rapoport (IBM)" <[email protected]>
> Date: Wed, 15 Feb 2023 11:12:18 +0200
> Subject: [PATCH] x86/mm: drop 4MB restriction on minimal NUMA node size
>
> Qi Zheng reports crashes in a production environment and provides a
> simplified example as a reproducer:
>
> For example, if we use qemu to start a two NUMA node kernel,
> one of the nodes has 2M memory (less than NODE_MIN_SIZE),
> and the other node has 2G, then we will encounter the
> following panic:
>
> [ 0.149844] BUG: kernel NULL pointer dereference, address: 0000000000000000
> [ 0.150783] #PF: supervisor write access in kernel mode
> [ 0.151488] #PF: error_code(0x0002) - not-present page
> <...>
> [ 0.156056] RIP: 0010:_raw_spin_lock_irqsave+0x22/0x40
> <...>
> [ 0.169781] Call Trace:
> [ 0.170159] <TASK>
> [ 0.170448] deactivate_slab+0x187/0x3c0
> [ 0.171031] ? bootstrap+0x1b/0x10e
> [ 0.171559] ? preempt_count_sub+0x9/0xa0
> [ 0.172145] ? kmem_cache_alloc+0x12c/0x440
> [ 0.172735] ? bootstrap+0x1b/0x10e
> [ 0.173236] bootstrap+0x6b/0x10e
> [ 0.173720] kmem_cache_init+0x10a/0x188
> [ 0.174240] start_kernel+0x415/0x6ac
> [ 0.174738] secondary_startup_64_no_verify+0xe0/0xeb
> [ 0.175417] </TASK>
> [ 0.175713] Modules linked in:
> [ 0.176117] CR2: 0000000000000000
>
> The crashes happen because of inconsistency between nodemask that has
> nodes with less than 4MB as memoryless and the actual memory fed into
> core mm.
>
> The commit 9391a3f9c7f1 ("[PATCH] x86_64: Clear more state when ignoring
> empty node in SRAT parsing") that introduced minimal size of a NUMA node
> does not explain why a node size cannot be less than 4MB and what boot
> failures this restriction might fix.
>
> Since then a lot has changed and core mm won't confuse badly about small
> node sizes.
>
> Drop the limitation for the minimal node size.
>
> Link: https://lore.kernel.org/all/[email protected]/
> Signed-off-by: Mike Rapoport (IBM) <[email protected]>
> ---
> arch/x86/include/asm/numa.h | 7 -------
> arch/x86/mm/numa.c | 7 -------
> 2 files changed, 14 deletions(-)
>
> diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h
> index e3bae2b60a0d..ef2844d69173 100644
> --- a/arch/x86/include/asm/numa.h
> +++ b/arch/x86/include/asm/numa.h
> @@ -12,13 +12,6 @@
>
> #define NR_NODE_MEMBLKS (MAX_NUMNODES*2)
>
> -/*
> - * Too small node sizes may confuse the VM badly. Usually they
> - * result from BIOS bugs. So dont recognize nodes as standalone
> - * NUMA entities that have less than this amount of RAM listed:
> - */
> -#define NODE_MIN_SIZE (4*1024*1024)
> -
> extern int numa_off;
>
> /*
> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index 2aadb2019b4f..55e3d895f15c 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -601,13 +601,6 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
> if (start >= end)
> continue;
>
> - /*
> - * Don't confuse VM with a node that doesn't have the
> - * minimum amount of memory:
> - */
> - if (end && (end - start) < NODE_MIN_SIZE)
> - continue;
> -
> alloc_node_data(nid);
> }
>

Hopefully it fixes the issue.

Acked-by: David Hildenbrand <[email protected]>

The 4 MiB looks like the magical MAX_ORDER (and/or pageblock) thingy to
me. I recall that there were issues in the past when memory exposed to
the buddy would only be partially covering a pageblock. IIRC, memblock
should already take care to not expose memory to the buddy that is not
aligned to MAX_ORDER boundaries -- correct?

--
Thanks,

David / dhildenb

2023-02-15 10:05:17

[permalink] [raw]

Subject: Re: [PATCH] mm: page_alloc: don't allocate page from memoryless nodes

On Wed, Feb 15, 2023 at 10:43:58AM +0100, David Hildenbrand wrote:
> On 15.02.23 10:30, Mike Rapoport wrote:
> > On Tue, Feb 14, 2023 at 02:38:44PM +0100, Michal Hocko wrote:
> > > On Tue 14-02-23 12:58:39, David Hildenbrand wrote:
> > > > On 14.02.23 12:48, David Hildenbrand wrote:
> > > > > On 14.02.23 12:44, Mike Rapoport wrote:
> > > > > > (added x86 folks)
> > > > > >
> > > > > > On Tue, Feb 14, 2023 at 12:29:42PM +0100, David Hildenbrand wrote:
> > > > > > > On 14.02.23 12:26, Qi Zheng wrote:
> > > > > > > > On 2023/2/14 19:22, David Hildenbrand wrote:
> > > > > > > > >
> > > > > > > > > TBH, this is the first time I hear of NODE_MIN_SIZE and it seems to be a
> > > > > > > > > pretty x86 specific thing.
> > > > > > > > >
> > > > > > > > > Are we sure we want to get NODE_MIN_SIZE involved?
> > > > > > > >
> > > > > > > > Maybe add an arch_xxx() to handle it?
> > > > > > >
> > > > > > > I still haven't figured out what we want to achieve with NODE_MIN_SIZE at
> > > > > > > all. It smells like an arch-specific hack looking at
> > > > > > >
> > > > > > > "Don't confuse VM with a node that doesn't have the minimum amount of
> > > > > > > memory"
> > > > > > >
> > > > > > > Why shouldn't mm-core deal with that?
> > > > > >
> > > > > > Well, a node with <4M RAM is not very useful and bears all the overhead of
> > > > > > an extra live node.
> > > > >
> > > > > And totally not with 4.1M, haha.
> > > > >
> > > > > I really like the "Might fix boot" in the commit description.
> > > > >
> > > > > >
> > > > > > But, hey, why won't we just drop that '< NODE_MIN_SIZE' and let people with
> > > > > > weird HW configurations just live with this?
> > > > >
> > > > >
> > > > > ;)
> > > > >
> > > >
> > > > Actually, remembering 09f49dca570a ("mm: handle uninitialized numa nodes
> > > > gracefully"), this might be the right thing to do. That commit assumes that
> > > > all offline nodes would get the pgdat allocated in free_area_init(). So that
> > > > we end up with an allocated pgdat for all possible nodes. The reasoning IIRC
> > > > was that we don't care about wasting memory in weird VM setups.
> > >
> > > Yes, that is the case indeed. I suspect the NODE_MIN_SIZE is a relict of
> > > the past when some PXM entries were incorrect or fishy. I would just
> > > drop the check and see whether something breaks. Or make those involved
> > > back then remember whether this is addressing something that is relevant
> > > these days. Even 5MB node makes (as the memmap is allocated for the
> > > whole memory section anyway and that is 128MB) a very little sense if you ask me.
> >
> > How about we try this:
> >
> > From b670120bcacd3fe34a40d7179c70ca2ab69279e0 Mon Sep 17 00:00:00 2001
> > From: "Mike Rapoport (IBM)" <[email protected]>
> > Date: Wed, 15 Feb 2023 11:12:18 +0200
> > Subject: [PATCH] x86/mm: drop 4MB restriction on minimal NUMA node size
> >
> > Qi Zheng reports crashes in a production environment and provides a
> > simplified example as a reproducer:
> >
> > For example, if we use qemu to start a two NUMA node kernel,
> > one of the nodes has 2M memory (less than NODE_MIN_SIZE),
> > and the other node has 2G, then we will encounter the
> > following panic:
> >
> > [ 0.149844] BUG: kernel NULL pointer dereference, address: 0000000000000000
> > [ 0.150783] #PF: supervisor write access in kernel mode
> > [ 0.151488] #PF: error_code(0x0002) - not-present page
> > <...>
> > [ 0.156056] RIP: 0010:_raw_spin_lock_irqsave+0x22/0x40
> > <...>
> > [ 0.169781] Call Trace:
> > [ 0.170159] <TASK>
> > [ 0.170448] deactivate_slab+0x187/0x3c0
> > [ 0.171031] ? bootstrap+0x1b/0x10e
> > [ 0.171559] ? preempt_count_sub+0x9/0xa0
> > [ 0.172145] ? kmem_cache_alloc+0x12c/0x440
> > [ 0.172735] ? bootstrap+0x1b/0x10e
> > [ 0.173236] bootstrap+0x6b/0x10e
> > [ 0.173720] kmem_cache_init+0x10a/0x188
> > [ 0.174240] start_kernel+0x415/0x6ac
> > [ 0.174738] secondary_startup_64_no_verify+0xe0/0xeb
> > [ 0.175417] </TASK>
> > [ 0.175713] Modules linked in:
> > [ 0.176117] CR2: 0000000000000000
> >
> > The crashes happen because of inconsistency between nodemask that has
> > nodes with less than 4MB as memoryless and the actual memory fed into
> > core mm.
> >
> > The commit 9391a3f9c7f1 ("[PATCH] x86_64: Clear more state when ignoring
> > empty node in SRAT parsing") that introduced minimal size of a NUMA node
> > does not explain why a node size cannot be less than 4MB and what boot
> > failures this restriction might fix.
> >
> > Since then a lot has changed and core mm won't confuse badly about small
> > node sizes.
> >
> > Drop the limitation for the minimal node size.
> >
> > Link: https://lore.kernel.org/all/[email protected]/
> > Signed-off-by: Mike Rapoport (IBM) <[email protected]>
> > ---
> > arch/x86/include/asm/numa.h | 7 -------
> > arch/x86/mm/numa.c | 7 -------
> > 2 files changed, 14 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h
> > index e3bae2b60a0d..ef2844d69173 100644
> > --- a/arch/x86/include/asm/numa.h
> > +++ b/arch/x86/include/asm/numa.h
> > @@ -12,13 +12,6 @@
> > #define NR_NODE_MEMBLKS (MAX_NUMNODES*2)
> > -/*
> > - * Too small node sizes may confuse the VM badly. Usually they
> > - * result from BIOS bugs. So dont recognize nodes as standalone
> > - * NUMA entities that have less than this amount of RAM listed:
> > - */
> > -#define NODE_MIN_SIZE (4*1024*1024)
> > -
> > extern int numa_off;
> > /*
> > diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> > index 2aadb2019b4f..55e3d895f15c 100644
> > --- a/arch/x86/mm/numa.c
> > +++ b/arch/x86/mm/numa.c
> > @@ -601,13 +601,6 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
> > if (start >= end)
> > continue;
> > - /*
> > - * Don't confuse VM with a node that doesn't have the
> > - * minimum amount of memory:
> > - */
> > - if (end && (end - start) < NODE_MIN_SIZE)
> > - continue;
> > -
> > alloc_node_data(nid);
> > }
>
> Hopefully it fixes the issue.
>
> Acked-by: David Hildenbrand <[email protected]>
>
>
> The 4 MiB looks like the magical MAX_ORDER (and/or pageblock) thingy to me.
> I recall that there were issues in the past when memory exposed to the buddy
> would only be partially covering a pageblock. IIRC, memblock should already
> take care to not expose memory to the buddy that is not aligned to MAX_ORDER
> boundaries -- correct?

I don't remember those issues, but memblock does not align memory freed to
the buddy.

Still, this 4MB looks like a really old magic that was way before memblock
and even SPARSEMEM was experimental back then.

It's possible that the issues were with DISCONTIGMEM or with bootmem.

> --
> Thanks,
>
> David / dhildenb
>

--
Sincerely yours,
Mike.

2023-02-15 10:08:48

[permalink] [raw]

Subject: Re: [PATCH] mm: page_alloc: don't allocate page from memoryless nodes

On Wed, Feb 15, 2023 at 05:41:52PM +0800, Qi Zheng wrote:
>
>
> On 2023/2/15 17:30, Mike Rapoport wrote:
> > On Tue, Feb 14, 2023 at 02:38:44PM +0100, Michal Hocko wrote:
> > > On Tue 14-02-23 12:58:39, David Hildenbrand wrote:
> > > > On 14.02.23 12:48, David Hildenbrand wrote:
> > > > > On 14.02.23 12:44, Mike Rapoport wrote:
> > > > > > (added x86 folks)
> > > > > >
> > > > > > On Tue, Feb 14, 2023 at 12:29:42PM +0100, David Hildenbrand wrote:
> > > > > > > On 14.02.23 12:26, Qi Zheng wrote:
> > > > > > > > On 2023/2/14 19:22, David Hildenbrand wrote:
> > > > > > > > >
> > > > > > > > > TBH, this is the first time I hear of NODE_MIN_SIZE and it seems to be a
> > > > > > > > > pretty x86 specific thing.
> > > > > > > > >
> > > > > > > > > Are we sure we want to get NODE_MIN_SIZE involved?
> > > > > > > >
> > > > > > > > Maybe add an arch_xxx() to handle it?
> > > > > > >
> > > > > > > I still haven't figured out what we want to achieve with NODE_MIN_SIZE at
> > > > > > > all. It smells like an arch-specific hack looking at
> > > > > > >
> > > > > > > "Don't confuse VM with a node that doesn't have the minimum amount of
> > > > > > > memory"
> > > > > > >
> > > > > > > Why shouldn't mm-core deal with that?
> > > > > >
> > > > > > Well, a node with <4M RAM is not very useful and bears all the overhead of
> > > > > > an extra live node.
> > > > >
> > > > > And totally not with 4.1M, haha.
> > > > >
> > > > > I really like the "Might fix boot" in the commit description.
> > > > >
> > > > > >
> > > > > > But, hey, why won't we just drop that '< NODE_MIN_SIZE' and let people with
> > > > > > weird HW configurations just live with this?
> > > > >
> > > > >
> > > > > ;)
> > > > >
> > > >
> > > > Actually, remembering 09f49dca570a ("mm: handle uninitialized numa nodes
> > > > gracefully"), this might be the right thing to do. That commit assumes that
> > > > all offline nodes would get the pgdat allocated in free_area_init(). So that
> > > > we end up with an allocated pgdat for all possible nodes. The reasoning IIRC
> > > > was that we don't care about wasting memory in weird VM setups.
> > >
> > > Yes, that is the case indeed. I suspect the NODE_MIN_SIZE is a relict of
> > > the past when some PXM entries were incorrect or fishy. I would just
> > > drop the check and see whether something breaks. Or make those involved
> > > back then remember whether this is addressing something that is relevant
> > > these days. Even 5MB node makes (as the memmap is allocated for the
> > > whole memory section anyway and that is 128MB) a very little sense if you ask me.
> >
> > How about we try this:
>
> I'm curious how we can test this? I guess no one remembers the
> historical background of NODE_MIN_SIZE. :(

At the very least we can verify that your setup works fine with this ;-)

Of course we cannot test the exact same configuration that NODE_MIN_SIZE
was supposed to fix, but there was a lot of effort to make core mm
initialization robust to cope with weird memory layouts and I'm quite
confident this won't break anything.

> > From b670120bcacd3fe34a40d7179c70ca2ab69279e0 Mon Sep 17 00:00:00 2001
> > From: "Mike Rapoport (IBM)" <[email protected]>
> > Date: Wed, 15 Feb 2023 11:12:18 +0200
> > Subject: [PATCH] x86/mm: drop 4MB restriction on minimal NUMA node size
> >
> > Qi Zheng reports crashes in a production environment and provides a
> > simplified example as a reproducer:
> >
> > For example, if we use qemu to start a two NUMA node kernel,
> > one of the nodes has 2M memory (less than NODE_MIN_SIZE),
> > and the other node has 2G, then we will encounter the
> > following panic:
> >
> > [ 0.149844] BUG: kernel NULL pointer dereference, address: 0000000000000000
> > [ 0.150783] #PF: supervisor write access in kernel mode
> > [ 0.151488] #PF: error_code(0x0002) - not-present page
> > <...>
> > [ 0.156056] RIP: 0010:_raw_spin_lock_irqsave+0x22/0x40
> > <...>
> > [ 0.169781] Call Trace:
> > [ 0.170159] <TASK>
> > [ 0.170448] deactivate_slab+0x187/0x3c0
> > [ 0.171031] ? bootstrap+0x1b/0x10e
> > [ 0.171559] ? preempt_count_sub+0x9/0xa0
> > [ 0.172145] ? kmem_cache_alloc+0x12c/0x440
> > [ 0.172735] ? bootstrap+0x1b/0x10e
> > [ 0.173236] bootstrap+0x6b/0x10e
> > [ 0.173720] kmem_cache_init+0x10a/0x188
> > [ 0.174240] start_kernel+0x415/0x6ac
> > [ 0.174738] secondary_startup_64_no_verify+0xe0/0xeb
> > [ 0.175417] </TASK>
> > [ 0.175713] Modules linked in:
> > [ 0.176117] CR2: 0000000000000000
> >
> > The crashes happen because of inconsistency between nodemask that has
> > nodes with less than 4MB as memoryless and the actual memory fed into
> > core mm.
> >
> > The commit 9391a3f9c7f1 ("[PATCH] x86_64: Clear more state when ignoring
> > empty node in SRAT parsing") that introduced minimal size of a NUMA node
> > does not explain why a node size cannot be less than 4MB and what boot
> > failures this restriction might fix.
> >
> > Since then a lot has changed and core mm won't confuse badly about small
> > node sizes.
> >
> > Drop the limitation for the minimal node size.
> >
> > Link: https://lore.kernel.org/all/[email protected]/
> > Signed-off-by: Mike Rapoport (IBM) <[email protected]>
> > ---
> > arch/x86/include/asm/numa.h | 7 -------
> > arch/x86/mm/numa.c | 7 -------
> > 2 files changed, 14 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h
> > index e3bae2b60a0d..ef2844d69173 100644
> > --- a/arch/x86/include/asm/numa.h
> > +++ b/arch/x86/include/asm/numa.h
> > @@ -12,13 +12,6 @@
> > #define NR_NODE_MEMBLKS (MAX_NUMNODES*2)
> > -/*
> > - * Too small node sizes may confuse the VM badly. Usually they
> > - * result from BIOS bugs. So dont recognize nodes as standalone
> > - * NUMA entities that have less than this amount of RAM listed:
> > - */
> > -#define NODE_MIN_SIZE (4*1024*1024)
> > -
> > extern int numa_off;
> > /*
> > diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> > index 2aadb2019b4f..55e3d895f15c 100644
> > --- a/arch/x86/mm/numa.c
> > +++ b/arch/x86/mm/numa.c
> > @@ -601,13 +601,6 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
> > if (start >= end)
> > continue;
> > - /*
> > - * Don't confuse VM with a node that doesn't have the
> > - * minimum amount of memory:
> > - */
> > - if (end && (end - start) < NODE_MIN_SIZE)
> > - continue;
> > -
> > alloc_node_data(nid);
> > }
>
> --
> Thanks,
> Qi

--
Sincerely yours,
Mike.

2023-02-15 10:12:48

[permalink] [raw]

Subject: Re: [PATCH] mm: page_alloc: don't allocate page from memoryless nodes

On 15.02.23 11:04, Mike Rapoport wrote:
> On Wed, Feb 15, 2023 at 10:43:58AM +0100, David Hildenbrand wrote:
>> On 15.02.23 10:30, Mike Rapoport wrote:
>>> On Tue, Feb 14, 2023 at 02:38:44PM +0100, Michal Hocko wrote:
>>>> On Tue 14-02-23 12:58:39, David Hildenbrand wrote:
>>>>> On 14.02.23 12:48, David Hildenbrand wrote:
>>>>>> On 14.02.23 12:44, Mike Rapoport wrote:
>>>>>>> (added x86 folks)
>>>>>>>
>>>>>>> On Tue, Feb 14, 2023 at 12:29:42PM +0100, David Hildenbrand wrote:
>>>>>>>> On 14.02.23 12:26, Qi Zheng wrote:
>>>>>>>>> On 2023/2/14 19:22, David Hildenbrand wrote:
>>>>>>>>>>
>>>>>>>>>> TBH, this is the first time I hear of NODE_MIN_SIZE and it seems to be a
>>>>>>>>>> pretty x86 specific thing.
>>>>>>>>>>
>>>>>>>>>> Are we sure we want to get NODE_MIN_SIZE involved?
>>>>>>>>>
>>>>>>>>> Maybe add an arch_xxx() to handle it?
>>>>>>>>
>>>>>>>> I still haven't figured out what we want to achieve with NODE_MIN_SIZE at
>>>>>>>> all. It smells like an arch-specific hack looking at
>>>>>>>>
>>>>>>>> "Don't confuse VM with a node that doesn't have the minimum amount of
>>>>>>>> memory"
>>>>>>>>
>>>>>>>> Why shouldn't mm-core deal with that?
>>>>>>>
>>>>>>> Well, a node with <4M RAM is not very useful and bears all the overhead of
>>>>>>> an extra live node.
>>>>>>
>>>>>> And totally not with 4.1M, haha.
>>>>>>
>>>>>> I really like the "Might fix boot" in the commit description.
>>>>>>
>>>>>>>
>>>>>>> But, hey, why won't we just drop that '< NODE_MIN_SIZE' and let people with
>>>>>>> weird HW configurations just live with this?
>>>>>>
>>>>>>
>>>>>> ;)
>>>>>>
>>>>>
>>>>> Actually, remembering 09f49dca570a ("mm: handle uninitialized numa nodes
>>>>> gracefully"), this might be the right thing to do. That commit assumes that
>>>>> all offline nodes would get the pgdat allocated in free_area_init(). So that
>>>>> we end up with an allocated pgdat for all possible nodes. The reasoning IIRC
>>>>> was that we don't care about wasting memory in weird VM setups.
>>>>
>>>> Yes, that is the case indeed. I suspect the NODE_MIN_SIZE is a relict of
>>>> the past when some PXM entries were incorrect or fishy. I would just
>>>> drop the check and see whether something breaks. Or make those involved
>>>> back then remember whether this is addressing something that is relevant
>>>> these days. Even 5MB node makes (as the memmap is allocated for the
>>>> whole memory section anyway and that is 128MB) a very little sense if you ask me.
>>>
>>> How about we try this:
>>>
>>> From b670120bcacd3fe34a40d7179c70ca2ab69279e0 Mon Sep 17 00:00:00 2001
>>> From: "Mike Rapoport (IBM)" <[email protected]>
>>> Date: Wed, 15 Feb 2023 11:12:18 +0200
>>> Subject: [PATCH] x86/mm: drop 4MB restriction on minimal NUMA node size
>>>
>>> Qi Zheng reports crashes in a production environment and provides a
>>> simplified example as a reproducer:
>>>
>>> For example, if we use qemu to start a two NUMA node kernel,
>>> one of the nodes has 2M memory (less than NODE_MIN_SIZE),
>>> and the other node has 2G, then we will encounter the
>>> following panic:
>>>
>>> [ 0.149844] BUG: kernel NULL pointer dereference, address: 0000000000000000
>>> [ 0.150783] #PF: supervisor write access in kernel mode
>>> [ 0.151488] #PF: error_code(0x0002) - not-present page
>>> <...>
>>> [ 0.156056] RIP: 0010:_raw_spin_lock_irqsave+0x22/0x40
>>> <...>
>>> [ 0.169781] Call Trace:
>>> [ 0.170159] <TASK>
>>> [ 0.170448] deactivate_slab+0x187/0x3c0
>>> [ 0.171031] ? bootstrap+0x1b/0x10e
>>> [ 0.171559] ? preempt_count_sub+0x9/0xa0
>>> [ 0.172145] ? kmem_cache_alloc+0x12c/0x440
>>> [ 0.172735] ? bootstrap+0x1b/0x10e
>>> [ 0.173236] bootstrap+0x6b/0x10e
>>> [ 0.173720] kmem_cache_init+0x10a/0x188
>>> [ 0.174240] start_kernel+0x415/0x6ac
>>> [ 0.174738] secondary_startup_64_no_verify+0xe0/0xeb
>>> [ 0.175417] </TASK>
>>> [ 0.175713] Modules linked in:
>>> [ 0.176117] CR2: 0000000000000000
>>>
>>> The crashes happen because of inconsistency between nodemask that has
>>> nodes with less than 4MB as memoryless and the actual memory fed into
>>> core mm.
>>>
>>> The commit 9391a3f9c7f1 ("[PATCH] x86_64: Clear more state when ignoring
>>> empty node in SRAT parsing") that introduced minimal size of a NUMA node
>>> does not explain why a node size cannot be less than 4MB and what boot
>>> failures this restriction might fix.
>>>
>>> Since then a lot has changed and core mm won't confuse badly about small
>>> node sizes.
>>>
>>> Drop the limitation for the minimal node size.
>>>
>>> Link: https://lore.kernel.org/all/[email protected]/
>>> Signed-off-by: Mike Rapoport (IBM) <[email protected]>
>>> ---
>>> arch/x86/include/asm/numa.h | 7 -------
>>> arch/x86/mm/numa.c | 7 -------
>>> 2 files changed, 14 deletions(-)
>>>
>>> diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h
>>> index e3bae2b60a0d..ef2844d69173 100644
>>> --- a/arch/x86/include/asm/numa.h
>>> +++ b/arch/x86/include/asm/numa.h
>>> @@ -12,13 +12,6 @@
>>> #define NR_NODE_MEMBLKS (MAX_NUMNODES*2)
>>> -/*
>>> - * Too small node sizes may confuse the VM badly. Usually they
>>> - * result from BIOS bugs. So dont recognize nodes as standalone
>>> - * NUMA entities that have less than this amount of RAM listed:
>>> - */
>>> -#define NODE_MIN_SIZE (4*1024*1024)
>>> -
>>> extern int numa_off;
>>> /*
>>> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
>>> index 2aadb2019b4f..55e3d895f15c 100644
>>> --- a/arch/x86/mm/numa.c
>>> +++ b/arch/x86/mm/numa.c
>>> @@ -601,13 +601,6 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
>>> if (start >= end)
>>> continue;
>>> - /*
>>> - * Don't confuse VM with a node that doesn't have the
>>> - * minimum amount of memory:
>>> - */
>>> - if (end && (end - start) < NODE_MIN_SIZE)
>>> - continue;
>>> -
>>> alloc_node_data(nid);
>>> }
>>
>> Hopefully it fixes the issue.
>>
>> Acked-by: David Hildenbrand <[email protected]>
>>
>>
>> The 4 MiB looks like the magical MAX_ORDER (and/or pageblock) thingy to me.
>> I recall that there were issues in the past when memory exposed to the buddy
>> would only be partially covering a pageblock. IIRC, memblock should already
>> take care to not expose memory to the buddy that is not aligned to MAX_ORDER
>> boundaries -- correct?
>
> I don't remember those issues, but memblock does not align memory freed to
> the buddy.
>
> Still, this 4MB looks like a really old magic that was way before memblock
> and even SPARSEMEM was experimental back then.
>
> It's possible that the issues were with DISCONTIGMEM or with bootmem.

I recall where I stumbled over that:

commit 3c5f2eb9695cd241c9898a01388b19a149d0b7d2
Author: Heiko Carstens <[email protected]>
Date: Tue Jul 14 07:46:40 2020 +0200

s390/mm: avoid trimming to MAX_ORDER

Trimming to MAX_ORDER was originally done in order to avoid to set
HOLES_IN_ZONE, which in turn would enable a quite expensive
pfn_valid() check. pfn_valid() however only checks if a struct page
exists for a given pfn.

With sparsemen vmemmap there are always struct pages, since memmaps
are allocated for whole sections. Therefore remove the HOLES_IN_ZONE
comment and the trimming.

Signed-off-by: Heiko Carstens <[email protected]>

So indeed, it might just be a legacy leftover and with SPARSEMEM it
should all be fine.

--
Thanks,

David / dhildenb

2023-02-15 10:20:34

[permalink] [raw]

Subject: Re: [PATCH] mm: page_alloc: don't allocate page from memoryless nodes

On 2023/2/15 18:08, Mike Rapoport wrote:
> On Wed, Feb 15, 2023 at 05:41:52PM +0800, Qi Zheng wrote:
>>
>>
>> On 2023/2/15 17:30, Mike Rapoport wrote:
>>> On Tue, Feb 14, 2023 at 02:38:44PM +0100, Michal Hocko wrote:
>>>> On Tue 14-02-23 12:58:39, David Hildenbrand wrote:
>>>>> On 14.02.23 12:48, David Hildenbrand wrote:
>>>>>> On 14.02.23 12:44, Mike Rapoport wrote:
>>>>>>> (added x86 folks)
>>>>>>>
>>>>>>> On Tue, Feb 14, 2023 at 12:29:42PM +0100, David Hildenbrand wrote:
>>>>>>>> On 14.02.23 12:26, Qi Zheng wrote:
>>>>>>>>> On 2023/2/14 19:22, David Hildenbrand wrote:
>>>>>>>>>>
>>>>>>>>>> TBH, this is the first time I hear of NODE_MIN_SIZE and it seems to be a
>>>>>>>>>> pretty x86 specific thing.
>>>>>>>>>>
>>>>>>>>>> Are we sure we want to get NODE_MIN_SIZE involved?
>>>>>>>>>
>>>>>>>>> Maybe add an arch_xxx() to handle it?
>>>>>>>>
>>>>>>>> I still haven't figured out what we want to achieve with NODE_MIN_SIZE at
>>>>>>>> all. It smells like an arch-specific hack looking at
>>>>>>>>
>>>>>>>> "Don't confuse VM with a node that doesn't have the minimum amount of
>>>>>>>> memory"
>>>>>>>>
>>>>>>>> Why shouldn't mm-core deal with that?
>>>>>>>
>>>>>>> Well, a node with <4M RAM is not very useful and bears all the overhead of
>>>>>>> an extra live node.
>>>>>>
>>>>>> And totally not with 4.1M, haha.
>>>>>>
>>>>>> I really like the "Might fix boot" in the commit description.
>>>>>>
>>>>>>>
>>>>>>> But, hey, why won't we just drop that '< NODE_MIN_SIZE' and let people with
>>>>>>> weird HW configurations just live with this?
>>>>>>
>>>>>>
>>>>>> ;)
>>>>>>
>>>>>
>>>>> Actually, remembering 09f49dca570a ("mm: handle uninitialized numa nodes
>>>>> gracefully"), this might be the right thing to do. That commit assumes that
>>>>> all offline nodes would get the pgdat allocated in free_area_init(). So that
>>>>> we end up with an allocated pgdat for all possible nodes. The reasoning IIRC
>>>>> was that we don't care about wasting memory in weird VM setups.
>>>>
>>>> Yes, that is the case indeed. I suspect the NODE_MIN_SIZE is a relict of
>>>> the past when some PXM entries were incorrect or fishy. I would just
>>>> drop the check and see whether something breaks. Or make those involved
>>>> back then remember whether this is addressing something that is relevant
>>>> these days. Even 5MB node makes (as the memmap is allocated for the
>>>> whole memory section anyway and that is 128MB) a very little sense if you ask me.
>>>
>>> How about we try this:
>>
>> I'm curious how we can test this? I guess no one remembers the
>> historical background of NODE_MIN_SIZE. :(
>
> At the very least we can verify that your setup works fine with this ;-)
>
> Of course we cannot test the exact same configuration that NODE_MIN_SIZE
> was supposed to fix, but there was a lot of effort to make core mm
> initialization robust to cope with weird memory layouts and I'm quite
> confident this won't break anything.

Got it. And for the following patch:
Reviewed-by: Qi Zheng <[email protected]>

In addition, with this patch, although the crash problem I mentioned
will not exist, I still think it is necessary to modify the
find_next_best_node() and offline_pages() mentioned in the summary in
my other reply. What do you think?

>
>>> From b670120bcacd3fe34a40d7179c70ca2ab69279e0 Mon Sep 17 00:00:00 2001
>>> From: "Mike Rapoport (IBM)" <[email protected]>
>>> Date: Wed, 15 Feb 2023 11:12:18 +0200
>>> Subject: [PATCH] x86/mm: drop 4MB restriction on minimal NUMA node size
>>>
>>> Qi Zheng reports crashes in a production environment and provides a
>>> simplified example as a reproducer:
>>>
>>> For example, if we use qemu to start a two NUMA node kernel,
>>> one of the nodes has 2M memory (less than NODE_MIN_SIZE),
>>> and the other node has 2G, then we will encounter the
>>> following panic:
>>>
>>> [ 0.149844] BUG: kernel NULL pointer dereference, address: 0000000000000000
>>> [ 0.150783] #PF: supervisor write access in kernel mode
>>> [ 0.151488] #PF: error_code(0x0002) - not-present page
>>> <...>
>>> [ 0.156056] RIP: 0010:_raw_spin_lock_irqsave+0x22/0x40
>>> <...>
>>> [ 0.169781] Call Trace:
>>> [ 0.170159] <TASK>
>>> [ 0.170448] deactivate_slab+0x187/0x3c0
>>> [ 0.171031] ? bootstrap+0x1b/0x10e
>>> [ 0.171559] ? preempt_count_sub+0x9/0xa0
>>> [ 0.172145] ? kmem_cache_alloc+0x12c/0x440
>>> [ 0.172735] ? bootstrap+0x1b/0x10e
>>> [ 0.173236] bootstrap+0x6b/0x10e
>>> [ 0.173720] kmem_cache_init+0x10a/0x188
>>> [ 0.174240] start_kernel+0x415/0x6ac
>>> [ 0.174738] secondary_startup_64_no_verify+0xe0/0xeb
>>> [ 0.175417] </TASK>
>>> [ 0.175713] Modules linked in:
>>> [ 0.176117] CR2: 0000000000000000
>>>
>>> The crashes happen because of inconsistency between nodemask that has
>>> nodes with less than 4MB as memoryless and the actual memory fed into
>>> core mm.
>>>
>>> The commit 9391a3f9c7f1 ("[PATCH] x86_64: Clear more state when ignoring
>>> empty node in SRAT parsing") that introduced minimal size of a NUMA node
>>> does not explain why a node size cannot be less than 4MB and what boot
>>> failures this restriction might fix.
>>>
>>> Since then a lot has changed and core mm won't confuse badly about small
>>> node sizes.
>>>
>>> Drop the limitation for the minimal node size.
>>>
>>> Link: https://lore.kernel.org/all/[email protected]/
>>> Signed-off-by: Mike Rapoport (IBM) <[email protected]>
>>> ---
>>> arch/x86/include/asm/numa.h | 7 -------
>>> arch/x86/mm/numa.c | 7 -------
>>> 2 files changed, 14 deletions(-)
>>>
>>> diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h
>>> index e3bae2b60a0d..ef2844d69173 100644
>>> --- a/arch/x86/include/asm/numa.h
>>> +++ b/arch/x86/include/asm/numa.h
>>> @@ -12,13 +12,6 @@
>>> #define NR_NODE_MEMBLKS (MAX_NUMNODES*2)
>>> -/*
>>> - * Too small node sizes may confuse the VM badly. Usually they
>>> - * result from BIOS bugs. So dont recognize nodes as standalone
>>> - * NUMA entities that have less than this amount of RAM listed:
>>> - */
>>> -#define NODE_MIN_SIZE (4*1024*1024)
>>> -
>>> extern int numa_off;
>>> /*
>>> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
>>> index 2aadb2019b4f..55e3d895f15c 100644
>>> --- a/arch/x86/mm/numa.c
>>> +++ b/arch/x86/mm/numa.c
>>> @@ -601,13 +601,6 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
>>> if (start >= end)
>>> continue;
>>> - /*
>>> - * Don't confuse VM with a node that doesn't have the
>>> - * minimum amount of memory:
>>> - */
>>> - if (end && (end - start) < NODE_MIN_SIZE)
>>> - continue;
>>> -
>>> alloc_node_data(nid);
>>> }
>>
>> --
>> Thanks,
>> Qi
>

--
Thanks,
Qi

2023-02-15 16:55:10

by Michal Hocko

[permalink] [raw]

Subject: Re: [PATCH] mm: page_alloc: don't allocate page from memoryless nodes

On Wed 15-02-23 11:30:19, Mike Rapoport wrote:
[...]
> >From b670120bcacd3fe34a40d7179c70ca2ab69279e0 Mon Sep 17 00:00:00 2001
> From: "Mike Rapoport (IBM)" <[email protected]>
> Date: Wed, 15 Feb 2023 11:12:18 +0200
> Subject: [PATCH] x86/mm: drop 4MB restriction on minimal NUMA node size
>
> Qi Zheng reports crashes in a production environment and provides a
> simplified example as a reproducer:
>
> For example, if we use qemu to start a two NUMA node kernel,
> one of the nodes has 2M memory (less than NODE_MIN_SIZE),
> and the other node has 2G, then we will encounter the
> following panic:
>
> [ 0.149844] BUG: kernel NULL pointer dereference, address: 0000000000000000
> [ 0.150783] #PF: supervisor write access in kernel mode
> [ 0.151488] #PF: error_code(0x0002) - not-present page
> <...>
> [ 0.156056] RIP: 0010:_raw_spin_lock_irqsave+0x22/0x40
> <...>
> [ 0.169781] Call Trace:
> [ 0.170159] <TASK>
> [ 0.170448] deactivate_slab+0x187/0x3c0
> [ 0.171031] ? bootstrap+0x1b/0x10e
> [ 0.171559] ? preempt_count_sub+0x9/0xa0
> [ 0.172145] ? kmem_cache_alloc+0x12c/0x440
> [ 0.172735] ? bootstrap+0x1b/0x10e
> [ 0.173236] bootstrap+0x6b/0x10e
> [ 0.173720] kmem_cache_init+0x10a/0x188
> [ 0.174240] start_kernel+0x415/0x6ac
> [ 0.174738] secondary_startup_64_no_verify+0xe0/0xeb
> [ 0.175417] </TASK>
> [ 0.175713] Modules linked in:
> [ 0.176117] CR2: 0000000000000000
>
> The crashes happen because of inconsistency between nodemask that has
> nodes with less than 4MB as memoryless and the actual memory fed into
> core mm.
>
> The commit 9391a3f9c7f1 ("[PATCH] x86_64: Clear more state when ignoring
> empty node in SRAT parsing") that introduced minimal size of a NUMA node
> does not explain why a node size cannot be less than 4MB and what boot
> failures this restriction might fix.
>
> Since then a lot has changed and core mm won't confuse badly about small
> node sizes.
>
> Drop the limitation for the minimal node size.
>
> Link: https://lore.kernel.org/all/[email protected]/
> Signed-off-by: Mike Rapoport (IBM) <[email protected]>

Yes, I would start with this. If the original reasoning to have a limit
still exists then we would need to have a closer look but right now I
would much rather drop this unexplained subtlety. If we hit the issue we
would get a more specific description at least.

Acked-by: Michal Hocko <[email protected]>

Thanks!

> ---
> arch/x86/include/asm/numa.h | 7 -------
> arch/x86/mm/numa.c | 7 -------
> 2 files changed, 14 deletions(-)
>
> diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h
> index e3bae2b60a0d..ef2844d69173 100644
> --- a/arch/x86/include/asm/numa.h
> +++ b/arch/x86/include/asm/numa.h
> @@ -12,13 +12,6 @@
>
> #define NR_NODE_MEMBLKS (MAX_NUMNODES*2)
>
> -/*
> - * Too small node sizes may confuse the VM badly. Usually they
> - * result from BIOS bugs. So dont recognize nodes as standalone
> - * NUMA entities that have less than this amount of RAM listed:
> - */
> -#define NODE_MIN_SIZE (4*1024*1024)
> -
> extern int numa_off;
>
> /*
> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index 2aadb2019b4f..55e3d895f15c 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -601,13 +601,6 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
> if (start >= end)
> continue;
>
> - /*
> - * Don't confuse VM with a node that doesn't have the
> - * minimum amount of memory:
> - */
> - if (end && (end - start) < NODE_MIN_SIZE)
> - continue;
> -
> alloc_node_data(nid);
> }
>
> --
> 2.35.1
>
>
> > --
> > Michal Hocko
> > SUSE Labs
>
> --
> Sincerely yours,
> Mike.

--
Michal Hocko
SUSE Labs

2023-10-16 04:11:01

[permalink] [raw]

Subject: Re: [PATCH] mm: page_alloc: don't allocate page from memoryless nodes

Hi Mike,

On 2023/2/15 17:30, Mike Rapoport wrote:
[...]
>
> How about we try this:
>
> From b670120bcacd3fe34a40d7179c70ca2ab69279e0 Mon Sep 17 00:00:00 2001
> From: "Mike Rapoport (IBM)" <[email protected]>
> Date: Wed, 15 Feb 2023 11:12:18 +0200
> Subject: [PATCH] x86/mm: drop 4MB restriction on minimal NUMA node size
>
> Qi Zheng reports crashes in a production environment and provides a
> simplified example as a reproducer:
>
> For example, if we use qemu to start a two NUMA node kernel,
> one of the nodes has 2M memory (less than NODE_MIN_SIZE),
> and the other node has 2G, then we will encounter the
> following panic:
>
> [ 0.149844] BUG: kernel NULL pointer dereference, address: 0000000000000000
> [ 0.150783] #PF: supervisor write access in kernel mode
> [ 0.151488] #PF: error_code(0x0002) - not-present page
> <...>
> [ 0.156056] RIP: 0010:_raw_spin_lock_irqsave+0x22/0x40
> <...>
> [ 0.169781] Call Trace:
> [ 0.170159] <TASK>
> [ 0.170448] deactivate_slab+0x187/0x3c0
> [ 0.171031] ? bootstrap+0x1b/0x10e
> [ 0.171559] ? preempt_count_sub+0x9/0xa0
> [ 0.172145] ? kmem_cache_alloc+0x12c/0x440
> [ 0.172735] ? bootstrap+0x1b/0x10e
> [ 0.173236] bootstrap+0x6b/0x10e
> [ 0.173720] kmem_cache_init+0x10a/0x188
> [ 0.174240] start_kernel+0x415/0x6ac
> [ 0.174738] secondary_startup_64_no_verify+0xe0/0xeb
> [ 0.175417] </TASK>
> [ 0.175713] Modules linked in:
> [ 0.176117] CR2: 0000000000000000
>
> The crashes happen because of inconsistency between nodemask that has
> nodes with less than 4MB as memoryless and the actual memory fed into
> core mm.
>
> The commit 9391a3f9c7f1 ("[PATCH] x86_64: Clear more state when ignoring
> empty node in SRAT parsing") that introduced minimal size of a NUMA node
> does not explain why a node size cannot be less than 4MB and what boot
> failures this restriction might fix.
>
> Since then a lot has changed and core mm won't confuse badly about small
> node sizes.
>
> Drop the limitation for the minimal node size.
>
> Link: https://lore.kernel.org/all/[email protected]/
> Signed-off-by: Mike Rapoport (IBM) <[email protected]>
> ---
> arch/x86/include/asm/numa.h | 7 -------
> arch/x86/mm/numa.c | 7 -------
> 2 files changed, 14 deletions(-)

What's the current progress on this patch? This patch doesn't seem to be
merged into any trees. Did I miss something?

Thanks,
Qi

>
> diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h
> index e3bae2b60a0d..ef2844d69173 100644
> --- a/arch/x86/include/asm/numa.h
> +++ b/arch/x86/include/asm/numa.h
> @@ -12,13 +12,6 @@
>
> #define NR_NODE_MEMBLKS (MAX_NUMNODES*2)
>
> -/*
> - * Too small node sizes may confuse the VM badly. Usually they
> - * result from BIOS bugs. So dont recognize nodes as standalone
> - * NUMA entities that have less than this amount of RAM listed:
> - */
> -#define NODE_MIN_SIZE (4*1024*1024)
> -
> extern int numa_off;
>
> /*
> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index 2aadb2019b4f..55e3d895f15c 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -601,13 +601,6 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
> if (start >= end)
> continue;
>
> - /*
> - * Don't confuse VM with a node that doesn't have the
> - * minimum amount of memory:
> - */
> - if (end && (end - start) < NODE_MIN_SIZE)
> - continue;
> -
> alloc_node_data(nid);
> }
>

2023-10-17 06:12:41