2018-08-01 20:05:31

by Jeremy Linton

[permalink] [raw]
Subject: [RFC 0/2] harden alloc_pages against bogus nid

The thread "avoid alloc memory on offline node"

https://lkml.org/lkml/2018/6/7/251

Asked at one point why the kzalloc_node was crashing rather than
returning memory from a valid node. The thread ended up fixing
the immediate causes of the crash but left open the case of bad
proximity values being in DSDT tables without corrisponding
SRAT/SLIT entries as is happening on another machine.

Its also easy to fix that, but we should also harden the allocator
sufficiently that it doesn't crash when passed an invalid node id.
There are a couple possible ways to do this, and i've attached two
separate patches which individually fix that problem.

The first detects the offline node before calling
the new_slab code path when it becomes apparent that the allocation isn't
going to succeed. The second actually hardens node_zonelist() and
prepare_alloc_pages() in the face of NODE_DATA(nid) returning a NULL
zonelist. This latter case happens if the node has never been initialized
or is possibly out of range. There are other places (NODE_DATA &
online_node) which should be checking if the node id's are > MAX_NUMNODES.

Jeremy Linton (2):
slub: Avoid trying to allocate memory on offline nodes
mm: harden alloc_pages code paths against bogus nodes

include/linux/gfp.h | 2 ++
mm/page_alloc.c | 2 ++
mm/slub.c | 2 ++
3 files changed, 6 insertions(+)

--
2.14.3



2018-08-01 20:05:32

by Jeremy Linton

[permalink] [raw]
Subject: [RFC 2/2] mm: harden alloc_pages code paths against bogus nodes

Its possible to crash __alloc_pages_nodemask by passing it
bogus node ids. This is caused by NODE_DATA() returning null
(hopefully) when the requested node is offline. We can
harded against the basic case of a mostly valid node, that
isn't online by checking for null and failing prepare_alloc_pages.

But this then suggests we should also harden NODE_DATA() like this

#define NODE_DATA(nid) ( (nid) < MAX_NUMNODES ? node_data[(nid)] : NULL)

eventually this starts to add a bunch of generally uneeded checks
in some code paths that are called quite frequently.

Signed-off-by: Jeremy Linton <[email protected]>
---
include/linux/gfp.h | 2 ++
mm/page_alloc.c | 2 ++
2 files changed, 4 insertions(+)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index a6afcec53795..17d70271c42e 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -436,6 +436,8 @@ static inline int gfp_zonelist(gfp_t flags)
*/
static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
{
+ if (unlikely(!NODE_DATA(nid))) //VM_WARN_ON?
+ return NULL;
return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags);
}

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a790ef4be74e..3a3d9ac2662a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4306,6 +4306,8 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
{
ac->high_zoneidx = gfp_zone(gfp_mask);
ac->zonelist = node_zonelist(preferred_nid, gfp_mask);
+ if (!ac->zonelist)
+ return false;
ac->nodemask = nodemask;
ac->migratetype = gfpflags_to_migratetype(gfp_mask);

--
2.14.3


2018-08-01 20:06:21

by Jeremy Linton

[permalink] [raw]
Subject: [RFC 1/2] slub: Avoid trying to allocate memory on offline nodes

If a user calls the *alloc_node() functions with an invalid node
its possible to crash in alloc_pages_nodemask because NODE_DATA()
returns a bad node, which propogates into the node zonelist in
prepare_alloc_pages. This avoids that by not trying to allocate
new slabs against offline nodes.

(example backtrace)

__alloc_pages_nodemask+0x128/0xf48
allocate_slab+0x94/0x528
new_slab+0x68/0xc8
___slab_alloc+0x44c/0x520
__slab_alloc+0x50/0x68
kmem_cache_alloc_node_trace+0xe0/0x230

Signed-off-by: Jeremy Linton <[email protected]>
---
mm/slub.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/mm/slub.c b/mm/slub.c
index 51258eff4178..e03719bac1e2 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2519,6 +2519,8 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
if (unlikely(!node_match(page, searchnode))) {
stat(s, ALLOC_NODE_MISMATCH);
deactivate_slab(s, page, c->freelist, c);
+ if (!node_online(searchnode))
+ node = NUMA_NO_NODE;
goto new_slab;
}
}
--
2.14.3


2018-08-01 21:51:41

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC 0/2] harden alloc_pages against bogus nid

On Wed, 1 Aug 2018 15:04:16 -0500 Jeremy Linton <[email protected]> wrote:

> The thread "avoid alloc memory on offline node"
>
> https://lkml.org/lkml/2018/6/7/251
>
> Asked at one point why the kzalloc_node was crashing rather than
> returning memory from a valid node. The thread ended up fixing
> the immediate causes of the crash but left open the case of bad
> proximity values being in DSDT tables without corrisponding
> SRAT/SLIT entries as is happening on another machine.
>
> Its also easy to fix that, but we should also harden the allocator
> sufficiently that it doesn't crash when passed an invalid node id.
> There are a couple possible ways to do this, and i've attached two
> separate patches which individually fix that problem.
>
> The first detects the offline node before calling
> the new_slab code path when it becomes apparent that the allocation isn't
> going to succeed. The second actually hardens node_zonelist() and
> prepare_alloc_pages() in the face of NODE_DATA(nid) returning a NULL
> zonelist. This latter case happens if the node has never been initialized
> or is possibly out of range. There are other places (NODE_DATA &
> online_node) which should be checking if the node id's are > MAX_NUMNODES.
>

What is it that leads to a caller requesting memory from an invalid
node? A race against offlining? If so then that's a lack of
appropriate locking, isn't it?

I don't see a problem with emitting a warning and then selecting a
different node so we can keep running. But we do want that warning, so
we can understand the root cause and fix it?


2018-08-01 22:58:16

by Jeremy Linton

[permalink] [raw]
Subject: Re: [RFC 0/2] harden alloc_pages against bogus nid

Hi,

On 08/01/2018 04:50 PM, Andrew Morton wrote:
> On Wed, 1 Aug 2018 15:04:16 -0500 Jeremy Linton <[email protected]> wrote:
>
>> The thread "avoid alloc memory on offline node"
>>
>> https://lkml.org/lkml/2018/6/7/251
>>
>> Asked at one point why the kzalloc_node was crashing rather than
>> returning memory from a valid node. The thread ended up fixing
>> the immediate causes of the crash but left open the case of bad
>> proximity values being in DSDT tables without corrisponding
>> SRAT/SLIT entries as is happening on another machine.
>>
>> Its also easy to fix that, but we should also harden the allocator
>> sufficiently that it doesn't crash when passed an invalid node id.
>> There are a couple possible ways to do this, and i've attached two
>> separate patches which individually fix that problem.
>>
>> The first detects the offline node before calling
>> the new_slab code path when it becomes apparent that the allocation isn't
>> going to succeed. The second actually hardens node_zonelist() and
>> prepare_alloc_pages() in the face of NODE_DATA(nid) returning a NULL
>> zonelist. This latter case happens if the node has never been initialized
>> or is possibly out of range. There are other places (NODE_DATA &
>> online_node) which should be checking if the node id's are > MAX_NUMNODES.
>>
>
> What is it that leads to a caller requesting memory from an invalid
> node? A race against offlining? If so then that's a lack of
> appropriate locking, isn't it?

There were a couple unrelated cases, both having to do with the PXN
associated with a PCI port. The first case AFAIK, the domain wasn't
really invalid if the entire SRAT was parsed and nodes created even when
there weren't associated CPUs. The second case (a different machine) is
simply a PXN value that is completely invalid (no associated
SLIT/SRAT/etc entries) due to firmware making a mistake when a socket
isn't populated.

There have been a few other suggested or merged patches for the
individual problems above, this set is just an attempt at avoiding a
full crash if/when another similar problem happens.


>
> I don't see a problem with emitting a warning and then selecting a
> different node so we can keep running. But we do want that warning, so
> we can understand the root cause and fix it?

Yes, we do want to know when an invalid id is passed, i will add the
VM_WARN in the first one.

The second one I wasn't sure about as failing prepare_alloc_pages()
generates a couple of error messages, but the system then continues
operation.

I guess my question though is which method (or both/something else?) is
the preferred way to harden this up?

Thanks for looking at this.


2018-08-02 00:15:44

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC 0/2] harden alloc_pages against bogus nid

On Wed, 1 Aug 2018 17:56:46 -0500 Jeremy Linton <[email protected]> wrote:

> Hi,
>
> On 08/01/2018 04:50 PM, Andrew Morton wrote:
> > On Wed, 1 Aug 2018 15:04:16 -0500 Jeremy Linton <[email protected]> wrote:
> >
> >> The thread "avoid alloc memory on offline node"
> >>
> >> https://lkml.org/lkml/2018/6/7/251
> >>
> >> Asked at one point why the kzalloc_node was crashing rather than
> >> returning memory from a valid node. The thread ended up fixing
> >> the immediate causes of the crash but left open the case of bad
> >> proximity values being in DSDT tables without corrisponding
> >> SRAT/SLIT entries as is happening on another machine.
> >>
> >> Its also easy to fix that, but we should also harden the allocator
> >> sufficiently that it doesn't crash when passed an invalid node id.
> >> There are a couple possible ways to do this, and i've attached two
> >> separate patches which individually fix that problem.
> >>
> >> The first detects the offline node before calling
> >> the new_slab code path when it becomes apparent that the allocation isn't
> >> going to succeed. The second actually hardens node_zonelist() and
> >> prepare_alloc_pages() in the face of NODE_DATA(nid) returning a NULL
> >> zonelist. This latter case happens if the node has never been initialized
> >> or is possibly out of range. There are other places (NODE_DATA &
> >> online_node) which should be checking if the node id's are > MAX_NUMNODES.
> >>
> >
> > What is it that leads to a caller requesting memory from an invalid
> > node? A race against offlining? If so then that's a lack of
> > appropriate locking, isn't it?
>
> There were a couple unrelated cases, both having to do with the PXN
> associated with a PCI port. The first case AFAIK, the domain wasn't
> really invalid if the entire SRAT was parsed and nodes created even when
> there weren't associated CPUs. The second case (a different machine) is
> simply a PXN value that is completely invalid (no associated
> SLIT/SRAT/etc entries) due to firmware making a mistake when a socket
> isn't populated.
>
> There have been a few other suggested or merged patches for the
> individual problems above, this set is just an attempt at avoiding a
> full crash if/when another similar problem happens.

Please add the above info to the changelog.

>
> >
> > I don't see a problem with emitting a warning and then selecting a
> > different node so we can keep running. But we do want that warning, so
> > we can understand the root cause and fix it?
>
> Yes, we do want to know when an invalid id is passed, i will add the
> VM_WARN in the first one.
>
> The second one I wasn't sure about as failing prepare_alloc_pages()
> generates a couple of error messages, but the system then continues
> operation.
>
> I guess my question though is which method (or both/something else?) is
> the preferred way to harden this up?

The first patch looked neater. Can we get a WARN_ON in there as well?

2018-08-02 07:32:54

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC 2/2] mm: harden alloc_pages code paths against bogus nodes

On Wed 01-08-18 15:04:18, Jeremy Linton wrote:
> Its possible to crash __alloc_pages_nodemask by passing it
> bogus node ids. This is caused by NODE_DATA() returning null
> (hopefully) when the requested node is offline. We can
> harded against the basic case of a mostly valid node, that
> isn't online by checking for null and failing prepare_alloc_pages.
>
> But this then suggests we should also harden NODE_DATA() like this
>
> #define NODE_DATA(nid) ( (nid) < MAX_NUMNODES ? node_data[(nid)] : NULL)
>
> eventually this starts to add a bunch of generally uneeded checks
> in some code paths that are called quite frequently.

But the page allocator is really a hot path and people will not be happy
to have yet another branch there. No code should really use invalid numa
node ids in the first place.

If I remember those bugs correctly then it was the arch code which was
doing something wrong. I would prefer that code to be fixed instead.

> Signed-off-by: Jeremy Linton <[email protected]>
> ---
> include/linux/gfp.h | 2 ++
> mm/page_alloc.c | 2 ++
> 2 files changed, 4 insertions(+)
>
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index a6afcec53795..17d70271c42e 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -436,6 +436,8 @@ static inline int gfp_zonelist(gfp_t flags)
> */
> static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
> {
> + if (unlikely(!NODE_DATA(nid))) //VM_WARN_ON?
> + return NULL;
> return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags);
> }
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index a790ef4be74e..3a3d9ac2662a 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4306,6 +4306,8 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
> {
> ac->high_zoneidx = gfp_zone(gfp_mask);
> ac->zonelist = node_zonelist(preferred_nid, gfp_mask);
> + if (!ac->zonelist)
> + return false;
> ac->nodemask = nodemask;
> ac->migratetype = gfpflags_to_migratetype(gfp_mask);
>
> --
> 2.14.3
>

--
Michal Hocko
SUSE Labs

2018-08-02 09:18:18

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC 1/2] slub: Avoid trying to allocate memory on offline nodes

On Wed 01-08-18 15:04:17, Jeremy Linton wrote:
[...]
> @@ -2519,6 +2519,8 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> if (unlikely(!node_match(page, searchnode))) {
> stat(s, ALLOC_NODE_MISMATCH);
> deactivate_slab(s, page, c->freelist, c);
> + if (!node_online(searchnode))
> + node = NUMA_NO_NODE;
> goto new_slab;

This is inherently racy. Numa node can get offline at any point after
you check it here. Making it race free would involve some sort of
locking and I am not really convinced this is a good idea.

> }
> }
> --
> 2.14.3
>

--
Michal Hocko
SUSE Labs

Subject: Re: [RFC 1/2] slub: Avoid trying to allocate memory on offline nodes

On Wed, 1 Aug 2018, Jeremy Linton wrote:

> diff --git a/mm/slub.c b/mm/slub.c
> index 51258eff4178..e03719bac1e2 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2519,6 +2519,8 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> if (unlikely(!node_match(page, searchnode))) {
> stat(s, ALLOC_NODE_MISMATCH);
> deactivate_slab(s, page, c->freelist, c);
> + if (!node_online(searchnode))
> + node = NUMA_NO_NODE;
> goto new_slab;
> }
> }
>

Would it not be better to implement this check in the page allocator?
There is also the issue of how to fallback to the nearest node.

NUMA_NO_NODE should fallback to the current memory allocation policy but
it seems by inserting it here you would end up just with the default node
for the processor.


2018-08-03 03:15:05

by Jeremy Linton

[permalink] [raw]
Subject: Re: [RFC 1/2] slub: Avoid trying to allocate memory on offline nodes

Hi,

On 08/02/2018 09:23 AM, Christopher Lameter wrote:
> On Wed, 1 Aug 2018, Jeremy Linton wrote:
>
>> diff --git a/mm/slub.c b/mm/slub.c
>> index 51258eff4178..e03719bac1e2 100644
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -2519,6 +2519,8 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>> if (unlikely(!node_match(page, searchnode))) {
>> stat(s, ALLOC_NODE_MISMATCH);
>> deactivate_slab(s, page, c->freelist, c);
>> + if (!node_online(searchnode))
>> + node = NUMA_NO_NODE;
>> goto new_slab;
>> }
>> }
>>
>
> Would it not be better to implement this check in the page allocator?
> There is also the issue of how to fallback to the nearest node.

Possibly? Falling back to the nearest node though, should be handled if
memory-less nodes is enabled, which in the problematic case isn't.

>
> NUMA_NO_NODE should fallback to the current memory allocation policy but
> it seems by inserting it here you would end up just with the default node
> for the processor.

I picked this spot (compared to 2/2) because a number of paths are
funneling through here, and in this case it shouldn't be a very hot path.

2018-08-03 03:16:46

by Jeremy Linton

[permalink] [raw]
Subject: Re: [RFC 0/2] harden alloc_pages against bogus nid

Hi,

On 08/01/2018 07:14 PM, Andrew Morton wrote:
> On Wed, 1 Aug 2018 17:56:46 -0500 Jeremy Linton <[email protected]> wrote:
>
>> Hi,
>>
>> On 08/01/2018 04:50 PM, Andrew Morton wrote:
>>> On Wed, 1 Aug 2018 15:04:16 -0500 Jeremy Linton <[email protected]> wrote:
>>>
>>>> The thread "avoid alloc memory on offline node"
>>>>
>>>> https://lkml.org/lkml/2018/6/7/251
>>>>
>>>> Asked at one point why the kzalloc_node was crashing rather than
>>>> returning memory from a valid node. The thread ended up fixing
>>>> the immediate causes of the crash but left open the case of bad
>>>> proximity values being in DSDT tables without corrisponding
>>>> SRAT/SLIT entries as is happening on another machine.
>>>>
>>>> Its also easy to fix that, but we should also harden the allocator
>>>> sufficiently that it doesn't crash when passed an invalid node id.
>>>> There are a couple possible ways to do this, and i've attached two
>>>> separate patches which individually fix that problem.
>>>>
>>>> The first detects the offline node before calling
>>>> the new_slab code path when it becomes apparent that the allocation isn't
>>>> going to succeed. The second actually hardens node_zonelist() and
>>>> prepare_alloc_pages() in the face of NODE_DATA(nid) returning a NULL
>>>> zonelist. This latter case happens if the node has never been initialized
>>>> or is possibly out of range. There are other places (NODE_DATA &
>>>> online_node) which should be checking if the node id's are > MAX_NUMNODES.
>>>>
>>>
>>> What is it that leads to a caller requesting memory from an invalid
>>> node? A race against offlining? If so then that's a lack of
>>> appropriate locking, isn't it?
>>
>> There were a couple unrelated cases, both having to do with the PXN
>> associated with a PCI port. The first case AFAIK, the domain wasn't
>> really invalid if the entire SRAT was parsed and nodes created even when
>> there weren't associated CPUs. The second case (a different machine) is
>> simply a PXN value that is completely invalid (no associated
>> SLIT/SRAT/etc entries) due to firmware making a mistake when a socket
>> isn't populated.
>>
>> There have been a few other suggested or merged patches for the
>> individual problems above, this set is just an attempt at avoiding a
>> full crash if/when another similar problem happens.
>
> Please add the above info to the changelog.

Sure.

>
>>
>>>
>>> I don't see a problem with emitting a warning and then selecting a
>>> different node so we can keep running. But we do want that warning, so
>>> we can understand the root cause and fix it?
>>
>> Yes, we do want to know when an invalid id is passed, i will add the
>> VM_WARN in the first one.
>>
>> The second one I wasn't sure about as failing prepare_alloc_pages()
>> generates a couple of error messages, but the system then continues
>> operation.
>>
>> I guess my question though is which method (or both/something else?) is
>> the preferred way to harden this up?
>
> The first patch looked neater. Can we get a WARN_ON in there as well?
>

Yes,

Thanks,

2018-08-03 03:18:55

by Jeremy Linton

[permalink] [raw]
Subject: Re: [RFC 2/2] mm: harden alloc_pages code paths against bogus nodes

Hi,

On 08/02/2018 02:31 AM, Michal Hocko wrote:
> On Wed 01-08-18 15:04:18, Jeremy Linton wrote:
>> Its possible to crash __alloc_pages_nodemask by passing it
>> bogus node ids. This is caused by NODE_DATA() returning null
>> (hopefully) when the requested node is offline. We can
>> harded against the basic case of a mostly valid node, that
>> isn't online by checking for null and failing prepare_alloc_pages.
>>
>> But this then suggests we should also harden NODE_DATA() like this
>>
>> #define NODE_DATA(nid) ( (nid) < MAX_NUMNODES ? node_data[(nid)] : NULL)
>>
>> eventually this starts to add a bunch of generally uneeded checks
>> in some code paths that are called quite frequently.
>
> But the page allocator is really a hot path and people will not be happy
> to have yet another branch there. No code should really use invalid numa
> node ids in the first place.
>
> If I remember those bugs correctly then it was the arch code which was
> doing something wrong. I would prefer that code to be fixed instead.

Yes, I think the consensus is that 2/2 should be dropped.

The arch code is being fixed (both cases) this patch set is just an
attempt to harden this code path against future failures like that so
that we get some warnings/ugly messages rather than early boot failures.

Thanks,



>> Signed-off-by: Jeremy Linton <[email protected]>
>> ---
>> include/linux/gfp.h | 2 ++
>> mm/page_alloc.c | 2 ++
>> 2 files changed, 4 insertions(+)
>>
>> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
>> index a6afcec53795..17d70271c42e 100644
>> --- a/include/linux/gfp.h
>> +++ b/include/linux/gfp.h
>> @@ -436,6 +436,8 @@ static inline int gfp_zonelist(gfp_t flags)
>> */
>> static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
>> {
>> + if (unlikely(!NODE_DATA(nid))) //VM_WARN_ON?
>> + return NULL;
>> return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags);
>> }
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index a790ef4be74e..3a3d9ac2662a 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -4306,6 +4306,8 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
>> {
>> ac->high_zoneidx = gfp_zone(gfp_mask);
>> ac->zonelist = node_zonelist(preferred_nid, gfp_mask);
>> + if (!ac->zonelist)
>> + return false;
>> ac->nodemask = nodemask;
>> ac->migratetype = gfpflags_to_migratetype(gfp_mask);
>>
>> --
>> 2.14.3
>>
>


2018-08-03 03:22:59

by Jeremy Linton

[permalink] [raw]
Subject: Re: [RFC 1/2] slub: Avoid trying to allocate memory on offline nodes

Hi,

On 08/02/2018 04:15 AM, Michal Hocko wrote:
> On Wed 01-08-18 15:04:17, Jeremy Linton wrote:
> [...]
>> @@ -2519,6 +2519,8 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>> if (unlikely(!node_match(page, searchnode))) {
>> stat(s, ALLOC_NODE_MISMATCH);
>> deactivate_slab(s, page, c->freelist, c);
>> + if (!node_online(searchnode))
>> + node = NUMA_NO_NODE;
>> goto new_slab;
>
> This is inherently racy. Numa node can get offline at any point after
> you check it here. Making it race free would involve some sort of
> locking and I am not really convinced this is a good idea.

I spent some time looking/thinking about this, and i'm pretty sure its
not creating any new problems. But OTOH, I think the node_online() check
is probably a bit misleading as what we really want to assure is that
node<MAX_NUMNODES and that there is going to be a valid entry in
NODE_DATA() so we don't deference null.


>
>> }
>> }
>> --
>> 2.14.3
>>
>


2018-08-03 06:22:41

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC 1/2] slub: Avoid trying to allocate memory on offline nodes

On Thu 02-08-18 22:21:53, Jeremy Linton wrote:
> Hi,
>
> On 08/02/2018 04:15 AM, Michal Hocko wrote:
> > On Wed 01-08-18 15:04:17, Jeremy Linton wrote:
> > [...]
> > > @@ -2519,6 +2519,8 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> > > if (unlikely(!node_match(page, searchnode))) {
> > > stat(s, ALLOC_NODE_MISMATCH);
> > > deactivate_slab(s, page, c->freelist, c);
> > > + if (!node_online(searchnode))
> > > + node = NUMA_NO_NODE;
> > > goto new_slab;
> >
> > This is inherently racy. Numa node can get offline at any point after
> > you check it here. Making it race free would involve some sort of
> > locking and I am not really convinced this is a good idea.
>
> I spent some time looking/thinking about this, and i'm pretty sure its not
> creating any new problems. But OTOH, I think the node_online() check is
> probably a bit misleading as what we really want to assure is that
> node<MAX_NUMNODES and that there is going to be a valid entry in NODE_DATA()
> so we don't deference null.

Exactly. And we do rely that the user of the allocator doesn't really
use bogus parameters. This is not a function to be used for untrusted or
unsanitized inputs.

--
Michal Hocko
SUSE Labs

2018-08-03 06:25:49

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC 2/2] mm: harden alloc_pages code paths against bogus nodes

On Thu 02-08-18 22:17:49, Jeremy Linton wrote:
> Hi,
>
> On 08/02/2018 02:31 AM, Michal Hocko wrote:
> > On Wed 01-08-18 15:04:18, Jeremy Linton wrote:
> > > Its possible to crash __alloc_pages_nodemask by passing it
> > > bogus node ids. This is caused by NODE_DATA() returning null
> > > (hopefully) when the requested node is offline. We can
> > > harded against the basic case of a mostly valid node, that
> > > isn't online by checking for null and failing prepare_alloc_pages.
> > >
> > > But this then suggests we should also harden NODE_DATA() like this
> > >
> > > #define NODE_DATA(nid) ( (nid) < MAX_NUMNODES ? node_data[(nid)] : NULL)
> > >
> > > eventually this starts to add a bunch of generally uneeded checks
> > > in some code paths that are called quite frequently.
> >
> > But the page allocator is really a hot path and people will not be happy
> > to have yet another branch there. No code should really use invalid numa
> > node ids in the first place.
> >
> > If I remember those bugs correctly then it was the arch code which was
> > doing something wrong. I would prefer that code to be fixed instead.
>
> Yes, I think the consensus is that 2/2 should be dropped.
>
> The arch code is being fixed (both cases) this patch set is just an attempt
> to harden this code path against future failures like that so that we get
> some warnings/ugly messages rather than early boot failures.

Hmm, this is a completely different story. We do have VM_{BUG,WARN}_ON
which are noops for most configurations. It is primarily meant to be
enabled for developers or special debug kernels. If you have an example
when such an early splat in the log would safe a lot of head scratching
then this would sound like a reasonable justification to add
VM_WARN_ON(!NODE_DATA(nid))
into the page allocator, me thinks. But considering that would should
get NULL ptr splat anyway then I am not really so sure. But maybe we are
in a context where warning would get into the log while a blow up would
just make the whole machine silent...
--
Michal Hocko
SUSE Labs