2016-03-25 06:57:15

by Xishi Qiu

[permalink] [raw]
Subject: [PATCH] mm: fix invalid node in alloc_migrate_target()

It is incorrect to use next_node to find a target node, it will
return MAX_NUMNODES or invalid node. This will lead to crash in
buddy system allocation.

Signed-off-by: Xishi Qiu <[email protected]>
---
mm/page_isolation.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 92c4c36..31555b6 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -289,11 +289,11 @@ struct page *alloc_migrate_target(struct page *page, unsigned long private,
* now as a simple work-around, we use the next node for destination.
*/
if (PageHuge(page)) {
- nodemask_t src = nodemask_of_node(page_to_nid(page));
- nodemask_t dst;
- nodes_complement(dst, src);
+ int node = next_online_node(page_to_nid(page));
+ if (node == MAX_NUMNODES)
+ node = first_online_node;
return alloc_huge_page_node(page_hstate(compound_head(page)),
- next_node(page_to_nid(page), dst));
+ node);
}

if (PageHighMem(page))
--
1.8.3.1



2016-03-25 19:22:39

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] mm: fix invalid node in alloc_migrate_target()

On Fri, 25 Mar 2016 14:56:04 +0800 Xishi Qiu <[email protected]> wrote:

> It is incorrect to use next_node to find a target node, it will
> return MAX_NUMNODES or invalid node. This will lead to crash in
> buddy system allocation.
>
> ...
>
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -289,11 +289,11 @@ struct page *alloc_migrate_target(struct page *page, unsigned long private,
> * now as a simple work-around, we use the next node for destination.
> */
> if (PageHuge(page)) {
> - nodemask_t src = nodemask_of_node(page_to_nid(page));
> - nodemask_t dst;
> - nodes_complement(dst, src);
> + int node = next_online_node(page_to_nid(page));
> + if (node == MAX_NUMNODES)
> + node = first_online_node;
> return alloc_huge_page_node(page_hstate(compound_head(page)),
> - next_node(page_to_nid(page), dst));
> + node);
> }
>
> if (PageHighMem(page))

Indeed. Can you tell us more about this circumstances under which the
kernel will crash? I need to decide which kernel version(s) need the
patch, but the changelog doesn't contain the info needed to make this
decision (it should).



next_node() isn't a very useful interface, really. Just about every
caller does this:


node = next_node(node, XXX);
if (node == MAX_NUMNODES)
node = first_node(XXX);

so how about we write a function which does that, and stop open-coding
the same thing everywhere?

And I think your fix could then use such a function:

int node = that_new_function(page_to_nid(page), node_online_map);



Also, mm/mempolicy.c:offset_il_node() worries me:

do {
nid = next_node(nid, pol->v.nodes);
c++;
} while (c <= target);

Can't `nid' hit MAX_NUMNODES?


And can someone please explain mem_cgroup_select_victim_node() to me?
How can we hit the "node = numa_node_id()" path? Only if
memcg->scan_nodes is empty? is that even valid? The comment seems to
have not much to do with the code?

mpol_rebind_nodemask() is similar.



Something like this?


From: Andrew Morton <[email protected]>
Subject: include/linux/nodemask.h: create next_node_in() helper

Lots of code does

node = next_node(node, XXX);
if (node == MAX_NUMNODES)
node = first_node(XXX);

so create next_node_in() to do this and use it in various places.

Cc: Xishi Qiu <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: "Laura Abbott" <[email protected]>
Cc: Hui Zhu <[email protected]>
Cc: Wang Xiaoqiang <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Johannes Weiner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---

include/linux/nodemask.h | 18 +++++++++++++++++-
kernel/cpuset.c | 8 +-------
mm/hugetlb.c | 4 +---
mm/memcontrol.c | 4 +---
mm/mempolicy.c | 8 ++------
mm/page_isolation.c | 9 +++------
mm/slab.c | 13 +++----------
7 files changed, 28 insertions(+), 36 deletions(-)

diff -puN include/linux/nodemask.h~include-linux-nodemaskh-create-next_node_in-helper include/linux/nodemask.h
--- a/include/linux/nodemask.h~include-linux-nodemaskh-create-next_node_in-helper
+++ a/include/linux/nodemask.h
@@ -43,8 +43,10 @@
*
* int first_node(mask) Number lowest set bit, or MAX_NUMNODES
* int next_node(node, mask) Next node past 'node', or MAX_NUMNODES
+ * int next_node_in(node, mask) Next node past 'node', or wrap to first,
+ * or MAX_NUMNODES
* int first_unset_node(mask) First node not set in mask, or
- * MAX_NUMNODES.
+ * MAX_NUMNODES
*
* nodemask_t nodemask_of_node(node) Return nodemask with bit 'node' set
* NODE_MASK_ALL Initializer - all bits set
@@ -259,6 +261,20 @@ static inline int __next_node(int n, con
return min_t(int,MAX_NUMNODES,find_next_bit(srcp->bits, MAX_NUMNODES, n+1));
}

+/*
+ * Find the next present node in src, starting after node n, wrapping around to
+ * the first node in src if needed. Returns MAX_NUMNODES if src is empty.
+ */
+#define next_node_in(n, src) __next_node_in((n), &(src))
+static inline int __next_node_in(int node, const nodemask_t *srcp)
+{
+ int ret = __next_node(node, srcp);
+
+ if (ret == MAX_NUMNODES)
+ ret = __first_node(srcp);
+ return ret;
+}
+
static inline void init_nodemask_of_node(nodemask_t *mask, int node)
{
nodes_clear(*mask);
diff -puN kernel/cpuset.c~include-linux-nodemaskh-create-next_node_in-helper kernel/cpuset.c
--- a/kernel/cpuset.c~include-linux-nodemaskh-create-next_node_in-helper
+++ a/kernel/cpuset.c
@@ -2591,13 +2591,7 @@ int __cpuset_node_allowed(int node, gfp_

static int cpuset_spread_node(int *rotor)
{
- int node;
-
- node = next_node(*rotor, current->mems_allowed);
- if (node == MAX_NUMNODES)
- node = first_node(current->mems_allowed);
- *rotor = node;
- return node;
+ return *rotor = next_node_in(*rotor, current->mems_allowed);
}

int cpuset_mem_spread_node(void)
diff -puN mm/hugetlb.c~include-linux-nodemaskh-create-next_node_in-helper mm/hugetlb.c
--- a/mm/hugetlb.c~include-linux-nodemaskh-create-next_node_in-helper
+++ a/mm/hugetlb.c
@@ -937,9 +937,7 @@ err:
*/
static int next_node_allowed(int nid, nodemask_t *nodes_allowed)
{
- nid = next_node(nid, *nodes_allowed);
- if (nid == MAX_NUMNODES)
- nid = first_node(*nodes_allowed);
+ nid = next_node_in(nid, *nodes_allowed);
VM_BUG_ON(nid >= MAX_NUMNODES);

return nid;
diff -puN mm/memcontrol.c~include-linux-nodemaskh-create-next_node_in-helper mm/memcontrol.c
--- a/mm/memcontrol.c~include-linux-nodemaskh-create-next_node_in-helper
+++ a/mm/memcontrol.c
@@ -1388,9 +1388,7 @@ int mem_cgroup_select_victim_node(struct
mem_cgroup_may_update_nodemask(memcg);
node = memcg->last_scanned_node;

- node = next_node(node, memcg->scan_nodes);
- if (node == MAX_NUMNODES)
- node = first_node(memcg->scan_nodes);
+ node = next_node_in(node, memcg->scan_nodes);
/*
* We call this when we hit limit, not when pages are added to LRU.
* No LRU may hold pages because all pages are UNEVICTABLE or
diff -puN mm/mempolicy.c~include-linux-nodemaskh-create-next_node_in-helper mm/mempolicy.c
--- a/mm/mempolicy.c~include-linux-nodemaskh-create-next_node_in-helper
+++ a/mm/mempolicy.c
@@ -347,9 +347,7 @@ static void mpol_rebind_nodemask(struct
BUG();

if (!node_isset(current->il_next, tmp)) {
- current->il_next = next_node(current->il_next, tmp);
- if (current->il_next >= MAX_NUMNODES)
- current->il_next = first_node(tmp);
+ current->il_next = next_node_in(current->il_next, tmp);
if (current->il_next >= MAX_NUMNODES)
current->il_next = numa_node_id();
}
@@ -1709,9 +1707,7 @@ static unsigned interleave_nodes(struct
struct task_struct *me = current;

nid = me->il_next;
- next = next_node(nid, policy->v.nodes);
- if (next >= MAX_NUMNODES)
- next = first_node(policy->v.nodes);
+ next = next_node_in(nid, policy->v.nodes);
if (next < MAX_NUMNODES)
me->il_next = next;
return nid;
diff -puN mm/page_isolation.c~include-linux-nodemaskh-create-next_node_in-helper mm/page_isolation.c
--- a/mm/page_isolation.c~include-linux-nodemaskh-create-next_node_in-helper
+++ a/mm/page_isolation.c
@@ -288,13 +288,10 @@ struct page *alloc_migrate_target(struct
* accordance with memory policy of the user process if possible. For
* now as a simple work-around, we use the next node for destination.
*/
- if (PageHuge(page)) {
- int node = next_online_node(page_to_nid(page));
- if (node == MAX_NUMNODES)
- node = first_online_node;
+ if (PageHuge(page))
return alloc_huge_page_node(page_hstate(compound_head(page)),
- node);
- }
+ next_node_in(page_to_nid(page),
+ node_online_map));

if (PageHighMem(page))
gfp_mask |= __GFP_HIGHMEM;
diff -puN mm/slab.c~include-linux-nodemaskh-create-next_node_in-helper mm/slab.c
--- a/mm/slab.c~include-linux-nodemaskh-create-next_node_in-helper
+++ a/mm/slab.c
@@ -519,22 +519,15 @@ static DEFINE_PER_CPU(unsigned long, sla

static void init_reap_node(int cpu)
{
- int node;
-
- node = next_node(cpu_to_mem(cpu), node_online_map);
- if (node == MAX_NUMNODES)
- node = first_node(node_online_map);
-
- per_cpu(slab_reap_node, cpu) = node;
+ per_cpu(slab_reap_node, cpu) = next_node_in(cpu_to_mem(cpu),
+ node_online_map);
}

static void next_reap_node(void)
{
int node = __this_cpu_read(slab_reap_node);

- node = next_node(node, node_online_map);
- if (unlikely(node >= MAX_NUMNODES))
- node = first_node(node_online_map);
+ node = next_node_in(node, node_online_map);
__this_cpu_write(slab_reap_node, node);
}

_

2016-03-26 05:32:55

by Xishi Qiu

[permalink] [raw]
Subject: Re: [PATCH] mm: fix invalid node in alloc_migrate_target()

On 2016/3/26 3:22, Andrew Morton wrote:

> On Fri, 25 Mar 2016 14:56:04 +0800 Xishi Qiu <[email protected]> wrote:
>
>> It is incorrect to use next_node to find a target node, it will
>> return MAX_NUMNODES or invalid node. This will lead to crash in
>> buddy system allocation.
>>
>> ...
>>
>> --- a/mm/page_isolation.c
>> +++ b/mm/page_isolation.c
>> @@ -289,11 +289,11 @@ struct page *alloc_migrate_target(struct page *page, unsigned long private,
>> * now as a simple work-around, we use the next node for destination.
>> */
>> if (PageHuge(page)) {
>> - nodemask_t src = nodemask_of_node(page_to_nid(page));
>> - nodemask_t dst;
>> - nodes_complement(dst, src);
>> + int node = next_online_node(page_to_nid(page));
>> + if (node == MAX_NUMNODES)
>> + node = first_online_node;
>> return alloc_huge_page_node(page_hstate(compound_head(page)),
>> - next_node(page_to_nid(page), dst));
>> + node);
>> }
>>
>> if (PageHighMem(page))
>
> Indeed. Can you tell us more about this circumstances under which the
> kernel will crash? I need to decide which kernel version(s) need the
> patch, but the changelog doesn't contain the info needed to make this
> decision (it should).
>

Hi Andrew,

I read the code v4.4, and find the following path maybe trigger the bug.

alloc_migrate_target()
alloc_huge_page_node() // the node may be offline or MAX_NUMNODES
__alloc_buddy_huge_page_no_mpol()
__alloc_buddy_huge_page()
__hugetlb_alloc_buddy_huge_page()
alloc_pages_node()
__alloc_pages_node()
VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
VM_WARN_ON(!node_online(nid));

Thanks,
Xishi Qiu

2016-03-29 09:52:43

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH] mm: fix invalid node in alloc_migrate_target()

On 03/26/2016 06:31 AM, Xishi Qiu wrote:
> On 2016/3/26 3:22, Andrew Morton wrote:
>
>> On Fri, 25 Mar 2016 14:56:04 +0800 Xishi Qiu <[email protected]> wrote:
>>
>>> It is incorrect to use next_node to find a target node, it will
>>> return MAX_NUMNODES or invalid node. This will lead to crash in
>>> buddy system allocation.
>>>
>>> ...
>>>
>>> --- a/mm/page_isolation.c
>>> +++ b/mm/page_isolation.c
>>> @@ -289,11 +289,11 @@ struct page *alloc_migrate_target(struct page *page, unsigned long private,
>>> * now as a simple work-around, we use the next node for destination.
>>> */
>>> if (PageHuge(page)) {
>>> - nodemask_t src = nodemask_of_node(page_to_nid(page));
>>> - nodemask_t dst;
>>> - nodes_complement(dst, src);
>>> + int node = next_online_node(page_to_nid(page));
>>> + if (node == MAX_NUMNODES)
>>> + node = first_online_node;
>>> return alloc_huge_page_node(page_hstate(compound_head(page)),
>>> - next_node(page_to_nid(page), dst));
>>> + node);
>>> }
>>>
>>> if (PageHighMem(page))
>>
>> Indeed. Can you tell us more about this circumstances under which the
>> kernel will crash? I need to decide which kernel version(s) need the
>> patch, but the changelog doesn't contain the info needed to make this
>> decision (it should).
>>
>
> Hi Andrew,
>
> I read the code v4.4, and find the following path maybe trigger the bug.
>
> alloc_migrate_target()
> alloc_huge_page_node() // the node may be offline or MAX_NUMNODES
> __alloc_buddy_huge_page_no_mpol()
> __alloc_buddy_huge_page()
> __hugetlb_alloc_buddy_huge_page()

The code in this functions seems to come from 099730d67417d ("mm,
hugetlb: use memory policy when available") by Dave Hansen (adding to
CC), which was indeed merged in 4.4-rc1.

However, alloc_pages_node() is only called in the block guarded by:

if (!IS_ENABLED(CONFIG_NUMA) || !vma) {

The rather weird "!IS_ENABLED(CONFIG_NUMA)" part comes from immediate
followup commit e0ec90ee7e6f ("mm, hugetlbfs: optimize when NUMA=n")

So I doubt the code path here can actually happen. But it's fragile and
confusing nevertheless.

> alloc_pages_node()
> __alloc_pages_node()
> VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
> VM_WARN_ON(!node_online(nid));
>
> Thanks,
> Xishi Qiu
>

2016-03-29 10:06:18

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH] mm: fix invalid node in alloc_migrate_target()

On 03/29/2016 11:52 AM, Vlastimil Babka wrote:
> On 03/26/2016 06:31 AM, Xishi Qiu wrote:
>> On 2016/3/26 3:22, Andrew Morton wrote:
>>
>>> On Fri, 25 Mar 2016 14:56:04 +0800 Xishi Qiu <[email protected]> wrote:
>>>
>>>> It is incorrect to use next_node to find a target node, it will
>>>> return MAX_NUMNODES or invalid node. This will lead to crash in
>>>> buddy system allocation.
>>>>
>>>> ...
>>>>
>>>> --- a/mm/page_isolation.c
>>>> +++ b/mm/page_isolation.c
>>>> @@ -289,11 +289,11 @@ struct page *alloc_migrate_target(struct page *page, unsigned long private,
>>>> * now as a simple work-around, we use the next node for destination.
>>>> */
>>>> if (PageHuge(page)) {
>>>> - nodemask_t src = nodemask_of_node(page_to_nid(page));
>>>> - nodemask_t dst;
>>>> - nodes_complement(dst, src);
>>>> + int node = next_online_node(page_to_nid(page));
>>>> + if (node == MAX_NUMNODES)
>>>> + node = first_online_node;
>>>> return alloc_huge_page_node(page_hstate(compound_head(page)),
>>>> - next_node(page_to_nid(page), dst));
>>>> + node);
>>>> }
>>>>
>>>> if (PageHighMem(page))
>>>
>>> Indeed. Can you tell us more about this circumstances under which the
>>> kernel will crash? I need to decide which kernel version(s) need the
>>> patch, but the changelog doesn't contain the info needed to make this
>>> decision (it should).
>>>
>>
>> Hi Andrew,
>>
>> I read the code v4.4, and find the following path maybe trigger the bug.
>>
>> alloc_migrate_target()
>> alloc_huge_page_node() // the node may be offline or MAX_NUMNODES
>> __alloc_buddy_huge_page_no_mpol()
>> __alloc_buddy_huge_page()
>> __hugetlb_alloc_buddy_huge_page()
>
> The code in this functions seems to come from 099730d67417d ("mm,
> hugetlb: use memory policy when available") by Dave Hansen (adding to
> CC), which was indeed merged in 4.4-rc1.
>
> However, alloc_pages_node() is only called in the block guarded by:
>
> if (!IS_ENABLED(CONFIG_NUMA) || !vma) {
>
> The rather weird "!IS_ENABLED(CONFIG_NUMA)" part comes from immediate
> followup commit e0ec90ee7e6f ("mm, hugetlbfs: optimize when NUMA=n")
>
> So I doubt the code path here can actually happen. But it's fragile and
> confusing nevertheless.

Ah, so there's actually a dangerous path:
alloc_huge_page_node()
dequeue_huge_page_node()
list_for_each_entry(page, &h->hugepage_freelists[nid], lru)

hugepage_freelists is MAX_NUMNODES sized, so when nid is MAX_NUMNODES,
we access past it.

However, look closer at how nid is obtained in alloc_migrate_target():

nodemask_t src = nodemask_of_node(page_to_nid(page));
nodemask_t dst;
nodes_complement(dst, src);

nid = next_node(page_to_nid(page), dst)

for nid to be MAX_NUMNODES, the original page has to be on node
MAX_NUMNODES-1, otherwise the complement part means we hit the very next
bit which is set.

It's actually a rather obfuscated way of doing:

nid = page_to_nid(page) + 1;

In that case the problem is in commit c8721bbbdd36 ("mm: memory-hotplug:
enable memory hotplug to handle hugepage") from 3.12 and will likely
affect only people that tune down MAX_NUMNODES to match their machine.

>> alloc_pages_node()
>> __alloc_pages_node()
>> VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
>> VM_WARN_ON(!node_online(nid));
>>
>> Thanks,
>> Xishi Qiu
>>
>

2016-03-29 10:38:36

by Xishi Qiu

[permalink] [raw]
Subject: Re: [PATCH] mm: fix invalid node in alloc_migrate_target()

On 2016/3/29 17:52, Vlastimil Babka wrote:

> On 03/26/2016 06:31 AM, Xishi Qiu wrote:
>> On 2016/3/26 3:22, Andrew Morton wrote:
>>
>>> On Fri, 25 Mar 2016 14:56:04 +0800 Xishi Qiu <[email protected]> wrote:
>>>
>>>> It is incorrect to use next_node to find a target node, it will
>>>> return MAX_NUMNODES or invalid node. This will lead to crash in
>>>> buddy system allocation.
>>>>
>>>> ...
>>>>
>>>> --- a/mm/page_isolation.c
>>>> +++ b/mm/page_isolation.c
>>>> @@ -289,11 +289,11 @@ struct page *alloc_migrate_target(struct page *page, unsigned long private,
>>>> * now as a simple work-around, we use the next node for destination.
>>>> */
>>>> if (PageHuge(page)) {
>>>> - nodemask_t src = nodemask_of_node(page_to_nid(page));
>>>> - nodemask_t dst;
>>>> - nodes_complement(dst, src);
>>>> + int node = next_online_node(page_to_nid(page));
>>>> + if (node == MAX_NUMNODES)
>>>> + node = first_online_node;
>>>> return alloc_huge_page_node(page_hstate(compound_head(page)),
>>>> - next_node(page_to_nid(page), dst));
>>>> + node);
>>>> }
>>>>
>>>> if (PageHighMem(page))
>>>
>>> Indeed. Can you tell us more about this circumstances under which the
>>> kernel will crash? I need to decide which kernel version(s) need the
>>> patch, but the changelog doesn't contain the info needed to make this
>>> decision (it should).
>>>
>>
>> Hi Andrew,
>>
>> I read the code v4.4, and find the following path maybe trigger the bug.
>>
>> alloc_migrate_target()
>> alloc_huge_page_node() // the node may be offline or MAX_NUMNODES
>> __alloc_buddy_huge_page_no_mpol()
>> __alloc_buddy_huge_page()
>> __hugetlb_alloc_buddy_huge_page()
>
> The code in this functions seems to come from 099730d67417d ("mm, hugetlb: use memory policy when available") by Dave Hansen (adding to CC), which was indeed merged in 4.4-rc1.
>
> However, alloc_pages_node() is only called in the block guarded by:
>
> if (!IS_ENABLED(CONFIG_NUMA) || !vma) {
>
> The rather weird "!IS_ENABLED(CONFIG_NUMA)" part comes from immediate followup commit e0ec90ee7e6f ("mm, hugetlbfs: optimize when NUMA=n")
>
> So I doubt the code path here can actually happen. But it's fragile and confusing nevertheless.
>

Hi Vlastimil

__alloc_buddy_huge_page(h, NULL, addr, nid); // so the vma is NULL

Thanks,
Xishi Qiu

>> alloc_pages_node()
>> __alloc_pages_node()
>> VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
>> VM_WARN_ON(!node_online(nid));
>>
>> Thanks,
>> Xishi Qiu
>>
>
>
> .
>



2016-03-29 12:21:11

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH] mm: fix invalid node in alloc_migrate_target()

On 03/29/2016 12:37 PM, Xishi Qiu wrote:
> On 2016/3/29 17:52, Vlastimil Babka wrote:
>> The code in this functions seems to come from 099730d67417d ("mm, hugetlb: use memory policy when available") by Dave Hansen (adding to CC), which was indeed merged in 4.4-rc1.
>>
>> However, alloc_pages_node() is only called in the block guarded by:
>>
>> if (!IS_ENABLED(CONFIG_NUMA) || !vma) {
>>
>> The rather weird "!IS_ENABLED(CONFIG_NUMA)" part comes from immediate followup commit e0ec90ee7e6f ("mm, hugetlbfs: optimize when NUMA=n")
>>
>> So I doubt the code path here can actually happen. But it's fragile and confusing nevertheless.
>>
>
> Hi Vlastimil
>
> __alloc_buddy_huge_page(h, NULL, addr, nid); // so the vma is NULL

Hm that's true, I got lost in the logic, thanks.
But the problem with dequeue_huge_page_node() is also IMHO true, and
older, so we should fix 3.12+.

2016-03-29 12:25:10

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH] mm: fix invalid node in alloc_migrate_target()

On 03/25/2016 07:56 AM, Xishi Qiu wrote:
> It is incorrect to use next_node to find a target node, it will
> return MAX_NUMNODES or invalid node. This will lead to crash in
> buddy system allocation.

One possible place of crash is:
alloc_huge_page_node()
dequeue_huge_page_node()
[accesses h->hugepage_freelists[nid] with size MAX_NUMANODES]

> Signed-off-by: Xishi Qiu <[email protected]>

Fixes: c8721bbbdd36 ("mm: memory-hotplug: enable memory hotplug to
handle hugepage")
Cc: stable
Acked-by: Vlastimil Babka <[email protected]>

> ---
> mm/page_isolation.c | 8 ++++----
> 1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> index 92c4c36..31555b6 100644
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -289,11 +289,11 @@ struct page *alloc_migrate_target(struct page *page, unsigned long private,
> * now as a simple work-around, we use the next node for destination.
> */
> if (PageHuge(page)) {
> - nodemask_t src = nodemask_of_node(page_to_nid(page));
> - nodemask_t dst;
> - nodes_complement(dst, src);
> + int node = next_online_node(page_to_nid(page));
> + if (node == MAX_NUMNODES)
> + node = first_online_node;
> return alloc_huge_page_node(page_hstate(compound_head(page)),
> - next_node(page_to_nid(page), dst));
> + node);
> }
>
> if (PageHighMem(page))
>

2016-03-29 13:06:25

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH] mm: fix invalid node in alloc_migrate_target()

On 03/25/2016 08:22 PM, Andrew Morton wrote:
> On Fri, 25 Mar 2016 14:56:04 +0800 Xishi Qiu <[email protected]> wrote:
>
>> It is incorrect to use next_node to find a target node, it will
>> return MAX_NUMNODES or invalid node. This will lead to crash in
>> buddy system allocation.
>>
>> ...
>>
>> --- a/mm/page_isolation.c
>> +++ b/mm/page_isolation.c
>> @@ -289,11 +289,11 @@ struct page *alloc_migrate_target(struct page *page, unsigned long private,
>> * now as a simple work-around, we use the next node for destination.
>> */
>> if (PageHuge(page)) {
>> - nodemask_t src = nodemask_of_node(page_to_nid(page));
>> - nodemask_t dst;
>> - nodes_complement(dst, src);
>> + int node = next_online_node(page_to_nid(page));
>> + if (node == MAX_NUMNODES)
>> + node = first_online_node;
>> return alloc_huge_page_node(page_hstate(compound_head(page)),
>> - next_node(page_to_nid(page), dst));
>> + node);
>> }
>>
>> if (PageHighMem(page))
>
> Indeed. Can you tell us more about this circumstances under which the
> kernel will crash? I need to decide which kernel version(s) need the
> patch, but the changelog doesn't contain the info needed to make this
> decision (it should).
>
>
>
> next_node() isn't a very useful interface, really. Just about every
> caller does this:
>
>
> node = next_node(node, XXX);
> if (node == MAX_NUMNODES)
> node = first_node(XXX);
>
> so how about we write a function which does that, and stop open-coding
> the same thing everywhere?

Good idea.

> And I think your fix could then use such a function:
>
> int node = that_new_function(page_to_nid(page), node_online_map);
>
>
>
> Also, mm/mempolicy.c:offset_il_node() worries me:
>
> do {
> nid = next_node(nid, pol->v.nodes);
> c++;
> } while (c <= target);
>
> Can't `nid' hit MAX_NUMNODES?

AFAICS it can. interleave_nid() uses this and the nid is then used e.g.
in node_zonelist() where it's used for NODE_DATA(nid). That's quite
scary. It also predates git. Why don't we see crashes or KASAN finding this?

>
> And can someone please explain mem_cgroup_select_victim_node() to me?
> How can we hit the "node = numa_node_id()" path? Only if
> memcg->scan_nodes is empty? is that even valid? The comment seems to
> have not much to do with the code?

I understand the comment that it's valid to be empty and the comment
lists reasons why that can happen (with somewhat broken language). Note
that I didn't verify these reasons:
- we call this when hitting memcg limit, not when adding pages to LRU,
as adding to LRU means it would contain the given LRU's node
- adding to unevictable LRU means it's not added to scan_nodes (probably
because scanning unevictable lru would be useless)
- for other reasons (which?) it might have pages not on LRU and it's so
small there are no other pages that would be on LRU

> mpol_rebind_nodemask() is similar.
>
>
>
> Something like this?
>
>
> From: Andrew Morton <[email protected]>
> Subject: include/linux/nodemask.h: create next_node_in() helper
>
> Lots of code does
>
> node = next_node(node, XXX);
> if (node == MAX_NUMNODES)
> node = first_node(XXX);
>
> so create next_node_in() to do this and use it in various places.
>
> Cc: Xishi Qiu <[email protected]>
> Cc: Vlastimil Babka <[email protected]>

Acked-by: Vlastimil Babka <[email protected]>

Patch doesn't address offset_il_node() which is good, because if it's
indeed buggy, it's serious and needs a non-cleanup patch.

2016-03-29 15:52:52

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm: fix invalid node in alloc_migrate_target()

On Fri 25-03-16 12:22:37, Andrew Morton wrote:
[...]
> And can someone please explain mem_cgroup_select_victim_node() to me?
> How can we hit the "node = numa_node_id()" path? Only if
> memcg->scan_nodes is empty?

Yes, this seems to be the primary motivation.
mem_cgroup_may_update_nodemask might have seen all the pages on
unevictable LRU last time it checked something.

> is that even valid?

I suspect it is really rare but it seems possible

> The comment seems to have not much to do with the code?

I guess the comment tries to say that the code path is triggered when we
charge the page which happens _before_ it is added to the LRU list and
so last_scanned_node might contain the stale data. Would something like
the following be more clear?
---
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 17a847c96618..cff095318950 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1390,10 +1390,9 @@ int mem_cgroup_select_victim_node(struct mem_cgroup *memcg)

node = next_node_in(node, memcg->scan_nodes);
/*
- * We call this when we hit limit, not when pages are added to LRU.
- * No LRU may hold pages because all pages are UNEVICTABLE or
- * memcg is too small and all pages are not on LRU. In that case,
- * we use curret node.
+ * mem_cgroup_may_update_nodemask might have seen no reclaimmable pages
+ * last time it really checked all the LRUs due to rate limiting.
+ * Fallback to the current node in that case for simplicity.
*/
if (unlikely(node == MAX_NUMNODES))
node = numa_node_id();
--
Michal Hocko
SUSE Labs

2016-03-30 01:16:11

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH] mm: fix invalid node in alloc_migrate_target()

On Tue, Mar 29, 2016 at 02:25:03PM +0200, Vlastimil Babka wrote:
> On 03/25/2016 07:56 AM, Xishi Qiu wrote:
> >It is incorrect to use next_node to find a target node, it will
> >return MAX_NUMNODES or invalid node. This will lead to crash in
> >buddy system allocation.
>
> One possible place of crash is:
> alloc_huge_page_node()
> dequeue_huge_page_node()
> [accesses h->hugepage_freelists[nid] with size MAX_NUMANODES]
>
> >Signed-off-by: Xishi Qiu <[email protected]>
>
> Fixes: c8721bbbdd36 ("mm: memory-hotplug: enable memory hotplug to handle
> hugepage")
> Cc: stable
> Acked-by: Vlastimil Babka <[email protected]>

Thanks everyone for finding/fixing the bug!

Acked-by: Naoya Horiguchi <[email protected]>

> >---
> > mm/page_isolation.c | 8 ++++----
> > 1 file changed, 4 insertions(+), 4 deletions(-)
> >
> >diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> >index 92c4c36..31555b6 100644
> >--- a/mm/page_isolation.c
> >+++ b/mm/page_isolation.c
> >@@ -289,11 +289,11 @@ struct page *alloc_migrate_target(struct page *page, unsigned long private,
> > * now as a simple work-around, we use the next node for destination.
> > */
> > if (PageHuge(page)) {
> >- nodemask_t src = nodemask_of_node(page_to_nid(page));
> >- nodemask_t dst;
> >- nodes_complement(dst, src);
> >+ int node = next_online_node(page_to_nid(page));
> >+ if (node == MAX_NUMNODES)
> >+ node = first_online_node;
> > return alloc_huge_page_node(page_hstate(compound_head(page)),
> >- next_node(page_to_nid(page), dst));
> >+ node);
> > }
> >
> > if (PageHighMem(page))
> >
>

2016-03-31 13:13:52

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH] mm: fix invalid node in alloc_migrate_target()

On 03/29/2016 03:06 PM, Vlastimil Babka wrote:
> On 03/25/2016 08:22 PM, Andrew Morton wrote:
>> Also, mm/mempolicy.c:offset_il_node() worries me:
>>
>> do {
>> nid = next_node(nid, pol->v.nodes);
>> c++;
>> } while (c <= target);
>>
>> Can't `nid' hit MAX_NUMNODES?
>
> AFAICS it can. interleave_nid() uses this and the nid is then used e.g.
> in node_zonelist() where it's used for NODE_DATA(nid). That's quite
> scary. It also predates git. Why don't we see crashes or KASAN finding this?

Ah, I see. In offset_il_node(), nid is initialized to -1, and the number
of do-while iterations calling next_node() is up to the number of bits
set in the pol->v.nodes bitmap, so it can't reach past the last set bit
and return MAX_NUMNODES.

2016-03-31 21:01:33

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] mm: fix invalid node in alloc_migrate_target()

On Thu, 31 Mar 2016 15:13:41 +0200 Vlastimil Babka <[email protected]> wrote:

> On 03/29/2016 03:06 PM, Vlastimil Babka wrote:
> > On 03/25/2016 08:22 PM, Andrew Morton wrote:
> >> Also, mm/mempolicy.c:offset_il_node() worries me:
> >>
> >> do {
> >> nid = next_node(nid, pol->v.nodes);
> >> c++;
> >> } while (c <= target);
> >>
> >> Can't `nid' hit MAX_NUMNODES?
> >
> > AFAICS it can. interleave_nid() uses this and the nid is then used e.g.
> > in node_zonelist() where it's used for NODE_DATA(nid). That's quite
> > scary. It also predates git. Why don't we see crashes or KASAN finding this?
>
> Ah, I see. In offset_il_node(), nid is initialized to -1, and the number
> of do-while iterations calling next_node() is up to the number of bits
> set in the pol->v.nodes bitmap, so it can't reach past the last set bit
> and return MAX_NUMNODES.

Gack. offset_il_node() should be dragged out, strangled, shot then burnt.

static unsigned offset_il_node(struct mempolicy *pol,
struct vm_area_struct *vma, unsigned long off)
{
unsigned nnodes = nodes_weight(pol->v.nodes);
unsigned target;
int c;
int nid = NUMA_NO_NODE;

if (!nnodes)
return numa_node_id();
target = (unsigned int)off % nnodes;
c = 0;
do {
nid = next_node(nid, pol->v.nodes);
c++;
} while (c <= target);
return nid;
}

For starters it is relying upon next_node(-1, ...) behaving like
first_node(). Fair enough I guess, but that isn't very clear.

static inline int __next_node(int n, const nodemask_t *srcp)
{
return min_t(int,MAX_NUMNODES,find_next_bit(srcp->bits, MAX_NUMNODES, n+1));
}

will start from node 0 when it does the n+1.

Also it is relying upon NUMA_NO_NODE having a value of -1. That's just
grubby - this code shouldn't "know" that NUMA_NO_NODE==-1. It would have
been better to use plain old "-1" here.


Does this look clearer and correct?

/*
* Do static interleaving for a VMA with known offset @n. Returns the n'th
* node in pol->v.nodes (starting from n=0), wrapping around if n exceeds the
* number of present nodes.
*/
static unsigned offset_il_node(struct mempolicy *pol,
struct vm_area_struct *vma, unsigned long n)
{
unsigned nnodes = nodes_weight(pol->v.nodes);
unsigned target;
int i;
int nid;

if (!nnodes)
return numa_node_id();
target = (unsigned int)n % nnodes;
nid = first_node(pol->v.nodes);
for (i = 0; i < target; i++)
nid = next_node(nid, pol->v.nodes);
return nid;
}


From: Andrew Morton <[email protected]>
Subject: mm/mempolicy.c:offset_il_node() document and clarify

This code was pretty obscure and was relying upon obscure side-effects of
next_node(-1, ...) and was relying upon NUMA_NO_NODE being equal to -1.

Clean that all up and document the function's intent.

Cc: Vlastimil Babka <[email protected]>
Cc: Xishi Qiu <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Laura Abbott <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---

mm/mempolicy.c | 20 +++++++++++---------
1 file changed, 11 insertions(+), 9 deletions(-)

diff -puN mm/mempolicy.c~mm-mempolicyc-offset_il_node-document-and-clarify mm/mempolicy.c
--- a/mm/mempolicy.c~mm-mempolicyc-offset_il_node-document-and-clarify
+++ a/mm/mempolicy.c
@@ -1763,23 +1763,25 @@ unsigned int mempolicy_slab_node(void)
}
}

-/* Do static interleaving for a VMA with known offset. */
+/*
+ * Do static interleaving for a VMA with known offset @n. Returns the n'th
+ * node in pol->v.nodes (starting from n=0), wrapping around if n exceeds the
+ * number of present nodes.
+ */
static unsigned offset_il_node(struct mempolicy *pol,
- struct vm_area_struct *vma, unsigned long off)
+ struct vm_area_struct *vma, unsigned long n)
{
unsigned nnodes = nodes_weight(pol->v.nodes);
unsigned target;
- int c;
- int nid = NUMA_NO_NODE;
+ int i;
+ int nid;

if (!nnodes)
return numa_node_id();
- target = (unsigned int)off % nnodes;
- c = 0;
- do {
+ target = (unsigned int)n % nnodes;
+ nid = first_node(pol->v.nodes);
+ for (i = 0; i < target; i++)
nid = next_node(nid, pol->v.nodes);
- c++;
- } while (c <= target);
return nid;
}

_

2016-04-01 08:43:01

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH] mm: fix invalid node in alloc_migrate_target()

On 03/31/2016 11:01 PM, Andrew Morton wrote:
> On Thu, 31 Mar 2016 15:13:41 +0200 Vlastimil Babka <[email protected]> wrote:
>
>> On 03/29/2016 03:06 PM, Vlastimil Babka wrote:
>> > On 03/25/2016 08:22 PM, Andrew Morton wrote:
>> >> Also, mm/mempolicy.c:offset_il_node() worries me:
>> >>
>> >> do {
>> >> nid = next_node(nid, pol->v.nodes);
>> >> c++;
>> >> } while (c <= target);
>> >>
>> >> Can't `nid' hit MAX_NUMNODES?
>> >
>> > AFAICS it can. interleave_nid() uses this and the nid is then used e.g.
>> > in node_zonelist() where it's used for NODE_DATA(nid). That's quite
>> > scary. It also predates git. Why don't we see crashes or KASAN finding this?
>>
>> Ah, I see. In offset_il_node(), nid is initialized to -1, and the number
>> of do-while iterations calling next_node() is up to the number of bits
>> set in the pol->v.nodes bitmap, so it can't reach past the last set bit
>> and return MAX_NUMNODES.
>
> Gack. offset_il_node() should be dragged out, strangled, shot then burnt.

Ah, but you went with the much less amusing alternative of just fixing it.

> static unsigned offset_il_node(struct mempolicy *pol,
> struct vm_area_struct *vma, unsigned long off)
> {
> unsigned nnodes = nodes_weight(pol->v.nodes);
> unsigned target;
> int c;
> int nid = NUMA_NO_NODE;
>
> if (!nnodes)
> return numa_node_id();
> target = (unsigned int)off % nnodes;
> c = 0;
> do {
> nid = next_node(nid, pol->v.nodes);
> c++;
> } while (c <= target);
> return nid;
> }
>
> For starters it is relying upon next_node(-1, ...) behaving like
> first_node(). Fair enough I guess, but that isn't very clear.
>
> static inline int __next_node(int n, const nodemask_t *srcp)
> {
> return min_t(int,MAX_NUMNODES,find_next_bit(srcp->bits, MAX_NUMNODES, n+1));
> }
>
> will start from node 0 when it does the n+1.
>
> Also it is relying upon NUMA_NO_NODE having a value of -1. That's just
> grubby - this code shouldn't "know" that NUMA_NO_NODE==-1. It would have
> been better to use plain old "-1" here.

Yeah looks like a blind change of all "-1" to "NUMA_NO_NODE" happened at some point.

>
> Does this look clearer and correct?

Definitely.

> /*
> * Do static interleaving for a VMA with known offset @n. Returns the n'th
> * node in pol->v.nodes (starting from n=0), wrapping around if n exceeds the
> * number of present nodes.
> */
> static unsigned offset_il_node(struct mempolicy *pol,
> struct vm_area_struct *vma, unsigned long n)
> {
> unsigned nnodes = nodes_weight(pol->v.nodes);
> unsigned target;
> int i;
> int nid;
>
> if (!nnodes)
> return numa_node_id();
> target = (unsigned int)n % nnodes;
> nid = first_node(pol->v.nodes);
> for (i = 0; i < target; i++)
> nid = next_node(nid, pol->v.nodes);
> return nid;
> }
>
>
> From: Andrew Morton <[email protected]>
> Subject: mm/mempolicy.c:offset_il_node() document and clarify
>
> This code was pretty obscure and was relying upon obscure side-effects of
> next_node(-1, ...) and was relying upon NUMA_NO_NODE being equal to -1.
>
> Clean that all up and document the function's intent.
>
> Cc: Vlastimil Babka <[email protected]>
> Cc: Xishi Qiu <[email protected]>
> Cc: Joonsoo Kim <[email protected]>
> Cc: David Rientjes <[email protected]>
> Cc: Naoya Horiguchi <[email protected]>
> Cc: Laura Abbott <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>

Acked-by: Vlastimil Babka <[email protected]>