2021-06-22 02:15:51

by Mike Kravetz

[permalink] [raw]
Subject: [PATCH 0/2] Fix prep_compound_gigantic_page ref count adjustment

These patches address the possible race between prep_compound_gigantic_page
and __page_cache_add_speculative as described by Jann Horn in [1].

The first patch simply removes the unnecessary/obsolete helper routine
prep_compound_huge_page to make the actual fix a little simpler.

The second patch is the actual fix and has a detailed explanation in the
commit message.

This potential issue has existed for almost 10 years and I am unaware of
anyone actually hitting the race. I did not cc stable, but would be
happy to squash the patches and send to stable if anyone thinks that is
a good idea.

I could not think of a reliable way to recreate the issue for testing.
Rather, I 'simulated errors' to exercise all the error paths.

Mike Kravetz (2):
hugetlb: remove prep_compound_huge_page cleanup
hugetlb: address ref count racing in prep_compound_gigantic_page

mm/hugetlb.c | 99 ++++++++++++++++++++++++++++++++++++-------------
mm/page_alloc.c | 1 -
2 files changed, 73 insertions(+), 27 deletions(-)

--
2.31.1


2021-06-22 02:16:06

by Mike Kravetz

[permalink] [raw]
Subject: [PATCH 1/2] hugetlb: remove prep_compound_huge_page cleanup

The routine prep_compound_huge_page is a simple wrapper to call either
prep_compound_gigantic_page or prep_compound_page. However, it is only
called from gather_bootmem_prealloc which only processes gigantic pages.
Eliminate the routine and call prep_compound_gigantic_page directly.

Signed-off-by: Mike Kravetz <[email protected]>
---
mm/hugetlb.c | 29 ++++++++++-------------------
1 file changed, 10 insertions(+), 19 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 760b5fb836b8..50596b7d6da9 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1320,8 +1320,6 @@ static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask,
return alloc_contig_pages(nr_pages, gfp_mask, nid, nodemask);
}

-static void prep_new_huge_page(struct hstate *h, struct page *page, int nid);
-static void prep_compound_gigantic_page(struct page *page, unsigned int order);
#else /* !CONFIG_CONTIG_ALLOC */
static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask,
int nid, nodemask_t *nodemask)
@@ -2759,16 +2757,10 @@ int __alloc_bootmem_huge_page(struct hstate *h)
return 1;
}

-static void __init prep_compound_huge_page(struct page *page,
- unsigned int order)
-{
- if (unlikely(order > (MAX_ORDER - 1)))
- prep_compound_gigantic_page(page, order);
- else
- prep_compound_page(page, order);
-}
-
-/* Put bootmem huge pages into the standard lists after mem_map is up */
+/*
+ * Put bootmem huge pages into the standard lists after mem_map is up.
+ * Note: This only applies to gigantic (order > MAX_ORDER) pages.
+ */
static void __init gather_bootmem_prealloc(void)
{
struct huge_bootmem_page *m;
@@ -2777,20 +2769,19 @@ static void __init gather_bootmem_prealloc(void)
struct page *page = virt_to_page(m);
struct hstate *h = m->hstate;

+ VM_BUG_ON(!hstate_is_gigantic(h));
WARN_ON(page_count(page) != 1);
- prep_compound_huge_page(page, huge_page_order(h));
+ prep_compound_gigantic_page(page, huge_page_order(h));
WARN_ON(PageReserved(page));
prep_new_huge_page(h, page, page_to_nid(page));
put_page(page); /* free it into the hugepage allocator */

/*
- * If we had gigantic hugepages allocated at boot time, we need
- * to restore the 'stolen' pages to totalram_pages in order to
- * fix confusing memory reports from free(1) and another
- * side-effects, like CommitLimit going negative.
+ * We need to restore the 'stolen' pages to totalram_pages
+ * in order to fix confusing memory reports from free(1) and
+ * other side-effects, like CommitLimit going negative.
*/
- if (hstate_is_gigantic(h))
- adjust_managed_page_count(page, pages_per_huge_page(h));
+ adjust_managed_page_count(page, pages_per_huge_page(h));
cond_resched();
}
}
--
2.31.1

2021-06-22 02:17:16

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH 0/2] Fix prep_compound_gigantic_page ref count adjustment

On 6/21/21 7:14 PM, Mike Kravetz wrote:
> These patches address the possible race between prep_compound_gigantic_page
> and __page_cache_add_speculative as described by Jann Horn in [1].

Oops,

[1] https://lore.kernel.org/linux-mm/CAG48ez23q0Jy9cuVnwAe7t_fdhMk2S7N5Hdi-GLcCeq5bsfLxw@mail.gmail.com/

--
Mike Kravetz

2021-06-22 02:17:19

by Mike Kravetz

[permalink] [raw]
Subject: [PATCH 2/2] hugetlb: address ref count racing in prep_compound_gigantic_page

In [1], Jann Horn points out a possible race between
prep_compound_gigantic_page and __page_cache_add_speculative. The
root cause of the possible race is prep_compound_gigantic_page
uncondittionally setting the ref count of pages to zero. It does this
because prep_compound_gigantic_page is handed a 'group' of pages from an
allocator and needs to convert that group of pages to a compound page.
The ref count of each page in this 'group' is one as set by the
allocator. However, the ref count of compound page tail pages must be
zero.

The potential race comes about when ref counted pages are returned from
the allocator. When this happens, other mm code could also take a
reference on the page. __page_cache_add_speculative is one such
example. Therefore, prep_compound_gigantic_page can not just set the
ref count of pages to zero as it does today. Doing so would lose the
reference taken by any other code. This would lead to BUGs in code
checking ref counts and could possibly even lead to memory corruption.

There are two possible ways to address this issue.
1) Make all allocators of gigantic groups of pages be able to return a
properly constructed compound page.
2) Make prep_compound_gigantic_page be more careful when constructing a
compound page.
This patch takes approach 2.

In prep_compound_gigantic_page, use cmpxchg to only set ref count to
zero if it is one. If the cmpxchg fails, call synchronize_rcu() in the
hope that the extra ref count will be driopped during a rcu grace
period. This is not a performance critical code path and the wait
should be accceptable. If the ref count is still inflated after the
grace period, then undo any modifications made and return an error.

Currently prep_compound_gigantic_page is type void and does not return
errors. Modify the two callers to check for and handle error returns.
On error, the caller must free the 'group' of pages as they can not be
used to form a gigantic page. After freeing pages, the runtime caller
(alloc_fresh_huge_page) will retry the allocation once. Boot time
allocations can not be retried.

The routine prep_compound_page also unconditionally sets the ref count
of compound page tail pages to zero. However, in this case the buddy
allocator is constructing a compound page from freshly allocated pages.
The ref count on those freshly allocated pages is already zero, so the
set_page_count(p, 0) is unnecessary and could lead to confusion.
Just remove it.

[1] https://lore.kernel.org/linux-mm/CAG48ez23q0Jy9cuVnwAe7t_fdhMk2S7N5Hdi-GLcCeq5bsfLxw@mail.gmail.com/
Fixes: 58a84aa92723 ("thp: set compound tail page _count to zero")
Reported-by: Jann Horn <[email protected]>
Signed-off-by: Mike Kravetz <[email protected]>
---
mm/hugetlb.c | 72 +++++++++++++++++++++++++++++++++++++++++++------
mm/page_alloc.c | 1 -
2 files changed, 64 insertions(+), 9 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 50596b7d6da9..924553aa8f78 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1623,9 +1623,9 @@ static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
spin_unlock_irq(&hugetlb_lock);
}

-static void prep_compound_gigantic_page(struct page *page, unsigned int order)
+static bool prep_compound_gigantic_page(struct page *page, unsigned int order)
{
- int i;
+ int i, j;
int nr_pages = 1 << order;
struct page *p = page + 1;

@@ -1647,11 +1647,48 @@ static void prep_compound_gigantic_page(struct page *page, unsigned int order)
* after get_user_pages().
*/
__ClearPageReserved(p);
+ /*
+ * Subtle and very unlikely
+ *
+ * Gigantic 'page allocators' such as memblock or cma will
+ * return a set of pages with each page ref counted. We need
+ * to turn this set of pages into a compound page with tail
+ * page ref counts set to zero. Code such as speculative page
+ * cache adding could take a ref on a 'to be' tail page.
+ * We need to respect any increased ref count, and only set
+ * the ref count to zero if count is currently 1. If count
+ * is not 1, we call synchronize_rcu in the hope that a rcu
+ * grace period will cause ref count to drop and then retry.
+ * If count is still inflated on retry we return an error and
+ * must discard the pages.
+ */
+ if (!page_ref_freeze(p, 1)) {
+ pr_info("HugeTLB unexpected inflated ref count on freshly allocated page\n");
+ synchronize_rcu();
+ if (!page_ref_freeze(p, 1))
+ goto out_error;
+ }
set_page_count(p, 0);
set_compound_head(p, page);
}
atomic_set(compound_mapcount_ptr(page), -1);
atomic_set(compound_pincount_ptr(page), 0);
+ return true;
+
+out_error:
+ /* undo tail page modifications made above */
+ p = page + 1;
+ for (j = 1; j < i; j++, p = mem_map_next(p, page, j)) {
+ clear_compound_head(p);
+ set_page_refcounted(p);
+ }
+ /* need to clear PG_reserved on remaining tail pages */
+ for (; j < nr_pages; j++, p = mem_map_next(p, page, j))
+ __ClearPageReserved(p);
+ set_compound_order(page, 0);
+ page[1].compound_nr = 0;
+ __ClearPageHead(page);
+ return false;
}

/*
@@ -1771,7 +1808,9 @@ static struct page *alloc_fresh_huge_page(struct hstate *h,
nodemask_t *node_alloc_noretry)
{
struct page *page;
+ bool retry = false;

+retry:
if (hstate_is_gigantic(h))
page = alloc_gigantic_page(h, gfp_mask, nid, nmask);
else
@@ -1780,8 +1819,21 @@ static struct page *alloc_fresh_huge_page(struct hstate *h,
if (!page)
return NULL;

- if (hstate_is_gigantic(h))
- prep_compound_gigantic_page(page, huge_page_order(h));
+ if (hstate_is_gigantic(h)) {
+ if (!prep_compound_gigantic_page(page, huge_page_order(h))) {
+ /*
+ * Rare failure to convert pages to compound page.
+ * Free pages and try again - ONCE!
+ */
+ free_gigantic_page(page, huge_page_order(h));
+ if (!retry) {
+ retry = true;
+ goto retry;
+ }
+ pr_warn("HugeTLB page can not be used due to unexpected inflated ref count\n");
+ return NULL;
+ }
+ }
prep_new_huge_page(h, page, page_to_nid(page));

return page;
@@ -2771,10 +2823,14 @@ static void __init gather_bootmem_prealloc(void)

VM_BUG_ON(!hstate_is_gigantic(h));
WARN_ON(page_count(page) != 1);
- prep_compound_gigantic_page(page, huge_page_order(h));
- WARN_ON(PageReserved(page));
- prep_new_huge_page(h, page, page_to_nid(page));
- put_page(page); /* free it into the hugepage allocator */
+ if (prep_compound_gigantic_page(page, huge_page_order(h))) {
+ WARN_ON(PageReserved(page));
+ prep_new_huge_page(h, page, page_to_nid(page));
+ put_page(page); /* add to the hugepage allocator */
+ } else {
+ free_gigantic_page(page, huge_page_order(h));
+ pr_warn("HugeTLB page can not be used due to unexpected inflated ref count\n");
+ }

/*
* We need to restore the 'stolen' pages to totalram_pages
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8836e54721ae..a672d9c85118 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -749,7 +749,6 @@ void prep_compound_page(struct page *page, unsigned int order)
__SetPageHead(page);
for (i = 1; i < nr_pages; i++) {
struct page *p = page + i;
- set_page_count(p, 0);
p->mapping = TAIL_MAPPING;
set_compound_head(p, page);
}
--
2.31.1

2021-06-22 09:11:57

by Muchun Song

[permalink] [raw]
Subject: Re: [External] [PATCH 1/2] hugetlb: remove prep_compound_huge_page cleanup

On Tue, Jun 22, 2021 at 10:15 AM Mike Kravetz <[email protected]> wrote:
>
> The routine prep_compound_huge_page is a simple wrapper to call either
> prep_compound_gigantic_page or prep_compound_page. However, it is only
> called from gather_bootmem_prealloc which only processes gigantic pages.
> Eliminate the routine and call prep_compound_gigantic_page directly.
>
> Signed-off-by: Mike Kravetz <[email protected]>

Nice clean-up. Thanks.

Reviewed-by: Muchun Song <[email protected]>

> ---
> mm/hugetlb.c | 29 ++++++++++-------------------
> 1 file changed, 10 insertions(+), 19 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 760b5fb836b8..50596b7d6da9 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1320,8 +1320,6 @@ static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask,
> return alloc_contig_pages(nr_pages, gfp_mask, nid, nodemask);
> }
>
> -static void prep_new_huge_page(struct hstate *h, struct page *page, int nid);
> -static void prep_compound_gigantic_page(struct page *page, unsigned int order);
> #else /* !CONFIG_CONTIG_ALLOC */
> static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask,
> int nid, nodemask_t *nodemask)
> @@ -2759,16 +2757,10 @@ int __alloc_bootmem_huge_page(struct hstate *h)
> return 1;
> }
>
> -static void __init prep_compound_huge_page(struct page *page,
> - unsigned int order)
> -{
> - if (unlikely(order > (MAX_ORDER - 1)))
> - prep_compound_gigantic_page(page, order);
> - else
> - prep_compound_page(page, order);
> -}
> -
> -/* Put bootmem huge pages into the standard lists after mem_map is up */
> +/*
> + * Put bootmem huge pages into the standard lists after mem_map is up.
> + * Note: This only applies to gigantic (order > MAX_ORDER) pages.
> + */
> static void __init gather_bootmem_prealloc(void)
> {
> struct huge_bootmem_page *m;
> @@ -2777,20 +2769,19 @@ static void __init gather_bootmem_prealloc(void)
> struct page *page = virt_to_page(m);
> struct hstate *h = m->hstate;
>
> + VM_BUG_ON(!hstate_is_gigantic(h));
> WARN_ON(page_count(page) != 1);
> - prep_compound_huge_page(page, huge_page_order(h));
> + prep_compound_gigantic_page(page, huge_page_order(h));
> WARN_ON(PageReserved(page));
> prep_new_huge_page(h, page, page_to_nid(page));
> put_page(page); /* free it into the hugepage allocator */
>
> /*
> - * If we had gigantic hugepages allocated at boot time, we need
> - * to restore the 'stolen' pages to totalram_pages in order to
> - * fix confusing memory reports from free(1) and another
> - * side-effects, like CommitLimit going negative.
> + * We need to restore the 'stolen' pages to totalram_pages
> + * in order to fix confusing memory reports from free(1) and
> + * other side-effects, like CommitLimit going negative.
> */
> - if (hstate_is_gigantic(h))
> - adjust_managed_page_count(page, pages_per_huge_page(h));
> + adjust_managed_page_count(page, pages_per_huge_page(h));
> cond_resched();
> }
> }
> --
> 2.31.1
>

2021-06-23 08:02:32

by Muchun Song

[permalink] [raw]
Subject: Re: [External] [PATCH 2/2] hugetlb: address ref count racing in prep_compound_gigantic_page

On Tue, Jun 22, 2021 at 10:15 AM Mike Kravetz <[email protected]> wrote:
>
> In [1], Jann Horn points out a possible race between
> prep_compound_gigantic_page and __page_cache_add_speculative. The
> root cause of the possible race is prep_compound_gigantic_page
> uncondittionally setting the ref count of pages to zero. It does this
> because prep_compound_gigantic_page is handed a 'group' of pages from an
> allocator and needs to convert that group of pages to a compound page.
> The ref count of each page in this 'group' is one as set by the
> allocator. However, the ref count of compound page tail pages must be
> zero.
>
> The potential race comes about when ref counted pages are returned from
> the allocator. When this happens, other mm code could also take a
> reference on the page. __page_cache_add_speculative is one such
> example. Therefore, prep_compound_gigantic_page can not just set the
> ref count of pages to zero as it does today. Doing so would lose the
> reference taken by any other code. This would lead to BUGs in code
> checking ref counts and could possibly even lead to memory corruption.

Hi Mike,

Well. It takes me some time to get the race. It also makes me think more
about this. See the below code snippet in gather_surplus_pages().

zeroed = put_page_testzero(page);
VM_BUG_ON_PAGE(!zeroed, page);
enqueue_huge_page(h, page);

The VM_BUG_ON_PAGE() can be triggered because of the similar
race, right? IIUC, we also should fix this.

Thanks.

>
> There are two possible ways to address this issue.
> 1) Make all allocators of gigantic groups of pages be able to return a
> properly constructed compound page.
> 2) Make prep_compound_gigantic_page be more careful when constructing a
> compound page.
> This patch takes approach 2.
>
> In prep_compound_gigantic_page, use cmpxchg to only set ref count to
> zero if it is one. If the cmpxchg fails, call synchronize_rcu() in the
> hope that the extra ref count will be driopped during a rcu grace
> period. This is not a performance critical code path and the wait
> should be accceptable. If the ref count is still inflated after the
> grace period, then undo any modifications made and return an error.
>
> Currently prep_compound_gigantic_page is type void and does not return
> errors. Modify the two callers to check for and handle error returns.
> On error, the caller must free the 'group' of pages as they can not be
> used to form a gigantic page. After freeing pages, the runtime caller
> (alloc_fresh_huge_page) will retry the allocation once. Boot time
> allocations can not be retried.
>
> The routine prep_compound_page also unconditionally sets the ref count
> of compound page tail pages to zero. However, in this case the buddy
> allocator is constructing a compound page from freshly allocated pages.
> The ref count on those freshly allocated pages is already zero, so the
> set_page_count(p, 0) is unnecessary and could lead to confusion.
> Just remove it.
>
> [1] https://lore.kernel.org/linux-mm/CAG48ez23q0Jy9cuVnwAe7t_fdhMk2S7N5Hdi-GLcCeq5bsfLxw@mail.gmail.com/
> Fixes: 58a84aa92723 ("thp: set compound tail page _count to zero")
> Reported-by: Jann Horn <[email protected]>
> Signed-off-by: Mike Kravetz <[email protected]>
> ---
> mm/hugetlb.c | 72 +++++++++++++++++++++++++++++++++++++++++++------
> mm/page_alloc.c | 1 -
> 2 files changed, 64 insertions(+), 9 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 50596b7d6da9..924553aa8f78 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1623,9 +1623,9 @@ static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
> spin_unlock_irq(&hugetlb_lock);
> }
>
> -static void prep_compound_gigantic_page(struct page *page, unsigned int order)
> +static bool prep_compound_gigantic_page(struct page *page, unsigned int order)
> {
> - int i;
> + int i, j;
> int nr_pages = 1 << order;
> struct page *p = page + 1;
>
> @@ -1647,11 +1647,48 @@ static void prep_compound_gigantic_page(struct page *page, unsigned int order)
> * after get_user_pages().
> */
> __ClearPageReserved(p);
> + /*
> + * Subtle and very unlikely
> + *
> + * Gigantic 'page allocators' such as memblock or cma will
> + * return a set of pages with each page ref counted. We need
> + * to turn this set of pages into a compound page with tail
> + * page ref counts set to zero. Code such as speculative page
> + * cache adding could take a ref on a 'to be' tail page.
> + * We need to respect any increased ref count, and only set
> + * the ref count to zero if count is currently 1. If count
> + * is not 1, we call synchronize_rcu in the hope that a rcu
> + * grace period will cause ref count to drop and then retry.
> + * If count is still inflated on retry we return an error and
> + * must discard the pages.
> + */
> + if (!page_ref_freeze(p, 1)) {
> + pr_info("HugeTLB unexpected inflated ref count on freshly allocated page\n");
> + synchronize_rcu();
> + if (!page_ref_freeze(p, 1))
> + goto out_error;
> + }
> set_page_count(p, 0);
> set_compound_head(p, page);
> }
> atomic_set(compound_mapcount_ptr(page), -1);
> atomic_set(compound_pincount_ptr(page), 0);
> + return true;
> +
> +out_error:
> + /* undo tail page modifications made above */
> + p = page + 1;
> + for (j = 1; j < i; j++, p = mem_map_next(p, page, j)) {
> + clear_compound_head(p);
> + set_page_refcounted(p);
> + }
> + /* need to clear PG_reserved on remaining tail pages */
> + for (; j < nr_pages; j++, p = mem_map_next(p, page, j))
> + __ClearPageReserved(p);
> + set_compound_order(page, 0);
> + page[1].compound_nr = 0;
> + __ClearPageHead(page);
> + return false;
> }
>
> /*
> @@ -1771,7 +1808,9 @@ static struct page *alloc_fresh_huge_page(struct hstate *h,
> nodemask_t *node_alloc_noretry)
> {
> struct page *page;
> + bool retry = false;
>
> +retry:
> if (hstate_is_gigantic(h))
> page = alloc_gigantic_page(h, gfp_mask, nid, nmask);
> else
> @@ -1780,8 +1819,21 @@ static struct page *alloc_fresh_huge_page(struct hstate *h,
> if (!page)
> return NULL;
>
> - if (hstate_is_gigantic(h))
> - prep_compound_gigantic_page(page, huge_page_order(h));
> + if (hstate_is_gigantic(h)) {
> + if (!prep_compound_gigantic_page(page, huge_page_order(h))) {
> + /*
> + * Rare failure to convert pages to compound page.
> + * Free pages and try again - ONCE!
> + */
> + free_gigantic_page(page, huge_page_order(h));
> + if (!retry) {
> + retry = true;
> + goto retry;
> + }
> + pr_warn("HugeTLB page can not be used due to unexpected inflated ref count\n");
> + return NULL;
> + }
> + }
> prep_new_huge_page(h, page, page_to_nid(page));
>
> return page;
> @@ -2771,10 +2823,14 @@ static void __init gather_bootmem_prealloc(void)
>
> VM_BUG_ON(!hstate_is_gigantic(h));
> WARN_ON(page_count(page) != 1);
> - prep_compound_gigantic_page(page, huge_page_order(h));
> - WARN_ON(PageReserved(page));
> - prep_new_huge_page(h, page, page_to_nid(page));
> - put_page(page); /* free it into the hugepage allocator */
> + if (prep_compound_gigantic_page(page, huge_page_order(h))) {
> + WARN_ON(PageReserved(page));
> + prep_new_huge_page(h, page, page_to_nid(page));
> + put_page(page); /* add to the hugepage allocator */
> + } else {
> + free_gigantic_page(page, huge_page_order(h));
> + pr_warn("HugeTLB page can not be used due to unexpected inflated ref count\n");
> + }
>
> /*
> * We need to restore the 'stolen' pages to totalram_pages
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 8836e54721ae..a672d9c85118 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -749,7 +749,6 @@ void prep_compound_page(struct page *page, unsigned int order)
> __SetPageHead(page);
> for (i = 1; i < nr_pages; i++) {
> struct page *p = page + i;
> - set_page_count(p, 0);
> p->mapping = TAIL_MAPPING;
> set_compound_head(p, page);
> }
> --
> 2.31.1
>

2021-06-24 00:28:58

by Mike Kravetz

[permalink] [raw]
Subject: Re: [External] [PATCH 2/2] hugetlb: address ref count racing in prep_compound_gigantic_page

Cc: Naoya

On 6/23/21 1:00 AM, Muchun Song wrote:
> On Tue, Jun 22, 2021 at 10:15 AM Mike Kravetz <[email protected]> wrote:
>>
>> In [1], Jann Horn points out a possible race between
>> prep_compound_gigantic_page and __page_cache_add_speculative. The
>> root cause of the possible race is prep_compound_gigantic_page
>> uncondittionally setting the ref count of pages to zero. It does this
>> because prep_compound_gigantic_page is handed a 'group' of pages from an
>> allocator and needs to convert that group of pages to a compound page.
>> The ref count of each page in this 'group' is one as set by the
>> allocator. However, the ref count of compound page tail pages must be
>> zero.
>>
>> The potential race comes about when ref counted pages are returned from
>> the allocator. When this happens, other mm code could also take a
>> reference on the page. __page_cache_add_speculative is one such
>> example. Therefore, prep_compound_gigantic_page can not just set the
>> ref count of pages to zero as it does today. Doing so would lose the
>> reference taken by any other code. This would lead to BUGs in code
>> checking ref counts and could possibly even lead to memory corruption.
>
> Hi Mike,
>
> Well. It takes me some time to get the race. It also makes me think more
> about this. See the below code snippet in gather_surplus_pages().
>
> zeroed = put_page_testzero(page);
> VM_BUG_ON_PAGE(!zeroed, page);
> enqueue_huge_page(h, page);
>
> The VM_BUG_ON_PAGE() can be triggered because of the similar
> race, right? IIUC, we also should fix this.

Thanks for taking a look at this Muchun.

I believe you are correct. Page allocators (even buddy) will hand back
a ref counted head page. Any other code 'could' take a reference on the
head page before the pages are made into a hugetlb page. Once the pages
becomes a hugetlb page (PageHuge() true), then only hugetlb specific
code should be modifying the ref count. So, it seems the 'race window'
is from the time the pages are returned from a low level allocator until
the time the pages become a hugetlb page. Does that sound correct?

If we want to check for and handle such a race, we would need to do so
in prep_new_huge_page. After setting the descructor we would need to
check for an increased ref count (> 1). Not sure if we would need a
memory barrier or some other type synchronization for this? This of
course means that prep_new_huge_page could return an error, and we would
need to deal with that in all callers.

I went back and looked at those lines in gather_surplus_pages

zeroed = put_page_testzero(page);
VM_BUG_ON_PAGE(!zeroed, page);
enqueue_huge_page(h, page);

They were first added as part of alloc_buddy_huge_page with commit
2668db9111bb - hugetlb: correct page count for surplus huge pages.
It appears the reason for the VM_BUG_ON is because prior hugetlb code
forgot to account for the ref count provided by the buddy allocator.
The VM_BUG_ON may have been added mostly as a sanity check for hugetlb
ref count management.

I wonder if we have ever hit that VM_BUG_ON in the 13 years it has been
in the code? I know you recently spotted the potential race with memory
error handling and Naoya fixed up the memory error code.

I'm OK with modifying prep_new_huge_page, but it is going to be a bit
messy (like this patch). I wonder if there are other less intrusive
ways to address this potential issue?
--
Mike Kravetz

2021-06-24 03:39:35

by Muchun Song

[permalink] [raw]
Subject: Re: [External] [PATCH 2/2] hugetlb: address ref count racing in prep_compound_gigantic_page

On Thu, Jun 24, 2021 at 8:26 AM Mike Kravetz <[email protected]> wrote:
>
> Cc: Naoya
>
> On 6/23/21 1:00 AM, Muchun Song wrote:
> > On Tue, Jun 22, 2021 at 10:15 AM Mike Kravetz <[email protected]> wrote:
> >>
> >> In [1], Jann Horn points out a possible race between
> >> prep_compound_gigantic_page and __page_cache_add_speculative. The
> >> root cause of the possible race is prep_compound_gigantic_page
> >> uncondittionally setting the ref count of pages to zero. It does this
> >> because prep_compound_gigantic_page is handed a 'group' of pages from an
> >> allocator and needs to convert that group of pages to a compound page.
> >> The ref count of each page in this 'group' is one as set by the
> >> allocator. However, the ref count of compound page tail pages must be
> >> zero.
> >>
> >> The potential race comes about when ref counted pages are returned from
> >> the allocator. When this happens, other mm code could also take a
> >> reference on the page. __page_cache_add_speculative is one such
> >> example. Therefore, prep_compound_gigantic_page can not just set the
> >> ref count of pages to zero as it does today. Doing so would lose the
> >> reference taken by any other code. This would lead to BUGs in code
> >> checking ref counts and could possibly even lead to memory corruption.
> >
> > Hi Mike,
> >
> > Well. It takes me some time to get the race. It also makes me think more
> > about this. See the below code snippet in gather_surplus_pages().
> >
> > zeroed = put_page_testzero(page);
> > VM_BUG_ON_PAGE(!zeroed, page);
> > enqueue_huge_page(h, page);
> >
> > The VM_BUG_ON_PAGE() can be triggered because of the similar
> > race, right? IIUC, we also should fix this.
>
> Thanks for taking a look at this Muchun.
>
> I believe you are correct. Page allocators (even buddy) will hand back
> a ref counted head page. Any other code 'could' take a reference on the
> head page before the pages are made into a hugetlb page. Once the pages
> becomes a hugetlb page (PageHuge() true), then only hugetlb specific
> code should be modifying the ref count. So, it seems the 'race window'
> is from the time the pages are returned from a low level allocator until
> the time the pages become a hugetlb page. Does that sound correct?

I have a question about this, why should the ref count of the hugetlb page
be managed by the hugetlb specific code? What if we broke this rule? If so,
we can fix this issue easily.

CPU0: CPU1:
page = xas_load()
// the page is freed to the buddy
page = alloc_surplus_huge_page()
if (put_page_testzero(page))
enqueue_huge_page(h, page)
page_cache_get_speculative(page)
if (unlikely(page != xas_reload(&xas)))
// this can help us free the hugetlb page to pool
put_page(page)

If someone gets the page successfully before we put the page in the
gather_surplus_pages, then it will help us add the hugetlb page
to the pool when it calls put_page().

>
> If we want to check for and handle such a race, we would need to do so
> in prep_new_huge_page. After setting the descructor we would need to
> check for an increased ref count (> 1). Not sure if we would need a
> memory barrier or some other type synchronization for this? This of
> course means that prep_new_huge_page could return an error, and we would
> need to deal with that in all callers.

As described above, IIUC, we do not need to change the behavior
of prep_new_huge_page. We just need to change gather_surplus_pages
like below. Just my thoughts about this, maybe I am wrong.

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c3b2a8a494d6..8a1a75534543 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2160,17 +2160,11 @@ static int gather_surplus_pages(struct hstate
*h, long delta)

/* Free the needed pages to the hugetlb pool */
list_for_each_entry_safe(page, tmp, &surplus_list, lru) {
- int zeroed;
-
if ((--needed) < 0)
break;
- /*
- * This page is now managed by the hugetlb allocator and has
- * no users -- drop the buddy allocator's reference.
- */
- zeroed = put_page_testzero(page);
- VM_BUG_ON_PAGE(!zeroed, page);
- enqueue_huge_page(h, page);
+
+ if (put_page_testzero(page))
+ enqueue_huge_page(h, page);
}
free:
spin_unlock_irq(&hugetlb_lock);


>
> I went back and looked at those lines in gather_surplus_pages
>
> zeroed = put_page_testzero(page);
> VM_BUG_ON_PAGE(!zeroed, page);
> enqueue_huge_page(h, page);
>
> They were first added as part of alloc_buddy_huge_page with commit
> 2668db9111bb - hugetlb: correct page count for surplus huge pages.
> It appears the reason for the VM_BUG_ON is because prior hugetlb code
> forgot to account for the ref count provided by the buddy allocator.
> The VM_BUG_ON may have been added mostly as a sanity check for hugetlb
> ref count management.
>
> I wonder if we have ever hit that VM_BUG_ON in the 13 years it has been
> in the code? I know you recently spotted the potential race with memory
> error handling and Naoya fixed up the memory error code.
>
> I'm OK with modifying prep_new_huge_page, but it is going to be a bit
> messy (like this patch). I wonder if there are other less intrusive
> ways to address this potential issue?
> --
> Mike Kravetz