2021-05-19 18:44:25

by Naoya Horiguchi

[permalink] [raw]
Subject: [PATCH v5 1/2] mm,hwpoison: fix race with hugetlb page allocation

From: Naoya Horiguchi <[email protected]>

When hugetlb page fault (under overcommitting situation) and
memory_failure() race, VM_BUG_ON_PAGE() is triggered by the following race:

CPU0: CPU1:

gather_surplus_pages()
page = alloc_surplus_huge_page()
memory_failure_hugetlb()
get_hwpoison_page(page)
__get_hwpoison_page(page)
get_page_unless_zero(page)
zero = put_page_testzero(page)
VM_BUG_ON_PAGE(!zero, page)
enqueue_huge_page(h, page)
put_page(page)

__get_hwpoison_page() only checks the page refcount before taking an
additional one for memory error handling, which is wrong because there's
a time window where compound pages have non-zero refcount during
initialization. So make __get_hwpoison_page() check page status a bit
more for hugetlb pages.

Fixes: ead07f6a867b ("mm/memory-failure: introduce get_hwpoison_page() for consistent refcount handling")
Signed-off-by: Naoya Horiguchi <[email protected]>
Reported-by: Muchun Song <[email protected]>
Cc: [email protected] # 5.12+
---
include/linux/hugetlb.h | 6 ++++++
mm/hugetlb.c | 15 +++++++++++++++
mm/memory-failure.c | 8 +++++++-
3 files changed, 28 insertions(+), 1 deletion(-)

diff --git v5.13-rc2/include/linux/hugetlb.h v5.13-rc2_patched/include/linux/hugetlb.h
index b92f25ccef58..790ae618548d 100644
--- v5.13-rc2/include/linux/hugetlb.h
+++ v5.13-rc2_patched/include/linux/hugetlb.h
@@ -149,6 +149,7 @@ bool hugetlb_reserve_pages(struct inode *inode, long from, long to,
long hugetlb_unreserve_pages(struct inode *inode, long start, long end,
long freed);
bool isolate_huge_page(struct page *page, struct list_head *list);
+int get_hwpoison_huge_page(struct page *page, bool *hugetlb);
void putback_active_hugepage(struct page *page);
void move_hugetlb_state(struct page *oldpage, struct page *newpage, int reason);
void free_huge_page(struct page *page);
@@ -339,6 +340,11 @@ static inline bool isolate_huge_page(struct page *page, struct list_head *list)
return false;
}

+static inline int get_hwpoison_huge_page(struct page *page, bool *hugetlb)
+{
+ return 0;
+}
+
static inline void putback_active_hugepage(struct page *page)
{
}
diff --git v5.13-rc2/mm/hugetlb.c v5.13-rc2_patched/mm/hugetlb.c
index 95918f410c0f..f138bae3e302 100644
--- v5.13-rc2/mm/hugetlb.c
+++ v5.13-rc2_patched/mm/hugetlb.c
@@ -5847,6 +5847,21 @@ bool isolate_huge_page(struct page *page, struct list_head *list)
return ret;
}

+int get_hwpoison_huge_page(struct page *page, bool *hugetlb)
+{
+ int ret = 0;
+
+ *hugetlb = false;
+ spin_lock_irq(&hugetlb_lock);
+ if (PageHeadHuge(page)) {
+ *hugetlb = true;
+ if (HPageFreed(page) || HPageMigratable(page))
+ ret = get_page_unless_zero(page);
+ }
+ spin_unlock_irq(&hugetlb_lock);
+ return ret;
+}
+
void putback_active_hugepage(struct page *page)
{
spin_lock_irq(&hugetlb_lock);
diff --git v5.13-rc2/mm/memory-failure.c v5.13-rc2_patched/mm/memory-failure.c
index 85ad98c00fd9..353c6177e489 100644
--- v5.13-rc2/mm/memory-failure.c
+++ v5.13-rc2_patched/mm/memory-failure.c
@@ -959,8 +959,14 @@ static int page_action(struct page_state *ps, struct page *p,
static int __get_hwpoison_page(struct page *page)
{
struct page *head = compound_head(page);
+ int ret = 0;
+ bool hugetlb = false;
+
+ ret = get_hwpoison_huge_page(head, &hugetlb);
+ if (hugetlb)
+ return ret;

- if (!PageHuge(head) && PageTransHuge(head)) {
+ if (PageTransHuge(head)) {
/*
* Non anonymous thp exists only in allocation/free time. We
* can't handle such a case correctly, so let's give it up.
--
2.25.1



2021-05-19 22:33:37

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH v5 1/2] mm,hwpoison: fix race with hugetlb page allocation

On 5/18/21 4:12 PM, Naoya Horiguchi wrote:
> From: Naoya Horiguchi <[email protected]>
>
> When hugetlb page fault (under overcommitting situation) and
> memory_failure() race, VM_BUG_ON_PAGE() is triggered by the following race:
>
> CPU0: CPU1:
>
> gather_surplus_pages()
> page = alloc_surplus_huge_page()
> memory_failure_hugetlb()
> get_hwpoison_page(page)
> __get_hwpoison_page(page)
> get_page_unless_zero(page)
> zero = put_page_testzero(page)
> VM_BUG_ON_PAGE(!zero, page)
> enqueue_huge_page(h, page)
> put_page(page)
>
> __get_hwpoison_page() only checks the page refcount before taking an
> additional one for memory error handling, which is wrong because there's
> a time window where compound pages have non-zero refcount during
> initialization. So make __get_hwpoison_page() check page status a bit
> more for hugetlb pages.
>
> Fixes: ead07f6a867b ("mm/memory-failure: introduce get_hwpoison_page() for consistent refcount handling")
> Signed-off-by: Naoya Horiguchi <[email protected]>
> Reported-by: Muchun Song <[email protected]>
> Cc: [email protected] # 5.12+
> ---
> include/linux/hugetlb.h | 6 ++++++
> mm/hugetlb.c | 15 +++++++++++++++
> mm/memory-failure.c | 8 +++++++-
> 3 files changed, 28 insertions(+), 1 deletion(-)
>
> diff --git v5.13-rc2/include/linux/hugetlb.h v5.13-rc2_patched/include/linux/hugetlb.h
> index b92f25ccef58..790ae618548d 100644
> --- v5.13-rc2/include/linux/hugetlb.h
> +++ v5.13-rc2_patched/include/linux/hugetlb.h
> @@ -149,6 +149,7 @@ bool hugetlb_reserve_pages(struct inode *inode, long from, long to,
> long hugetlb_unreserve_pages(struct inode *inode, long start, long end,
> long freed);
> bool isolate_huge_page(struct page *page, struct list_head *list);
> +int get_hwpoison_huge_page(struct page *page, bool *hugetlb);
> void putback_active_hugepage(struct page *page);
> void move_hugetlb_state(struct page *oldpage, struct page *newpage, int reason);
> void free_huge_page(struct page *page);
> @@ -339,6 +340,11 @@ static inline bool isolate_huge_page(struct page *page, struct list_head *list)
> return false;
> }
>
> +static inline int get_hwpoison_huge_page(struct page *page, bool *hugetlb)
> +{
> + return 0;
> +}
> +
> static inline void putback_active_hugepage(struct page *page)
> {
> }
> diff --git v5.13-rc2/mm/hugetlb.c v5.13-rc2_patched/mm/hugetlb.c
> index 95918f410c0f..f138bae3e302 100644
> --- v5.13-rc2/mm/hugetlb.c
> +++ v5.13-rc2_patched/mm/hugetlb.c
> @@ -5847,6 +5847,21 @@ bool isolate_huge_page(struct page *page, struct list_head *list)
> return ret;
> }
>
> +int get_hwpoison_huge_page(struct page *page, bool *hugetlb)
> +{
> + int ret = 0;
> +
> + *hugetlb = false;
> + spin_lock_irq(&hugetlb_lock);
> + if (PageHeadHuge(page)) {
> + *hugetlb = true;
> + if (HPageFreed(page) || HPageMigratable(page))
> + ret = get_page_unless_zero(page);
> + }
> + spin_unlock_irq(&hugetlb_lock);
> + return ret;
> +}
> +
> void putback_active_hugepage(struct page *page)
> {
> spin_lock_irq(&hugetlb_lock);
> diff --git v5.13-rc2/mm/memory-failure.c v5.13-rc2_patched/mm/memory-failure.c
> index 85ad98c00fd9..353c6177e489 100644
> --- v5.13-rc2/mm/memory-failure.c
> +++ v5.13-rc2_patched/mm/memory-failure.c
> @@ -959,8 +959,14 @@ static int page_action(struct page_state *ps, struct page *p,
> static int __get_hwpoison_page(struct page *page)
> {
> struct page *head = compound_head(page);
> + int ret = 0;
> + bool hugetlb = false;
> +
> + ret = get_hwpoison_huge_page(head, &hugetlb);
> + if (hugetlb)
> + return ret;
>

Hello Naoya,

Thanks for your continued efforts. However, I believe the race still
exists. Unless I am mistaken, it is possible that page is in the hugetlb
allocation patch and racing with __get_hwpoison_page() as follows:

CPU0: CPU1:

gather_surplus_pages()
page = alloc_surplus_huge_page()
page = alloc_fresh_huge_page()
page = alloc_buddy_huge_page()
memory_failure_hugetlb()
get_hwpoison_page(page)
__get_hwpoison_page(page)
get_hwpoison_huge_page()
/* Note that PageHuge()
is false, so hugetlb
not set */
PageTransHuge(head) false
prep_new_huge_page(page)
/* Now PageHuge() becomes true */
get_page_unless_zero(page)

I am not sure if it is possible to handle this race in the memory error
code. I can not think of a way to avoid potentially incrementing the
ref count on a hugetlb page as it is being created. There is nothing
synchronizing this in the hugetlb code.

When Muchun first proposed a fix to the race, the idea was to catch the
race in the hugetlb code. Michal suggested that the memory error code
be more careful in modifying ref counts. I would wait a bit to see if
someone has a good idea how this can be done. We 'may' need to revisit
the approach suggested by Muchun.
--
Mike Kravetz

Subject: Re: [PATCH v5 1/2] mm,hwpoison: fix race with hugetlb page allocation

On Wed, May 19, 2021 at 03:32:17PM -0700, Mike Kravetz wrote:
> On 5/18/21 4:12 PM, Naoya Horiguchi wrote:
> > From: Naoya Horiguchi <[email protected]>
> >
> > When hugetlb page fault (under overcommitting situation) and
> > memory_failure() race, VM_BUG_ON_PAGE() is triggered by the following race:
> >
> > CPU0: CPU1:
> >
> > gather_surplus_pages()
> > page = alloc_surplus_huge_page()
> > memory_failure_hugetlb()
> > get_hwpoison_page(page)
> > __get_hwpoison_page(page)
> > get_page_unless_zero(page)
> > zero = put_page_testzero(page)
> > VM_BUG_ON_PAGE(!zero, page)
> > enqueue_huge_page(h, page)
> > put_page(page)
> >
> > __get_hwpoison_page() only checks the page refcount before taking an
> > additional one for memory error handling, which is wrong because there's
> > a time window where compound pages have non-zero refcount during
> > initialization. So make __get_hwpoison_page() check page status a bit
> > more for hugetlb pages.
> >
> > Fixes: ead07f6a867b ("mm/memory-failure: introduce get_hwpoison_page() for consistent refcount handling")
> > Signed-off-by: Naoya Horiguchi <[email protected]>
> > Reported-by: Muchun Song <[email protected]>
> > Cc: [email protected] # 5.12+
> > ---
> > include/linux/hugetlb.h | 6 ++++++
> > mm/hugetlb.c | 15 +++++++++++++++
> > mm/memory-failure.c | 8 +++++++-
> > 3 files changed, 28 insertions(+), 1 deletion(-)
> >
> > diff --git v5.13-rc2/include/linux/hugetlb.h v5.13-rc2_patched/include/linux/hugetlb.h
> > index b92f25ccef58..790ae618548d 100644
> > --- v5.13-rc2/include/linux/hugetlb.h
> > +++ v5.13-rc2_patched/include/linux/hugetlb.h
> > @@ -149,6 +149,7 @@ bool hugetlb_reserve_pages(struct inode *inode, long from, long to,
> > long hugetlb_unreserve_pages(struct inode *inode, long start, long end,
> > long freed);
> > bool isolate_huge_page(struct page *page, struct list_head *list);
> > +int get_hwpoison_huge_page(struct page *page, bool *hugetlb);
> > void putback_active_hugepage(struct page *page);
> > void move_hugetlb_state(struct page *oldpage, struct page *newpage, int reason);
> > void free_huge_page(struct page *page);
> > @@ -339,6 +340,11 @@ static inline bool isolate_huge_page(struct page *page, struct list_head *list)
> > return false;
> > }
> >
> > +static inline int get_hwpoison_huge_page(struct page *page, bool *hugetlb)
> > +{
> > + return 0;
> > +}
> > +
> > static inline void putback_active_hugepage(struct page *page)
> > {
> > }
> > diff --git v5.13-rc2/mm/hugetlb.c v5.13-rc2_patched/mm/hugetlb.c
> > index 95918f410c0f..f138bae3e302 100644
> > --- v5.13-rc2/mm/hugetlb.c
> > +++ v5.13-rc2_patched/mm/hugetlb.c
> > @@ -5847,6 +5847,21 @@ bool isolate_huge_page(struct page *page, struct list_head *list)
> > return ret;
> > }
> >
> > +int get_hwpoison_huge_page(struct page *page, bool *hugetlb)
> > +{
> > + int ret = 0;
> > +
> > + *hugetlb = false;
> > + spin_lock_irq(&hugetlb_lock);
> > + if (PageHeadHuge(page)) {
> > + *hugetlb = true;
> > + if (HPageFreed(page) || HPageMigratable(page))
> > + ret = get_page_unless_zero(page);
> > + }
> > + spin_unlock_irq(&hugetlb_lock);
> > + return ret;
> > +}
> > +
> > void putback_active_hugepage(struct page *page)
> > {
> > spin_lock_irq(&hugetlb_lock);
> > diff --git v5.13-rc2/mm/memory-failure.c v5.13-rc2_patched/mm/memory-failure.c
> > index 85ad98c00fd9..353c6177e489 100644
> > --- v5.13-rc2/mm/memory-failure.c
> > +++ v5.13-rc2_patched/mm/memory-failure.c
> > @@ -959,8 +959,14 @@ static int page_action(struct page_state *ps, struct page *p,
> > static int __get_hwpoison_page(struct page *page)
> > {
> > struct page *head = compound_head(page);
> > + int ret = 0;
> > + bool hugetlb = false;
> > +
> > + ret = get_hwpoison_huge_page(head, &hugetlb);
> > + if (hugetlb)
> > + return ret;
> >
>
> Hello Naoya,
>
> Thanks for your continued efforts. However, I believe the race still
> exists. Unless I am mistaken, it is possible that page is in the hugetlb
> allocation patch and racing with __get_hwpoison_page() as follows:

Hi Mike, thank you for the investigation.

>
> CPU0: CPU1:
>
> gather_surplus_pages()
> page = alloc_surplus_huge_page()
> page = alloc_fresh_huge_page()
> page = alloc_buddy_huge_page()
> memory_failure_hugetlb()
> get_hwpoison_page(page)
> __get_hwpoison_page(page)
> get_hwpoison_huge_page()
> /* Note that PageHuge()
> is false, so hugetlb
> not set */
> PageTransHuge(head) false

It seems that PageTransHuge returns true in this race condition because it
simply checks PG_head flag. But anyway, get_page_unless_zero() is called for
a hugetlb in this race, which is problematic.

> prep_new_huge_page(page)
> /* Now PageHuge() becomes true */
> get_page_unless_zero(page)
>
> I am not sure if it is possible to handle this race in the memory error
> code.

I think that __get_hwpoison_page() might not properly call get_page_unless_zero().
Looking at other callers of this function, most(*) of them are calling it in
the context where a given page is pinned and in-use, and get_page_unless_zero()
is used to detect the race with page freeing. In the current version,
__get_hwpoison_page() calls get_page_unless_zero() without caring for such an
assumption, which might be the root cause of the race with hugetlb page allocation.

# (*) It seems to me that do_migrate_range() might have the same issue
# around get_page_unless_zero().

So I think of inserting the check to comply with the assumption of
get_hwpoison_huge_page() like below:

ret = get_hwpoison_huge_page(head, &hugetlb);
if (hugetlb)
return ret;

if (!PageLRU(head) && !__PageMovable(head))
return 0;

if (PageTransHuge(head)) {
...
}

if (get_page_unless_zero(head)) {
...
}

return 0;

The newly added checks should work to prevent the above race, then get_any_page()
should retry and grab the page properly as a stable hugetlb page.

> I can not think of a way to avoid potentially incrementing the
> ref count on a hugetlb page as it is being created. There is nothing
> synchronizing this in the hugetlb code.
>
> When Muchun first proposed a fix to the race, the idea was to catch the
> race in the hugetlb code. Michal suggested that the memory error code
> be more careful in modifying ref counts. I would wait a bit to see if
> someone has a good idea how this can be done. We 'may' need to revisit
> the approach suggested by Muchun.

If the above approach is still broken, let's revisit Muchun's approach.

Thanks,
Naoya Horiguchi

2021-05-25 08:25:21

by Oscar Salvador

[permalink] [raw]
Subject: Re: [PATCH v5 1/2] mm,hwpoison: fix race with hugetlb page allocation

On Thu, May 20, 2021 at 07:17:17AM +0000, HORIGUCHI NAOYA(堀口 直也) wrote:
> So I think of inserting the check to comply with the assumption of
> get_hwpoison_huge_page() like below:
>
> ret = get_hwpoison_huge_page(head, &hugetlb);
> if (hugetlb)
> return ret;
>
> if (!PageLRU(head) && !__PageMovable(head))
> return 0;
>
> if (PageTransHuge(head)) {
> ...
> }
>
> if (get_page_unless_zero(head)) {
> ...
> }
>
> return 0;

Hi Naoya,

would you mind posting a complete draft of what it would look like?
I am having a hard time picturing it.


Thanks

--
Oscar Salvador
SUSE L3

Subject: Re: [PATCH v5 1/2] mm,hwpoison: fix race with hugetlb page allocation

On Tue, May 25, 2021 at 09:36:05AM +0200, Oscar Salvador wrote:
> On Thu, May 20, 2021 at 07:17:17AM +0000, HORIGUCHI NAOYA(堀口 直也) wrote:
> > So I think of inserting the check to comply with the assumption of
> > get_hwpoison_huge_page() like below:
> >
> > ret = get_hwpoison_huge_page(head, &hugetlb);
> > if (hugetlb)
> > return ret;
> >
> > if (!PageLRU(head) && !__PageMovable(head))
> > return 0;
> >
> > if (PageTransHuge(head)) {
> > ...
> > }
> >
> > if (get_page_unless_zero(head)) {
> > ...
> > }
> >
> > return 0;
>
> Hi Naoya,
>
> would you mind posting a complete draft of what it would look like?
> I am having a hard time picturing it.

OK, here's the current draft.

Thanks,
Naoya Horiguchi

---
From: Naoya Horiguchi <[email protected]>
Date: Tue, 18 May 2021 23:49:18 +0900
Subject: [PATCH] mm,hwpoison: fix race with hugetlb page allocation

When hugetlb page fault (under overcommitting situation) and
memory_failure() race, VM_BUG_ON_PAGE() is triggered by the following race:

CPU0: CPU1:

gather_surplus_pages()
page = alloc_surplus_huge_page()
memory_failure_hugetlb()
get_hwpoison_page(page)
__get_hwpoison_page(page)
get_page_unless_zero(page)
zero = put_page_testzero(page)
VM_BUG_ON_PAGE(!zero, page)
enqueue_huge_page(h, page)
put_page(page)

__get_hwpoison_page() only checks the page refcount before taking an
additional one for memory error handling, which is wrong because there's
a time window where compound pages have non-zero refcount during
initialization. So make __get_hwpoison_page() check page status a bit
more for hugetlb pages.

Fixes: ead07f6a867b ("mm/memory-failure: introduce get_hwpoison_page() for consistent refcount handling")
Signed-off-by: Naoya Horiguchi <[email protected]>
Reported-by: Muchun Song <[email protected]>
Cc: [email protected] # 5.12+
---
include/linux/hugetlb.h | 6 ++++++
mm/hugetlb.c | 15 +++++++++++++++
mm/memory-failure.c | 11 ++++++++++-
3 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index b92f25ccef58..790ae618548d 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -149,6 +149,7 @@ bool hugetlb_reserve_pages(struct inode *inode, long from, long to,
long hugetlb_unreserve_pages(struct inode *inode, long start, long end,
long freed);
bool isolate_huge_page(struct page *page, struct list_head *list);
+int get_hwpoison_huge_page(struct page *page, bool *hugetlb);
void putback_active_hugepage(struct page *page);
void move_hugetlb_state(struct page *oldpage, struct page *newpage, int reason);
void free_huge_page(struct page *page);
@@ -339,6 +340,11 @@ static inline bool isolate_huge_page(struct page *page, struct list_head *list)
return false;
}

+static inline int get_hwpoison_huge_page(struct page *page, bool *hugetlb)
+{
+ return 0;
+}
+
static inline void putback_active_hugepage(struct page *page)
{
}
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 95918f410c0f..f138bae3e302 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5847,6 +5847,21 @@ bool isolate_huge_page(struct page *page, struct list_head *list)
return ret;
}

+int get_hwpoison_huge_page(struct page *page, bool *hugetlb)
+{
+ int ret = 0;
+
+ *hugetlb = false;
+ spin_lock_irq(&hugetlb_lock);
+ if (PageHeadHuge(page)) {
+ *hugetlb = true;
+ if (HPageFreed(page) || HPageMigratable(page))
+ ret = get_page_unless_zero(page);
+ }
+ spin_unlock_irq(&hugetlb_lock);
+ return ret;
+}
+
void putback_active_hugepage(struct page *page)
{
spin_lock_irq(&hugetlb_lock);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 85ad98c00fd9..4c264c4090d7 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -959,8 +959,17 @@ static int page_action(struct page_state *ps, struct page *p,
static int __get_hwpoison_page(struct page *page)
{
struct page *head = compound_head(page);
+ int ret = 0;
+ bool hugetlb = false;
+
+ ret = get_hwpoison_huge_page(head, &hugetlb);
+ if (hugetlb)
+ return ret;
+
+ if (!PageLRU(head) && !__PageMovable(head))
+ return 0;

- if (!PageHuge(head) && PageTransHuge(head)) {
+ if (PageTransHuge(head)) {
/*
* Non anonymous thp exists only in allocation/free time. We
* can't handle such a case correctly, so let's give it up.
--
2.25.1

2021-05-25 09:12:43

by Oscar Salvador

[permalink] [raw]
Subject: Re: [PATCH v5 1/2] mm,hwpoison: fix race with hugetlb page allocation

On Tue, May 25, 2021 at 08:07:07AM +0000, HORIGUCHI NAOYA(堀口 直也) wrote:
> OK, here's the current draft.
>
> Thanks,
> Naoya Horiguchi
>
> ---
> From: Naoya Horiguchi <[email protected]>
> Date: Tue, 18 May 2021 23:49:18 +0900
> Subject: [PATCH] mm,hwpoison: fix race with hugetlb page allocation
>
> When hugetlb page fault (under overcommitting situation) and
> memory_failure() race, VM_BUG_ON_PAGE() is triggered by the following race:
>
> CPU0: CPU1:
>
> gather_surplus_pages()
> page = alloc_surplus_huge_page()
> memory_failure_hugetlb()
> get_hwpoison_page(page)
> __get_hwpoison_page(page)
> get_page_unless_zero(page)
> zero = put_page_testzero(page)
> VM_BUG_ON_PAGE(!zero, page)
> enqueue_huge_page(h, page)
> put_page(page)
>
> __get_hwpoison_page() only checks the page refcount before taking an
> additional one for memory error handling, which is wrong because there's
> a time window where compound pages have non-zero refcount during
> initialization. So make __get_hwpoison_page() check page status a bit
> more for hugetlb pages.

I think that this changelog would benefit from some information about the new
!PageLRU && !__PageMovable check.

> static int __get_hwpoison_page(struct page *page)
> {
> struct page *head = compound_head(page);
> + int ret = 0;
> + bool hugetlb = false;
> +
> + ret = get_hwpoison_huge_page(head, &hugetlb);
> + if (hugetlb)
> + return ret;
> +
> + if (!PageLRU(head) && !__PageMovable(head))
> + return 0;

This definitely needs a comment hinting the reader why we need to check for this.
AFAICS, this is to close the race where a page is about to be a hugetlb page soon,
so we do not go for get_page_unless_zero(), right?

From soft_offline_page's POV I __guess__ that's fine because we only deal with
pages we know about.
But what about memory_failure()? I think memory_failure() is less picky about that,
so it is okay to not take a refcount on that case?

--
Oscar Salvador
SUSE L3

Subject: Re: [PATCH v5 1/2] mm,hwpoison: fix race with hugetlb page allocation

On Tue, May 25, 2021 at 11:09:18AM +0200, Oscar Salvador wrote:
> On Tue, May 25, 2021 at 08:07:07AM +0000, HORIGUCHI NAOYA(堀口 直也) wrote:
> > OK, here's the current draft.
> >
> > Thanks,
> > Naoya Horiguchi
> >
> > ---
> > From: Naoya Horiguchi <[email protected]>
> > Date: Tue, 18 May 2021 23:49:18 +0900
> > Subject: [PATCH] mm,hwpoison: fix race with hugetlb page allocation
> >
> > When hugetlb page fault (under overcommitting situation) and
> > memory_failure() race, VM_BUG_ON_PAGE() is triggered by the following race:
> >
> > CPU0: CPU1:
> >
> > gather_surplus_pages()
> > page = alloc_surplus_huge_page()
> > memory_failure_hugetlb()
> > get_hwpoison_page(page)
> > __get_hwpoison_page(page)
> > get_page_unless_zero(page)
> > zero = put_page_testzero(page)
> > VM_BUG_ON_PAGE(!zero, page)
> > enqueue_huge_page(h, page)
> > put_page(page)
> >
> > __get_hwpoison_page() only checks the page refcount before taking an
> > additional one for memory error handling, which is wrong because there's
> > a time window where compound pages have non-zero refcount during
> > initialization. So make __get_hwpoison_page() check page status a bit
> > more for hugetlb pages.
>
> I think that this changelog would benefit from some information about the new
> !PageLRU && !__PageMovable check.

OK, I'll update about this check.

> > static int __get_hwpoison_page(struct page *page)
> > {
> > struct page *head = compound_head(page);
> > + int ret = 0;
> > + bool hugetlb = false;
> > +
> > + ret = get_hwpoison_huge_page(head, &hugetlb);
> > + if (hugetlb)
> > + return ret;
> > +
> > + if (!PageLRU(head) && !__PageMovable(head))
> > + return 0;
>
> This definitely needs a comment hinting the reader why we need to check for this.
> AFAICS, this is to close the race where a page is about to be a hugetlb page soon,
> so we do not go for get_page_unless_zero(), right?

Right, I can't find any other reliable check to show that a page is not
under initialization of hugetlb pages. I'll state why this check is needed.

>
> From soft_offline_page's POV I __guess__ that's fine because we only deal with
> pages we know about.
> But what about memory_failure()? I think memory_failure() is less picky about that,
> so it is okay to not take a refcount on that case?

Actually the coverage of page types for which memory error can be handled
are the same between memory_failure() and soft_offline_page(). One
memory_failure-specific role is to print out logs about error events even if
they can't be successfully handled, which could be affected by this change,
so we might need to adjust how memory_failure() uses the return value of
get_hwpoison_page().

Thanks,
Naoya Horiguchi