2018-07-17 05:33:38

by Naoya Horiguchi

[permalink] [raw]
Subject: [PATCH v2 0/2] mm: soft-offline: fix race against page allocation

I've updated the patchset based on feedbacks:

- updated comments (from Andrew),
- moved calling set_hwpoison_free_buddy_page() from mm/migrate.c to mm/memory-failure.c,
which is necessary to check the return code of set_hwpoison_free_buddy_page(),
- lkp bot reported a build error when only 1/2 is applied.

> mm/memory-failure.c: In function 'soft_offline_huge_page':
> >> mm/memory-failure.c:1610:8: error: implicit declaration of function
> 'set_hwpoison_free_buddy_page'; did you mean 'is_free_buddy_page'?
> [-Werror=implicit-function-declaration]
> if (set_hwpoison_free_buddy_page(page))
> ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
> is_free_buddy_page
> cc1: some warnings being treated as errors

set_hwpoison_free_buddy_page() is defined in 2/2, so we can't use it
in 1/2. Simply doing s/set_hwpoison_free_buddy_page/!TestSetPageHWPoison/
will fix this.

v1: https://lkml.org/lkml/2018/7/12/968

Thanks,
Naoya Horiguchi
---
Summary:

Naoya Horiguchi (2):
mm: fix race on soft-offlining free huge pages
mm: soft-offline: close the race against page allocation

include/linux/page-flags.h | 5 +++++
include/linux/swapops.h | 10 ---------
mm/hugetlb.c | 11 +++++-----
mm/memory-failure.c | 53 ++++++++++++++++++++++++++++++++++++++--------
mm/migrate.c | 11 ----------
mm/page_alloc.c | 30 ++++++++++++++++++++++++++
6 files changed, 84 insertions(+), 36 deletions(-)


2018-07-17 05:33:39

by Naoya Horiguchi

[permalink] [raw]
Subject: [PATCH v2 1/2] mm: fix race on soft-offlining free huge pages

There's a race condition between soft offline and hugetlb_fault which
causes unexpected process killing and/or hugetlb allocation failure.

The process killing is caused by the following flow:

CPU 0 CPU 1 CPU 2

soft offline
get_any_page
// find the hugetlb is free
mmap a hugetlb file
page fault
...
hugetlb_fault
hugetlb_no_page
alloc_huge_page
// succeed
soft_offline_free_page
// set hwpoison flag
mmap the hugetlb file
page fault
...
hugetlb_fault
hugetlb_no_page
find_lock_page
return VM_FAULT_HWPOISON
mm_fault_error
do_sigbus
// kill the process


The hugetlb allocation failure comes from the following flow:

CPU 0 CPU 1

mmap a hugetlb file
// reserve all free page but don't fault-in
soft offline
get_any_page
// find the hugetlb is free
soft_offline_free_page
// set hwpoison flag
dissolve_free_huge_page
// fail because all free hugepages are reserved
page fault
...
hugetlb_fault
hugetlb_no_page
alloc_huge_page
...
dequeue_huge_page_node_exact
// ignore hwpoisoned hugepage
// and finally fail due to no-mem

The root cause of this is that current soft-offline code is written
based on an assumption that PageHWPoison flag should beset at first to
avoid accessing the corrupted data. This makes sense for memory_failure()
or hard offline, but does not for soft offline because soft offline is
about corrected (not uncorrected) error and is safe from data lost.
This patch changes soft offline semantics where it sets PageHWPoison flag
only after containment of the error page completes successfully.

Reported-by: Xishi Qiu <[email protected]>
Suggested-by: Xishi Qiu <[email protected]>
Signed-off-by: Naoya Horiguchi <[email protected]>
---
changelog v1->v2:
- don't use set_hwpoison_free_buddy_page() (not defined yet)
- updated comment in soft_offline_huge_page()
---
mm/hugetlb.c | 11 +++++------
mm/memory-failure.c | 24 ++++++++++++++++++------
mm/migrate.c | 2 --
3 files changed, 23 insertions(+), 14 deletions(-)

diff --git v4.18-rc4-mmotm-2018-07-10-16-50/mm/hugetlb.c v4.18-rc4-mmotm-2018-07-10-16-50_patched/mm/hugetlb.c
index 430be42..937c142 100644
--- v4.18-rc4-mmotm-2018-07-10-16-50/mm/hugetlb.c
+++ v4.18-rc4-mmotm-2018-07-10-16-50_patched/mm/hugetlb.c
@@ -1479,22 +1479,20 @@ static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
/*
* Dissolve a given free hugepage into free buddy pages. This function does
* nothing for in-use (including surplus) hugepages. Returns -EBUSY if the
- * number of free hugepages would be reduced below the number of reserved
- * hugepages.
+ * dissolution fails because a give page is not a free hugepage, or because
+ * free hugepages are fully reserved.
*/
int dissolve_free_huge_page(struct page *page)
{
- int rc = 0;
+ int rc = -EBUSY;

spin_lock(&hugetlb_lock);
if (PageHuge(page) && !page_count(page)) {
struct page *head = compound_head(page);
struct hstate *h = page_hstate(head);
int nid = page_to_nid(head);
- if (h->free_huge_pages - h->resv_huge_pages == 0) {
- rc = -EBUSY;
+ if (h->free_huge_pages - h->resv_huge_pages == 0)
goto out;
- }
/*
* Move PageHWPoison flag from head page to the raw error page,
* which makes any subpages rather than the error page reusable.
@@ -1508,6 +1506,7 @@ int dissolve_free_huge_page(struct page *page)
h->free_huge_pages_node[nid]--;
h->max_huge_pages--;
update_and_free_page(h, head);
+ rc = 0;
}
out:
spin_unlock(&hugetlb_lock);
diff --git v4.18-rc4-mmotm-2018-07-10-16-50/mm/memory-failure.c v4.18-rc4-mmotm-2018-07-10-16-50_patched/mm/memory-failure.c
index 9d142b9..9b77f85 100644
--- v4.18-rc4-mmotm-2018-07-10-16-50/mm/memory-failure.c
+++ v4.18-rc4-mmotm-2018-07-10-16-50_patched/mm/memory-failure.c
@@ -1598,8 +1598,20 @@ static int soft_offline_huge_page(struct page *page, int flags)
if (ret > 0)
ret = -EIO;
} else {
- if (PageHuge(page))
- dissolve_free_huge_page(page);
+ /*
+ * We set PG_hwpoison only when the migration source hugepage
+ * was successfully dissolved, because otherwise hwpoisoned
+ * hugepage remains on free hugepage list. The allocator ignores
+ * such a hwpoisoned page so it's never allocated, but it could
+ * kill a process because of no-memory rather than hwpoison.
+ * Soft-offline never impacts the userspace, so this is
+ * undesired.
+ */
+ ret = dissolve_free_huge_page(page);
+ if (!ret) {
+ if (!TestSetPageHWPoison(page))
+ num_poisoned_pages_inc();
+ }
}
return ret;
}
@@ -1715,13 +1727,13 @@ static int soft_offline_in_use_page(struct page *page, int flags)

static void soft_offline_free_page(struct page *page)
{
+ int rc = 0;
struct page *head = compound_head(page);

- if (!TestSetPageHWPoison(head)) {
+ if (PageHuge(head))
+ rc = dissolve_free_huge_page(page);
+ if (!rc && !TestSetPageHWPoison(page))
num_poisoned_pages_inc();
- if (PageHuge(head))
- dissolve_free_huge_page(page);
- }
}

/**
diff --git v4.18-rc4-mmotm-2018-07-10-16-50/mm/migrate.c v4.18-rc4-mmotm-2018-07-10-16-50_patched/mm/migrate.c
index 198af42..3ae213b 100644
--- v4.18-rc4-mmotm-2018-07-10-16-50/mm/migrate.c
+++ v4.18-rc4-mmotm-2018-07-10-16-50_patched/mm/migrate.c
@@ -1318,8 +1318,6 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
out:
if (rc != -EAGAIN)
putback_active_hugepage(hpage);
- if (reason == MR_MEMORY_FAILURE && !test_set_page_hwpoison(hpage))
- num_poisoned_pages_inc();

/*
* If migration was not successful and there's a freeing callback, use
--
2.7.0


2018-07-17 05:33:43

by Naoya Horiguchi

[permalink] [raw]
Subject: [PATCH v2 2/2] mm: soft-offline: close the race against page allocation

A process can be killed with SIGBUS(BUS_MCEERR_AR) when it tries to
allocate a page that was just freed on the way of soft-offline.
This is undesirable because soft-offline (which is about corrected error)
is less aggressive than hard-offline (which is about uncorrected error),
and we can make soft-offline fail and keep using the page for good reason
like "system is busy."

Two main changes of this patch are:

- setting migrate type of the target page to MIGRATE_ISOLATE. As done
in free_unref_page_commit(), this makes kernel bypass pcplist when
freeing the page. So we can assume that the page is in freelist just
after put_page() returns,

- setting PG_hwpoison on free page under zone->lock which protects
freelists, so this allows us to avoid setting PG_hwpoison on a page
that is decided to be allocated soon.

Reported-by: Xishi Qiu <[email protected]>
Signed-off-by: Naoya Horiguchi <[email protected]>
---
changelog v1->v2:
- updated comment on set_hwpoison_free_buddy_page(),
- moved calling set_hwpoison_free_buddy_page() from mm/migrate.c to
mm/memory-failure.c, which is necessary to check the return code of
set_hwpoison_free_buddy_page().
---
include/linux/page-flags.h | 5 +++++
include/linux/swapops.h | 10 ----------
mm/memory-failure.c | 35 +++++++++++++++++++++++++++++------
mm/migrate.c | 9 ---------
mm/page_alloc.c | 30 ++++++++++++++++++++++++++++++
5 files changed, 64 insertions(+), 25 deletions(-)

diff --git v4.18-rc4-mmotm-2018-07-10-16-50/include/linux/page-flags.h v4.18-rc4-mmotm-2018-07-10-16-50_patched/include/linux/page-flags.h
index 901943e..74bee8c 100644
--- v4.18-rc4-mmotm-2018-07-10-16-50/include/linux/page-flags.h
+++ v4.18-rc4-mmotm-2018-07-10-16-50_patched/include/linux/page-flags.h
@@ -369,8 +369,13 @@ PAGEFLAG_FALSE(Uncached)
PAGEFLAG(HWPoison, hwpoison, PF_ANY)
TESTSCFLAG(HWPoison, hwpoison, PF_ANY)
#define __PG_HWPOISON (1UL << PG_hwpoison)
+extern bool set_hwpoison_free_buddy_page(struct page *page);
#else
PAGEFLAG_FALSE(HWPoison)
+static inline bool set_hwpoison_free_buddy_page(struct page *page)
+{
+ return 0;
+}
#define __PG_HWPOISON 0
#endif

diff --git v4.18-rc4-mmotm-2018-07-10-16-50/include/linux/swapops.h v4.18-rc4-mmotm-2018-07-10-16-50_patched/include/linux/swapops.h
index 9c0eb4d..fe8e08b 100644
--- v4.18-rc4-mmotm-2018-07-10-16-50/include/linux/swapops.h
+++ v4.18-rc4-mmotm-2018-07-10-16-50_patched/include/linux/swapops.h
@@ -335,11 +335,6 @@ static inline int is_hwpoison_entry(swp_entry_t entry)
return swp_type(entry) == SWP_HWPOISON;
}

-static inline bool test_set_page_hwpoison(struct page *page)
-{
- return TestSetPageHWPoison(page);
-}
-
static inline void num_poisoned_pages_inc(void)
{
atomic_long_inc(&num_poisoned_pages);
@@ -362,11 +357,6 @@ static inline int is_hwpoison_entry(swp_entry_t swp)
return 0;
}

-static inline bool test_set_page_hwpoison(struct page *page)
-{
- return false;
-}
-
static inline void num_poisoned_pages_inc(void)
{
}
diff --git v4.18-rc4-mmotm-2018-07-10-16-50/mm/memory-failure.c v4.18-rc4-mmotm-2018-07-10-16-50_patched/mm/memory-failure.c
index 9b77f85..936d0e7 100644
--- v4.18-rc4-mmotm-2018-07-10-16-50/mm/memory-failure.c
+++ v4.18-rc4-mmotm-2018-07-10-16-50_patched/mm/memory-failure.c
@@ -57,6 +57,7 @@
#include <linux/mm_inline.h>
#include <linux/kfifo.h>
#include <linux/ratelimit.h>
+#include <linux/page-isolation.h>
#include "internal.h"
#include "ras/ras_event.h"

@@ -1609,8 +1610,10 @@ static int soft_offline_huge_page(struct page *page, int flags)
*/
ret = dissolve_free_huge_page(page);
if (!ret) {
- if (!TestSetPageHWPoison(page))
+ if (set_hwpoison_free_buddy_page(page))
num_poisoned_pages_inc();
+ else
+ ret = -EBUSY;
}
}
return ret;
@@ -1688,6 +1691,11 @@ static int __soft_offline_page(struct page *page, int flags)
pfn, ret, page->flags, &page->flags);
if (ret > 0)
ret = -EIO;
+ } else {
+ if (set_hwpoison_free_buddy_page(page))
+ num_poisoned_pages_inc();
+ else
+ ret = -EBUSY;
}
} else {
pr_info("soft offline: %#lx: isolation failed: %d, page count %d, type %lx (%pGp)\n",
@@ -1699,6 +1707,7 @@ static int __soft_offline_page(struct page *page, int flags)
static int soft_offline_in_use_page(struct page *page, int flags)
{
int ret;
+ int mt;
struct page *hpage = compound_head(page);

if (!PageHuge(page) && PageTransHuge(hpage)) {
@@ -1717,23 +1726,37 @@ static int soft_offline_in_use_page(struct page *page, int flags)
put_hwpoison_page(hpage);
}

+ /*
+ * Setting MIGRATE_ISOLATE here ensures that the page will be linked
+ * to free list immediately (not via pcplist) when released after
+ * successful page migration. Otherwise we can't guarantee that the
+ * page is really free after put_page() returns, so
+ * set_hwpoison_free_buddy_page() highly likely fails.
+ */
+ mt = get_pageblock_migratetype(page);
+ set_pageblock_migratetype(page, MIGRATE_ISOLATE);
if (PageHuge(page))
ret = soft_offline_huge_page(page, flags);
else
ret = __soft_offline_page(page, flags);
-
+ set_pageblock_migratetype(page, mt);
return ret;
}

-static void soft_offline_free_page(struct page *page)
+static int soft_offline_free_page(struct page *page)
{
int rc = 0;
struct page *head = compound_head(page);

if (PageHuge(head))
rc = dissolve_free_huge_page(page);
- if (!rc && !TestSetPageHWPoison(page))
- num_poisoned_pages_inc();
+ if (!rc) {
+ if (set_hwpoison_free_buddy_page(page))
+ num_poisoned_pages_inc();
+ else
+ rc = -EBUSY;
+ }
+ return rc;
}

/**
@@ -1777,7 +1800,7 @@ int soft_offline_page(struct page *page, int flags)
if (ret > 0)
ret = soft_offline_in_use_page(page, flags);
else if (ret == 0)
- soft_offline_free_page(page);
+ ret = soft_offline_free_page(page);

return ret;
}
diff --git v4.18-rc4-mmotm-2018-07-10-16-50/mm/migrate.c v4.18-rc4-mmotm-2018-07-10-16-50_patched/mm/migrate.c
index 3ae213b..4fd0fe0 100644
--- v4.18-rc4-mmotm-2018-07-10-16-50/mm/migrate.c
+++ v4.18-rc4-mmotm-2018-07-10-16-50_patched/mm/migrate.c
@@ -1193,15 +1193,6 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
*/
if (rc == MIGRATEPAGE_SUCCESS) {
put_page(page);
- if (reason == MR_MEMORY_FAILURE) {
- /*
- * Set PG_HWPoison on just freed page
- * intentionally. Although it's rather weird,
- * it's how HWPoison flag works at the moment.
- */
- if (!test_set_page_hwpoison(page))
- num_poisoned_pages_inc();
- }
} else {
if (rc != -EAGAIN) {
if (likely(!__PageMovable(page))) {
diff --git v4.18-rc4-mmotm-2018-07-10-16-50/mm/page_alloc.c v4.18-rc4-mmotm-2018-07-10-16-50_patched/mm/page_alloc.c
index 607deff..4058b7e 100644
--- v4.18-rc4-mmotm-2018-07-10-16-50/mm/page_alloc.c
+++ v4.18-rc4-mmotm-2018-07-10-16-50_patched/mm/page_alloc.c
@@ -8027,3 +8027,33 @@ bool is_free_buddy_page(struct page *page)

return order < MAX_ORDER;
}
+
+#ifdef CONFIG_MEMORY_FAILURE
+/*
+ * Set PG_hwpoison flag if a given page is confirmed to be a free page. This
+ * test is performed under the zone lock to prevent a race against page
+ * allocation.
+ */
+bool set_hwpoison_free_buddy_page(struct page *page)
+{
+ struct zone *zone = page_zone(page);
+ unsigned long pfn = page_to_pfn(page);
+ unsigned long flags;
+ unsigned int order;
+ bool hwpoisoned = false;
+
+ spin_lock_irqsave(&zone->lock, flags);
+ for (order = 0; order < MAX_ORDER; order++) {
+ struct page *page_head = page - (pfn & ((1 << order) - 1));
+
+ if (PageBuddy(page_head) && page_order(page_head) >= order) {
+ if (!TestSetPageHWPoison(page))
+ hwpoisoned = true;
+ break;
+ }
+ }
+ spin_unlock_irqrestore(&zone->lock, flags);
+
+ return hwpoisoned;
+}
+#endif
--
2.7.0


2018-07-17 14:29:18

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] mm: fix race on soft-offlining free huge pages

On Tue 17-07-18 14:32:31, Naoya Horiguchi wrote:
> There's a race condition between soft offline and hugetlb_fault which
> causes unexpected process killing and/or hugetlb allocation failure.
>
> The process killing is caused by the following flow:
>
> CPU 0 CPU 1 CPU 2
>
> soft offline
> get_any_page
> // find the hugetlb is free
> mmap a hugetlb file
> page fault
> ...
> hugetlb_fault
> hugetlb_no_page
> alloc_huge_page
> // succeed
> soft_offline_free_page
> // set hwpoison flag
> mmap the hugetlb file
> page fault
> ...
> hugetlb_fault
> hugetlb_no_page
> find_lock_page
> return VM_FAULT_HWPOISON
> mm_fault_error
> do_sigbus
> // kill the process
>
>
> The hugetlb allocation failure comes from the following flow:
>
> CPU 0 CPU 1
>
> mmap a hugetlb file
> // reserve all free page but don't fault-in
> soft offline
> get_any_page
> // find the hugetlb is free
> soft_offline_free_page
> // set hwpoison flag
> dissolve_free_huge_page
> // fail because all free hugepages are reserved
> page fault
> ...
> hugetlb_fault
> hugetlb_no_page
> alloc_huge_page
> ...
> dequeue_huge_page_node_exact
> // ignore hwpoisoned hugepage
> // and finally fail due to no-mem
>
> The root cause of this is that current soft-offline code is written
> based on an assumption that PageHWPoison flag should beset at first to
> avoid accessing the corrupted data. This makes sense for memory_failure()
> or hard offline, but does not for soft offline because soft offline is
> about corrected (not uncorrected) error and is safe from data lost.
> This patch changes soft offline semantics where it sets PageHWPoison flag
> only after containment of the error page completes successfully.

Could you please expand on the worklow here please? The code is really
hard to grasp. I must be missing something because the thing shouldn't
be really complicated. Either the page is in the free pool and you just
remove it from the allocator (with hugetlb asking for a new hugeltb page
to guaratee reserves) or it is used and you just migrate the content to
a new page (again with the hugetlb reserves consideration). Why should
PageHWPoison flag ordering make any relevance?

Do I get it right that the only difference between the hard and soft
offlining is that hugetlb reserves might break for the former while not
for the latter and that the failed migration kills all owners for the
former while not for latter?

> Reported-by: Xishi Qiu <[email protected]>
> Suggested-by: Xishi Qiu <[email protected]>
> Signed-off-by: Naoya Horiguchi <[email protected]>
> ---
> changelog v1->v2:
> - don't use set_hwpoison_free_buddy_page() (not defined yet)
> - updated comment in soft_offline_huge_page()
> ---
> mm/hugetlb.c | 11 +++++------
> mm/memory-failure.c | 24 ++++++++++++++++++------
> mm/migrate.c | 2 --
> 3 files changed, 23 insertions(+), 14 deletions(-)
>
> diff --git v4.18-rc4-mmotm-2018-07-10-16-50/mm/hugetlb.c v4.18-rc4-mmotm-2018-07-10-16-50_patched/mm/hugetlb.c
> index 430be42..937c142 100644
> --- v4.18-rc4-mmotm-2018-07-10-16-50/mm/hugetlb.c
> +++ v4.18-rc4-mmotm-2018-07-10-16-50_patched/mm/hugetlb.c
> @@ -1479,22 +1479,20 @@ static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
> /*
> * Dissolve a given free hugepage into free buddy pages. This function does
> * nothing for in-use (including surplus) hugepages. Returns -EBUSY if the
> - * number of free hugepages would be reduced below the number of reserved
> - * hugepages.
> + * dissolution fails because a give page is not a free hugepage, or because
> + * free hugepages are fully reserved.
> */
> int dissolve_free_huge_page(struct page *page)
> {
> - int rc = 0;
> + int rc = -EBUSY;
>
> spin_lock(&hugetlb_lock);
> if (PageHuge(page) && !page_count(page)) {
> struct page *head = compound_head(page);
> struct hstate *h = page_hstate(head);
> int nid = page_to_nid(head);
> - if (h->free_huge_pages - h->resv_huge_pages == 0) {
> - rc = -EBUSY;
> + if (h->free_huge_pages - h->resv_huge_pages == 0)
> goto out;
> - }
> /*
> * Move PageHWPoison flag from head page to the raw error page,
> * which makes any subpages rather than the error page reusable.
> @@ -1508,6 +1506,7 @@ int dissolve_free_huge_page(struct page *page)
> h->free_huge_pages_node[nid]--;
> h->max_huge_pages--;
> update_and_free_page(h, head);
> + rc = 0;
> }
> out:
> spin_unlock(&hugetlb_lock);
> diff --git v4.18-rc4-mmotm-2018-07-10-16-50/mm/memory-failure.c v4.18-rc4-mmotm-2018-07-10-16-50_patched/mm/memory-failure.c
> index 9d142b9..9b77f85 100644
> --- v4.18-rc4-mmotm-2018-07-10-16-50/mm/memory-failure.c
> +++ v4.18-rc4-mmotm-2018-07-10-16-50_patched/mm/memory-failure.c
> @@ -1598,8 +1598,20 @@ static int soft_offline_huge_page(struct page *page, int flags)
> if (ret > 0)
> ret = -EIO;
> } else {
> - if (PageHuge(page))
> - dissolve_free_huge_page(page);
> + /*
> + * We set PG_hwpoison only when the migration source hugepage
> + * was successfully dissolved, because otherwise hwpoisoned
> + * hugepage remains on free hugepage list. The allocator ignores
> + * such a hwpoisoned page so it's never allocated, but it could
> + * kill a process because of no-memory rather than hwpoison.
> + * Soft-offline never impacts the userspace, so this is
> + * undesired.
> + */
> + ret = dissolve_free_huge_page(page);
> + if (!ret) {
> + if (!TestSetPageHWPoison(page))
> + num_poisoned_pages_inc();
> + }
> }
> return ret;
> }
> @@ -1715,13 +1727,13 @@ static int soft_offline_in_use_page(struct page *page, int flags)
>
> static void soft_offline_free_page(struct page *page)
> {
> + int rc = 0;
> struct page *head = compound_head(page);
>
> - if (!TestSetPageHWPoison(head)) {
> + if (PageHuge(head))
> + rc = dissolve_free_huge_page(page);
> + if (!rc && !TestSetPageHWPoison(page))
> num_poisoned_pages_inc();
> - if (PageHuge(head))
> - dissolve_free_huge_page(page);
> - }
> }
>
> /**
> diff --git v4.18-rc4-mmotm-2018-07-10-16-50/mm/migrate.c v4.18-rc4-mmotm-2018-07-10-16-50_patched/mm/migrate.c
> index 198af42..3ae213b 100644
> --- v4.18-rc4-mmotm-2018-07-10-16-50/mm/migrate.c
> +++ v4.18-rc4-mmotm-2018-07-10-16-50_patched/mm/migrate.c
> @@ -1318,8 +1318,6 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
> out:
> if (rc != -EAGAIN)
> putback_active_hugepage(hpage);
> - if (reason == MR_MEMORY_FAILURE && !test_set_page_hwpoison(hpage))
> - num_poisoned_pages_inc();
>
> /*
> * If migration was not successful and there's a freeing callback, use
> --
> 2.7.0

--
Michal Hocko
SUSE Labs

2018-07-17 20:12:29

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] mm: fix race on soft-offlining free huge pages

On 07/17/2018 07:27 AM, Michal Hocko wrote:
> On Tue 17-07-18 14:32:31, Naoya Horiguchi wrote:
>> There's a race condition between soft offline and hugetlb_fault which
>> causes unexpected process killing and/or hugetlb allocation failure.
>>
>> The process killing is caused by the following flow:
>>
>> CPU 0 CPU 1 CPU 2
>>
>> soft offline
>> get_any_page
>> // find the hugetlb is free
>> mmap a hugetlb file
>> page fault
>> ...
>> hugetlb_fault
>> hugetlb_no_page
>> alloc_huge_page
>> // succeed
>> soft_offline_free_page
>> // set hwpoison flag
>> mmap the hugetlb file
>> page fault
>> ...
>> hugetlb_fault
>> hugetlb_no_page
>> find_lock_page
>> return VM_FAULT_HWPOISON
>> mm_fault_error
>> do_sigbus
>> // kill the process
>>
>>
>> The hugetlb allocation failure comes from the following flow:
>>
>> CPU 0 CPU 1
>>
>> mmap a hugetlb file
>> // reserve all free page but don't fault-in
>> soft offline
>> get_any_page
>> // find the hugetlb is free
>> soft_offline_free_page
>> // set hwpoison flag
>> dissolve_free_huge_page
>> // fail because all free hugepages are reserved
>> page fault
>> ...
>> hugetlb_fault
>> hugetlb_no_page
>> alloc_huge_page
>> ...
>> dequeue_huge_page_node_exact
>> // ignore hwpoisoned hugepage
>> // and finally fail due to no-mem
>>
>> The root cause of this is that current soft-offline code is written
>> based on an assumption that PageHWPoison flag should beset at first to
>> avoid accessing the corrupted data. This makes sense for memory_failure()
>> or hard offline, but does not for soft offline because soft offline is
>> about corrected (not uncorrected) error and is safe from data lost.
>> This patch changes soft offline semantics where it sets PageHWPoison flag
>> only after containment of the error page completes successfully.
>
> Could you please expand on the worklow here please? The code is really
> hard to grasp. I must be missing something because the thing shouldn't
> be really complicated. Either the page is in the free pool and you just
> remove it from the allocator (with hugetlb asking for a new hugeltb page
> to guaratee reserves) or it is used and you just migrate the content to
> a new page (again with the hugetlb reserves consideration). Why should
> PageHWPoison flag ordering make any relevance?

My understanding may not be corect, but just looking at the current code
for soft_offline_free_page helps me understand:

static void soft_offline_free_page(struct page *page)
{
struct page *head = compound_head(page);

if (!TestSetPageHWPoison(head)) {
num_poisoned_pages_inc();
if (PageHuge(head))
dissolve_free_huge_page(page);
}
}

The HWPoison flag is set before even checking to determine if the huge
page can be dissolved. So, someone could could attempt to pull the page
off the free list (if free) or fault/map it (if already associated with
a file) which leads to the failures described above. The patches ensure
that we only set HWPoison after successfully dissolving the page. At least
that is how I understand it.

It seems that soft_offline_free_page can be called for in use pages.
Certainly, that is the case in the first workflow above. With the
suggested changes, I think this is OK for huge pages. However, it seems
that setting HWPoison on a in use non-huge page could cause issues?

While looking at the code, I noticed this comment in __get_any_page()
/*
* When the target page is a free hugepage, just remove it
* from free hugepage list.
*/
Did that apply to some code that was removed? It does not seem to make
any sense in that routine.
--
Mike Kravetz

2018-07-18 01:00:56

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] mm: fix race on soft-offlining free huge pages

On Tue, Jul 17, 2018 at 04:27:43PM +0200, Michal Hocko wrote:
> On Tue 17-07-18 14:32:31, Naoya Horiguchi wrote:
> > There's a race condition between soft offline and hugetlb_fault which
> > causes unexpected process killing and/or hugetlb allocation failure.
> >
> > The process killing is caused by the following flow:
> >
> > CPU 0 CPU 1 CPU 2
> >
> > soft offline
> > get_any_page
> > // find the hugetlb is free
> > mmap a hugetlb file
> > page fault
> > ...
> > hugetlb_fault
> > hugetlb_no_page
> > alloc_huge_page
> > // succeed
> > soft_offline_free_page
> > // set hwpoison flag
> > mmap the hugetlb file
> > page fault
> > ...
> > hugetlb_fault
> > hugetlb_no_page
> > find_lock_page
> > return VM_FAULT_HWPOISON
> > mm_fault_error
> > do_sigbus
> > // kill the process
> >
> >
> > The hugetlb allocation failure comes from the following flow:
> >
> > CPU 0 CPU 1
> >
> > mmap a hugetlb file
> > // reserve all free page but don't fault-in
> > soft offline
> > get_any_page
> > // find the hugetlb is free
> > soft_offline_free_page
> > // set hwpoison flag
> > dissolve_free_huge_page
> > // fail because all free hugepages are reserved
> > page fault
> > ...
> > hugetlb_fault
> > hugetlb_no_page
> > alloc_huge_page
> > ...
> > dequeue_huge_page_node_exact
> > // ignore hwpoisoned hugepage
> > // and finally fail due to no-mem
> >
> > The root cause of this is that current soft-offline code is written
> > based on an assumption that PageHWPoison flag should beset at first to
> > avoid accessing the corrupted data. This makes sense for memory_failure()
> > or hard offline, but does not for soft offline because soft offline is
> > about corrected (not uncorrected) error and is safe from data lost.
> > This patch changes soft offline semantics where it sets PageHWPoison flag
> > only after containment of the error page completes successfully.
>
> Could you please expand on the worklow here please? The code is really
> hard to grasp. I must be missing something because the thing shouldn't
> be really complicated. Either the page is in the free pool and you just
> remove it from the allocator (with hugetlb asking for a new hugeltb page
> to guaratee reserves) or it is used and you just migrate the content to
> a new page (again with the hugetlb reserves consideration). Why should
> PageHWPoison flag ordering make any relevance?

(Considering soft offlining free hugepage,)
PageHWPoison is set at first before this patch, which is racy with
hugetlb fault code because it's not protected by hugetlb_lock.

Originally this was written in the similar manner as hard-offline, where
the race is accepted and a PageHWPoison flag is set as soon as possible.
But actually that's found not necessary/correct because soft offline is
supposed to be less aggressive and failure is OK.

So this patch is suggesting to make soft-offline less aggressive by
moving SetPageHWPoison into the lock.

>
> Do I get it right that the only difference between the hard and soft
> offlining is that hugetlb reserves might break for the former while not
> for the latter

Correct.

> and that the failed migration kills all owners for the
> former while not for latter?

Hard-offline doesn't cause any page migration because the data is already
lost, but yes it can kill the owners.
Soft-offline never kills processes even if it fails (due to migration failrue
or some other reasons.)

I listed below some common points and differences between hard-offline
and soft-offline.

common points
- they are both contained by PageHWPoison flag,
- error is injected via simliar interfaces.

differences
- the data on the page is considered lost in hard offline, but is not
in soft offline,
- hard offline likely kills the affected processes, but soft offline
never kills processes,
- soft offline causes page migration, but hard offline does not,
- hard offline prioritizes to prevent consumption of broken data with
accepting some race, and soft offline prioritizes not to impact
userspace with accepting failure.

Looks to me that there're more differences rather than commont points.

Thanks,
Naoya Horiguchi

>
> > Reported-by: Xishi Qiu <[email protected]>
> > Suggested-by: Xishi Qiu <[email protected]>
> > Signed-off-by: Naoya Horiguchi <[email protected]>
> > ---
> > changelog v1->v2:
> > - don't use set_hwpoison_free_buddy_page() (not defined yet)
> > - updated comment in soft_offline_huge_page()
> > ---
> > mm/hugetlb.c | 11 +++++------
> > mm/memory-failure.c | 24 ++++++++++++++++++------
> > mm/migrate.c | 2 --
> > 3 files changed, 23 insertions(+), 14 deletions(-)
> >
> > diff --git v4.18-rc4-mmotm-2018-07-10-16-50/mm/hugetlb.c v4.18-rc4-mmotm-2018-07-10-16-50_patched/mm/hugetlb.c
> > index 430be42..937c142 100644
> > --- v4.18-rc4-mmotm-2018-07-10-16-50/mm/hugetlb.c
> > +++ v4.18-rc4-mmotm-2018-07-10-16-50_patched/mm/hugetlb.c
> > @@ -1479,22 +1479,20 @@ static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
> > /*
> > * Dissolve a given free hugepage into free buddy pages. This function does
> > * nothing for in-use (including surplus) hugepages. Returns -EBUSY if the
> > - * number of free hugepages would be reduced below the number of reserved
> > - * hugepages.
> > + * dissolution fails because a give page is not a free hugepage, or because
> > + * free hugepages are fully reserved.
> > */
> > int dissolve_free_huge_page(struct page *page)
> > {
> > - int rc = 0;
> > + int rc = -EBUSY;
> >
> > spin_lock(&hugetlb_lock);
> > if (PageHuge(page) && !page_count(page)) {
> > struct page *head = compound_head(page);
> > struct hstate *h = page_hstate(head);
> > int nid = page_to_nid(head);
> > - if (h->free_huge_pages - h->resv_huge_pages == 0) {
> > - rc = -EBUSY;
> > + if (h->free_huge_pages - h->resv_huge_pages == 0)
> > goto out;
> > - }
> > /*
> > * Move PageHWPoison flag from head page to the raw error page,
> > * which makes any subpages rather than the error page reusable.
> > @@ -1508,6 +1506,7 @@ int dissolve_free_huge_page(struct page *page)
> > h->free_huge_pages_node[nid]--;
> > h->max_huge_pages--;
> > update_and_free_page(h, head);
> > + rc = 0;
> > }
> > out:
> > spin_unlock(&hugetlb_lock);
> > diff --git v4.18-rc4-mmotm-2018-07-10-16-50/mm/memory-failure.c v4.18-rc4-mmotm-2018-07-10-16-50_patched/mm/memory-failure.c
> > index 9d142b9..9b77f85 100644
> > --- v4.18-rc4-mmotm-2018-07-10-16-50/mm/memory-failure.c
> > +++ v4.18-rc4-mmotm-2018-07-10-16-50_patched/mm/memory-failure.c
> > @@ -1598,8 +1598,20 @@ static int soft_offline_huge_page(struct page *page, int flags)
> > if (ret > 0)
> > ret = -EIO;
> > } else {
> > - if (PageHuge(page))
> > - dissolve_free_huge_page(page);
> > + /*
> > + * We set PG_hwpoison only when the migration source hugepage
> > + * was successfully dissolved, because otherwise hwpoisoned
> > + * hugepage remains on free hugepage list. The allocator ignores
> > + * such a hwpoisoned page so it's never allocated, but it could
> > + * kill a process because of no-memory rather than hwpoison.
> > + * Soft-offline never impacts the userspace, so this is
> > + * undesired.
> > + */
> > + ret = dissolve_free_huge_page(page);
> > + if (!ret) {
> > + if (!TestSetPageHWPoison(page))
> > + num_poisoned_pages_inc();
> > + }
> > }
> > return ret;
> > }
> > @@ -1715,13 +1727,13 @@ static int soft_offline_in_use_page(struct page *page, int flags)
> >
> > static void soft_offline_free_page(struct page *page)
> > {
> > + int rc = 0;
> > struct page *head = compound_head(page);
> >
> > - if (!TestSetPageHWPoison(head)) {
> > + if (PageHuge(head))
> > + rc = dissolve_free_huge_page(page);
> > + if (!rc && !TestSetPageHWPoison(page))
> > num_poisoned_pages_inc();
> > - if (PageHuge(head))
> > - dissolve_free_huge_page(page);
> > - }
> > }
> >
> > /**
> > diff --git v4.18-rc4-mmotm-2018-07-10-16-50/mm/migrate.c v4.18-rc4-mmotm-2018-07-10-16-50_patched/mm/migrate.c
> > index 198af42..3ae213b 100644
> > --- v4.18-rc4-mmotm-2018-07-10-16-50/mm/migrate.c
> > +++ v4.18-rc4-mmotm-2018-07-10-16-50_patched/mm/migrate.c
> > @@ -1318,8 +1318,6 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
> > out:
> > if (rc != -EAGAIN)
> > putback_active_hugepage(hpage);
> > - if (reason == MR_MEMORY_FAILURE && !test_set_page_hwpoison(hpage))
> > - num_poisoned_pages_inc();
> >
> > /*
> > * If migration was not successful and there's a freeing callback, use
> > --
> > 2.7.0
>
> --
> Michal Hocko
> SUSE Labs
>

2018-07-18 01:29:40

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] mm: fix race on soft-offlining free huge pages

On Tue, Jul 17, 2018 at 01:10:39PM -0700, Mike Kravetz wrote:
> On 07/17/2018 07:27 AM, Michal Hocko wrote:
> > On Tue 17-07-18 14:32:31, Naoya Horiguchi wrote:
> >> There's a race condition between soft offline and hugetlb_fault which
> >> causes unexpected process killing and/or hugetlb allocation failure.
> >>
> >> The process killing is caused by the following flow:
> >>
> >> CPU 0 CPU 1 CPU 2
> >>
> >> soft offline
> >> get_any_page
> >> // find the hugetlb is free
> >> mmap a hugetlb file
> >> page fault
> >> ...
> >> hugetlb_fault
> >> hugetlb_no_page
> >> alloc_huge_page
> >> // succeed
> >> soft_offline_free_page
> >> // set hwpoison flag
> >> mmap the hugetlb file
> >> page fault
> >> ...
> >> hugetlb_fault
> >> hugetlb_no_page
> >> find_lock_page
> >> return VM_FAULT_HWPOISON
> >> mm_fault_error
> >> do_sigbus
> >> // kill the process
> >>
> >>
> >> The hugetlb allocation failure comes from the following flow:
> >>
> >> CPU 0 CPU 1
> >>
> >> mmap a hugetlb file
> >> // reserve all free page but don't fault-in
> >> soft offline
> >> get_any_page
> >> // find the hugetlb is free
> >> soft_offline_free_page
> >> // set hwpoison flag
> >> dissolve_free_huge_page
> >> // fail because all free hugepages are reserved
> >> page fault
> >> ...
> >> hugetlb_fault
> >> hugetlb_no_page
> >> alloc_huge_page
> >> ...
> >> dequeue_huge_page_node_exact
> >> // ignore hwpoisoned hugepage
> >> // and finally fail due to no-mem
> >>
> >> The root cause of this is that current soft-offline code is written
> >> based on an assumption that PageHWPoison flag should beset at first to
> >> avoid accessing the corrupted data. This makes sense for memory_failure()
> >> or hard offline, but does not for soft offline because soft offline is
> >> about corrected (not uncorrected) error and is safe from data lost.
> >> This patch changes soft offline semantics where it sets PageHWPoison flag
> >> only after containment of the error page completes successfully.
> >
> > Could you please expand on the worklow here please? The code is really
> > hard to grasp. I must be missing something because the thing shouldn't
> > be really complicated. Either the page is in the free pool and you just
> > remove it from the allocator (with hugetlb asking for a new hugeltb page
> > to guaratee reserves) or it is used and you just migrate the content to
> > a new page (again with the hugetlb reserves consideration). Why should
> > PageHWPoison flag ordering make any relevance?
>
> My understanding may not be corect, but just looking at the current code
> for soft_offline_free_page helps me understand:
>
> static void soft_offline_free_page(struct page *page)
> {
> struct page *head = compound_head(page);
>
> if (!TestSetPageHWPoison(head)) {
> num_poisoned_pages_inc();
> if (PageHuge(head))
> dissolve_free_huge_page(page);
> }
> }
>
> The HWPoison flag is set before even checking to determine if the huge
> page can be dissolved. So, someone could could attempt to pull the page
> off the free list (if free) or fault/map it (if already associated with
> a file) which leads to the failures described above. The patches ensure
> that we only set HWPoison after successfully dissolving the page. At least
> that is how I understand it.

Thanks for elaborating, this is correct.

>
> It seems that soft_offline_free_page can be called for in use pages.
> Certainly, that is the case in the first workflow above. With the
> suggested changes, I think this is OK for huge pages. However, it seems
> that setting HWPoison on a in use non-huge page could cause issues?

Just after dissolve_free_huge_page() returns, the target page is just a
free buddy page without PageHWPoison set. If this page is allocated
immediately, that's "migration succeeded, but soft offline failed" case,
so no problem.
Certainly, there also is a race between cheking TestSetPageHWPoison and
page allocation, so this issue is handled in patch 2/2.

> While looking at the code, I noticed this comment in __get_any_page()
> /*
> * When the target page is a free hugepage, just remove it
> * from free hugepage list.
> */
> Did that apply to some code that was removed? It does not seem to make
> any sense in that routine.

This comment is completely obsolete, I'll remove this one.

Thanks,
Naoya Horiguchi

2018-07-18 01:50:20

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] mm: fix race on soft-offlining free huge pages

On Wed, Jul 18, 2018 at 12:55:29AM +0000, Horiguchi Naoya($BKY8}(B $BD>Li(B) wrote:
> On Tue, Jul 17, 2018 at 04:27:43PM +0200, Michal Hocko wrote:
> > On Tue 17-07-18 14:32:31, Naoya Horiguchi wrote:
> > > There's a race condition between soft offline and hugetlb_fault which
> > > causes unexpected process killing and/or hugetlb allocation failure.
> > >
> > > The process killing is caused by the following flow:
> > >
> > > CPU 0 CPU 1 CPU 2
> > >
> > > soft offline
> > > get_any_page
> > > // find the hugetlb is free
> > > mmap a hugetlb file
> > > page fault
> > > ...
> > > hugetlb_fault
> > > hugetlb_no_page
> > > alloc_huge_page
> > > // succeed
> > > soft_offline_free_page
> > > // set hwpoison flag
> > > mmap the hugetlb file
> > > page fault
> > > ...
> > > hugetlb_fault
> > > hugetlb_no_page
> > > find_lock_page
> > > return VM_FAULT_HWPOISON
> > > mm_fault_error
> > > do_sigbus
> > > // kill the process
> > >
> > >
> > > The hugetlb allocation failure comes from the following flow:
> > >
> > > CPU 0 CPU 1
> > >
> > > mmap a hugetlb file
> > > // reserve all free page but don't fault-in
> > > soft offline
> > > get_any_page
> > > // find the hugetlb is free
> > > soft_offline_free_page
> > > // set hwpoison flag
> > > dissolve_free_huge_page
> > > // fail because all free hugepages are reserved
> > > page fault
> > > ...
> > > hugetlb_fault
> > > hugetlb_no_page
> > > alloc_huge_page
> > > ...
> > > dequeue_huge_page_node_exact
> > > // ignore hwpoisoned hugepage
> > > // and finally fail due to no-mem
> > >
> > > The root cause of this is that current soft-offline code is written
> > > based on an assumption that PageHWPoison flag should beset at first to
> > > avoid accessing the corrupted data. This makes sense for memory_failure()
> > > or hard offline, but does not for soft offline because soft offline is
> > > about corrected (not uncorrected) error and is safe from data lost.
> > > This patch changes soft offline semantics where it sets PageHWPoison flag
> > > only after containment of the error page completes successfully.
> >
> > Could you please expand on the worklow here please? The code is really
> > hard to grasp. I must be missing something because the thing shouldn't
> > be really complicated. Either the page is in the free pool and you just
> > remove it from the allocator (with hugetlb asking for a new hugeltb page
> > to guaratee reserves) or it is used and you just migrate the content to
> > a new page (again with the hugetlb reserves consideration). Why should
> > PageHWPoison flag ordering make any relevance?
>
> (Considering soft offlining free hugepage,)
> PageHWPoison is set at first before this patch, which is racy with
> hugetlb fault code because it's not protected by hugetlb_lock.
>
> Originally this was written in the similar manner as hard-offline, where
> the race is accepted and a PageHWPoison flag is set as soon as possible.
> But actually that's found not necessary/correct because soft offline is
> supposed to be less aggressive and failure is OK.
>
> So this patch is suggesting to make soft-offline less aggressive


> by moving SetPageHWPoison into the lock.

My apology, this part of reasoning was incorrect. What patch 1/2 actually
does is transforming the issue into the normal page's similar race issue
which is solved by patch 2/2. After patch 1/2, soft offline never sets
PageHWPoison on hugepage.

Thanks,
Naoya Horiguchi

>
> >
> > Do I get it right that the only difference between the hard and soft
> > offlining is that hugetlb reserves might break for the former while not
> > for the latter
>
> Correct.
>
> > and that the failed migration kills all owners for the
> > former while not for latter?
>
> Hard-offline doesn't cause any page migration because the data is already
> lost, but yes it can kill the owners.
> Soft-offline never kills processes even if it fails (due to migration failrue
> or some other reasons.)
>
> I listed below some common points and differences between hard-offline
> and soft-offline.
>
> common points
> - they are both contained by PageHWPoison flag,
> - error is injected via simliar interfaces.
>
> differences
> - the data on the page is considered lost in hard offline, but is not
> in soft offline,
> - hard offline likely kills the affected processes, but soft offline
> never kills processes,
> - soft offline causes page migration, but hard offline does not,
> - hard offline prioritizes to prevent consumption of broken data with
> accepting some race, and soft offline prioritizes not to impact
> userspace with accepting failure.
>
> Looks to me that there're more differences rather than commont points.

2018-07-18 02:37:59

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] mm: fix race on soft-offlining free huge pages

On 07/17/2018 06:28 PM, Naoya Horiguchi wrote:
> On Tue, Jul 17, 2018 at 01:10:39PM -0700, Mike Kravetz wrote:
>> It seems that soft_offline_free_page can be called for in use pages.
>> Certainly, that is the case in the first workflow above. With the
>> suggested changes, I think this is OK for huge pages. However, it seems
>> that setting HWPoison on a in use non-huge page could cause issues?
>
> Just after dissolve_free_huge_page() returns, the target page is just a
> free buddy page without PageHWPoison set. If this page is allocated
> immediately, that's "migration succeeded, but soft offline failed" case,
> so no problem.
> Certainly, there also is a race between cheking TestSetPageHWPoison and
> page allocation, so this issue is handled in patch 2/2.

Yes, the issue of calling soft_offline_free_page() for an 'in use' page
is handled in the new routine set_hwpoison_free_buddy_page() of patch 2/2.

Thanks,
--
Mike Kravetz

2018-07-18 08:52:20

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] mm: fix race on soft-offlining free huge pages

On Wed 18-07-18 00:55:29, Naoya Horiguchi wrote:
> On Tue, Jul 17, 2018 at 04:27:43PM +0200, Michal Hocko wrote:
> > On Tue 17-07-18 14:32:31, Naoya Horiguchi wrote:
> > > There's a race condition between soft offline and hugetlb_fault which
> > > causes unexpected process killing and/or hugetlb allocation failure.
> > >
> > > The process killing is caused by the following flow:
> > >
> > > CPU 0 CPU 1 CPU 2
> > >
> > > soft offline
> > > get_any_page
> > > // find the hugetlb is free
> > > mmap a hugetlb file
> > > page fault
> > > ...
> > > hugetlb_fault
> > > hugetlb_no_page
> > > alloc_huge_page
> > > // succeed
> > > soft_offline_free_page
> > > // set hwpoison flag
> > > mmap the hugetlb file
> > > page fault
> > > ...
> > > hugetlb_fault
> > > hugetlb_no_page
> > > find_lock_page
> > > return VM_FAULT_HWPOISON
> > > mm_fault_error
> > > do_sigbus
> > > // kill the process
> > >
> > >
> > > The hugetlb allocation failure comes from the following flow:
> > >
> > > CPU 0 CPU 1
> > >
> > > mmap a hugetlb file
> > > // reserve all free page but don't fault-in
> > > soft offline
> > > get_any_page
> > > // find the hugetlb is free
> > > soft_offline_free_page
> > > // set hwpoison flag
> > > dissolve_free_huge_page
> > > // fail because all free hugepages are reserved
> > > page fault
> > > ...
> > > hugetlb_fault
> > > hugetlb_no_page
> > > alloc_huge_page
> > > ...
> > > dequeue_huge_page_node_exact
> > > // ignore hwpoisoned hugepage
> > > // and finally fail due to no-mem
> > >
> > > The root cause of this is that current soft-offline code is written
> > > based on an assumption that PageHWPoison flag should beset at first to
> > > avoid accessing the corrupted data. This makes sense for memory_failure()
> > > or hard offline, but does not for soft offline because soft offline is
> > > about corrected (not uncorrected) error and is safe from data lost.
> > > This patch changes soft offline semantics where it sets PageHWPoison flag
> > > only after containment of the error page completes successfully.
> >
> > Could you please expand on the worklow here please? The code is really
> > hard to grasp. I must be missing something because the thing shouldn't
> > be really complicated. Either the page is in the free pool and you just
> > remove it from the allocator (with hugetlb asking for a new hugeltb page
> > to guaratee reserves) or it is used and you just migrate the content to
> > a new page (again with the hugetlb reserves consideration). Why should
> > PageHWPoison flag ordering make any relevance?
>
> (Considering soft offlining free hugepage,)
> PageHWPoison is set at first before this patch, which is racy with
> hugetlb fault code because it's not protected by hugetlb_lock.
>
> Originally this was written in the similar manner as hard-offline, where
> the race is accepted and a PageHWPoison flag is set as soon as possible.
> But actually that's found not necessary/correct because soft offline is
> supposed to be less aggressive and failure is OK.

OK

> So this patch is suggesting to make soft-offline less aggressive by
> moving SetPageHWPoison into the lock.

I guess I still do not understand why we should even care about the
ordering of the HWPoison flag setting. Why cannot we simply have the
following code flow? Or maybe we are doing that already I just do not
follow the code

soft_offline
check page_count
- free - normal page - remove from the allocator
- hugetlb - allocate a new hugetlb page && remove from the pool
- used - migrate to a new page && never release the old one

Why do we even need HWPoison flag here? Everything can be completely
transparent to the application. It shouldn't fail from what I
understood.

> > Do I get it right that the only difference between the hard and soft
> > offlining is that hugetlb reserves might break for the former while not
> > for the latter
>
> Correct.
>
> > and that the failed migration kills all owners for the
> > former while not for latter?
>
> Hard-offline doesn't cause any page migration because the data is already
> lost, but yes it can kill the owners.
> Soft-offline never kills processes even if it fails (due to migration failrue
> or some other reasons.)
>
> I listed below some common points and differences between hard-offline
> and soft-offline.
>
> common points
> - they are both contained by PageHWPoison flag,
> - error is injected via simliar interfaces.
>
> differences
> - the data on the page is considered lost in hard offline, but is not
> in soft offline,
> - hard offline likely kills the affected processes, but soft offline
> never kills processes,
> - soft offline causes page migration, but hard offline does not,
> - hard offline prioritizes to prevent consumption of broken data with
> accepting some race, and soft offline prioritizes not to impact
> userspace with accepting failure.
>
> Looks to me that there're more differences rather than commont points.

Thanks for the summary. It certainly helped me
--
Michal Hocko
SUSE Labs

2018-07-19 06:22:12

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] mm: fix race on soft-offlining free huge pages

On Wed, Jul 18, 2018 at 10:50:32AM +0200, Michal Hocko wrote:
> On Wed 18-07-18 00:55:29, Naoya Horiguchi wrote:
> > On Tue, Jul 17, 2018 at 04:27:43PM +0200, Michal Hocko wrote:
> > > On Tue 17-07-18 14:32:31, Naoya Horiguchi wrote:
> > > > There's a race condition between soft offline and hugetlb_fault which
> > > > causes unexpected process killing and/or hugetlb allocation failure.
> > > >
> > > > The process killing is caused by the following flow:
> > > >
> > > > CPU 0 CPU 1 CPU 2
> > > >
> > > > soft offline
> > > > get_any_page
> > > > // find the hugetlb is free
> > > > mmap a hugetlb file
> > > > page fault
> > > > ...
> > > > hugetlb_fault
> > > > hugetlb_no_page
> > > > alloc_huge_page
> > > > // succeed
> > > > soft_offline_free_page
> > > > // set hwpoison flag
> > > > mmap the hugetlb file
> > > > page fault
> > > > ...
> > > > hugetlb_fault
> > > > hugetlb_no_page
> > > > find_lock_page
> > > > return VM_FAULT_HWPOISON
> > > > mm_fault_error
> > > > do_sigbus
> > > > // kill the process
> > > >
> > > >
> > > > The hugetlb allocation failure comes from the following flow:
> > > >
> > > > CPU 0 CPU 1
> > > >
> > > > mmap a hugetlb file
> > > > // reserve all free page but don't fault-in
> > > > soft offline
> > > > get_any_page
> > > > // find the hugetlb is free
> > > > soft_offline_free_page
> > > > // set hwpoison flag
> > > > dissolve_free_huge_page
> > > > // fail because all free hugepages are reserved
> > > > page fault
> > > > ...
> > > > hugetlb_fault
> > > > hugetlb_no_page
> > > > alloc_huge_page
> > > > ...
> > > > dequeue_huge_page_node_exact
> > > > // ignore hwpoisoned hugepage
> > > > // and finally fail due to no-mem
> > > >
> > > > The root cause of this is that current soft-offline code is written
> > > > based on an assumption that PageHWPoison flag should beset at first to
> > > > avoid accessing the corrupted data. This makes sense for memory_failure()
> > > > or hard offline, but does not for soft offline because soft offline is
> > > > about corrected (not uncorrected) error and is safe from data lost.
> > > > This patch changes soft offline semantics where it sets PageHWPoison flag
> > > > only after containment of the error page completes successfully.
> > >
> > > Could you please expand on the worklow here please? The code is really
> > > hard to grasp. I must be missing something because the thing shouldn't
> > > be really complicated. Either the page is in the free pool and you just
> > > remove it from the allocator (with hugetlb asking for a new hugeltb page
> > > to guaratee reserves) or it is used and you just migrate the content to
> > > a new page (again with the hugetlb reserves consideration). Why should
> > > PageHWPoison flag ordering make any relevance?
> >
> > (Considering soft offlining free hugepage,)
> > PageHWPoison is set at first before this patch, which is racy with
> > hugetlb fault code because it's not protected by hugetlb_lock.
> >
> > Originally this was written in the similar manner as hard-offline, where
> > the race is accepted and a PageHWPoison flag is set as soon as possible.
> > But actually that's found not necessary/correct because soft offline is
> > supposed to be less aggressive and failure is OK.
>
> OK
>
> > So this patch is suggesting to make soft-offline less aggressive by
> > moving SetPageHWPoison into the lock.
>
> I guess I still do not understand why we should even care about the
> ordering of the HWPoison flag setting. Why cannot we simply have the
> following code flow? Or maybe we are doing that already I just do not
> follow the code
>
> soft_offline
> check page_count
> - free - normal page - remove from the allocator
> - hugetlb - allocate a new hugetlb page && remove from the pool
> - used - migrate to a new page && never release the old one
>
> Why do we even need HWPoison flag here? Everything can be completely
> transparent to the application. It shouldn't fail from what I
> understood.

PageHWPoison flag is used to the 'remove from the allocator' part
which is like below:

static inline
struct page *rmqueue(
...
do {
page = NULL;
if (alloc_flags & ALLOC_HARDER) {
page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
if (page)
trace_mm_page_alloc_zone_locked(page, order, migratetype);
}
if (!page)
page = __rmqueue(zone, order, migratetype);
} while (page && check_new_pages(page, order));

check_new_pages() returns true if the page taken from free list has
a hwpoison page so that the allocator iterates another round to get
another page.

There's no function that can be called from outside allocator to remove
a page in allocator. So actual page removal is done at allocation time,
not at error handling time. That's the reason why we need PageHWPoison.

Thanks,
Naoya Horiguchi


> > > Do I get it right that the only difference between the hard and soft
> > > offlining is that hugetlb reserves might break for the former while not
> > > for the latter
> >
> > Correct.
> >
> > > and that the failed migration kills all owners for the
> > > former while not for latter?
> >
> > Hard-offline doesn't cause any page migration because the data is already
> > lost, but yes it can kill the owners.
> > Soft-offline never kills processes even if it fails (due to migration failrue
> > or some other reasons.)
> >
> > I listed below some common points and differences between hard-offline
> > and soft-offline.
> >
> > common points
> > - they are both contained by PageHWPoison flag,
> > - error is injected via simliar interfaces.
> >
> > differences
> > - the data on the page is considered lost in hard offline, but is not
> > in soft offline,
> > - hard offline likely kills the affected processes, but soft offline
> > never kills processes,
> > - soft offline causes page migration, but hard offline does not,
> > - hard offline prioritizes to prevent consumption of broken data with
> > accepting some race, and soft offline prioritizes not to impact
> > userspace with accepting failure.
> >
> > Looks to me that there're more differences rather than commont points.
>
> Thanks for the summary. It certainly helped me
> --
> Michal Hocko
> SUSE Labs
>

2018-07-19 07:16:36

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] mm: fix race on soft-offlining free huge pages

On Thu 19-07-18 06:19:45, Naoya Horiguchi wrote:
> On Wed, Jul 18, 2018 at 10:50:32AM +0200, Michal Hocko wrote:
> > On Wed 18-07-18 00:55:29, Naoya Horiguchi wrote:
> > > On Tue, Jul 17, 2018 at 04:27:43PM +0200, Michal Hocko wrote:
> > > > On Tue 17-07-18 14:32:31, Naoya Horiguchi wrote:
> > > > > There's a race condition between soft offline and hugetlb_fault which
> > > > > causes unexpected process killing and/or hugetlb allocation failure.
> > > > >
> > > > > The process killing is caused by the following flow:
> > > > >
> > > > > CPU 0 CPU 1 CPU 2
> > > > >
> > > > > soft offline
> > > > > get_any_page
> > > > > // find the hugetlb is free
> > > > > mmap a hugetlb file
> > > > > page fault
> > > > > ...
> > > > > hugetlb_fault
> > > > > hugetlb_no_page
> > > > > alloc_huge_page
> > > > > // succeed
> > > > > soft_offline_free_page
> > > > > // set hwpoison flag
> > > > > mmap the hugetlb file
> > > > > page fault
> > > > > ...
> > > > > hugetlb_fault
> > > > > hugetlb_no_page
> > > > > find_lock_page
> > > > > return VM_FAULT_HWPOISON
> > > > > mm_fault_error
> > > > > do_sigbus
> > > > > // kill the process
> > > > >
> > > > >
> > > > > The hugetlb allocation failure comes from the following flow:
> > > > >
> > > > > CPU 0 CPU 1
> > > > >
> > > > > mmap a hugetlb file
> > > > > // reserve all free page but don't fault-in
> > > > > soft offline
> > > > > get_any_page
> > > > > // find the hugetlb is free
> > > > > soft_offline_free_page
> > > > > // set hwpoison flag
> > > > > dissolve_free_huge_page
> > > > > // fail because all free hugepages are reserved
> > > > > page fault
> > > > > ...
> > > > > hugetlb_fault
> > > > > hugetlb_no_page
> > > > > alloc_huge_page
> > > > > ...
> > > > > dequeue_huge_page_node_exact
> > > > > // ignore hwpoisoned hugepage
> > > > > // and finally fail due to no-mem
> > > > >
> > > > > The root cause of this is that current soft-offline code is written
> > > > > based on an assumption that PageHWPoison flag should beset at first to
> > > > > avoid accessing the corrupted data. This makes sense for memory_failure()
> > > > > or hard offline, but does not for soft offline because soft offline is
> > > > > about corrected (not uncorrected) error and is safe from data lost.
> > > > > This patch changes soft offline semantics where it sets PageHWPoison flag
> > > > > only after containment of the error page completes successfully.
> > > >
> > > > Could you please expand on the worklow here please? The code is really
> > > > hard to grasp. I must be missing something because the thing shouldn't
> > > > be really complicated. Either the page is in the free pool and you just
> > > > remove it from the allocator (with hugetlb asking for a new hugeltb page
> > > > to guaratee reserves) or it is used and you just migrate the content to
> > > > a new page (again with the hugetlb reserves consideration). Why should
> > > > PageHWPoison flag ordering make any relevance?
> > >
> > > (Considering soft offlining free hugepage,)
> > > PageHWPoison is set at first before this patch, which is racy with
> > > hugetlb fault code because it's not protected by hugetlb_lock.
> > >
> > > Originally this was written in the similar manner as hard-offline, where
> > > the race is accepted and a PageHWPoison flag is set as soon as possible.
> > > But actually that's found not necessary/correct because soft offline is
> > > supposed to be less aggressive and failure is OK.
> >
> > OK
> >
> > > So this patch is suggesting to make soft-offline less aggressive by
> > > moving SetPageHWPoison into the lock.
> >
> > I guess I still do not understand why we should even care about the
> > ordering of the HWPoison flag setting. Why cannot we simply have the
> > following code flow? Or maybe we are doing that already I just do not
> > follow the code
> >
> > soft_offline
> > check page_count
> > - free - normal page - remove from the allocator
> > - hugetlb - allocate a new hugetlb page && remove from the pool
> > - used - migrate to a new page && never release the old one
> >
> > Why do we even need HWPoison flag here? Everything can be completely
> > transparent to the application. It shouldn't fail from what I
> > understood.
>
> PageHWPoison flag is used to the 'remove from the allocator' part
> which is like below:
>
> static inline
> struct page *rmqueue(
> ...
> do {
> page = NULL;
> if (alloc_flags & ALLOC_HARDER) {
> page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
> if (page)
> trace_mm_page_alloc_zone_locked(page, order, migratetype);
> }
> if (!page)
> page = __rmqueue(zone, order, migratetype);
> } while (page && check_new_pages(page, order));
>
> check_new_pages() returns true if the page taken from free list has
> a hwpoison page so that the allocator iterates another round to get
> another page.
>
> There's no function that can be called from outside allocator to remove
> a page in allocator. So actual page removal is done at allocation time,
> not at error handling time. That's the reason why we need PageHWPoison.

hwpoison is an internal mm functionality so why cannot we simply add a
function that would do that? I find the PageHWPoison usage here doing
more complications than real good. Or am I missing something?
--
Michal Hocko
SUSE Labs

2018-07-19 08:12:03

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] mm: fix race on soft-offlining free huge pages

On Thu, Jul 19, 2018 at 09:15:16AM +0200, Michal Hocko wrote:
> On Thu 19-07-18 06:19:45, Naoya Horiguchi wrote:
> > On Wed, Jul 18, 2018 at 10:50:32AM +0200, Michal Hocko wrote:
> > > On Wed 18-07-18 00:55:29, Naoya Horiguchi wrote:
> > > > On Tue, Jul 17, 2018 at 04:27:43PM +0200, Michal Hocko wrote:
> > > > > On Tue 17-07-18 14:32:31, Naoya Horiguchi wrote:
> > > > > > There's a race condition between soft offline and hugetlb_fault which
> > > > > > causes unexpected process killing and/or hugetlb allocation failure.
> > > > > >
> > > > > > The process killing is caused by the following flow:
> > > > > >
> > > > > > CPU 0 CPU 1 CPU 2
> > > > > >
> > > > > > soft offline
> > > > > > get_any_page
> > > > > > // find the hugetlb is free
> > > > > > mmap a hugetlb file
> > > > > > page fault
> > > > > > ...
> > > > > > hugetlb_fault
> > > > > > hugetlb_no_page
> > > > > > alloc_huge_page
> > > > > > // succeed
> > > > > > soft_offline_free_page
> > > > > > // set hwpoison flag
> > > > > > mmap the hugetlb file
> > > > > > page fault
> > > > > > ...
> > > > > > hugetlb_fault
> > > > > > hugetlb_no_page
> > > > > > find_lock_page
> > > > > > return VM_FAULT_HWPOISON
> > > > > > mm_fault_error
> > > > > > do_sigbus
> > > > > > // kill the process
> > > > > >
> > > > > >
> > > > > > The hugetlb allocation failure comes from the following flow:
> > > > > >
> > > > > > CPU 0 CPU 1
> > > > > >
> > > > > > mmap a hugetlb file
> > > > > > // reserve all free page but don't fault-in
> > > > > > soft offline
> > > > > > get_any_page
> > > > > > // find the hugetlb is free
> > > > > > soft_offline_free_page
> > > > > > // set hwpoison flag
> > > > > > dissolve_free_huge_page
> > > > > > // fail because all free hugepages are reserved
> > > > > > page fault
> > > > > > ...
> > > > > > hugetlb_fault
> > > > > > hugetlb_no_page
> > > > > > alloc_huge_page
> > > > > > ...
> > > > > > dequeue_huge_page_node_exact
> > > > > > // ignore hwpoisoned hugepage
> > > > > > // and finally fail due to no-mem
> > > > > >
> > > > > > The root cause of this is that current soft-offline code is written
> > > > > > based on an assumption that PageHWPoison flag should beset at first to
> > > > > > avoid accessing the corrupted data. This makes sense for memory_failure()
> > > > > > or hard offline, but does not for soft offline because soft offline is
> > > > > > about corrected (not uncorrected) error and is safe from data lost.
> > > > > > This patch changes soft offline semantics where it sets PageHWPoison flag
> > > > > > only after containment of the error page completes successfully.
> > > > >
> > > > > Could you please expand on the worklow here please? The code is really
> > > > > hard to grasp. I must be missing something because the thing shouldn't
> > > > > be really complicated. Either the page is in the free pool and you just
> > > > > remove it from the allocator (with hugetlb asking for a new hugeltb page
> > > > > to guaratee reserves) or it is used and you just migrate the content to
> > > > > a new page (again with the hugetlb reserves consideration). Why should
> > > > > PageHWPoison flag ordering make any relevance?
> > > >
> > > > (Considering soft offlining free hugepage,)
> > > > PageHWPoison is set at first before this patch, which is racy with
> > > > hugetlb fault code because it's not protected by hugetlb_lock.
> > > >
> > > > Originally this was written in the similar manner as hard-offline, where
> > > > the race is accepted and a PageHWPoison flag is set as soon as possible.
> > > > But actually that's found not necessary/correct because soft offline is
> > > > supposed to be less aggressive and failure is OK.
> > >
> > > OK
> > >
> > > > So this patch is suggesting to make soft-offline less aggressive by
> > > > moving SetPageHWPoison into the lock.
> > >
> > > I guess I still do not understand why we should even care about the
> > > ordering of the HWPoison flag setting. Why cannot we simply have the
> > > following code flow? Or maybe we are doing that already I just do not
> > > follow the code
> > >
> > > soft_offline
> > > check page_count
> > > - free - normal page - remove from the allocator
> > > - hugetlb - allocate a new hugetlb page && remove from the pool
> > > - used - migrate to a new page && never release the old one
> > >
> > > Why do we even need HWPoison flag here? Everything can be completely
> > > transparent to the application. It shouldn't fail from what I
> > > understood.
> >
> > PageHWPoison flag is used to the 'remove from the allocator' part
> > which is like below:
> >
> > static inline
> > struct page *rmqueue(
> > ...
> > do {
> > page = NULL;
> > if (alloc_flags & ALLOC_HARDER) {
> > page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
> > if (page)
> > trace_mm_page_alloc_zone_locked(page, order, migratetype);
> > }
> > if (!page)
> > page = __rmqueue(zone, order, migratetype);
> > } while (page && check_new_pages(page, order));
> >
> > check_new_pages() returns true if the page taken from free list has
> > a hwpoison page so that the allocator iterates another round to get
> > another page.
> >
> > There's no function that can be called from outside allocator to remove
> > a page in allocator. So actual page removal is done at allocation time,
> > not at error handling time. That's the reason why we need PageHWPoison.
>
> hwpoison is an internal mm functionality so why cannot we simply add a
> function that would do that?

That's one possible solution.

I know about another downside in current implementation.
If a hwpoison page is found during high order page allocation,
all 2^order pages (not only hwpoison page) are removed from
buddy because of the above quoted code. And these leaked pages
are never returned to freelist even with unpoison_memory().
If we have a page removal function which properly splits high order
free pages into lower order pages, this problem is avoided.

OTOH PageHWPoison still has a role to report error to userspace.
Without it unpoison_memory() doesn't work.

Thanks,
Naoya Horiguchi

> I find the PageHWPoison usage here doing
> more complications than real good. Or am I missing something?

2018-07-19 08:28:58

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] mm: fix race on soft-offlining free huge pages

On Thu 19-07-18 08:08:05, Naoya Horiguchi wrote:
> On Thu, Jul 19, 2018 at 09:15:16AM +0200, Michal Hocko wrote:
> > On Thu 19-07-18 06:19:45, Naoya Horiguchi wrote:
> > > On Wed, Jul 18, 2018 at 10:50:32AM +0200, Michal Hocko wrote:
[...]
> > > > Why do we even need HWPoison flag here? Everything can be completely
> > > > transparent to the application. It shouldn't fail from what I
> > > > understood.
> > >
> > > PageHWPoison flag is used to the 'remove from the allocator' part
> > > which is like below:
> > >
> > > static inline
> > > struct page *rmqueue(
> > > ...
> > > do {
> > > page = NULL;
> > > if (alloc_flags & ALLOC_HARDER) {
> > > page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
> > > if (page)
> > > trace_mm_page_alloc_zone_locked(page, order, migratetype);
> > > }
> > > if (!page)
> > > page = __rmqueue(zone, order, migratetype);
> > > } while (page && check_new_pages(page, order));
> > >
> > > check_new_pages() returns true if the page taken from free list has
> > > a hwpoison page so that the allocator iterates another round to get
> > > another page.
> > >
> > > There's no function that can be called from outside allocator to remove
> > > a page in allocator. So actual page removal is done at allocation time,
> > > not at error handling time. That's the reason why we need PageHWPoison.
> >
> > hwpoison is an internal mm functionality so why cannot we simply add a
> > function that would do that?
>
> That's one possible solution.

I would prefer that much more than add an overhead (albeit small) into
the page allocator directly. HWPoison should be a really rare event so
why should everybody pay the price? I would much rather see that the
poison path pays the additional price.

> I know about another downside in current implementation.
> If a hwpoison page is found during high order page allocation,
> all 2^order pages (not only hwpoison page) are removed from
> buddy because of the above quoted code. And these leaked pages
> are never returned to freelist even with unpoison_memory().
> If we have a page removal function which properly splits high order
> free pages into lower order pages, this problem is avoided.

Even more reason to move to a new scheme.

> OTOH PageHWPoison still has a role to report error to userspace.
> Without it unpoison_memory() doesn't work.

Sure but we do not really need a special page flag for that. We know the
page is not reachable other than via pfn walkers. If you make the page
reserved and note the fact it has been poisoned in the past then you can
emulate the missing functionality.

Btw. do we really need unpoisoning functionality? Who is really using
it, other than some tests? How does the memory become OK again? Don't we
really need to go through physical hotremove & hotadd to clean the
poison status?

Thanks!
--
Michal Hocko
SUSE Labs

2018-07-19 09:24:56

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] mm: fix race on soft-offlining free huge pages

On Thu, Jul 19, 2018 at 10:27:43AM +0200, Michal Hocko wrote:
> On Thu 19-07-18 08:08:05, Naoya Horiguchi wrote:
> > On Thu, Jul 19, 2018 at 09:15:16AM +0200, Michal Hocko wrote:
> > > On Thu 19-07-18 06:19:45, Naoya Horiguchi wrote:
> > > > On Wed, Jul 18, 2018 at 10:50:32AM +0200, Michal Hocko wrote:
> [...]
> > > > > Why do we even need HWPoison flag here? Everything can be completely
> > > > > transparent to the application. It shouldn't fail from what I
> > > > > understood.
> > > >
> > > > PageHWPoison flag is used to the 'remove from the allocator' part
> > > > which is like below:
> > > >
> > > > static inline
> > > > struct page *rmqueue(
> > > > ...
> > > > do {
> > > > page = NULL;
> > > > if (alloc_flags & ALLOC_HARDER) {
> > > > page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
> > > > if (page)
> > > > trace_mm_page_alloc_zone_locked(page, order, migratetype);
> > > > }
> > > > if (!page)
> > > > page = __rmqueue(zone, order, migratetype);
> > > > } while (page && check_new_pages(page, order));
> > > >
> > > > check_new_pages() returns true if the page taken from free list has
> > > > a hwpoison page so that the allocator iterates another round to get
> > > > another page.
> > > >
> > > > There's no function that can be called from outside allocator to remove
> > > > a page in allocator. So actual page removal is done at allocation time,
> > > > not at error handling time. That's the reason why we need PageHWPoison.
> > >
> > > hwpoison is an internal mm functionality so why cannot we simply add a
> > > function that would do that?
> >
> > That's one possible solution.
>
> I would prefer that much more than add an overhead (albeit small) into
> the page allocator directly. HWPoison should be a really rare event so
> why should everybody pay the price? I would much rather see that the
> poison path pays the additional price.

Yes, that's more maintainable.

>
> > I know about another downside in current implementation.
> > If a hwpoison page is found during high order page allocation,
> > all 2^order pages (not only hwpoison page) are removed from
> > buddy because of the above quoted code. And these leaked pages
> > are never returned to freelist even with unpoison_memory().
> > If we have a page removal function which properly splits high order
> > free pages into lower order pages, this problem is avoided.
>
> Even more reason to move to a new scheme.
>
> > OTOH PageHWPoison still has a role to report error to userspace.
> > Without it unpoison_memory() doesn't work.
>
> Sure but we do not really need a special page flag for that. We know the
> page is not reachable other than via pfn walkers. If you make the page
> reserved and note the fact it has been poisoned in the past then you can
> emulate the missing functionality.
>
> Btw. do we really need unpoisoning functionality? Who is really using
> it, other than some tests?

None, as long as I know.

> How does the memory become OK again?

For hard-offlined in-use pages which are assumed to be pinned,
we clear the PageHWPoison flag and unpin the page to return it to buddy.
For other cases, we simply clear the PageHWPoison flag.
Unless the page is checked by check_new_pages() before unpoison,
the page is reusable.

Sometimes error handling fails and the error page might turn into
unexpected state (like additional refcount/mapcount).
Unpoison just fails on such pages.

> Don't we
> really need to go through physical hotremove & hotadd to clean the
> poison status?

hotremove/hotadd can be a user of unpoison, but I think simply
reinitializing struct pages is easiler.

Thanks,
Naoya Horiguchi

2018-07-19 10:34:37

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] mm: fix race on soft-offlining free huge pages

On Thu 19-07-18 09:22:47, Naoya Horiguchi wrote:
> On Thu, Jul 19, 2018 at 10:27:43AM +0200, Michal Hocko wrote:
> > On Thu 19-07-18 08:08:05, Naoya Horiguchi wrote:
> > > On Thu, Jul 19, 2018 at 09:15:16AM +0200, Michal Hocko wrote:
> > > > On Thu 19-07-18 06:19:45, Naoya Horiguchi wrote:
> > > > > On Wed, Jul 18, 2018 at 10:50:32AM +0200, Michal Hocko wrote:
> > [...]
> > > > > > Why do we even need HWPoison flag here? Everything can be completely
> > > > > > transparent to the application. It shouldn't fail from what I
> > > > > > understood.
> > > > >
> > > > > PageHWPoison flag is used to the 'remove from the allocator' part
> > > > > which is like below:
> > > > >
> > > > > static inline
> > > > > struct page *rmqueue(
> > > > > ...
> > > > > do {
> > > > > page = NULL;
> > > > > if (alloc_flags & ALLOC_HARDER) {
> > > > > page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
> > > > > if (page)
> > > > > trace_mm_page_alloc_zone_locked(page, order, migratetype);
> > > > > }
> > > > > if (!page)
> > > > > page = __rmqueue(zone, order, migratetype);
> > > > > } while (page && check_new_pages(page, order));
> > > > >
> > > > > check_new_pages() returns true if the page taken from free list has
> > > > > a hwpoison page so that the allocator iterates another round to get
> > > > > another page.
> > > > >
> > > > > There's no function that can be called from outside allocator to remove
> > > > > a page in allocator. So actual page removal is done at allocation time,
> > > > > not at error handling time. That's the reason why we need PageHWPoison.
> > > >
> > > > hwpoison is an internal mm functionality so why cannot we simply add a
> > > > function that would do that?
> > >
> > > That's one possible solution.
> >
> > I would prefer that much more than add an overhead (albeit small) into
> > the page allocator directly. HWPoison should be a really rare event so
> > why should everybody pay the price? I would much rather see that the
> > poison path pays the additional price.
>
> Yes, that's more maintainable.
>
> >
> > > I know about another downside in current implementation.
> > > If a hwpoison page is found during high order page allocation,
> > > all 2^order pages (not only hwpoison page) are removed from
> > > buddy because of the above quoted code. And these leaked pages
> > > are never returned to freelist even with unpoison_memory().
> > > If we have a page removal function which properly splits high order
> > > free pages into lower order pages, this problem is avoided.
> >
> > Even more reason to move to a new scheme.
> >
> > > OTOH PageHWPoison still has a role to report error to userspace.
> > > Without it unpoison_memory() doesn't work.
> >
> > Sure but we do not really need a special page flag for that. We know the
> > page is not reachable other than via pfn walkers. If you make the page
> > reserved and note the fact it has been poisoned in the past then you can
> > emulate the missing functionality.
> >
> > Btw. do we really need unpoisoning functionality? Who is really using
> > it, other than some tests?
>
> None, as long as I know.

Then why do we keep it?

> > How does the memory become OK again?
>
> For hard-offlined in-use pages which are assumed to be pinned,
> we clear the PageHWPoison flag and unpin the page to return it to buddy.
> For other cases, we simply clear the PageHWPoison flag.
> Unless the page is checked by check_new_pages() before unpoison,
> the page is reusable.

No I mean how does the memory becomes OK again. Is there any in place
repair technique?

> Sometimes error handling fails and the error page might turn into
> unexpected state (like additional refcount/mapcount).
> Unpoison just fails on such pages.
>
> > Don't we
> > really need to go through physical hotremove & hotadd to clean the
> > poison status?
>
> hotremove/hotadd can be a user of unpoison, but I think simply
> reinitializing struct pages is easiler.

Sure, that is what I meant actually. I do not really see any way to make
the memory OK again other than to replace the faulty dimm and put it
back online. The state is not really preserved for that so having a
sticky struct page state doesn't make much sense.

So from what I understood, we only need to track the poison state just
allow to hotremove that memory and do not stumble over an unexpected
page or try to do something with it. All other actions can be done
synchronously (migrate vs. kill users).
--
Michal Hocko
SUSE Labs

2018-08-15 23:27:49

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] mm: soft-offline: fix race against page allocation

On Tue, 17 Jul 2018 14:32:30 +0900 Naoya Horiguchi <[email protected]> wrote:

> I've updated the patchset based on feedbacks:
>
> - updated comments (from Andrew),
> - moved calling set_hwpoison_free_buddy_page() from mm/migrate.c to mm/memory-failure.c,
> which is necessary to check the return code of set_hwpoison_free_buddy_page(),
> - lkp bot reported a build error when only 1/2 is applied.
>
> > mm/memory-failure.c: In function 'soft_offline_huge_page':
> > >> mm/memory-failure.c:1610:8: error: implicit declaration of function
> > 'set_hwpoison_free_buddy_page'; did you mean 'is_free_buddy_page'?
> > [-Werror=implicit-function-declaration]
> > if (set_hwpoison_free_buddy_page(page))
> > ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > is_free_buddy_page
> > cc1: some warnings being treated as errors
>
> set_hwpoison_free_buddy_page() is defined in 2/2, so we can't use it
> in 1/2. Simply doing s/set_hwpoison_free_buddy_page/!TestSetPageHWPoison/
> will fix this.
>
> v1: https://lkml.org/lkml/2018/7/12/968
>

Quite a bit of discussion on these two, but no actual acks or
review-by's?

2018-08-22 01:50:17

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] mm: soft-offline: fix race against page allocation

On Wed, Aug 15, 2018 at 03:43:34PM -0700, Andrew Morton wrote:
> On Tue, 17 Jul 2018 14:32:30 +0900 Naoya Horiguchi <[email protected]> wrote:
>
> > I've updated the patchset based on feedbacks:
> >
> > - updated comments (from Andrew),
> > - moved calling set_hwpoison_free_buddy_page() from mm/migrate.c to mm/memory-failure.c,
> > which is necessary to check the return code of set_hwpoison_free_buddy_page(),
> > - lkp bot reported a build error when only 1/2 is applied.
> >
> > > mm/memory-failure.c: In function 'soft_offline_huge_page':
> > > >> mm/memory-failure.c:1610:8: error: implicit declaration of function
> > > 'set_hwpoison_free_buddy_page'; did you mean 'is_free_buddy_page'?
> > > [-Werror=implicit-function-declaration]
> > > if (set_hwpoison_free_buddy_page(page))
> > > ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > > is_free_buddy_page
> > > cc1: some warnings being treated as errors
> >
> > set_hwpoison_free_buddy_page() is defined in 2/2, so we can't use it
> > in 1/2. Simply doing s/set_hwpoison_free_buddy_page/!TestSetPageHWPoison/
> > will fix this.
> >
> > v1: https://lkml.org/lkml/2018/7/12/968
> >
>
> Quite a bit of discussion on these two, but no actual acks or
> review-by's?

Really sorry for late response.
Xishi provided feedback on previous version, but no final ack/reviewed-by.
This fix should work on the reported issue, but rewriting soft-offlining
without PageHWPoison flag would be the better fix (no actual patch yet.)
I'm not sure this patch should go to mainline immediately.

Thanks,
Naoya Horiguchi

2018-08-22 02:27:04

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] mm: soft-offline: fix race against page allocation

On 08/21/2018 06:37 PM, Naoya Horiguchi wrote:
> On Wed, Aug 15, 2018 at 03:43:34PM -0700, Andrew Morton wrote:
>> On Tue, 17 Jul 2018 14:32:30 +0900 Naoya Horiguchi <[email protected]> wrote:
>>
>>> I've updated the patchset based on feedbacks:
>>>
>>> - updated comments (from Andrew),
>>> - moved calling set_hwpoison_free_buddy_page() from mm/migrate.c to mm/memory-failure.c,
>>> which is necessary to check the return code of set_hwpoison_free_buddy_page(),
>>> - lkp bot reported a build error when only 1/2 is applied.
>>>
>>> > mm/memory-failure.c: In function 'soft_offline_huge_page':
>>> > >> mm/memory-failure.c:1610:8: error: implicit declaration of function
>>> > 'set_hwpoison_free_buddy_page'; did you mean 'is_free_buddy_page'?
>>> > [-Werror=implicit-function-declaration]
>>> > if (set_hwpoison_free_buddy_page(page))
>>> > ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>> > is_free_buddy_page
>>> > cc1: some warnings being treated as errors
>>>
>>> set_hwpoison_free_buddy_page() is defined in 2/2, so we can't use it
>>> in 1/2. Simply doing s/set_hwpoison_free_buddy_page/!TestSetPageHWPoison/
>>> will fix this.
>>>
>>> v1: https://lkml.org/lkml/2018/7/12/968
>>>
>>
>> Quite a bit of discussion on these two, but no actual acks or
>> review-by's?
>
> Really sorry for late response.
> Xishi provided feedback on previous version, but no final ack/reviewed-by.
> This fix should work on the reported issue, but rewriting soft-offlining
> without PageHWPoison flag would be the better fix (no actual patch yet.)
> I'm not sure this patch should go to mainline immediately.

FWIW - The 'migration of huge PMD shared pages' issue I am working was
originally triggered via soft-offline. While working the issue, I tried
to exercise huge page soft-offline really hard to recreate the issue and
validate a fix. However, I was more likely to hit the soft-offline race(s)
your patches address. Therefore, I applied your patches to focus my testing
and validation on the migration of huge PMD shared pages issue. That is sort
of a Tested-by :).

Just wanted to point out that it was pretty easy to hit this issue. It
was easier than the issue I am working. And, the issue I am trying to
address was seen in a real customer environment. So, I would not be
surprised to see this issue in real customer environments as well.

If you (or others) think we should go forward with these patches, I can
spend some time doing a review. Already did a 'quick look' some time back.
--
Mike Kravetz

2018-08-22 08:03:31

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] mm: soft-offline: fix race against page allocation

On Wed 22-08-18 01:37:48, Naoya Horiguchi wrote:
> On Wed, Aug 15, 2018 at 03:43:34PM -0700, Andrew Morton wrote:
> > On Tue, 17 Jul 2018 14:32:30 +0900 Naoya Horiguchi <[email protected]> wrote:
> >
> > > I've updated the patchset based on feedbacks:
> > >
> > > - updated comments (from Andrew),
> > > - moved calling set_hwpoison_free_buddy_page() from mm/migrate.c to mm/memory-failure.c,
> > > which is necessary to check the return code of set_hwpoison_free_buddy_page(),
> > > - lkp bot reported a build error when only 1/2 is applied.
> > >
> > > > mm/memory-failure.c: In function 'soft_offline_huge_page':
> > > > >> mm/memory-failure.c:1610:8: error: implicit declaration of function
> > > > 'set_hwpoison_free_buddy_page'; did you mean 'is_free_buddy_page'?
> > > > [-Werror=implicit-function-declaration]
> > > > if (set_hwpoison_free_buddy_page(page))
> > > > ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > > > is_free_buddy_page
> > > > cc1: some warnings being treated as errors
> > >
> > > set_hwpoison_free_buddy_page() is defined in 2/2, so we can't use it
> > > in 1/2. Simply doing s/set_hwpoison_free_buddy_page/!TestSetPageHWPoison/
> > > will fix this.
> > >
> > > v1: https://lkml.org/lkml/2018/7/12/968
> > >
> >
> > Quite a bit of discussion on these two, but no actual acks or
> > review-by's?
>
> Really sorry for late response.
> Xishi provided feedback on previous version, but no final ack/reviewed-by.
> This fix should work on the reported issue, but rewriting soft-offlining
> without PageHWPoison flag would be the better fix (no actual patch yet.)

If we can go with the later the I would obviously prefer that. I cannot
promise to work on the patch though. I can help with reviewing of
course.

If this is important enough that people are hitting the issue in normal
workloads then sure, let's go with the simple fix and continue on top of
that.
--
Michal Hocko
SUSE Labs

2018-10-26 08:52:57

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] mm: soft-offline: fix race against page allocation

On Wed 22-08-18 10:00:25, Michal Hocko wrote:
> On Wed 22-08-18 01:37:48, Naoya Horiguchi wrote:
> > On Wed, Aug 15, 2018 at 03:43:34PM -0700, Andrew Morton wrote:
> > > On Tue, 17 Jul 2018 14:32:30 +0900 Naoya Horiguchi <[email protected]> wrote:
> > >
> > > > I've updated the patchset based on feedbacks:
> > > >
> > > > - updated comments (from Andrew),
> > > > - moved calling set_hwpoison_free_buddy_page() from mm/migrate.c to mm/memory-failure.c,
> > > > which is necessary to check the return code of set_hwpoison_free_buddy_page(),
> > > > - lkp bot reported a build error when only 1/2 is applied.
> > > >
> > > > > mm/memory-failure.c: In function 'soft_offline_huge_page':
> > > > > >> mm/memory-failure.c:1610:8: error: implicit declaration of function
> > > > > 'set_hwpoison_free_buddy_page'; did you mean 'is_free_buddy_page'?
> > > > > [-Werror=implicit-function-declaration]
> > > > > if (set_hwpoison_free_buddy_page(page))
> > > > > ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > > > > is_free_buddy_page
> > > > > cc1: some warnings being treated as errors
> > > >
> > > > set_hwpoison_free_buddy_page() is defined in 2/2, so we can't use it
> > > > in 1/2. Simply doing s/set_hwpoison_free_buddy_page/!TestSetPageHWPoison/
> > > > will fix this.
> > > >
> > > > v1: https://lkml.org/lkml/2018/7/12/968
> > > >
> > >
> > > Quite a bit of discussion on these two, but no actual acks or
> > > review-by's?
> >
> > Really sorry for late response.
> > Xishi provided feedback on previous version, but no final ack/reviewed-by.
> > This fix should work on the reported issue, but rewriting soft-offlining
> > without PageHWPoison flag would be the better fix (no actual patch yet.)
>
> If we can go with the later the I would obviously prefer that. I cannot
> promise to work on the patch though. I can help with reviewing of
> course.
>
> If this is important enough that people are hitting the issue in normal
> workloads then sure, let's go with the simple fix and continue on top of
> that.

Naoya, did you have any chance to look at this or have any plans to look?
I am willing to review and help with the overal design but I cannot
really promise to work on the code.
--
Michal Hocko
SUSE Labs

2018-10-30 06:58:42

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] mm: soft-offline: fix race against page allocation

On Fri, Oct 26, 2018 at 10:46:36AM +0200, Michal Hocko wrote:
> On Wed 22-08-18 10:00:25, Michal Hocko wrote:
> > On Wed 22-08-18 01:37:48, Naoya Horiguchi wrote:
> > > On Wed, Aug 15, 2018 at 03:43:34PM -0700, Andrew Morton wrote:
> > > > On Tue, 17 Jul 2018 14:32:30 +0900 Naoya Horiguchi <[email protected]> wrote:
> > > >
> > > > > I've updated the patchset based on feedbacks:
> > > > >
> > > > > - updated comments (from Andrew),
> > > > > - moved calling set_hwpoison_free_buddy_page() from mm/migrate.c to mm/memory-failure.c,
> > > > > which is necessary to check the return code of set_hwpoison_free_buddy_page(),
> > > > > - lkp bot reported a build error when only 1/2 is applied.
> > > > >
> > > > > > mm/memory-failure.c: In function 'soft_offline_huge_page':
> > > > > > >> mm/memory-failure.c:1610:8: error: implicit declaration of function
> > > > > > 'set_hwpoison_free_buddy_page'; did you mean 'is_free_buddy_page'?
> > > > > > [-Werror=implicit-function-declaration]
> > > > > > if (set_hwpoison_free_buddy_page(page))
> > > > > > ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > > > > > is_free_buddy_page
> > > > > > cc1: some warnings being treated as errors
> > > > >
> > > > > set_hwpoison_free_buddy_page() is defined in 2/2, so we can't use it
> > > > > in 1/2. Simply doing s/set_hwpoison_free_buddy_page/!TestSetPageHWPoison/
> > > > > will fix this.
> > > > >
> > > > > v1: https://lkml.org/lkml/2018/7/12/968
> > > > >
> > > >
> > > > Quite a bit of discussion on these two, but no actual acks or
> > > > review-by's?
> > >
> > > Really sorry for late response.
> > > Xishi provided feedback on previous version, but no final ack/reviewed-by.
> > > This fix should work on the reported issue, but rewriting soft-offlining
> > > without PageHWPoison flag would be the better fix (no actual patch yet.)
> >
> > If we can go with the later the I would obviously prefer that. I cannot
> > promise to work on the patch though. I can help with reviewing of
> > course.
> >
> > If this is important enough that people are hitting the issue in normal
> > workloads then sure, let's go with the simple fix and continue on top of
> > that.
>
> Naoya, did you have any chance to look at this or have any plans to look?
> I am willing to review and help with the overal design but I cannot
> really promise to work on the code.

I have a draft version of a patch to isolate a page in buddy-friendly manner
without PageHWPoison flag (that was written weeks ago, but I couldn't finish
because my other project interrupted me ...).
I'll post it after testing, especially confirming that hotplug code properly
reset the isolated page.

Thanks,
Naoya Horiguchi

2018-10-30 08:18:43

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] mm: soft-offline: fix race against page allocation

On Tue 30-10-18 06:54:33, Naoya Horiguchi wrote:
> On Fri, Oct 26, 2018 at 10:46:36AM +0200, Michal Hocko wrote:
> > On Wed 22-08-18 10:00:25, Michal Hocko wrote:
> > > On Wed 22-08-18 01:37:48, Naoya Horiguchi wrote:
> > > > On Wed, Aug 15, 2018 at 03:43:34PM -0700, Andrew Morton wrote:
> > > > > On Tue, 17 Jul 2018 14:32:30 +0900 Naoya Horiguchi <[email protected]> wrote:
> > > > >
> > > > > > I've updated the patchset based on feedbacks:
> > > > > >
> > > > > > - updated comments (from Andrew),
> > > > > > - moved calling set_hwpoison_free_buddy_page() from mm/migrate.c to mm/memory-failure.c,
> > > > > > which is necessary to check the return code of set_hwpoison_free_buddy_page(),
> > > > > > - lkp bot reported a build error when only 1/2 is applied.
> > > > > >
> > > > > > > mm/memory-failure.c: In function 'soft_offline_huge_page':
> > > > > > > >> mm/memory-failure.c:1610:8: error: implicit declaration of function
> > > > > > > 'set_hwpoison_free_buddy_page'; did you mean 'is_free_buddy_page'?
> > > > > > > [-Werror=implicit-function-declaration]
> > > > > > > if (set_hwpoison_free_buddy_page(page))
> > > > > > > ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > > > > > > is_free_buddy_page
> > > > > > > cc1: some warnings being treated as errors
> > > > > >
> > > > > > set_hwpoison_free_buddy_page() is defined in 2/2, so we can't use it
> > > > > > in 1/2. Simply doing s/set_hwpoison_free_buddy_page/!TestSetPageHWPoison/
> > > > > > will fix this.
> > > > > >
> > > > > > v1: https://lkml.org/lkml/2018/7/12/968
> > > > > >
> > > > >
> > > > > Quite a bit of discussion on these two, but no actual acks or
> > > > > review-by's?
> > > >
> > > > Really sorry for late response.
> > > > Xishi provided feedback on previous version, but no final ack/reviewed-by.
> > > > This fix should work on the reported issue, but rewriting soft-offlining
> > > > without PageHWPoison flag would be the better fix (no actual patch yet.)
> > >
> > > If we can go with the later the I would obviously prefer that. I cannot
> > > promise to work on the patch though. I can help with reviewing of
> > > course.
> > >
> > > If this is important enough that people are hitting the issue in normal
> > > workloads then sure, let's go with the simple fix and continue on top of
> > > that.
> >
> > Naoya, did you have any chance to look at this or have any plans to look?
> > I am willing to review and help with the overal design but I cannot
> > really promise to work on the code.
>
> I have a draft version of a patch to isolate a page in buddy-friendly manner
> without PageHWPoison flag (that was written weeks ago, but I couldn't finish
> because my other project interrupted me ...).
> I'll post it after testing, especially confirming that hotplug code properly
> reset the isolated page.

Thanks a lot Naoya. It is highly appreciated!
--
Michal Hocko
SUSE Labs