2022-06-30 03:04:45

by Naoya Horiguchi

[permalink] [raw]
Subject: [PATCH v3 0/9] mm, hwpoison: enable 1GB hugepage support (v3)

Here is v3 of "enabling memory error handling on 1GB hugepage" patchset.

I applied feedbacks for v2, thank you very much. Overall design of the
patchset is unchanged. Please see individual patches for details about
changes.

Patch dependency:

- "mm/memory-failure: disable unpoison once hw error happens"
(actually the conflict is not logical one, but adding MF_SIMULATED to
mf_flags conflicts with patch 6/9.)

Previous versions:

- v1: https://lore.kernel.org/linux-mm/[email protected]/T/#u
- v2: https://lore.kernel.org/linux-mm/[email protected]/T/#u

Thanks,
Naoya Horiguchi
---
Summary:

Naoya Horiguchi (9):
mm/hugetlb: check gigantic_page_runtime_supported() in return_unused_surplus_pages()
mm/hugetlb: separate path for hwpoison entry in copy_hugetlb_page_range()
mm/hugetlb: make pud_huge() and follow_huge_pud() aware of non-present pud entry
mm, hwpoison, hugetlb: support saving mechanism of raw error pages
mm, hwpoison: make unpoison aware of raw error info in hwpoisoned hugepage
mm, hwpoison: set PG_hwpoison for busy hugetlb pages
mm, hwpoison: make __page_handle_poison returns int
mm, hwpoison: skip raw hwpoison page in freeing 1GB hugepage
mm, hwpoison: enable memory error handling on 1GB hugepage

arch/x86/mm/hugetlbpage.c | 8 ++-
include/linux/hugetlb.h | 18 +++++-
include/linux/mm.h | 2 +-
include/linux/swapops.h | 9 +++
include/ras/ras_event.h | 1 -
mm/hugetlb.c | 99 ++++++++++++++++++++++---------
mm/memory-failure.c | 148 ++++++++++++++++++++++++++++++++++++----------
7 files changed, 224 insertions(+), 61 deletions(-)


2022-06-30 03:04:51

by Naoya Horiguchi

[permalink] [raw]
Subject: [PATCH v3 8/9] mm, hwpoison: skip raw hwpoison page in freeing 1GB hugepage

From: Naoya Horiguchi <[email protected]>

Currently if memory_failure() (modified to remove blocking code with
subsequent patch) is called on a page in some 1GB hugepage, memory error
handling fails and the raw error page gets into leaked state. The impact
is small in production systems (just leaked single 4kB page), but this
limits the testability because unpoison doesn't work for it.
We can no longer create 1GB hugepage on the 1GB physical address range
with such leaked pages, that's not useful when testing on small systems.

When a hwpoison page in a 1GB hugepage is handled, it's caught by the
PageHWPoison check in free_pages_prepare() because the 1GB hugepage is
broken down into raw error pages before coming to this point:

if (unlikely(PageHWPoison(page)) && !order) {
...
return false;
}

Then, the page is not sent to buddy and the page refcount is left 0.

Originally this check is supposed to work when the error page is freed from
page_handle_poison() (that is called from soft-offline), but now we are
opening another path to call it, so the callers of __page_handle_poison()
need to handle the case by considering the return value 0 as success. Then
page refcount for hwpoison is properly incremented so unpoison works.

Signed-off-by: Naoya Horiguchi <[email protected]>
Reviewed-by: Miaohe Lin <[email protected]>
---
v2 -> v3:
- remove "res = MF_FAILED" in try_memory_failure_hugetlb (by Miaohe)
---
mm/memory-failure.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 6c7b9129eade..6a9b11807ee6 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1046,7 +1046,6 @@ static int me_huge_page(struct page_state *ps, struct page *p)
res = truncate_error_page(hpage, page_to_pfn(p), mapping);
unlock_page(hpage);
} else {
- res = MF_FAILED;
unlock_page(hpage);
/*
* migration entry prevents later access on error hugepage,
@@ -1054,9 +1053,11 @@ static int me_huge_page(struct page_state *ps, struct page *p)
* subpages.
*/
put_page(hpage);
- if (__page_handle_poison(p) > 0) {
+ if (__page_handle_poison(p) >= 0) {
page_ref_inc(p);
res = MF_RECOVERED;
+ } else {
+ res = MF_FAILED;
}
}

@@ -1689,10 +1690,11 @@ static int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb
*/
if (res == 0) {
unlock_page(head);
- res = MF_FAILED;
- if (__page_handle_poison(p) > 0) {
+ if (__page_handle_poison(p) >= 0) {
page_ref_inc(p);
res = MF_RECOVERED;
+ } else {
+ res = MF_FAILED;
}
action_result(pfn, MF_MSG_FREE_HUGE, res);
return res == MF_RECOVERED ? 0 : -EBUSY;
--
2.25.1

2022-06-30 03:04:59

by Naoya Horiguchi

[permalink] [raw]
Subject: [PATCH v3 9/9] mm, hwpoison: enable memory error handling on 1GB hugepage

From: Naoya Horiguchi <[email protected]>

Now error handling code is prepared, so remove the blocking code and
enable memory error handling on 1GB hugepage.

Signed-off-by: Naoya Horiguchi <[email protected]>
Reviewed-by: Miaohe Lin <[email protected]>
---
include/linux/mm.h | 1 -
include/ras/ras_event.h | 1 -
mm/memory-failure.c | 16 ----------------
3 files changed, 18 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 044dc5a2e361..9d7e9b5a4d1d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3284,7 +3284,6 @@ enum mf_action_page_type {
MF_MSG_DIFFERENT_COMPOUND,
MF_MSG_HUGE,
MF_MSG_FREE_HUGE,
- MF_MSG_NON_PMD_HUGE,
MF_MSG_UNMAP_FAILED,
MF_MSG_DIRTY_SWAPCACHE,
MF_MSG_CLEAN_SWAPCACHE,
diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
index d0337a41141c..cbd3ddd7c33d 100644
--- a/include/ras/ras_event.h
+++ b/include/ras/ras_event.h
@@ -360,7 +360,6 @@ TRACE_EVENT(aer_event,
EM ( MF_MSG_DIFFERENT_COMPOUND, "different compound page after locking" ) \
EM ( MF_MSG_HUGE, "huge page" ) \
EM ( MF_MSG_FREE_HUGE, "free huge page" ) \
- EM ( MF_MSG_NON_PMD_HUGE, "non-pmd-sized huge page" ) \
EM ( MF_MSG_UNMAP_FAILED, "unmapping failed page" ) \
EM ( MF_MSG_DIRTY_SWAPCACHE, "dirty swapcache page" ) \
EM ( MF_MSG_CLEAN_SWAPCACHE, "clean swapcache page" ) \
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 6a9b11807ee6..6ee5cbe0cbab 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -728,7 +728,6 @@ static const char * const action_page_types[] = {
[MF_MSG_DIFFERENT_COMPOUND] = "different compound page after locking",
[MF_MSG_HUGE] = "huge page",
[MF_MSG_FREE_HUGE] = "free huge page",
- [MF_MSG_NON_PMD_HUGE] = "non-pmd-sized huge page",
[MF_MSG_UNMAP_FAILED] = "unmapping failed page",
[MF_MSG_DIRTY_SWAPCACHE] = "dirty swapcache page",
[MF_MSG_CLEAN_SWAPCACHE] = "clean swapcache page",
@@ -1702,21 +1701,6 @@ static int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb

page_flags = head->flags;

- /*
- * TODO: hwpoison for pud-sized hugetlb doesn't work right now, so
- * simply disable it. In order to make it work properly, we need
- * make sure that:
- * - conversion of a pud that maps an error hugetlb into hwpoison
- * entry properly works, and
- * - other mm code walking over page table is aware of pud-aligned
- * hwpoison entries.
- */
- if (huge_page_size(page_hstate(head)) > PMD_SIZE) {
- action_result(pfn, MF_MSG_NON_PMD_HUGE, MF_IGNORED);
- res = -EBUSY;
- goto out;
- }
-
if (!hwpoison_user_mappings(p, pfn, flags, head)) {
action_result(pfn, MF_MSG_UNMAP_FAILED, MF_IGNORED);
res = -EBUSY;
--
2.25.1

2022-06-30 03:05:38

by Naoya Horiguchi

[permalink] [raw]
Subject: [PATCH v3 1/9] mm/hugetlb: check gigantic_page_runtime_supported() in return_unused_surplus_pages()

From: Naoya Horiguchi <[email protected]>

I found a weird state of 1GB hugepage pool, caused by the following
procedure:

- run a process reserving all free 1GB hugepages,
- shrink free 1GB hugepage pool to zero (i.e. writing 0 to
/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages), then
- kill the reserving process.

, then all the hugepages are free *and* surplus at the same time.

$ cat /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
3
$ cat /sys/kernel/mm/hugepages/hugepages-1048576kB/free_hugepages
3
$ cat /sys/kernel/mm/hugepages/hugepages-1048576kB/resv_hugepages
0
$ cat /sys/kernel/mm/hugepages/hugepages-1048576kB/surplus_hugepages
3

This state is resolved by reserving and allocating the pages then
freeing them again, so this seems not to result in serious problem.
But it's a little surprising (shrinking pool suddenly fails).

This behavior is caused by hstate_is_gigantic() check in
return_unused_surplus_pages(). This was introduced so long ago in 2008
by commit aa888a74977a ("hugetlb: support larger than MAX_ORDER"), and
at that time the gigantic pages were not supposed to be allocated/freed
at run-time. Now kernel can support runtime allocation/free, so let's
check gigantic_page_runtime_supported() together.

Signed-off-by: Naoya Horiguchi <[email protected]>
---
v2 -> v3:
- Fixed typo in patch description,
- add !gigantic_page_runtime_supported() check instead of removing
hstate_is_gigantic() check (suggested by Miaohe and Muchun)
- add a few more !gigantic_page_runtime_supported() check in
set_max_huge_pages() (by Mike).
---
mm/hugetlb.c | 19 ++++++++++++++++---
1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a57e1be41401..e67540dbbf88 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2432,8 +2432,7 @@ static void return_unused_surplus_pages(struct hstate *h,
/* Uncommit the reservation */
h->resv_huge_pages -= unused_resv_pages;

- /* Cannot return gigantic pages currently */
- if (hstate_is_gigantic(h))
+ if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
goto out;

/*
@@ -3318,7 +3317,8 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
* the user tries to allocate gigantic pages but let the user free the
* boottime allocated gigantic pages.
*/
- if (hstate_is_gigantic(h) && !IS_ENABLED(CONFIG_CONTIG_ALLOC)) {
+ if (hstate_is_gigantic(h) && (!IS_ENABLED(CONFIG_CONTIG_ALLOC) ||
+ !gigantic_page_runtime_supported())) {
if (count > persistent_huge_pages(h)) {
spin_unlock_irq(&hugetlb_lock);
mutex_unlock(&h->resize_lock);
@@ -3366,6 +3366,19 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
goto out;
}

+ /*
+ * We can not decrease gigantic pool size if runtime modification
+ * is not supported.
+ */
+ if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported()) {
+ if (count < persistent_huge_pages(h)) {
+ spin_unlock_irq(&hugetlb_lock);
+ mutex_unlock(&h->resize_lock);
+ NODEMASK_FREE(node_alloc_noretry);
+ return -EINVAL;
+ }
+ }
+
/*
* Decrease the pool size
* First return free pages to the buddy allocator (being careful
--
2.25.1

2022-06-30 03:05:40

by Naoya Horiguchi

[permalink] [raw]
Subject: [PATCH v3 3/9] mm/hugetlb: make pud_huge() and follow_huge_pud() aware of non-present pud entry

From: Naoya Horiguchi <[email protected]>

follow_pud_mask() does not support non-present pud entry now. As long as
I tested on x86_64 server, follow_pud_mask() still simply returns
no_page_table() for non-present_pud_entry() due to pud_bad(), so no severe
user-visible effect should happen. But generally we should call
follow_huge_pud() for non-present pud entry for 1GB hugetlb page.

Update pud_huge() and follow_huge_pud() to handle non-present pud entries.
The changes are similar to previous works for pud entries commit e66f17ff7177
("mm/hugetlb: take page table lock in follow_huge_pmd()") and commit
cbef8478bee5 ("mm/hugetlb: pmd_huge() returns true for non-present hugepage").

Signed-off-by: Naoya Horiguchi <[email protected]>
---
v2 -> v3:
- fixed typos in subject and description,
- added comment on pud_huge(),
- added comment about fallback for hwpoisoned entry,
- updated initial check about FOLL_{PIN,GET} flags.
---
arch/x86/mm/hugetlbpage.c | 8 +++++++-
mm/hugetlb.c | 32 ++++++++++++++++++++++++++++++--
2 files changed, 37 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index a0d023cb4292..9c8193df0187 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -68,9 +68,15 @@ int pmd_huge(pmd_t pmd)
(pmd_val(pmd) & (_PAGE_PRESENT|_PAGE_PSE)) != _PAGE_PRESENT;
}

+/*
+ * pud_huge() returns 1 if @pud is hugetlb related entry, that is normal
+ * hugetlb entry or non-present (migration or hwpoisoned) hugetlb entry.
+ * Otherwise, returns 0.
+ */
int pud_huge(pud_t pud)
{
- return !!(pud_val(pud) & _PAGE_PSE);
+ return !pud_none(pud) &&
+ (pud_val(pud) & (_PAGE_PRESENT|_PAGE_PSE)) != _PAGE_PRESENT;
}
#endif

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 03fdf342f5f2..4c30f3fcfe50 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6963,10 +6963,38 @@ struct page * __weak
follow_huge_pud(struct mm_struct *mm, unsigned long address,
pud_t *pud, int flags)
{
- if (flags & (FOLL_GET | FOLL_PIN))
+ struct page *page = NULL;
+ spinlock_t *ptl;
+ pte_t pte;
+
+ if (WARN_ON_ONCE(flags & FOLL_PIN))
return NULL;

- return pte_page(*(pte_t *)pud) + ((address & ~PUD_MASK) >> PAGE_SHIFT);
+retry:
+ ptl = huge_pte_lock(hstate_sizelog(PUD_SHIFT), mm, (pte_t *)pud);
+ if (!pud_huge(*pud))
+ goto out;
+ pte = huge_ptep_get((pte_t *)pud);
+ if (pte_present(pte)) {
+ page = pud_page(*pud) + ((address & ~PUD_MASK) >> PAGE_SHIFT);
+ if (WARN_ON_ONCE(!try_grab_page(page, flags))) {
+ page = NULL;
+ goto out;
+ }
+ } else {
+ if (is_hugetlb_entry_migration(pte)) {
+ spin_unlock(ptl);
+ __migration_entry_wait(mm, (pte_t *)pud, ptl);
+ goto retry;
+ }
+ /*
+ * hwpoisoned entry is treated as no_page_table in
+ * follow_page_mask().
+ */
+ }
+out:
+ spin_unlock(ptl);
+ return page;
}

struct page * __weak
--
2.25.1

2022-06-30 03:06:34

by Naoya Horiguchi

[permalink] [raw]
Subject: [PATCH v3 7/9] mm, hwpoison: make __page_handle_poison returns int

From: Naoya Horiguchi <[email protected]>

__page_handle_poison() returns bool that shows whether
take_page_off_buddy() has passed or not now. But we will want to
distinguish another case of "dissolve has passed but taking off failed"
by its return value. So change the type of the return value.
No functional change.

Signed-off-by: Naoya Horiguchi <[email protected]>
Reviewed-by: Miaohe Lin <[email protected]>
---
v2 -> v3:
- move deleting "res = MF_FAILED" to the later patch. (by Miaohe)
---
mm/memory-failure.c | 16 +++++++++++-----
1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 81e591cf8b4a..6c7b9129eade 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -71,7 +71,13 @@ atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0);

static bool hw_memory_failure __read_mostly = false;

-static bool __page_handle_poison(struct page *page)
+/*
+ * Return values:
+ * 1: the page is dissolved (if needed) and taken off from buddy,
+ * 0: the page is dissolved (if needed) and not taken off from buddy,
+ * < 0: failed to dissolve.
+ */
+static int __page_handle_poison(struct page *page)
{
int ret;

@@ -81,7 +87,7 @@ static bool __page_handle_poison(struct page *page)
ret = take_page_off_buddy(page);
zone_pcp_enable(page_zone(page));

- return ret > 0;
+ return ret;
}

static bool page_handle_poison(struct page *page, bool hugepage_or_freepage, bool release)
@@ -91,7 +97,7 @@ static bool page_handle_poison(struct page *page, bool hugepage_or_freepage, boo
* Doing this check for free pages is also fine since dissolve_free_huge_page
* returns 0 for non-hugetlb pages as well.
*/
- if (!__page_handle_poison(page))
+ if (__page_handle_poison(page) <= 0)
/*
* We could fail to take off the target page from buddy
* for example due to racy page allocation, but that's
@@ -1048,7 +1054,7 @@ static int me_huge_page(struct page_state *ps, struct page *p)
* subpages.
*/
put_page(hpage);
- if (__page_handle_poison(p)) {
+ if (__page_handle_poison(p) > 0) {
page_ref_inc(p);
res = MF_RECOVERED;
}
@@ -1684,7 +1690,7 @@ static int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb
if (res == 0) {
unlock_page(head);
res = MF_FAILED;
- if (__page_handle_poison(p)) {
+ if (__page_handle_poison(p) > 0) {
page_ref_inc(p);
res = MF_RECOVERED;
}
--
2.25.1

2022-06-30 03:08:41

by Naoya Horiguchi

[permalink] [raw]
Subject: [PATCH v3 6/9] mm, hwpoison: set PG_hwpoison for busy hugetlb pages

From: Naoya Horiguchi <[email protected]>

If memory_failure() fails to grab page refcount on a hugetlb page
because it's busy, it returns without setting PG_hwpoison on it.
This not only loses a chance of error containment, but breaks the rule
that action_result() should be called only when memory_failure() do
any of handling work (even if that's just setting PG_hwpoison).
This inconsistency could harm code maintainability.

So set PG_hwpoison and call hugetlb_set_page_hwpoison() for such a case.

Fixes: 405ce051236c ("mm/hwpoison: fix race between hugetlb free/demotion and memory_failure_hugetlb()")
Signed-off-by: Naoya Horiguchi <[email protected]>
Reviewed-by: Miaohe Lin <[email protected]>
---
include/linux/mm.h | 1 +
mm/memory-failure.c | 8 ++++----
2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 4346e51484ba..044dc5a2e361 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3233,6 +3233,7 @@ enum mf_flags {
MF_SOFT_OFFLINE = 1 << 3,
MF_UNPOISON = 1 << 4,
MF_SW_SIMULATED = 1 << 5,
+ MF_NO_RETRY = 1 << 6,
};
extern int memory_failure(unsigned long pfn, int flags);
extern void memory_failure_queue(unsigned long pfn, int flags);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index dc90bfadb1dd..81e591cf8b4a 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1617,7 +1617,8 @@ int __get_huge_page_for_hwpoison(unsigned long pfn, int flags)
count_increased = true;
} else {
ret = -EBUSY;
- goto out;
+ if (!(flags & MF_NO_RETRY))
+ goto out;
}

if (hugetlb_set_page_hwpoison(head, page)) {
@@ -1644,7 +1645,6 @@ static int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb
struct page *p = pfn_to_page(pfn);
struct page *head;
unsigned long page_flags;
- bool retry = true;

*hugetlb = 1;
retry:
@@ -1660,8 +1660,8 @@ static int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb
}
return res;
} else if (res == -EBUSY) {
- if (retry) {
- retry = false;
+ if (!(flags & MF_NO_RETRY)) {
+ flags |= MF_NO_RETRY;
goto retry;
}
action_result(pfn, MF_MSG_UNKNOWN, MF_IGNORED);
--
2.25.1