2023-04-28 00:43:31

by Jiaqi Yan

[permalink] [raw]
Subject: [RFC PATCH v1 0/7] PAGE_SIZE Unmapping in Memory Failure Recovery for HugeTLB Pages

Goal
====
Currently once a byte in a HugeTLB hugepage becomes HWPOISON, the whole
hugepage will be unmapped from the page table because that is the finest
granularity of the mapping.

High granularity mapping (HGM) [1], the functionality to map memory
addresses at finer granularities (extreme case is PAGE_SIZE), is recently
proposed upstream, and provides the opportunity to handle memory error more
efficiently: instead of unmapping the whole hugepage, only the raw subpage
in the hugepage needs to be thrown away and all the healthy
subpages can still be kept available for users.

Idea
====
Today memory failure recovery for HugeTLB pages (hugepage) is different
from raw and THP pages. We are only interested in in-use hugepages, which is
dealt with in these simplified steps:
1. Increment the refcount on the compound head of the hugepage.
2. Insert the raw HWPOISON page to the compound head’s raw_hwp_list
(_hugetlb_hwpoison) if it is not already in the list.
3. Unmap the entire hugepage from HugeTLB’s page table.
4. Kill the processes that are accessing the poisoned hugepage.

HGM can greatly improve this recovery mechanism. Step #3 (unmapping
entire hugepage) can be replaced by
3.1 Map the entire hugepage at finer granularity, so that the exact
HWPOISON address is mapped by a PAGE_SIZE PTE, and the rest of the
address spaces optimally mapped by either smaller P*Ds or PTEs. In
other words, the original HugeTLB PTE is split into smaller P*Ds
and PTEs.
3.2 Only unmap the newly mapped PTE that maps the HWPOISON address.

For shared mappings, current HGM patches is already a solid basis for
splitting functionality in step #3.1. This RFC drafts a complete solution
for shared mapping. The splitting-based idea can be applied to private
mappings as well, but additional subtle complexity needs to be dealt with.
We defer the private mapping case as future work.

Splitting HugeTLB PTEs (Step #3.1)
==================================
The general process of splitting a present leaf HugeTLB PTE is
1. Get and clear the original HugeTLB PTE old_pte.
2. Initialize curr with the start address range corresponding to old_pte.
3. Find the optimal level we should map curr at.
4. Perform HGM walk on curr with the optimal level found in step 3,
potentially allocating a new PTE at the optimal level.
5. Populate the newly allocated PTE with bits from old_pte, including
dirty, write, and UFFD_WP.
6. Update curr += the newly created PTE size, repeat step 3 until the
entire VMA is covered.

The functionality of splitting hugepage mapping is not meaningful for
mostly none PTEs. We handle none or userfaultfd write protect (UFFD_WP)
marker HugeTLB PTEs at the time of page faulting. Migration and HWPOISON
PTEs are better left not touched.

Memory Failure Recovery and Unmapping (Step #3.2)
=================================================
A few changes are made in memory_failure and rmap to only unmap raw
HWPOISON pages:
1. as long as HGM is turned on in CONFIG, memory_failure attempts to enable
HGM on the VMA containing the poisoned hugepage
2. memory_failure attempts to split the HugeTLB PTE so that poisoned
address is mapped by a PAGE_SIZE PTE, for all the VMAs containing the
poisoned hugepage.
3. get_huge_page_for_hwpoison only returns -EHWPOISON if the raw page is
already in the compound head’s raw_hwp_list. This makes unmapping work
correctly when multiple raw pages in the same hugepage become HWPOISON.
4. rmap utilizes compound head’s raw_hwp_list to 1) avoid unmapping raw
pages not in the list, and 2) keep track if the raw pages in the list
are already unmapped.
5. page refcount check in me_huge_page is skipped.

Between mmap() and Page Fault
==========================
Memory error can occur between the time when userspace maps a hugepage and
the time when userspace faults in the mapped hugepage. General idea is to
not create any raw-page-size page table entry for HWPOISON memory,
and render memory in healthy raw pages still available to userspace (via
normal fault handling). At the time of hugetlb_no_page:
- If the entire hugepage doesn’t contain any HWPOISON page, the normal
page fault handler continues.
- If the memory address being faulted is within a HWPOISON raw page,
hugetlb_no_page returns VM_FAULT_HWPOISON_LARGE (so that page fault
handler sends a BUS_MCEERR_AR SIGBUS to the faulting process).
- If the memory address being faulted is within a healthy raw page,
hugetlb_no_page utilize HGM to create a new HugeTLB PTE so that its
hugetlb_pte_size cannot be larger and at the same time it doesn’t map any
HWPOISON address. Then the normal page fault handler continues.

Failure Handling
================
- If the kernel still fails to allocate a new raw_hwp_page after a retry,
memory_failure returns MF_IGNORED with MF_MSG_UNKNOWN.
- For each VMA that maps the HWPOISON hugepage
- If the VMA is not eligible for HGM, the old behavior is taken: unmap
the entire hugepage from that VMA.
- If memory_failure fails to enable HGM on the VMA, or if memory_failure
fails to split any VMA that mapped the HWPOISON page, the recovery
returns MF_IGNORED with MF_MSG_UNMAP_FAILED.
- For a particular VMA, if splitting HugeTLB PTE fails, the original PTE
will be restored to the page table.

Code Changes
============
The code patches in this RFC is based on HGM patchset V2 [1], composed
of two parts. The first part implements the idea laid out in the cover
letter; the second part tests two major scenarios: HWPOISON on already
faulted pages and HWPOISON between mapped and faulted.

Future Changes
==============
There is a pending improvement to hugetlbfs_read_iter. If a hugepage is
found from page cache and it contains HWPOISON subpages, today kernel
returns -EIO immediately. With the new splitting-then-unmap
behavior, kernel can return userspace every byte until up to the first
raw HWPOISON byte. If userspace wants the read to start within a raw
HWPOISON page, kernel will have to return -EIO. This improvement and its
selftest will be done in the future patch series.

[1] https://lore.kernel.org/all/[email protected]/

Jiaqi Yan (7):
hugetlb: add HugeTLB splitting functionality
hugetlb: create PTE level mapping when possible
mm: publish raw_hwp_page in mm.h
mm/memory_failure: unmap raw HWPoison PTEs when possible
hugetlb: only VM_FAULT_HWPOISON_LARGE raw page
selftest/mm: test PAGESIZE unmapping HWPOISON pages
selftest/mm: test PAGESIZE unmapping UFFD WP marker HWPOISON pages

include/linux/hugetlb.h | 14 +
include/linux/mm.h | 36 ++
mm/hugetlb.c | 405 ++++++++++++++++++++++-
mm/memory-failure.c | 206 ++++++++++--
mm/rmap.c | 38 ++-
tools/testing/selftests/mm/hugetlb-hgm.c | 364 ++++++++++++++++++--
6 files changed, 1004 insertions(+), 59 deletions(-)

--
2.40.1.495.gc816e09b53d-goog


2023-04-28 00:43:38

by Jiaqi Yan

[permalink] [raw]
Subject: [RFC PATCH v1 2/7] hugetlb: create PTE level mapping when possible

In memory_failure handling, for each VMA that the HWPOISON HugeTLB
page mapped to, enable HGM if eligible, then split the P*D mapped
hugepage to smaller PTEs. try_to_unmap still unmaps the entire hugetlb
page, one PTE by one PTE, at levels smaller than original P*D.
For example, if a hugepage was original mapped at PUD size, it will
be split into PMDs and PTEs, and all of these PMDs and PTEs will
be unmapped. The next commit will only unmap the raw HWPOISON PTE.

For VMA that is not HGM eligible, or failed to enable HGM, or
failed to split hugepage mapping, the hugepage is still mapped by
its original P*D then unmapped at this P*D.

Signed-off-by: Jiaqi Yan <[email protected]>
---
include/linux/hugetlb.h | 5 +++
mm/hugetlb.c | 27 ++++++++++++++++
mm/memory-failure.c | 68 +++++++++++++++++++++++++++++++++++++++++
3 files changed, 100 insertions(+)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index d44bf6a794e5..03074b23c396 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -1266,6 +1266,7 @@ int hugetlb_alloc_largest_pte(struct hugetlb_pte *hpte, struct mm_struct *mm,
unsigned long end);
int hugetlb_collapse(struct mm_struct *mm, unsigned long start,
unsigned long end);
+int hugetlb_enable_hgm_vma(struct vm_area_struct *vma);
int hugetlb_split_to_shift(struct mm_struct *mm, struct vm_area_struct *vma,
struct hugetlb_pte *hpte, unsigned long addr,
unsigned int desired_shift);
@@ -1295,6 +1296,10 @@ int hugetlb_collapse(struct mm_struct *mm, unsigned long start,
{
return -EINVAL;
}
+int hugetlb_enable_hgm_vma(struct vm_area_struct *vma)
+{
+ return -EINVAL;
+}
int hugetlb_split_to_shift(struct mm_struct *mm, struct vm_area_struct *vma,
const struct hugetlb_pte *hpte, unsigned long addr,
unsigned int desired_shift)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d3f3f1c2d293..1419176b7e51 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -8203,6 +8203,33 @@ int hugetlb_collapse(struct mm_struct *mm, unsigned long start,
return ret;
}

+int hugetlb_enable_hgm_vma(struct vm_area_struct *vma)
+{
+ if (hugetlb_hgm_enabled(vma))
+ return 0;
+
+ if (!is_vm_hugetlb_page(vma)) {
+ pr_warn("VMA=[%#lx, %#lx) is not HugeTLB\n",
+ vma->vm_start, vma->vm_end);
+ return -EINVAL;
+ }
+
+ if (!hugetlb_hgm_eligible(vma)) {
+ pr_warn("VMA=[%#lx, %#lx) is not HGM eligible\n",
+ vma->vm_start, vma->vm_end);
+ return -EINVAL;
+ }
+
+ hugetlb_unshare_all_pmds(vma);
+
+ /*
+ * TODO: add the ability to tell if HGM is enabled by kernel
+ * (for HWPOISON unmapping) or by userspace (via MADV_SPLIT).
+ */
+ vm_flags_set(vma, VM_HUGETLB_HGM);
+ return 0;
+}
+
/*
* Find the optimal HugeTLB PTE shift that @desired_addr could be mapped at.
*/
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 0b37cbc6e8ae..eb5579b6787e 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1479,6 +1479,73 @@ static int get_hwpoison_page(struct page *p, unsigned long flags)
return ret;
}

+#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
+/*
+ * For each HGM-eligible VMA that the poisoned page mapped to, create new
+ * HGM mapping for hugepage @folio and make sure @poisoned_page is mapped
+ * by a PAGESIZE level PTE. Caller (hwpoison_user_mappings) must ensure
+ * 1. folio's address space (mapping) is locked in write mode.
+ * 2. folio is locked.
+ */
+static void try_to_split_huge_mapping(struct folio *folio,
+ struct page *poisoned_page)
+{
+ struct address_space *mapping = folio_mapping(folio);
+ pgoff_t pgoff_start;
+ pgoff_t pgoff_end;
+ struct vm_area_struct *vma;
+ unsigned long poisoned_addr;
+ unsigned long head_addr;
+ struct hugetlb_pte hpte;
+
+ if (WARN_ON(!mapping))
+ return;
+
+ VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
+
+ pgoff_start = folio_pgoff(folio);
+ pgoff_end = pgoff_start + folio_nr_pages(folio) - 1;
+
+ vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff_start, pgoff_end) {
+ /* Enable HGM on HGM-eligible VMAs. */
+ if (!hugetlb_hgm_eligible(vma))
+ continue;
+
+ i_mmap_assert_locked(vma->vm_file->f_mapping);
+ if (hugetlb_enable_hgm_vma(vma)) {
+ pr_err("Failed to enable HGM on eligible VMA=[%#lx, %#lx)\n",
+ vma->vm_start, vma->vm_end);
+ continue;
+ }
+
+ poisoned_addr = vma_address(poisoned_page, vma);
+ head_addr = vma_address(folio_page(folio, 0), vma);
+ /*
+ * Get the hugetlb_pte of the PUD-mapped hugepage first,
+ * then split the PUD entry into PMD + PTE entries.
+ *
+ * Both getting original huge PTE and splitting requires write
+ * lock on vma->vm_file->f_mapping, which caller
+ * (e.g. hwpoison_user_mappings) should already acquired.
+ */
+ if (hugetlb_full_walk(&hpte, vma, head_addr))
+ continue;
+
+ if (hugetlb_split_to_shift(vma->vm_mm, vma, &hpte,
+ poisoned_addr, PAGE_SHIFT)) {
+ pr_err("Failed to split huge mapping: pfn=%#lx, vaddr=%#lx in VMA=[%#lx, %#lx)\n",
+ page_to_pfn(poisoned_page), poisoned_addr,
+ vma->vm_start, vma->vm_end);
+ }
+ }
+}
+#else
+static void try_to_split_huge_mapping(struct folio *folio,
+ struct page *poisoned_page)
+{
+}
+#endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */
+
/*
* Do all that is necessary to remove user space mappings. Unmap
* the pages and send SIGBUS to the processes if the data was dirty.
@@ -1555,6 +1622,7 @@ static bool hwpoison_user_mappings(struct page *p, unsigned long pfn,
*/
mapping = hugetlb_page_mapping_lock_write(hpage);
if (mapping) {
+ try_to_split_huge_mapping(folio, p);
try_to_unmap(folio, ttu|TTU_RMAP_LOCKED);
i_mmap_unlock_write(mapping);
} else
--
2.40.1.495.gc816e09b53d-goog

2023-04-28 00:43:43

by Jiaqi Yan

[permalink] [raw]
Subject: [RFC PATCH v1 3/7] mm: publish raw_hwp_page in mm.h

raw_hwp_page will be needed by HugeTLB to determine if a raw subpage
in a hugepage is poisoned and either should be unmapped or not
faulted in at PAGE_SIZE PTE level

Signed-off-by: Jiaqi Yan <[email protected]>
---
include/linux/mm.h | 16 ++++++++++++++++
mm/memory-failure.c | 13 -------------
2 files changed, 16 insertions(+), 13 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9d3216b4284a..4496d7bdd3ea 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3522,6 +3522,22 @@ enum mf_action_page_type {
*/
extern const struct attribute_group memory_failure_attr_group;

+#ifdef CONFIG_HUGETLB_PAGE
+/*
+ * Struct raw_hwp_page represents information about "raw error page",
+ * constructing singly linked list from ->_hugetlb_hwpoison field of folio.
+ */
+struct raw_hwp_page {
+ struct llist_node node;
+ struct page *page;
+};
+
+static inline struct llist_head *raw_hwp_list_head(struct folio *folio)
+{
+ return (struct llist_head *)&folio->_hugetlb_hwpoison;
+}
+#endif
+
#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
extern void clear_huge_page(struct page *page,
unsigned long addr_hint,
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index eb5579b6787e..48e62d04af17 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1826,19 +1826,6 @@ EXPORT_SYMBOL_GPL(mf_dax_kill_procs);
#endif /* CONFIG_FS_DAX */

#ifdef CONFIG_HUGETLB_PAGE
-/*
- * Struct raw_hwp_page represents information about "raw error page",
- * constructing singly linked list from ->_hugetlb_hwpoison field of folio.
- */
-struct raw_hwp_page {
- struct llist_node node;
- struct page *page;
-};
-
-static inline struct llist_head *raw_hwp_list_head(struct folio *folio)
-{
- return (struct llist_head *)&folio->_hugetlb_hwpoison;
-}

static unsigned long __folio_free_raw_hwp(struct folio *folio, bool move_flag)
{
--
2.40.1.495.gc816e09b53d-goog

2023-04-28 00:44:05

by Jiaqi Yan

[permalink] [raw]
Subject: [RFC PATCH v1 4/7] mm/memory_failure: unmap raw HWPoison PTEs when possible

When a folio's VMA is HGM eligible, try_to_unmap_one now only unmaps
the raw HWPOISON page (previously split and mapped at PTE size).
If HGM failed to be enabled on eligible VMA or splitting failed,
try_to_unmap_one fails.

For VMS that is not HGM eligible, try_to_unmap_one still unmaps
the whole P*D.

When only the raw HWPOISON subpage is unmapped but others keep mapped,
the old way in memory_failure to check if unmapping successful doesn't
work. So introduce is_unmapping_successful() to cover both existing and
new unmapping behavior.

For the new unmapping behavior, store how many times a raw HWPOISON page
is expected to be unmapped, and how many times it is actually unmapped
in try_to_unmap_one(). A HWPOISON raw page is expected to be unmapped
from a VMA if splitting succeeded in try_to_split_huge_mapping(), so
unmap_success = (nr_expected_unamps == nr_actual_unmaps).

Old folio_set_hugetlb_hwpoison returns -EHWPOISON if a folio has any
raw HWPOISON subpage, and try_memory_failure_hugetlb won't attempt
recovery actions again because recovery used to be done on the entire
hugepage. With the new unmapping behavior, this doesn't hold. More
subpages in the hugepage can become corrupted, and needs to be recovered
(i.e. unmapped) individually. New folio_set_hugetlb_hwpoison returns
0 after adding a new raw subpage to raw_hwp_list.

Unmapping raw HWPOISON page requires allocating raw_hwp_page
successfully in folio_set_hugetlb_hwpoison, so try_memory_failure_hugetlb
now may fail due to OOM.

Signed-off-by: Jiaqi Yan <[email protected]>
---
include/linux/mm.h | 20 ++++++-
mm/memory-failure.c | 140 ++++++++++++++++++++++++++++++++++++++------
mm/rmap.c | 38 +++++++++++-
3 files changed, 175 insertions(+), 23 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 4496d7bdd3ea..dc192f98cb1d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3522,20 +3522,38 @@ enum mf_action_page_type {
*/
extern const struct attribute_group memory_failure_attr_group;

-#ifdef CONFIG_HUGETLB_PAGE
/*
* Struct raw_hwp_page represents information about "raw error page",
* constructing singly linked list from ->_hugetlb_hwpoison field of folio.
+ * @node: the node in folio->_hugetlb_hwpoison list.
+ * @page: the raw HWPOISON page struct.
+ * @nr_vmas_mapped: the number of VMAs that map @page when detected.
+ * @nr_expected_unmaps: if a VMA that maps @page when detected is eligible
+ * for high granularity mapping, @page is expected to be unmapped.
+ * @nr_actual_unmaps: how many times the raw page is actually unmapped.
*/
struct raw_hwp_page {
struct llist_node node;
struct page *page;
+ int nr_vmas_mapped;
+ int nr_expected_unmaps;
+ int nr_actual_unmaps;
};

+#ifdef CONFIG_HUGETLB_PAGE
static inline struct llist_head *raw_hwp_list_head(struct folio *folio)
{
return (struct llist_head *)&folio->_hugetlb_hwpoison;
}
+
+struct raw_hwp_page *find_in_raw_hwp_list(struct folio *folio,
+ struct page *subpage);
+#else
+static inline struct raw_hwp_page *find_in_raw_hwp_list(struct folio *folio,
+ struct page *subpage)
+{
+ return NULL;
+}
#endif

#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 48e62d04af17..47b935918ceb 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1120,10 +1120,10 @@ static int me_swapcache_clean(struct page_state *ps, struct page *p)
}

/*
- * Huge pages. Needs work.
- * Issues:
- * - Error on hugepage is contained in hugepage unit (not in raw page unit.)
- * To narrow down kill region to one page, we need to break up pmd.
+ * Huge pages.
+ * - Without HGM: Error on hugepage is contained in hugepage unit (not in
+ * raw page unit).
+ * - With HGM: Kill region is narrowed down to just one page.
*/
static int me_huge_page(struct page_state *ps, struct page *p)
{
@@ -1131,6 +1131,7 @@ static int me_huge_page(struct page_state *ps, struct page *p)
struct page *hpage = compound_head(p);
struct address_space *mapping;
bool extra_pins = false;
+ struct raw_hwp_page *hwp_page = find_in_raw_hwp_list(page_folio(p), p);

if (!PageHuge(hpage))
return MF_DELAYED;
@@ -1157,7 +1158,8 @@ static int me_huge_page(struct page_state *ps, struct page *p)
}
}

- if (has_extra_refcount(ps, p, extra_pins))
+ if (hwp_page->nr_expected_unmaps == 0 &&
+ has_extra_refcount(ps, p, extra_pins))
res = MF_FAILED;

return res;
@@ -1497,24 +1499,30 @@ static void try_to_split_huge_mapping(struct folio *folio,
unsigned long poisoned_addr;
unsigned long head_addr;
struct hugetlb_pte hpte;
+ struct raw_hwp_page *hwp_page = NULL;

if (WARN_ON(!mapping))
return;

VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);

+ hwp_page = find_in_raw_hwp_list(folio, poisoned_page);
+ VM_BUG_ON_PAGE(!hwp_page, poisoned_page);
+
pgoff_start = folio_pgoff(folio);
pgoff_end = pgoff_start + folio_nr_pages(folio) - 1;

vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff_start, pgoff_end) {
+ ++hwp_page->nr_vmas_mapped;
+
/* Enable HGM on HGM-eligible VMAs. */
if (!hugetlb_hgm_eligible(vma))
continue;

i_mmap_assert_locked(vma->vm_file->f_mapping);
if (hugetlb_enable_hgm_vma(vma)) {
- pr_err("Failed to enable HGM on eligible VMA=[%#lx, %#lx)\n",
- vma->vm_start, vma->vm_end);
+ pr_err("%#lx: failed to enable HGM on eligible VMA=[%#lx, %#lx)\n",
+ page_to_pfn(poisoned_page), vma->vm_start, vma->vm_end);
continue;
}

@@ -1528,15 +1536,21 @@ static void try_to_split_huge_mapping(struct folio *folio,
* lock on vma->vm_file->f_mapping, which caller
* (e.g. hwpoison_user_mappings) should already acquired.
*/
- if (hugetlb_full_walk(&hpte, vma, head_addr))
+ if (hugetlb_full_walk(&hpte, vma, head_addr)) {
+ pr_err("%#lx: failed to PT-walk with HGM on eligible VMA=[%#lx, %#lx)\n",
+ page_to_pfn(poisoned_page), vma->vm_start, vma->vm_end);
continue;
+ }

if (hugetlb_split_to_shift(vma->vm_mm, vma, &hpte,
poisoned_addr, PAGE_SHIFT)) {
- pr_err("Failed to split huge mapping: pfn=%#lx, vaddr=%#lx in VMA=[%#lx, %#lx)\n",
+ pr_err("%#lx: Failed to split huge mapping: vaddr=%#lx in VMA=[%#lx, %#lx)\n",
page_to_pfn(poisoned_page), poisoned_addr,
vma->vm_start, vma->vm_end);
+ continue;
}
+
+ ++hwp_page->nr_expected_unmaps;
}
}
#else
@@ -1546,6 +1560,47 @@ static void try_to_split_huge_mapping(struct folio *folio,
}
#endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */

+static bool is_unmapping_successful(struct folio *folio,
+ struct page *poisoned_page)
+{
+ bool unmap_success = false;
+ struct raw_hwp_page *hwp_page = find_in_raw_hwp_list(folio, poisoned_page);
+
+ if (!folio_test_hugetlb(folio) ||
+ folio_test_anon(folio) ||
+ !IS_ENABLED(CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING)) {
+ unmap_success = folio_mapped(folio);
+ if (!unmap_success)
+ pr_err("%#lx: failed to unmap page (mapcount=%d)\n",
+ page_to_pfn(poisoned_page),
+ page_mapcount(folio_page(folio, 0)));
+
+ return unmap_success;
+ }
+
+ VM_BUG_ON_PAGE(!hwp_page, poisoned_page);
+
+ /*
+ * Unmapping may not happen for some VMA:
+ * - HGM-eligible VMA but @poisoned_page is not faulted yet: nothing
+ * needs to be done at this point yet until page fault handling.
+ * - HGM-non-eliggible VMA: mapcount decreases by nr_subpages for each VMA,
+ * but not tracked so cannot tell if successfully unmapped from such VMA.
+ */
+ if (hwp_page->nr_vmas_mapped != hwp_page->nr_expected_unmaps)
+ pr_info("%#lx: mapped by %d VMAs but %d unmappings are expected\n",
+ page_to_pfn(poisoned_page), hwp_page->nr_vmas_mapped,
+ hwp_page->nr_expected_unmaps);
+
+ unmap_success = hwp_page->nr_expected_unmaps == hwp_page->nr_actual_unmaps;
+
+ if (!unmap_success)
+ pr_err("%#lx: failed to unmap page (folio_mapcount=%d)\n",
+ page_to_pfn(poisoned_page), folio_mapcount(folio));
+
+ return unmap_success;
+}
+
/*
* Do all that is necessary to remove user space mappings. Unmap
* the pages and send SIGBUS to the processes if the data was dirty.
@@ -1631,10 +1686,7 @@ static bool hwpoison_user_mappings(struct page *p, unsigned long pfn,
try_to_unmap(folio, ttu);
}

- unmap_success = !page_mapped(hpage);
- if (!unmap_success)
- pr_err("%#lx: failed to unmap page (mapcount=%d)\n",
- pfn, page_mapcount(hpage));
+ unmap_success = is_unmapping_successful(folio, p);

/*
* try_to_unmap() might put mlocked page in lru cache, so call
@@ -1827,6 +1879,31 @@ EXPORT_SYMBOL_GPL(mf_dax_kill_procs);

#ifdef CONFIG_HUGETLB_PAGE

+/*
+ * Given a HWPOISON @subpage as raw page, find its location in @folio's
+ * _hugetlb_hwpoison. Return NULL if @subpage is not in the list.
+ */
+struct raw_hwp_page *find_in_raw_hwp_list(struct folio *folio,
+ struct page *subpage)
+{
+ struct llist_node *t, *tnode;
+ struct llist_head *raw_hwp_head = raw_hwp_list_head(folio);
+ struct raw_hwp_page *hwp_page = NULL;
+ struct raw_hwp_page *p;
+
+ VM_BUG_ON_PAGE(PageHWPoison(subpage), subpage);
+
+ llist_for_each_safe(tnode, t, raw_hwp_head->first) {
+ p = container_of(tnode, struct raw_hwp_page, node);
+ if (subpage == p->page) {
+ hwp_page = p;
+ break;
+ }
+ }
+
+ return hwp_page;
+}
+
static unsigned long __folio_free_raw_hwp(struct folio *folio, bool move_flag)
{
struct llist_head *head;
@@ -1837,6 +1914,9 @@ static unsigned long __folio_free_raw_hwp(struct folio *folio, bool move_flag)
llist_for_each_safe(tnode, t, head->first) {
struct raw_hwp_page *p = container_of(tnode, struct raw_hwp_page, node);

+ /* Ideally raw HWPoison pages are fully unmapped if possible. */
+ WARN_ON(p->nr_expected_unmaps != p->nr_actual_unmaps);
+
if (move_flag)
SetPageHWPoison(p->page);
else
@@ -1853,7 +1933,8 @@ static int folio_set_hugetlb_hwpoison(struct folio *folio, struct page *page)
struct llist_head *head;
struct raw_hwp_page *raw_hwp;
struct llist_node *t, *tnode;
- int ret = folio_test_set_hwpoison(folio) ? -EHWPOISON : 0;
+ bool has_hwpoison = folio_test_set_hwpoison(folio);
+ bool hgm_enabled = IS_ENABLED(CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING);

/*
* Once the hwpoison hugepage has lost reliable raw error info,
@@ -1873,9 +1954,20 @@ static int folio_set_hugetlb_hwpoison(struct folio *folio, struct page *page)
raw_hwp = kmalloc(sizeof(struct raw_hwp_page), GFP_ATOMIC);
if (raw_hwp) {
raw_hwp->page = page;
+ raw_hwp->nr_vmas_mapped = 0;
+ raw_hwp->nr_expected_unmaps = 0;
+ raw_hwp->nr_actual_unmaps = 0;
llist_add(&raw_hwp->node, head);
+ if (hgm_enabled)
+ /*
+ * A new raw poisoned page. Don't return
+ * HWPOISON. Error event will be counted
+ * in action_result().
+ */
+ return 0;
+
/* the first error event will be counted in action_result(). */
- if (ret)
+ if (has_hwpoison)
num_poisoned_pages_inc(page_to_pfn(page));
} else {
/*
@@ -1889,8 +1981,16 @@ static int folio_set_hugetlb_hwpoison(struct folio *folio, struct page *page)
* used any more, so free it.
*/
__folio_free_raw_hwp(folio, false);
+
+ /*
+ * HGM relies on raw_hwp allocated and inserted to raw_hwp_list.
+ */
+ if (hgm_enabled)
+ return -ENOMEM;
}
- return ret;
+
+ BUG_ON(hgm_enabled);
+ return has_hwpoison ? -EHWPOISON : 0;
}

static unsigned long folio_free_raw_hwp(struct folio *folio, bool move_flag)
@@ -1936,6 +2036,7 @@ int __get_huge_page_for_hwpoison(unsigned long pfn, int flags,
struct page *page = pfn_to_page(pfn);
struct folio *folio = page_folio(page);
int ret = 2; /* fallback to normal page handling */
+ int set_page_hwpoison = 0;
bool count_increased = false;

if (!folio_test_hugetlb(folio))
@@ -1956,8 +2057,9 @@ int __get_huge_page_for_hwpoison(unsigned long pfn, int flags,
goto out;
}

- if (folio_set_hugetlb_hwpoison(folio, page)) {
- ret = -EHWPOISON;
+ set_page_hwpoison = folio_set_hugetlb_hwpoison(folio, page);
+ if (set_page_hwpoison) {
+ ret = set_page_hwpoison;
goto out;
}

@@ -2004,7 +2106,7 @@ static int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb
res = kill_accessing_process(current, folio_pfn(folio), flags);
}
return res;
- } else if (res == -EBUSY) {
+ } else if (res == -EBUSY || res == -ENOMEM) {
if (!(flags & MF_NO_RETRY)) {
flags |= MF_NO_RETRY;
goto retry;
diff --git a/mm/rmap.c b/mm/rmap.c
index d3bc81466902..4cfaa34b001e 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1453,6 +1453,8 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
struct mmu_notifier_range range;
enum ttu_flags flags = (enum ttu_flags)(long)arg;
bool page_poisoned;
+ bool hgm_eligible = hugetlb_hgm_eligible(vma);
+ struct raw_hwp_page *hwp_page;

/*
* When racing against e.g. zap_pte_range() on another cpu,
@@ -1525,6 +1527,29 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
* in the case where the hugetlb page is poisoned.
*/
VM_BUG_ON_FOLIO(!page_poisoned, folio);
+
+ /*
+ * When VMA is not HGM eligible, unmap at hugepage's
+ * original P*D.
+ *
+ * When HGM is eligible:
+ * - if original P*D is split to smaller P*Ds and
+ * PTEs, we skip subpage if it is not raw HWPoison
+ * page, or it was but was already unmapped.
+ * - if original P*D is not split, skip unmapping
+ * and memory_failure result will be MF_IGNORED.
+ */
+ if (hgm_eligible) {
+ if (pvmw.pte_order > 0)
+ continue;
+ hwp_page = find_in_raw_hwp_list(folio, subpage);
+ if (hwp_page == NULL)
+ continue;
+ if (hwp_page->nr_expected_unmaps ==
+ hwp_page->nr_actual_unmaps)
+ continue;
+ }
+
/*
* huge_pmd_unshare may unmap an entire PMD page.
* There is no way of knowing exactly which PMDs may
@@ -1760,12 +1785,19 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
*
* See Documentation/mm/mmu_notifier.rst
*/
- if (folio_test_hugetlb(folio))
+ if (!folio_test_hugetlb(folio))
+ page_remove_rmap(subpage, vma, false);
+ else {
hugetlb_remove_rmap(subpage,
pvmw.pte_order + PAGE_SHIFT,
hstate_vma(vma), vma);
- else
- page_remove_rmap(subpage, vma, false);
+ if (hgm_eligible) {
+ VM_BUG_ON_FOLIO(pvmw.pte_order > 0, folio);
+ VM_BUG_ON_FOLIO(!hwp_page, folio);
+ VM_BUG_ON_FOLIO(subpage != hwp_page->page, folio);
+ ++hwp_page->nr_actual_unmaps;
+ }
+ }

if (vma->vm_flags & VM_LOCKED)
mlock_drain_local();
--
2.40.1.495.gc816e09b53d-goog

2023-04-28 00:44:16

by Jiaqi Yan

[permalink] [raw]
Subject: [RFC PATCH v1 5/7] hugetlb: only VM_FAULT_HWPOISON_LARGE raw page

Memory raw pages can become HWPOISON between when userspace maps
a hugepage and when userspace faults in the hugepage.

Today when hugetlb faults somewhere in a hugepage containing
HWPOISON raw pages, the result is a VM_FAULT_HWPOISON_LARGE.

This commit teaches hugetlb page fault handler to only
VM_FAULT_HWPOISON_LARGE if the faulting address is within HWPOISON
raw page; otherwise, fault handler can continue to fault in healthy
raw pages.

Signed-off-by: Jiaqi Yan <[email protected]>
---
include/linux/mm.h | 2 +
mm/hugetlb.c | 129 ++++++++++++++++++++++++++++++++++++++++++--
mm/memory-failure.c | 1 +
3 files changed, 127 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index dc192f98cb1d..7caa4530953f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3531,6 +3531,7 @@ extern const struct attribute_group memory_failure_attr_group;
* @nr_expected_unmaps: if a VMA that maps @page when detected is eligible
* for high granularity mapping, @page is expected to be unmapped.
* @nr_actual_unmaps: how many times the raw page is actually unmapped.
+ * @index: index of the poisoned subpage in the folio.
*/
struct raw_hwp_page {
struct llist_node node;
@@ -3538,6 +3539,7 @@ struct raw_hwp_page {
int nr_vmas_mapped;
int nr_expected_unmaps;
int nr_actual_unmaps;
+ unsigned long index;
};

#ifdef CONFIG_HUGETLB_PAGE
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 1419176b7e51..f8ddf04ae0c4 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6158,6 +6158,30 @@ static struct folio *hugetlb_try_find_lock_folio(struct address_space *mapping,
return folio;
}

+static vm_fault_t hugetlb_no_page_hwpoison(struct mm_struct *mm,
+ struct vm_area_struct *vma,
+ struct folio *folio,
+ unsigned long address,
+ struct hugetlb_pte *hpte,
+ unsigned int flags);
+
+#ifndef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
+static vm_fault_t hugetlb_no_page_hwpoison(struct mm_struct *mm,
+ struct vm_area_struct *vma,
+ struct folio *folio,
+ unsigned long address,
+ struct hugetlb_pte *hpte,
+ unsigned int flags)
+{
+ if (unlikely(folio_test_hwpoison(folio))) {
+ return VM_FAULT_HWPOISON_LARGE |
+ VM_FAULT_SET_HINDEX(hstate_index(hstate_vma(vma)));
+ }
+
+ return 0;
+}
+#endif
+
static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
struct vm_area_struct *vma,
struct address_space *mapping, pgoff_t idx,
@@ -6287,13 +6311,13 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
/*
* If memory error occurs between mmap() and fault, some process
* don't have hwpoisoned swap entry for errored virtual address.
- * So we need to block hugepage fault by PG_hwpoison bit check.
+ * So we need to block hugepage fault by hwpoison check:
+ * - without HGM, the check is based on PG_hwpoison
+ * - with HGM, check if the raw page for address is poisoned
*/
- if (unlikely(folio_test_hwpoison(folio))) {
- ret = VM_FAULT_HWPOISON_LARGE |
- VM_FAULT_SET_HINDEX(hstate_index(h));
+ ret = hugetlb_no_page_hwpoison(mm, vma, folio, address, hpte, flags);
+ if (unlikely(ret))
goto backout_unlocked;
- }

/* Check for page in userfault range. */
if (userfaultfd_minor(vma)) {
@@ -8426,6 +8450,11 @@ int hugetlb_split_to_shift(struct mm_struct *mm, struct vm_area_struct *vma,
* the allocated PTEs created before splitting fails.
*/

+ /*
+ * For none and UFFD_WP marker PTEs, given try_to_unmap_one doesn't
+ * unmap them, delay the splitting until page fault happens. See the
+ * hugetlb_no_page_hwpoison check in hugetlb_no_page.
+ */
if (unlikely(huge_pte_none_mostly(old_entry))) {
ret = -EAGAIN;
goto skip;
@@ -8479,6 +8508,96 @@ int hugetlb_split_to_shift(struct mm_struct *mm, struct vm_area_struct *vma,
return ret;
}

+/*
+ * Given a hugetlb PTE, if we want to split it into its next smaller level
+ * PTE, return what size we should use to do HGM walk with allocations.
+ * If given hugetlb PTE is already at smallest PAGESIZE, returns -EINVAL.
+ */
+static int hgm_next_size(struct vm_area_struct *vma, struct hugetlb_pte *hpte)
+{
+ struct hstate *h = hstate_vma(vma), *tmp_h;
+ unsigned int shift;
+ unsigned long curr_size = hugetlb_pte_size(hpte);
+ unsigned long next_size;
+
+ for_each_hgm_shift(h, tmp_h, shift) {
+ next_size = 1UL << shift;
+ if (next_size < curr_size)
+ return next_size;
+ }
+
+ return -EINVAL;
+}
+
+/*
+ * Check if address is in the range of a HWPOISON raw page.
+ * During checking hugetlb PTE may be split into smaller hguetlb PTEs.
+ */
+static vm_fault_t hugetlb_no_page_hwpoison(struct mm_struct *mm,
+ struct vm_area_struct *vma,
+ struct folio *folio,
+ unsigned long address,
+ struct hugetlb_pte *hpte,
+ unsigned int flags)
+{
+ unsigned long range_start, range_end;
+ unsigned long start_index, end_index;
+ unsigned long folio_start = vma_address(folio_page(folio, 0), vma);
+ struct llist_node *t, *tnode;
+ struct llist_head *raw_hwp_head = raw_hwp_list_head(folio);
+ struct raw_hwp_page *p = NULL;
+ bool contain_hwpoison = false;
+ int hgm_size;
+ int hgm_ret = 0;
+
+ if (likely(!folio_test_hwpoison(folio)))
+ return 0;
+
+ if (hugetlb_enable_hgm_vma(vma))
+ return VM_FAULT_HWPOISON_LARGE |
+ VM_FAULT_SET_HINDEX(hstate_index(hstate_vma(vma)));
+
+recheck:
+ range_start = address & hugetlb_pte_mask(hpte);
+ range_end = range_start + hugetlb_pte_size(hpte);
+ start_index = (range_start - folio_start) / PAGE_SIZE;
+ end_index = start_index + hugetlb_pte_size(hpte) / PAGE_SIZE;
+
+ contain_hwpoison = false;
+ llist_for_each_safe(tnode, t, raw_hwp_head->first) {
+ p = container_of(tnode, struct raw_hwp_page, node);
+ if (start_index <= p->index && p->index < end_index) {
+ contain_hwpoison = true;
+ break;
+ }
+ }
+
+ if (!contain_hwpoison)
+ return 0;
+
+ if (hugetlb_pte_size(hpte) == PAGE_SIZE)
+ return VM_FAULT_HWPOISON;
+
+ /*
+ * hugetlb_fault already ensured hugetlb_vma_lock_read.
+ * We also checked hugetlb_pte_size(hpte) != PAGE_SIZE,
+ * so hgm_size must be something meaningful to HGM.
+ */
+ hgm_size = hgm_next_size(vma, hpte);
+ VM_BUG_ON(hgm_size == -EINVAL);
+ hgm_ret = hugetlb_full_walk_alloc(hpte, vma, address, hgm_size);
+ if (hgm_ret) {
+ WARN_ON_ONCE(hgm_ret);
+ /*
+ * When splitting using HGM fails, return like
+ * HGM is not eligible or enabled.
+ */
+ return VM_FAULT_HWPOISON_LARGE |
+ VM_FAULT_SET_HINDEX(hstate_index(hstate_vma(vma)));
+ }
+ goto recheck;
+}
+
#endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */

/*
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 47b935918ceb..9093ba53feed 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1957,6 +1957,7 @@ static int folio_set_hugetlb_hwpoison(struct folio *folio, struct page *page)
raw_hwp->nr_vmas_mapped = 0;
raw_hwp->nr_expected_unmaps = 0;
raw_hwp->nr_actual_unmaps = 0;
+ raw_hwp->index = folio_page_idx(folio, page);
llist_add(&raw_hwp->node, head);
if (hgm_enabled)
/*
--
2.40.1.495.gc816e09b53d-goog

2023-04-28 00:44:17

by Jiaqi Yan

[permalink] [raw]
Subject: [RFC PATCH v1 1/7] hugetlb: add HugeTLB splitting functionality

The new function, hugetlb_split_to_shift, optimally splits the page
table to map a particular address at a paricular granularity. This
is useful for punching a hole in the mapping and for mapping (and
unmapping) small sections of a HugeTLB page.

Splitting is for present leaf HugeTLB PTE only. None HugeTLB PTEs
and other non-present HugeTLB PTE types are not supported as they
are better left untouched:
* None PTEs
* Migration PTEs
* HWPOISON PTEs
* UFFD writeprotect PTEs

Signed-off-by: Jiaqi Yan <[email protected]>
---
include/linux/hugetlb.h | 9 ++
mm/hugetlb.c | 249 ++++++++++++++++++++++++++++++++++++++++
2 files changed, 258 insertions(+)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 742e7f2cb170..d44bf6a794e5 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -1266,6 +1266,9 @@ int hugetlb_alloc_largest_pte(struct hugetlb_pte *hpte, struct mm_struct *mm,
unsigned long end);
int hugetlb_collapse(struct mm_struct *mm, unsigned long start,
unsigned long end);
+int hugetlb_split_to_shift(struct mm_struct *mm, struct vm_area_struct *vma,
+ struct hugetlb_pte *hpte, unsigned long addr,
+ unsigned int desired_shift);
#else
static inline bool hugetlb_hgm_enabled(struct vm_area_struct *vma)
{
@@ -1292,6 +1295,12 @@ int hugetlb_collapse(struct mm_struct *mm, unsigned long start,
{
return -EINVAL;
}
+int hugetlb_split_to_shift(struct mm_struct *mm, struct vm_area_struct *vma,
+ const struct hugetlb_pte *hpte, unsigned long addr,
+ unsigned int desired_shift)
+{
+ return -EINVAL;
+}
#endif

static inline
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index df4c17164abb..d3f3f1c2d293 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -8203,6 +8203,255 @@ int hugetlb_collapse(struct mm_struct *mm, unsigned long start,
return ret;
}

+/*
+ * Find the optimal HugeTLB PTE shift that @desired_addr could be mapped at.
+ */
+static int hugetlb_find_shift(struct vm_area_struct *vma,
+ unsigned long curr,
+ unsigned long end,
+ unsigned long desired_addr,
+ unsigned long desired_shift,
+ unsigned int *shift_found)
+{
+ struct hstate *h = hstate_vma(vma);
+ struct hstate *tmp_h;
+ unsigned int shift;
+ unsigned long sz;
+
+ for_each_hgm_shift(h, tmp_h, shift) {
+ sz = 1UL << shift;
+ /* This sz is not aligned or too large. */
+ if (!IS_ALIGNED(curr, sz) || curr + sz > end)
+ continue;
+ /*
+ * When desired_addr is in [curr, curr + sz),
+ * we want shift to be as close to desired_shift
+ * as possible.
+ */
+ if (curr <= desired_addr && desired_addr < curr + sz
+ && shift > desired_shift)
+ continue;
+
+ *shift_found = shift;
+ return 0;
+ }
+
+ return -EINVAL;
+}
+
+/*
+ * Given a particular address @addr and it is a present leaf HugeTLB PTE,
+ * split it so that the PTE that maps @addr is at @desired_shift.
+ */
+static int hugetlb_split_to_shift_present_leaf(struct mm_struct *mm,
+ struct vm_area_struct *vma,
+ pte_t old_entry,
+ unsigned long start,
+ unsigned long end,
+ unsigned long addr,
+ unsigned int orig_shift,
+ unsigned int desired_shift)
+{
+ bool old_entry_dirty;
+ bool old_entry_write;
+ bool old_entry_uffd_wp;
+ pte_t new_entry;
+ unsigned long curr;
+ unsigned long sz;
+ unsigned int shift;
+ int ret = 0;
+ struct hugetlb_pte new_hpte;
+ struct page *subpage = NULL;
+ struct folio *folio = page_folio(compound_head(pte_page(old_entry)));
+ struct hstate *h = hstate_vma(vma);
+ spinlock_t *ptl;
+
+ /* Unmap original unsplit hugepage per huge_ptep_get_and_clear. */
+ hugetlb_remove_rmap(folio_page(folio, 0), orig_shift, h, vma);
+
+ old_entry_dirty = huge_pte_dirty(old_entry);
+ old_entry_write = huge_pte_write(old_entry);
+ old_entry_uffd_wp = huge_pte_uffd_wp(old_entry);
+
+ for (curr = start; curr < end; curr += sz) {
+ ret = hugetlb_find_shift(vma, curr, end, addr,
+ desired_shift, &shift);
+
+ /* Unable to find a shift that works */
+ if (WARN_ON(ret))
+ goto abort;
+
+ /*
+ * Do HGM full walk and allocate new page table structures
+ * to continue to walk to the level we want.
+ */
+ sz = 1UL << shift;
+ ret = hugetlb_full_walk_alloc(&new_hpte, vma, curr, sz);
+ if (WARN_ON(ret))
+ goto abort;
+
+ BUG_ON(hugetlb_pte_size(&new_hpte) > sz);
+ /*
+ * When hugetlb_pte_size(new_hpte) is than sz, increment
+ * curr by hugetlb_pte_size(new_hpte) to avoid skip over
+ * some PTEs.
+ */
+ if (hugetlb_pte_size(&new_hpte) < sz)
+ sz = hugetlb_pte_size(&new_hpte);
+
+ subpage = hugetlb_find_subpage(h, folio, curr);
+ /*
+ * Creating a new (finer granularity) PT entry and
+ * populate it with old_entry's bits.
+ */
+ new_entry = make_huge_pte(vma, subpage,
+ huge_pte_write(old_entry), shift);
+ if (old_entry_dirty)
+ new_entry = huge_pte_mkdirty(new_entry);
+ if (old_entry_write)
+ new_entry = huge_pte_mkwrite(new_entry);
+ if (old_entry_uffd_wp)
+ new_entry = huge_pte_mkuffd_wp(new_entry);
+ ptl = hugetlb_pte_lock(&new_hpte);
+ set_huge_pte_at(mm, curr, new_hpte.ptep, new_entry);
+ spin_unlock(ptl);
+ /* Increment ref/mapcount per set_huge_pte_at(). */
+ hugetlb_add_file_rmap(subpage, shift, h, vma);
+ folio_get(folio);
+ }
+ /*
+ * This refcount decrement is for the huge_ptep_get_and_clear
+ * on the hpte BEFORE splitting, for the same reason as
+ * hugetlb_remove_rmap(), but we cannot do it at that time.
+ * Now that splitting succeeded, the refcount can be decremented.
+ */
+ folio_put(folio);
+ return 0;
+abort:
+ /*
+ * Restore mapcount on unsplitted hugepage. No need to restore
+ * refcount as we won't folio_put() until splitting succeeded.
+ */
+ hugetlb_add_file_rmap(folio_page(folio, 0), orig_shift, h, vma);
+ return ret;
+}
+
+/*
+ * Given a particular address @addr, split the HugeTLB PTE that currently
+ * maps it so that, for the given @addr, the PTE that maps it is @desired_shift.
+ * The splitting is always done optimally.
+ *
+ * Example: given a HugeTLB 1G page mapped from VA 0 to 1G, if caller calls
+ * this API with addr=0 and desired_shift=PAGE_SHIFT, we will change the page
+ * table as follows:
+ * 1. The original PUD will be split into 512 2M PMDs first
+ * 2. The 1st PMD will further be split into 512 4K PTEs
+ *
+ * Callers are required to hold locks on the file mapping within vma.
+ */
+int hugetlb_split_to_shift(struct mm_struct *mm, struct vm_area_struct *vma,
+ struct hugetlb_pte *hpte, unsigned long addr,
+ unsigned int desired_shift)
+{
+ unsigned long start, end;
+ unsigned long desired_sz = 1UL << desired_shift;
+ int ret;
+ pte_t old_entry;
+ struct mmu_gather tlb;
+ struct mmu_notifier_range range;
+ spinlock_t *ptl;
+
+ BUG_ON(!hpte->ptep);
+
+ start = addr & hugetlb_pte_mask(hpte);
+ end = start + hugetlb_pte_size(hpte);
+ BUG_ON(!IS_ALIGNED(start, desired_sz));
+ BUG_ON(!IS_ALIGNED(end, desired_sz));
+ BUG_ON(addr < start || end <= addr);
+
+ if (hpte->shift == desired_shift)
+ return 0;
+
+ /*
+ * Non none-mostly hugetlb PTEs must be present leaf-level PTE,
+ * i.e. not split before.
+ */
+ ptl = hugetlb_pte_lock(hpte);
+ BUG_ON(!huge_pte_none_mostly(huge_ptep_get(hpte->ptep)) &&
+ !hugetlb_pte_present_leaf(hpte, huge_ptep_get(hpte->ptep)));
+
+ i_mmap_assert_write_locked(vma->vm_file->f_mapping);
+
+ mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, start, end);
+ mmu_notifier_invalidate_range_start(&range);
+
+ /*
+ * Get and clear the PTE. We will allocate new page table structures
+ * when walking the page table.
+ */
+ old_entry = huge_ptep_get_and_clear(mm, start, hpte->ptep);
+ spin_unlock(ptl);
+
+ /*
+ * From now on, any failure exit needs to go through "skip" to
+ * put old_entry back. If any form of hugetlb_split_to_shift_xxx
+ * is invoked, it also needs to go through "abort" to get rid of
+ * the allocated PTEs created before splitting fails.
+ */
+
+ if (unlikely(huge_pte_none_mostly(old_entry))) {
+ ret = -EAGAIN;
+ goto skip;
+ }
+ if (unlikely(!pte_present(old_entry))) {
+ if (is_hugetlb_entry_migration(old_entry))
+ ret = -EBUSY;
+ else if (is_hugetlb_entry_hwpoisoned(old_entry))
+ ret = -EHWPOISON;
+ else {
+ WARN_ONCE(1, "Unexpected case of non-present HugeTLB PTE\n");
+ ret = -EINVAL;
+ }
+ goto skip;
+ }
+
+ if (!hugetlb_pte_present_leaf(hpte, old_entry)) {
+ WARN_ONCE(1, "HugeTLB present PTE is not leaf\n");
+ ret = -EAGAIN;
+ goto skip;
+ }
+ /* From now on old_entry is present leaf entry. */
+ ret = hugetlb_split_to_shift_present_leaf(mm, vma, old_entry,
+ start, end, addr,
+ hpte->shift,
+ desired_shift);
+ if (ret)
+ goto abort;
+
+ /* Splitting done, new page table entries successfully setup. */
+ mmu_notifier_invalidate_range_end(&range);
+ return 0;
+abort:
+ /* Splitting failed, restoring to the original page table state. */
+ tlb_gather_mmu(&tlb, mm);
+ /* Decrement mapcount for all the split PTEs. */
+ __unmap_hugepage_range(&tlb, vma, start, end, NULL, ZAP_FLAG_DROP_MARKER);
+ /*
+ * Free any newly allocated page table entries.
+ * Ok if no new entries allocated at all.
+ */
+ hugetlb_free_pgd_range(&tlb, start, end, start, end);
+ /* Decrement refcount for all the split PTEs. */
+ tlb_finish_mmu(&tlb);
+skip:
+ /* Restore the old entry. */
+ ptl = hugetlb_pte_lock(hpte);
+ set_huge_pte_at(mm, start, hpte->ptep, old_entry);
+ spin_unlock(ptl);
+ mmu_notifier_invalidate_range_end(&range);
+ return ret;
+}
+
#endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */

/*
--
2.40.1.495.gc816e09b53d-goog

2023-04-28 00:44:45

by Jiaqi Yan

[permalink] [raw]
Subject: [RFC PATCH v1 6/7] selftest/mm: test PAGESIZE unmapping HWPOISON pages

After injecting memory errors to byte addresses inside HugeTLB page,
the updated test checks
1. only a raw page is unmapped, and userspace gets correct SIGBUS
from kernel.
2. other subpages in the same hugepage are still mapped and data not
corrupted.

Signed-off-by: Jiaqi Yan <[email protected]>
---
tools/testing/selftests/mm/hugetlb-hgm.c | 194 +++++++++++++++++++----
1 file changed, 167 insertions(+), 27 deletions(-)

diff --git a/tools/testing/selftests/mm/hugetlb-hgm.c b/tools/testing/selftests/mm/hugetlb-hgm.c
index c0ba6ad44005..bc9529986b66 100644
--- a/tools/testing/selftests/mm/hugetlb-hgm.c
+++ b/tools/testing/selftests/mm/hugetlb-hgm.c
@@ -39,6 +39,10 @@
#define MADV_SPLIT 26
#endif

+#ifndef NUM_HWPOISON_PAGES
+#define NUM_HWPOISON_PAGES 3UL
+#endif
+
#define PREFIX " ... "
#define ERROR_PREFIX " !!! "

@@ -241,6 +245,9 @@ static int test_sigbus(char *addr, bool poison)
sigbus_addr, addr);
else if (poison && !was_mceerr)
printf(ERROR_PREFIX "didn't get an MCEERR?\n");
+ else if (!poison && was_mceerr)
+ printf(ERROR_PREFIX "got BUS_MCEERR_AR sigbus on expected healthy address: %p\n",
+ sigbus_addr);
else
ret = 0;
out:
@@ -272,43 +279,176 @@ static int read_event_from_uffd(int *uffd, pthread_t *pthread)
return 0;
}

-static int test_sigbus_range(char *primary_map, size_t len, bool hwpoison)
+struct range_exclude_pages {
+ /* Starting address of the buffer. */
+ char *mapping;
+ /* Length of the buffer in bytes. */
+ size_t length;
+ /* The value that each byte in buffer should equal to. */
+ char value;
+ /*
+ * PAGESIZE aligned addresses excluded from the checking,
+ * e.g. if PAGE_SIZE=4k, for each addr in excludes,
+ * skips checking on [addr, addr + 4096).
+ */
+ unsigned long excluded[NUM_HWPOISON_PAGES];
+};
+
+static int check_range_exclude_pages(struct range_exclude_pages *range)
+{
+ const unsigned long pagesize = getpagesize();
+ unsigned long excluded_index;
+ unsigned long page_index;
+ bool should_skip;
+ size_t i = 0;
+ size_t j = 0;
+
+ while (i < range->length) {
+ page_index = ((unsigned long)(range->mapping + i)) / pagesize;
+ should_skip = false;
+ for (j = 0; j < NUM_HWPOISON_PAGES; ++j) {
+ excluded_index = range->excluded[j] / pagesize;
+ if (page_index == excluded_index) {
+ should_skip = true;
+ break;
+ }
+ }
+ if (should_skip) {
+ printf(PREFIX "skip excluded addr range [%#lx, %#lx)\n",
+ (unsigned long)(range->mapping + i),
+ (unsigned long)(range->mapping + i + pagesize));
+ i += pagesize;
+ continue;
+ }
+ if (range->mapping[i] != range->value) {
+ printf(ERROR_PREFIX "mismatch at %p (%d != %d)\n",
+ &range->mapping[i], range->mapping[i], range->value);
+ return -1;
+ }
+ ++i;
+ }
+
+ return 0;
+}
+
+enum test_status verify_raw_pages(char *map, size_t len,
+ unsigned long excluded[NUM_HWPOISON_PAGES])
{
const unsigned long pagesize = getpagesize();
- const int num_checks = 512;
- unsigned long bytes_per_check = len/num_checks;
- int i;
+ unsigned long size, offset, value;
+ size_t j = 0;
+
+ for (size = len / 2, offset = 0, value = 1; size > pagesize;
+ offset += size, size /= 2, ++value) {
+ struct range_exclude_pages range = {
+ .mapping = map + offset,
+ .length = size,
+ .value = value,
+ };
+ for (j = 0; j < NUM_HWPOISON_PAGES; ++j)
+ range.excluded[j] = excluded[j];
+
+ printf(PREFIX "checking non-poisoned range [%p, %p) "
+ "(len=%#lx) per-byte value=%lu\n",
+ range.mapping, range.mapping + range.length,
+ range.length, value);
+ if (check_range_exclude_pages(&range))
+ return TEST_FAILED;
+
+ printf(PREFIX PREFIX "good\n");
+ }

- printf(PREFIX "checking that we can't access "
- "(%d addresses within %p -> %p)\n",
- num_checks, primary_map, primary_map + len);
+ return TEST_PASSED;
+}

- if (pagesize > bytes_per_check)
- bytes_per_check = pagesize;
+static int read_hwpoison_pages(unsigned long *nr_hwp_pages)
+{
+ const unsigned long pagesize = getpagesize();
+ char buffer[256] = {0};
+ char *cmd = "cat /proc/meminfo | grep -i HardwareCorrupted | grep -o '[0-9]*'";
+ FILE *cmdfile = popen(cmd, "r");

- for (i = 0; i < len; i += bytes_per_check)
- if (test_sigbus(primary_map + i, hwpoison) < 0)
- return 1;
- /* check very last byte, because we left it unmapped */
- if (test_sigbus(primary_map + len - 1, hwpoison))
- return 1;
+ if (!(fgets(buffer, sizeof(buffer), cmdfile))) {
+ perror("failed to read HardwareCorrupted from /proc/meminfo\n");
+ return -1;
+ }
+ pclose(cmdfile);
+ *nr_hwp_pages = atoll(buffer) * 1024 / pagesize;

return 0;
}

-static enum test_status test_hwpoison(char *primary_map, size_t len)
+static enum test_status test_hwpoison_one_raw_page(char *hwpoison_addr)
{
- printf(PREFIX "poisoning %p -> %p\n", primary_map, primary_map + len);
- if (madvise(primary_map, len, MADV_HWPOISON) < 0) {
+ const unsigned long pagesize = getpagesize();
+
+ printf(PREFIX "poisoning [%p, %p) (len=%#lx)\n",
+ hwpoison_addr, hwpoison_addr + pagesize, pagesize);
+ if (madvise(hwpoison_addr, pagesize, MADV_HWPOISON) < 0) {
perror(ERROR_PREFIX "MADV_HWPOISON failed");
return TEST_SKIPPED;
}

- return test_sigbus_range(primary_map, len, true)
- ? TEST_FAILED : TEST_PASSED;
+ printf(PREFIX "checking poisoned range [%p, %p) (len=%#lx)\n",
+ hwpoison_addr, hwpoison_addr + pagesize, pagesize);
+ if (test_sigbus(hwpoison_addr, true) < 0)
+ return TEST_FAILED;
+
+ return TEST_PASSED;
}

-static int test_fork(int uffd, char *primary_map, size_t len)
+static enum test_status test_hwpoison_present(char *map, size_t len,
+ bool already_injected)
+{
+ const unsigned long pagesize = getpagesize();
+ const unsigned long hwpoison_next = 128;
+ unsigned long nr_hwpoison_pages_before, nr_hwpoison_pages_after;
+ enum test_status ret;
+ size_t i;
+ char *hwpoison_addr = map;
+ unsigned long hwpoison_addrs[NUM_HWPOISON_PAGES];
+
+ if (hwpoison_next * (NUM_HWPOISON_PAGES - 1) >= (len / pagesize)) {
+ printf(ERROR_PREFIX "max hwpoison_addr out of range");
+ return TEST_SKIPPED;
+ }
+
+ for (i = 0; i < NUM_HWPOISON_PAGES; ++i) {
+ hwpoison_addrs[i] = (unsigned long)hwpoison_addr;
+ hwpoison_addr += hwpoison_next * pagesize;
+ }
+
+ if (already_injected)
+ return verify_raw_pages(map, len, hwpoison_addrs);
+
+ if (read_hwpoison_pages(&nr_hwpoison_pages_before)) {
+ printf(ERROR_PREFIX "check #HWPOISON pages\n");
+ return TEST_SKIPPED;
+ }
+ printf(PREFIX "Before injections, #HWPOISON pages = %ld\n", nr_hwpoison_pages_before);
+
+ for (i = 0; i < NUM_HWPOISON_PAGES; ++i) {
+ ret = test_hwpoison_one_raw_page((char *)hwpoison_addrs[i]);
+ if (ret != TEST_PASSED)
+ return ret;
+ }
+
+ if (read_hwpoison_pages(&nr_hwpoison_pages_after)) {
+ printf(ERROR_PREFIX "check #HWPOISON pages\n");
+ return TEST_SKIPPED;
+ }
+ printf(PREFIX "After injections, #HWPOISON pages = %ld\n", nr_hwpoison_pages_after);
+
+ if (nr_hwpoison_pages_after - nr_hwpoison_pages_before != NUM_HWPOISON_PAGES) {
+ printf(ERROR_PREFIX "delta #HWPOISON pages != %ld",
+ NUM_HWPOISON_PAGES);
+ return TEST_FAILED;
+ }
+
+ return verify_raw_pages(map, len, hwpoison_addrs);
+}
+
+int test_fork(int uffd, char *primary_map, size_t len)
{
int status;
int ret = 0;
@@ -360,7 +500,6 @@ static int test_fork(int uffd, char *primary_map, size_t len)

pthread_join(uffd_thd, NULL);
return ret;
-
}

static int uffd_register(int uffd, char *primary_map, unsigned long len,
@@ -394,6 +533,7 @@ test_hgm(int fd, size_t hugepagesize, size_t len, enum test_type type)
bool uffd_wp = type == TEST_UFFDWP;
bool verify = type == TEST_DEFAULT;
int register_args;
+ enum test_status hwp_status = TEST_SKIPPED;

if (ftruncate(fd, len) < 0) {
perror(ERROR_PREFIX "ftruncate failed");
@@ -489,10 +629,10 @@ test_hgm(int fd, size_t hugepagesize, size_t len, enum test_type type)
* mapping.
*/
if (hwpoison) {
- enum test_status new_status = test_hwpoison(primary_map, len);
-
- if (new_status != TEST_PASSED) {
- status = new_status;
+ /* test_hwpoison can fail with TEST_SKIPPED. */
+ hwp_status = test_hwpoison_present(primary_map, len, false);
+ if (hwp_status != TEST_PASSED) {
+ status = hwp_status;
goto done;
}
}
@@ -539,7 +679,7 @@ test_hgm(int fd, size_t hugepagesize, size_t len, enum test_type type)
/*
* Verify that memory is still poisoned.
*/
- if (hwpoison && test_sigbus_range(primary_map, len, true))
+ if (hwpoison && test_hwpoison_present(primary_map, len, true))
goto done;

status = TEST_PASSED;
--
2.40.1.495.gc816e09b53d-goog

2023-04-28 00:46:09

by Jiaqi Yan

[permalink] [raw]
Subject: [RFC PATCH v1 7/7] selftest/mm: test PAGESIZE unmapping UFFD WP marker HWPOISON pages

For not-yet-faulted hugepage containing HWPOISON raw page, test
1. only HWPOISON raw page will not be faulted, and a BUS_MCEERR_AR
SIGBUS will be sent to userspace.
2. healthy raw pages are faulted in as normal. Since the hugepage
has been writeprotect by UFFD, non BUS_MCEERR_AR SIGBUS will be
sent to userspace.

Signed-off-by: Jiaqi Yan <[email protected]>
---
tools/testing/selftests/mm/hugetlb-hgm.c | 170 +++++++++++++++++++++++
1 file changed, 170 insertions(+)

diff --git a/tools/testing/selftests/mm/hugetlb-hgm.c b/tools/testing/selftests/mm/hugetlb-hgm.c
index bc9529986b66..81ee2d99fea8 100644
--- a/tools/testing/selftests/mm/hugetlb-hgm.c
+++ b/tools/testing/selftests/mm/hugetlb-hgm.c
@@ -515,6 +515,169 @@ static int uffd_register(int uffd, char *primary_map, unsigned long len,
return ioctl(uffd, UFFDIO_REGISTER, &reg);
}

+static int setup_present_map(char *present_map, size_t len)
+{
+ size_t offset = 0;
+ unsigned char iter = 0;
+ unsigned long pagesize = getpagesize();
+ uint64_t size;
+
+ for (size = len/2; size >= pagesize;
+ offset += size, size /= 2) {
+ iter++;
+ memset(present_map + offset, iter, size);
+ }
+ return 0;
+}
+
+static enum test_status test_hwpoison_absent_uffd_wp(int fd, size_t hugepagesize, size_t len)
+{
+ int uffd;
+ char *absent_map, *present_map;
+ struct uffdio_api api;
+ int register_args;
+ struct sigaction new, old;
+ enum test_status status = TEST_SKIPPED;
+ const unsigned long pagesize = getpagesize();
+ const unsigned long hwpoison_index = 128;
+ char *hwpoison_addr;
+
+ if (hwpoison_index >= (len / pagesize)) {
+ printf(ERROR_PREFIX "hwpoison_index out of range");
+ return TEST_FAILED;
+ }
+
+ if (ftruncate(fd, len) < 0) {
+ perror(ERROR_PREFIX "ftruncate failed");
+ return TEST_FAILED;
+ }
+
+ uffd = userfaultfd(O_CLOEXEC);
+ if (uffd < 0) {
+ perror(ERROR_PREFIX "uffd not created");
+ return TEST_FAILED;
+ }
+
+ absent_map = mmap(NULL, len, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+ if (absent_map == MAP_FAILED) {
+ perror(ERROR_PREFIX "mmap for ABSENT mapping failed");
+ goto close_uffd;
+ }
+ printf(PREFIX "ABSENT mapping: %p\n", absent_map);
+
+ api.api = UFFD_API;
+ api.features = UFFD_FEATURE_SIGBUS | UFFD_FEATURE_EXACT_ADDRESS |
+ UFFD_FEATURE_EVENT_FORK;
+ if (ioctl(uffd, UFFDIO_API, &api) == -1) {
+ perror(ERROR_PREFIX "UFFDIO_API failed");
+ goto unmap_absent;
+ }
+
+ /*
+ * Register with UFFDIO_REGISTER_MODE_WP to have UFFD WP bit on
+ * the HugeTLB page table entry.
+ */
+ register_args = UFFDIO_REGISTER_MODE_MISSING | UFFDIO_REGISTER_MODE_WP;
+ if (uffd_register(uffd, absent_map, len, register_args)) {
+ perror(ERROR_PREFIX "UFFDIO_REGISTER failed");
+ goto unmap_absent;
+ }
+
+ new.sa_sigaction = &sigbus_handler;
+ new.sa_flags = SA_SIGINFO;
+ if (sigaction(SIGBUS, &new, &old) < 0) {
+ perror(ERROR_PREFIX "could not setup SIGBUS handler");
+ goto unmap_absent;
+ }
+
+ /*
+ * Set WP markers to the absent huge mapping. With HGM enabled in
+ * kernel CONFIG, memory_failure will enabled HGM in kernel,
+ * so no need to enable HGM from userspace.
+ */
+ if (userfaultfd_writeprotect(uffd, absent_map, len, true) < 0) {
+ status = TEST_FAILED;
+ goto unmap_absent;
+ }
+
+ status = TEST_PASSED;
+
+ /*
+ * With MAP_SHARED hugetlb memory, we cna inject memory error to
+ * not-yet-faulted mapping (absent_map) by injecting memory error
+ * to a already faulted mapping (present_map).
+ */
+ present_map = mmap(NULL, len, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+ if (present_map == MAP_FAILED) {
+ perror(ERROR_PREFIX "mmap for non present mapping failed");
+ goto close_uffd;
+ }
+ printf(PREFIX "PRESENT mapping: %p\n", present_map);
+ setup_present_map(present_map, len);
+
+ hwpoison_addr = present_map + hwpoison_index * pagesize;
+ if (madvise(hwpoison_addr, pagesize, MADV_HWPOISON)) {
+ perror(PREFIX "MADV_HWPOISON a page in PRESENT mapping failed");
+ status = TEST_FAILED;
+ goto unmap_present;
+ }
+
+ printf(PREFIX "checking poisoned range [%p, %p) (len=%#lx) in PRESENT mapping\n",
+ hwpoison_addr, hwpoison_addr + pagesize, pagesize);
+ if (test_sigbus(hwpoison_addr, true) < 0) {
+ status = TEST_FAILED;
+ goto done;
+ }
+ printf(PREFIX "checking healthy pages in PRESENT mapping\n");
+ unsigned long hwpoison_addrs[] = {
+ (unsigned long)hwpoison_addr,
+ (unsigned long)hwpoison_addr,
+ (unsigned long)hwpoison_addr
+ };
+ status = verify_raw_pages(present_map, len, hwpoison_addrs);
+ if (status != TEST_PASSED) {
+ printf(ERROR_PREFIX "checking healthy pages failed\n");
+ goto done;
+ }
+
+ for (int i = 0; i < len; i += pagesize) {
+ if (i == hwpoison_index * pagesize) {
+ printf(PREFIX "checking poisoned range [%p, %p) (len=%#lx) in ABSENT mapping\n",
+ absent_map + i, absent_map + i + pagesize, pagesize);
+ if (test_sigbus(absent_map + i, true) < 0) {
+ status = TEST_FAILED;
+ break;
+ }
+ } else {
+ /*
+ * With UFFD_FEATURE_SIGBUS, we should get a SIGBUS for
+ * every not faulted (non present) page/byte.
+ */
+ if (test_sigbus(absent_map + i, false) < 0) {
+ printf(PREFIX "checking healthy range [%p, %p) (len=%#lx) in ABSENT mapping failed\n",
+ absent_map + i, absent_map + i + pagesize, pagesize);
+ status = TEST_FAILED;
+ break;
+ }
+ }
+ }
+done:
+ if (ftruncate(fd, 0) < 0) {
+ perror(ERROR_PREFIX "ftruncate back to 0 failed");
+ status = TEST_FAILED;
+ }
+unmap_present:
+ printf(PREFIX "Unmap PRESENT mapping=%p\n", absent_map);
+ munmap(present_map, len);
+unmap_absent:
+ printf(PREFIX "Unmap ABSENT mapping=%p\n", absent_map);
+ munmap(absent_map, len);
+close_uffd:
+ printf(PREFIX "Close UFFD\n");
+ close(uffd);
+ return status;
+}
+
enum test_type {
TEST_DEFAULT,
TEST_UFFDWP,
@@ -744,6 +907,13 @@ int main(void)
printf("HGM hwpoison test: %s\n", status_to_str(status));
if (status == TEST_FAILED)
ret = -1;
+
+ printf("HGM hwpoison UFFD-WP marker test...\n");
+ status = test_hwpoison_absent_uffd_wp(fd, hugepagesize, len);
+ printf("HGM hwpoison UFFD-WP marker test: %s\n",
+ status_to_str(status));
+ if (status == TEST_FAILED)
+ ret = -1;
close:
close(fd);

--
2.40.1.495.gc816e09b53d-goog

Subject: Re: [RFC PATCH v1 4/7] mm/memory_failure: unmap raw HWPoison PTEs when possible

On Fri, Apr 28, 2023 at 12:41:36AM +0000, Jiaqi Yan wrote:
> When a folio's VMA is HGM eligible, try_to_unmap_one now only unmaps
> the raw HWPOISON page (previously split and mapped at PTE size).
> If HGM failed to be enabled on eligible VMA or splitting failed,
> try_to_unmap_one fails.
>
> For VMS that is not HGM eligible, try_to_unmap_one still unmaps
> the whole P*D.
>
> When only the raw HWPOISON subpage is unmapped but others keep mapped,
> the old way in memory_failure to check if unmapping successful doesn't
> work. So introduce is_unmapping_successful() to cover both existing and
> new unmapping behavior.
>
> For the new unmapping behavior, store how many times a raw HWPOISON page
> is expected to be unmapped, and how many times it is actually unmapped
> in try_to_unmap_one(). A HWPOISON raw page is expected to be unmapped
> from a VMA if splitting succeeded in try_to_split_huge_mapping(), so
> unmap_success = (nr_expected_unamps == nr_actual_unmaps).
>
> Old folio_set_hugetlb_hwpoison returns -EHWPOISON if a folio has any
> raw HWPOISON subpage, and try_memory_failure_hugetlb won't attempt
> recovery actions again because recovery used to be done on the entire
> hugepage. With the new unmapping behavior, this doesn't hold. More
> subpages in the hugepage can become corrupted, and needs to be recovered
> (i.e. unmapped) individually. New folio_set_hugetlb_hwpoison returns
> 0 after adding a new raw subpage to raw_hwp_list.
>
> Unmapping raw HWPOISON page requires allocating raw_hwp_page
> successfully in folio_set_hugetlb_hwpoison, so try_memory_failure_hugetlb
> now may fail due to OOM.
>
> Signed-off-by: Jiaqi Yan <[email protected]>
> ---
...

> @@ -1827,6 +1879,31 @@ EXPORT_SYMBOL_GPL(mf_dax_kill_procs);
>
> #ifdef CONFIG_HUGETLB_PAGE
>
> +/*
> + * Given a HWPOISON @subpage as raw page, find its location in @folio's
> + * _hugetlb_hwpoison. Return NULL if @subpage is not in the list.
> + */
> +struct raw_hwp_page *find_in_raw_hwp_list(struct folio *folio,
> + struct page *subpage)
> +{
> + struct llist_node *t, *tnode;
> + struct llist_head *raw_hwp_head = raw_hwp_list_head(folio);
> + struct raw_hwp_page *hwp_page = NULL;
> + struct raw_hwp_page *p;
> +
> + VM_BUG_ON_PAGE(PageHWPoison(subpage), subpage);

I'm testing the series (on top of v6.2-rc4 + HGM v2 patchset) and found the
following error triggered by this VM_BUG_ON_PAGE(). The testcase is just to
inject hwpoison on an anonymous page (it's not hugetlb-related one).

[ 790.610985] ===> testcase 'mm/hwpoison/base/backend-anonymous_error-hard-offline_access-avoid.auto3' start
[ 793.304927] page:000000006743177b refcount:1 mapcount:0 mapping:0000000000000000 index:0x700000000 pfn:0x14d739
[ 793.309322] memcg:ffff8a30c50b6000
[ 793.310934] anon flags: 0x57ffffe08a001d(locked|uptodate|dirty|lru|mappedtodisk|swapbacked|hwpoison|node=1|zone=2|lastcpupid=0x1fffff)
[ 793.316665] raw: 0057ffffe08a001d ffffe93cc5353c88 ffffe93cc5685fc8 ffff8a30c91878f1
[ 793.320211] raw: 0000000700000000 0000000000000000 00000001ffffffff ffff8a30c50b6000
[ 793.323665] page dumped because: VM_BUG_ON_PAGE(PageHWPoison(subpage))
[ 793.326764] ------------[ cut here ]------------
[ 793.329080] kernel BUG at mm/memory-failure.c:1894!
[ 793.331895] invalid opcode: 0000 [#1] PREEMPT SMP PTI
[ 793.334854] CPU: 4 PID: 2644 Comm: mceinj.sh Tainted: G E N 6.2.0-rc4-v6.2-rc2-230529-1404+ #63
[ 793.340710] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.1-2.fc37 04/01/2014
[ 793.345875] RIP: 0010:hwpoison_user_mappings+0x654/0x780
[ 793.349066] Code: ef 89 de e8 6e bc f8 ff 48 8b 7c 24 20 48 83 c7 58 e8 10 bb d9 ff e9 5f fb ff ff 48 c7 c6 80 ce 4c b1 4c 89 ef e8 1c 38 f6 ff <0f> 0b 48 c7 c6 7b c8 4c b1 4c 89 ef e8 0b 38 f6 ff 0f 0b 8b 45 58
[ 793.359732] RSP: 0018:ffffa3ff85ed3d28 EFLAGS: 00010296
[ 793.362367] RAX: 000000000000003a RBX: 0000000000000018 RCX: 0000000000000000
[ 793.365763] RDX: 0000000000000001 RSI: ffffffffb14ac451 RDI: 00000000ffffffff
[ 793.368698] RBP: ffffe93cc535ce40 R08: 0000000000000000 R09: ffffa3ff85ed3ba0
[ 793.370837] R10: 0000000000000003 R11: ffffffffb1d3ed28 R12: 000000000014d739
[ 793.372903] R13: ffffe93cc535ce40 R14: ffffe93cc535ce40 R15: ffffe93cc535ce40
[ 793.374931] FS: 00007f6ccc42a740(0000) GS:ffff8a31bbc00000(0000) knlGS:0000000000000000
[ 793.377136] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 793.378656] CR2: 0000561aad6474b2 CR3: 00000001492d4005 CR4: 0000000000170ee0
[ 793.380514] DR0: ffffffffb28ed7d0 DR1: ffffffffb28ed7d1 DR2: ffffffffb28ed7d2
[ 793.382296] DR3: ffffffffb28ed7d3 DR6: 00000000ffff0ff0 DR7: 0000000000000600
[ 793.384028] Call Trace:
[ 793.384655] <TASK>
[ 793.385210] ? __lru_add_drain_all+0x164/0x1f0
[ 793.386316] memory_failure+0x352/0xaa0
[ 793.387249] ? __pfx_bpf_lsm_capable+0x10/0x10
[ 793.388323] ? __pfx_security_capable+0x10/0x10
[ 793.389350] hard_offline_page_store+0x46/0x80
[ 793.390397] kernfs_fop_write_iter+0x11e/0x200
[ 793.391441] vfs_write+0x1e4/0x3a0
[ 793.392221] ksys_write+0x53/0xd0
[ 793.392976] do_syscall_64+0x3a/0x90
[ 793.393790] entry_SYSCALL_64_after_hwframe+0x72/0xdc

I'm wondering how this code path is called, one possible path is like this:

hwpoison_user_mappings
if PageHuge(hpage) && !PageAnon(hpage)
try_to_split_huge_mapping()
find_in_raw_hwp_list
VM_BUG_ON_PAGE(PageHWPoison(subpage), subpage)

but this looks unlikely because the precheck "PageHuge(hpage) && !PageAnon(hpage)" is
false for anonymous pages.

Another possible code path is:

hwpoison_user_mappings
if PageHuge(hpage) && !PageAnon(hpage)
...
else
try_to_unmap
rmap_walk
rmap_walk_anon
try_to_unmap_one
if folio_test_hugetlb
if hgm_eligible
find_in_raw_hwp_list
VM_BUG_ON_PAGE(PageHWPoison(subpage), subpage)

but this looks also unlikely because of checking folio_test_hugetlb and hgm_eligible
(I think both are false in this testcase.)
Maybe I miss something (and I'll dig this more), but let me share the issue.

Thanks,
Naoya Horiguchi

2023-05-30 21:43:56

by Jiaqi Yan

[permalink] [raw]
Subject: Re: [RFC PATCH v1 4/7] mm/memory_failure: unmap raw HWPoison PTEs when possible

On Mon, May 29, 2023 at 7:25 PM HORIGUCHI NAOYA(堀口 直也)
<[email protected]> wrote:
>
> On Fri, Apr 28, 2023 at 12:41:36AM +0000, Jiaqi Yan wrote:
> > When a folio's VMA is HGM eligible, try_to_unmap_one now only unmaps
> > the raw HWPOISON page (previously split and mapped at PTE size).
> > If HGM failed to be enabled on eligible VMA or splitting failed,
> > try_to_unmap_one fails.
> >
> > For VMS that is not HGM eligible, try_to_unmap_one still unmaps
> > the whole P*D.
> >
> > When only the raw HWPOISON subpage is unmapped but others keep mapped,
> > the old way in memory_failure to check if unmapping successful doesn't
> > work. So introduce is_unmapping_successful() to cover both existing and
> > new unmapping behavior.
> >
> > For the new unmapping behavior, store how many times a raw HWPOISON page
> > is expected to be unmapped, and how many times it is actually unmapped
> > in try_to_unmap_one(). A HWPOISON raw page is expected to be unmapped
> > from a VMA if splitting succeeded in try_to_split_huge_mapping(), so
> > unmap_success = (nr_expected_unamps == nr_actual_unmaps).
> >
> > Old folio_set_hugetlb_hwpoison returns -EHWPOISON if a folio has any
> > raw HWPOISON subpage, and try_memory_failure_hugetlb won't attempt
> > recovery actions again because recovery used to be done on the entire
> > hugepage. With the new unmapping behavior, this doesn't hold. More
> > subpages in the hugepage can become corrupted, and needs to be recovered
> > (i.e. unmapped) individually. New folio_set_hugetlb_hwpoison returns
> > 0 after adding a new raw subpage to raw_hwp_list.
> >
> > Unmapping raw HWPOISON page requires allocating raw_hwp_page
> > successfully in folio_set_hugetlb_hwpoison, so try_memory_failure_hugetlb
> > now may fail due to OOM.
> >
> > Signed-off-by: Jiaqi Yan <[email protected]>
> > ---
> ...
>
> > @@ -1827,6 +1879,31 @@ EXPORT_SYMBOL_GPL(mf_dax_kill_procs);
> >
> > #ifdef CONFIG_HUGETLB_PAGE
> >
> > +/*
> > + * Given a HWPOISON @subpage as raw page, find its location in @folio's
> > + * _hugetlb_hwpoison. Return NULL if @subpage is not in the list.
> > + */
> > +struct raw_hwp_page *find_in_raw_hwp_list(struct folio *folio,
> > + struct page *subpage)
> > +{
> > + struct llist_node *t, *tnode;
> > + struct llist_head *raw_hwp_head = raw_hwp_list_head(folio);
> > + struct raw_hwp_page *hwp_page = NULL;
> > + struct raw_hwp_page *p;
> > +
> > + VM_BUG_ON_PAGE(PageHWPoison(subpage), subpage);
>
> I'm testing the series (on top of v6.2-rc4 + HGM v2 patchset) and found the
> following error triggered by this VM_BUG_ON_PAGE(). The testcase is just to
> inject hwpoison on an anonymous page (it's not hugetlb-related one).

Thanks for reporting this problem, Naoya!

My mistake, this assertion meant to be "if !PageHWPoison(subpage)", to
make sure the caller of find_in_raw_hwp_list is sure that subpage is
hw corrupted.

>
> [ 790.610985] ===> testcase 'mm/hwpoison/base/backend-anonymous_error-hard-offline_access-avoid.auto3' start
> [ 793.304927] page:000000006743177b refcount:1 mapcount:0 mapping:0000000000000000 index:0x700000000 pfn:0x14d739
> [ 793.309322] memcg:ffff8a30c50b6000
> [ 793.310934] anon flags: 0x57ffffe08a001d(locked|uptodate|dirty|lru|mappedtodisk|swapbacked|hwpoison|node=1|zone=2|lastcpupid=0x1fffff)
> [ 793.316665] raw: 0057ffffe08a001d ffffe93cc5353c88 ffffe93cc5685fc8 ffff8a30c91878f1
> [ 793.320211] raw: 0000000700000000 0000000000000000 00000001ffffffff ffff8a30c50b6000
> [ 793.323665] page dumped because: VM_BUG_ON_PAGE(PageHWPoison(subpage))
> [ 793.326764] ------------[ cut here ]------------
> [ 793.329080] kernel BUG at mm/memory-failure.c:1894!
> [ 793.331895] invalid opcode: 0000 [#1] PREEMPT SMP PTI
> [ 793.334854] CPU: 4 PID: 2644 Comm: mceinj.sh Tainted: G E N 6.2.0-rc4-v6.2-rc2-230529-1404+ #63
> [ 793.340710] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.1-2.fc37 04/01/2014
> [ 793.345875] RIP: 0010:hwpoison_user_mappings+0x654/0x780
> [ 793.349066] Code: ef 89 de e8 6e bc f8 ff 48 8b 7c 24 20 48 83 c7 58 e8 10 bb d9 ff e9 5f fb ff ff 48 c7 c6 80 ce 4c b1 4c 89 ef e8 1c 38 f6 ff <0f> 0b 48 c7 c6 7b c8 4c b1 4c 89 ef e8 0b 38 f6 ff 0f 0b 8b 45 58
> [ 793.359732] RSP: 0018:ffffa3ff85ed3d28 EFLAGS: 00010296
> [ 793.362367] RAX: 000000000000003a RBX: 0000000000000018 RCX: 0000000000000000
> [ 793.365763] RDX: 0000000000000001 RSI: ffffffffb14ac451 RDI: 00000000ffffffff
> [ 793.368698] RBP: ffffe93cc535ce40 R08: 0000000000000000 R09: ffffa3ff85ed3ba0
> [ 793.370837] R10: 0000000000000003 R11: ffffffffb1d3ed28 R12: 000000000014d739
> [ 793.372903] R13: ffffe93cc535ce40 R14: ffffe93cc535ce40 R15: ffffe93cc535ce40
> [ 793.374931] FS: 00007f6ccc42a740(0000) GS:ffff8a31bbc00000(0000) knlGS:0000000000000000
> [ 793.377136] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 793.378656] CR2: 0000561aad6474b2 CR3: 00000001492d4005 CR4: 0000000000170ee0
> [ 793.380514] DR0: ffffffffb28ed7d0 DR1: ffffffffb28ed7d1 DR2: ffffffffb28ed7d2
> [ 793.382296] DR3: ffffffffb28ed7d3 DR6: 00000000ffff0ff0 DR7: 0000000000000600
> [ 793.384028] Call Trace:
> [ 793.384655] <TASK>
> [ 793.385210] ? __lru_add_drain_all+0x164/0x1f0
> [ 793.386316] memory_failure+0x352/0xaa0
> [ 793.387249] ? __pfx_bpf_lsm_capable+0x10/0x10
> [ 793.388323] ? __pfx_security_capable+0x10/0x10
> [ 793.389350] hard_offline_page_store+0x46/0x80
> [ 793.390397] kernfs_fop_write_iter+0x11e/0x200
> [ 793.391441] vfs_write+0x1e4/0x3a0
> [ 793.392221] ksys_write+0x53/0xd0
> [ 793.392976] do_syscall_64+0x3a/0x90
> [ 793.393790] entry_SYSCALL_64_after_hwframe+0x72/0xdc
>
> I'm wondering how this code path is called, one possible path is like this:
>
> hwpoison_user_mappings
> if PageHuge(hpage) && !PageAnon(hpage)
> try_to_split_huge_mapping()
> find_in_raw_hwp_list
> VM_BUG_ON_PAGE(PageHWPoison(subpage), subpage)
>
> but this looks unlikely because the precheck "PageHuge(hpage) && !PageAnon(hpage)" is
> false for anonymous pages.
>
> Another possible code path is:
>
> hwpoison_user_mappings
> if PageHuge(hpage) && !PageAnon(hpage)
> ...
> else
> try_to_unmap
> rmap_walk
> rmap_walk_anon
> try_to_unmap_one
> if folio_test_hugetlb
> if hgm_eligible
> find_in_raw_hwp_list
> VM_BUG_ON_PAGE(PageHWPoison(subpage), subpage)
>
> but this looks also unlikely because of checking folio_test_hugetlb and hgm_eligible
> (I think both are false in this testcase.)
> Maybe I miss something (and I'll dig this more), but let me share the issue.

I bet it is in "is_unmapping_successful". So another problem with this
patch is, "is_unmapping_successful" should only calls
find_in_raw_hwp_list after it handles non hugetlb and non shared
mapping, i.e.:

struct raw_hwp_page *hwp_page = NULL;

if (!folio_test_hugetlb(folio) ||
folio_test_anon(folio) ||
!IS_ENABLED(CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING)) {
...
}

hwp_page = find_in_raw_hwp_list(folio, poisoned_page);
VM_BUG_ON_PAGE(!hwp_page, poisoned_page);

I will make sure these two issues get fixed up in follow-up revisions.

>
> Thanks,
> Naoya Horiguchi

2023-05-30 22:22:54

by Jiaqi Yan

[permalink] [raw]
Subject: Re: [RFC PATCH v1 4/7] mm/memory_failure: unmap raw HWPoison PTEs when possible

On Tue, May 30, 2023 at 2:31 PM Jiaqi Yan <[email protected]> wrote:
>
> On Mon, May 29, 2023 at 7:25 PM HORIGUCHI NAOYA(堀口 直也)
> <[email protected]> wrote:
> >
> > On Fri, Apr 28, 2023 at 12:41:36AM +0000, Jiaqi Yan wrote:
> > > When a folio's VMA is HGM eligible, try_to_unmap_one now only unmaps
> > > the raw HWPOISON page (previously split and mapped at PTE size).
> > > If HGM failed to be enabled on eligible VMA or splitting failed,
> > > try_to_unmap_one fails.
> > >
> > > For VMS that is not HGM eligible, try_to_unmap_one still unmaps
> > > the whole P*D.
> > >
> > > When only the raw HWPOISON subpage is unmapped but others keep mapped,
> > > the old way in memory_failure to check if unmapping successful doesn't
> > > work. So introduce is_unmapping_successful() to cover both existing and
> > > new unmapping behavior.
> > >
> > > For the new unmapping behavior, store how many times a raw HWPOISON page
> > > is expected to be unmapped, and how many times it is actually unmapped
> > > in try_to_unmap_one(). A HWPOISON raw page is expected to be unmapped
> > > from a VMA if splitting succeeded in try_to_split_huge_mapping(), so
> > > unmap_success = (nr_expected_unamps == nr_actual_unmaps).
> > >
> > > Old folio_set_hugetlb_hwpoison returns -EHWPOISON if a folio has any
> > > raw HWPOISON subpage, and try_memory_failure_hugetlb won't attempt
> > > recovery actions again because recovery used to be done on the entire
> > > hugepage. With the new unmapping behavior, this doesn't hold. More
> > > subpages in the hugepage can become corrupted, and needs to be recovered
> > > (i.e. unmapped) individually. New folio_set_hugetlb_hwpoison returns
> > > 0 after adding a new raw subpage to raw_hwp_list.
> > >
> > > Unmapping raw HWPOISON page requires allocating raw_hwp_page
> > > successfully in folio_set_hugetlb_hwpoison, so try_memory_failure_hugetlb
> > > now may fail due to OOM.
> > >
> > > Signed-off-by: Jiaqi Yan <[email protected]>
> > > ---
> > ...
> >
> > > @@ -1827,6 +1879,31 @@ EXPORT_SYMBOL_GPL(mf_dax_kill_procs);
> > >
> > > #ifdef CONFIG_HUGETLB_PAGE
> > >
> > > +/*
> > > + * Given a HWPOISON @subpage as raw page, find its location in @folio's
> > > + * _hugetlb_hwpoison. Return NULL if @subpage is not in the list.
> > > + */
> > > +struct raw_hwp_page *find_in_raw_hwp_list(struct folio *folio,

BTW, per our discussion here[1], this routine will probably reuse what
comes out of the refactored routine.

It should be safe for try_to_unmap_one to hold a raw_hwp_page returned
from find_in_raw_hwp_list as long as raw_hwp_list is protected by
mf_mutex.

[1] https://lore.kernel.org/linux-mm/CACw3F53+Hg4CgFoPj3LLSiURzWfa2egWLO-=12GzfhsNC3XTvQ@mail.gmail.com/T/#m9966de1007b80eb8bd2c2ce0a9db13624cd2652e

> > > + struct page *subpage)
> > > +{
> > > + struct llist_node *t, *tnode;
> > > + struct llist_head *raw_hwp_head = raw_hwp_list_head(folio);
> > > + struct raw_hwp_page *hwp_page = NULL;
> > > + struct raw_hwp_page *p;
> > > +
> > > + VM_BUG_ON_PAGE(PageHWPoison(subpage), subpage);
> >
> > I'm testing the series (on top of v6.2-rc4 + HGM v2 patchset) and found the
> > following error triggered by this VM_BUG_ON_PAGE(). The testcase is just to
> > inject hwpoison on an anonymous page (it's not hugetlb-related one).
>
> Thanks for reporting this problem, Naoya!
>
> My mistake, this assertion meant to be "if !PageHWPoison(subpage)", to
> make sure the caller of find_in_raw_hwp_list is sure that subpage is
> hw corrupted.
>
> >
> > [ 790.610985] ===> testcase 'mm/hwpoison/base/backend-anonymous_error-hard-offline_access-avoid.auto3' start
> > [ 793.304927] page:000000006743177b refcount:1 mapcount:0 mapping:0000000000000000 index:0x700000000 pfn:0x14d739
> > [ 793.309322] memcg:ffff8a30c50b6000
> > [ 793.310934] anon flags: 0x57ffffe08a001d(locked|uptodate|dirty|lru|mappedtodisk|swapbacked|hwpoison|node=1|zone=2|lastcpupid=0x1fffff)
> > [ 793.316665] raw: 0057ffffe08a001d ffffe93cc5353c88 ffffe93cc5685fc8 ffff8a30c91878f1
> > [ 793.320211] raw: 0000000700000000 0000000000000000 00000001ffffffff ffff8a30c50b6000
> > [ 793.323665] page dumped because: VM_BUG_ON_PAGE(PageHWPoison(subpage))
> > [ 793.326764] ------------[ cut here ]------------
> > [ 793.329080] kernel BUG at mm/memory-failure.c:1894!
> > [ 793.331895] invalid opcode: 0000 [#1] PREEMPT SMP PTI
> > [ 793.334854] CPU: 4 PID: 2644 Comm: mceinj.sh Tainted: G E N 6.2.0-rc4-v6.2-rc2-230529-1404+ #63
> > [ 793.340710] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.1-2.fc37 04/01/2014
> > [ 793.345875] RIP: 0010:hwpoison_user_mappings+0x654/0x780
> > [ 793.349066] Code: ef 89 de e8 6e bc f8 ff 48 8b 7c 24 20 48 83 c7 58 e8 10 bb d9 ff e9 5f fb ff ff 48 c7 c6 80 ce 4c b1 4c 89 ef e8 1c 38 f6 ff <0f> 0b 48 c7 c6 7b c8 4c b1 4c 89 ef e8 0b 38 f6 ff 0f 0b 8b 45 58
> > [ 793.359732] RSP: 0018:ffffa3ff85ed3d28 EFLAGS: 00010296
> > [ 793.362367] RAX: 000000000000003a RBX: 0000000000000018 RCX: 0000000000000000
> > [ 793.365763] RDX: 0000000000000001 RSI: ffffffffb14ac451 RDI: 00000000ffffffff
> > [ 793.368698] RBP: ffffe93cc535ce40 R08: 0000000000000000 R09: ffffa3ff85ed3ba0
> > [ 793.370837] R10: 0000000000000003 R11: ffffffffb1d3ed28 R12: 000000000014d739
> > [ 793.372903] R13: ffffe93cc535ce40 R14: ffffe93cc535ce40 R15: ffffe93cc535ce40
> > [ 793.374931] FS: 00007f6ccc42a740(0000) GS:ffff8a31bbc00000(0000) knlGS:0000000000000000
> > [ 793.377136] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [ 793.378656] CR2: 0000561aad6474b2 CR3: 00000001492d4005 CR4: 0000000000170ee0
> > [ 793.380514] DR0: ffffffffb28ed7d0 DR1: ffffffffb28ed7d1 DR2: ffffffffb28ed7d2
> > [ 793.382296] DR3: ffffffffb28ed7d3 DR6: 00000000ffff0ff0 DR7: 0000000000000600
> > [ 793.384028] Call Trace:
> > [ 793.384655] <TASK>
> > [ 793.385210] ? __lru_add_drain_all+0x164/0x1f0
> > [ 793.386316] memory_failure+0x352/0xaa0
> > [ 793.387249] ? __pfx_bpf_lsm_capable+0x10/0x10
> > [ 793.388323] ? __pfx_security_capable+0x10/0x10
> > [ 793.389350] hard_offline_page_store+0x46/0x80
> > [ 793.390397] kernfs_fop_write_iter+0x11e/0x200
> > [ 793.391441] vfs_write+0x1e4/0x3a0
> > [ 793.392221] ksys_write+0x53/0xd0
> > [ 793.392976] do_syscall_64+0x3a/0x90
> > [ 793.393790] entry_SYSCALL_64_after_hwframe+0x72/0xdc
> >
> > I'm wondering how this code path is called, one possible path is like this:
> >
> > hwpoison_user_mappings
> > if PageHuge(hpage) && !PageAnon(hpage)
> > try_to_split_huge_mapping()
> > find_in_raw_hwp_list
> > VM_BUG_ON_PAGE(PageHWPoison(subpage), subpage)
> >
> > but this looks unlikely because the precheck "PageHuge(hpage) && !PageAnon(hpage)" is
> > false for anonymous pages.
> >
> > Another possible code path is:
> >
> > hwpoison_user_mappings
> > if PageHuge(hpage) && !PageAnon(hpage)
> > ...
> > else
> > try_to_unmap
> > rmap_walk
> > rmap_walk_anon
> > try_to_unmap_one
> > if folio_test_hugetlb
> > if hgm_eligible
> > find_in_raw_hwp_list
> > VM_BUG_ON_PAGE(PageHWPoison(subpage), subpage)
> >
> > but this looks also unlikely because of checking folio_test_hugetlb and hgm_eligible
> > (I think both are false in this testcase.)
> > Maybe I miss something (and I'll dig this more), but let me share the issue.
>
> I bet it is in "is_unmapping_successful". So another problem with this
> patch is, "is_unmapping_successful" should only calls
> find_in_raw_hwp_list after it handles non hugetlb and non shared
> mapping, i.e.:
>
> struct raw_hwp_page *hwp_page = NULL;
>
> if (!folio_test_hugetlb(folio) ||
> folio_test_anon(folio) ||
> !IS_ENABLED(CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING)) {
> ...
> }
>
> hwp_page = find_in_raw_hwp_list(folio, poisoned_page);
> VM_BUG_ON_PAGE(!hwp_page, poisoned_page);
>
> I will make sure these two issues get fixed up in follow-up revisions.
>
> >
> > Thanks,
> > Naoya Horiguchi

Subject: Re: [RFC PATCH v1 4/7] mm/memory_failure: unmap raw HWPoison PTEs when possible

On Fri, Apr 28, 2023 at 12:41:36AM +0000, Jiaqi Yan wrote:
> When a folio's VMA is HGM eligible, try_to_unmap_one now only unmaps
> the raw HWPOISON page (previously split and mapped at PTE size).
> If HGM failed to be enabled on eligible VMA or splitting failed,
> try_to_unmap_one fails.
>
> For VMS that is not HGM eligible, try_to_unmap_one still unmaps
> the whole P*D.
>
> When only the raw HWPOISON subpage is unmapped but others keep mapped,
> the old way in memory_failure to check if unmapping successful doesn't
> work. So introduce is_unmapping_successful() to cover both existing and
> new unmapping behavior.
>
> For the new unmapping behavior, store how many times a raw HWPOISON page
> is expected to be unmapped, and how many times it is actually unmapped
> in try_to_unmap_one(). A HWPOISON raw page is expected to be unmapped
> from a VMA if splitting succeeded in try_to_split_huge_mapping(), so
> unmap_success = (nr_expected_unamps == nr_actual_unmaps).

s/nr_expected_unamps/nr_expected_unmaps/

>
> Old folio_set_hugetlb_hwpoison returns -EHWPOISON if a folio has any
> raw HWPOISON subpage, and try_memory_failure_hugetlb won't attempt
> recovery actions again because recovery used to be done on the entire
> hugepage. With the new unmapping behavior, this doesn't hold. More
> subpages in the hugepage can become corrupted, and needs to be recovered
> (i.e. unmapped) individually. New folio_set_hugetlb_hwpoison returns
> 0 after adding a new raw subpage to raw_hwp_list.
>
> Unmapping raw HWPOISON page requires allocating raw_hwp_page
> successfully in folio_set_hugetlb_hwpoison, so try_memory_failure_hugetlb
> now may fail due to OOM.
>
> Signed-off-by: Jiaqi Yan <[email protected]>
> ---
> include/linux/mm.h | 20 ++++++-
> mm/memory-failure.c | 140 ++++++++++++++++++++++++++++++++++++++------
> mm/rmap.c | 38 +++++++++++-
> 3 files changed, 175 insertions(+), 23 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 4496d7bdd3ea..dc192f98cb1d 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3522,20 +3522,38 @@ enum mf_action_page_type {
> */
> extern const struct attribute_group memory_failure_attr_group;
>
> -#ifdef CONFIG_HUGETLB_PAGE
> /*
> * Struct raw_hwp_page represents information about "raw error page",
> * constructing singly linked list from ->_hugetlb_hwpoison field of folio.
> + * @node: the node in folio->_hugetlb_hwpoison list.
> + * @page: the raw HWPOISON page struct.
> + * @nr_vmas_mapped: the number of VMAs that map @page when detected.
> + * @nr_expected_unmaps: if a VMA that maps @page when detected is eligible
> + * for high granularity mapping, @page is expected to be unmapped.
> + * @nr_actual_unmaps: how many times the raw page is actually unmapped.
> */
> struct raw_hwp_page {
> struct llist_node node;
> struct page *page;
> + int nr_vmas_mapped;
> + int nr_expected_unmaps;
> + int nr_actual_unmaps;
> };
>
> +#ifdef CONFIG_HUGETLB_PAGE
> static inline struct llist_head *raw_hwp_list_head(struct folio *folio)
> {
> return (struct llist_head *)&folio->_hugetlb_hwpoison;
> }
> +
> +struct raw_hwp_page *find_in_raw_hwp_list(struct folio *folio,
> + struct page *subpage);
> +#else
> +static inline struct raw_hwp_page *find_in_raw_hwp_list(struct folio *folio,
> + struct page *subpage)
> +{
> + return NULL;
> +}
> #endif
>
> #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 48e62d04af17..47b935918ceb 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1120,10 +1120,10 @@ static int me_swapcache_clean(struct page_state *ps, struct page *p)
> }
>
> /*
> - * Huge pages. Needs work.
> - * Issues:
> - * - Error on hugepage is contained in hugepage unit (not in raw page unit.)
> - * To narrow down kill region to one page, we need to break up pmd.
> + * Huge pages.
> + * - Without HGM: Error on hugepage is contained in hugepage unit (not in
> + * raw page unit).
> + * - With HGM: Kill region is narrowed down to just one page.
> */
> static int me_huge_page(struct page_state *ps, struct page *p)
> {
> @@ -1131,6 +1131,7 @@ static int me_huge_page(struct page_state *ps, struct page *p)
> struct page *hpage = compound_head(p);
> struct address_space *mapping;
> bool extra_pins = false;
> + struct raw_hwp_page *hwp_page = find_in_raw_hwp_list(page_folio(p), p);
>
> if (!PageHuge(hpage))
> return MF_DELAYED;
> @@ -1157,7 +1158,8 @@ static int me_huge_page(struct page_state *ps, struct page *p)
> }
> }
>
> - if (has_extra_refcount(ps, p, extra_pins))
> + if (hwp_page->nr_expected_unmaps == 0 &&

This additional check seems unclear to me, so could you add some comment about
the intention?

I think that "hwp_page->nr_expected_unmaps == 0" means no mapping is split.
This is OK because it just keeps the current behavior (where no HGM happens).
But thinking about the "partial" case where some vmas use HGM and some not,
has_extra_refcount() check seems to be meaningful, so maybe more resticted check
like "hwp_page->nr_expected_unmaps < hwp_page->nr_vmas_mapped" could be better?

> + has_extra_refcount(ps, p, extra_pins))
> res = MF_FAILED;
>
> return res;
> @@ -1497,24 +1499,30 @@ static void try_to_split_huge_mapping(struct folio *folio,
> unsigned long poisoned_addr;
> unsigned long head_addr;
> struct hugetlb_pte hpte;
> + struct raw_hwp_page *hwp_page = NULL;
>
> if (WARN_ON(!mapping))
> return;
>
> VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
>
> + hwp_page = find_in_raw_hwp_list(folio, poisoned_page);
> + VM_BUG_ON_PAGE(!hwp_page, poisoned_page);
> +
> pgoff_start = folio_pgoff(folio);
> pgoff_end = pgoff_start + folio_nr_pages(folio) - 1;
>
> vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff_start, pgoff_end) {
> + ++hwp_page->nr_vmas_mapped;
> +
> /* Enable HGM on HGM-eligible VMAs. */
> if (!hugetlb_hgm_eligible(vma))
> continue;
>
> i_mmap_assert_locked(vma->vm_file->f_mapping);
> if (hugetlb_enable_hgm_vma(vma)) {
> - pr_err("Failed to enable HGM on eligible VMA=[%#lx, %#lx)\n",
> - vma->vm_start, vma->vm_end);
> + pr_err("%#lx: failed to enable HGM on eligible VMA=[%#lx, %#lx)\n",
> + page_to_pfn(poisoned_page), vma->vm_start, vma->vm_end);

This page_to_pfn(poisoned_page) can be called multiple times in this function,
so this can be stored in a local variable.

> continue;
> }
>
> @@ -1528,15 +1536,21 @@ static void try_to_split_huge_mapping(struct folio *folio,
> * lock on vma->vm_file->f_mapping, which caller
> * (e.g. hwpoison_user_mappings) should already acquired.
> */
> - if (hugetlb_full_walk(&hpte, vma, head_addr))
> + if (hugetlb_full_walk(&hpte, vma, head_addr)) {
> + pr_err("%#lx: failed to PT-walk with HGM on eligible VMA=[%#lx, %#lx)\n",
> + page_to_pfn(poisoned_page), vma->vm_start, vma->vm_end);
> continue;
> + }
>
> if (hugetlb_split_to_shift(vma->vm_mm, vma, &hpte,
> poisoned_addr, PAGE_SHIFT)) {
> - pr_err("Failed to split huge mapping: pfn=%#lx, vaddr=%#lx in VMA=[%#lx, %#lx)\n",
> + pr_err("%#lx: Failed to split huge mapping: vaddr=%#lx in VMA=[%#lx, %#lx)\n",
> page_to_pfn(poisoned_page), poisoned_addr,
> vma->vm_start, vma->vm_end);
> + continue;
> }
> +
> + ++hwp_page->nr_expected_unmaps;
> }
> }
> #else
> @@ -1546,6 +1560,47 @@ static void try_to_split_huge_mapping(struct folio *folio,
> }
> #endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */
>
> +static bool is_unmapping_successful(struct folio *folio,
> + struct page *poisoned_page)
> +{
> + bool unmap_success = false;
> + struct raw_hwp_page *hwp_page = find_in_raw_hwp_list(folio, poisoned_page);

As a small optimization, this find_in_raw_hwp_list() can be called after the
if-block below, because hwp_page is used only after it.

> +
> + if (!folio_test_hugetlb(folio) ||
> + folio_test_anon(folio) ||
> + !IS_ENABLED(CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING)) {
> + unmap_success = folio_mapped(folio);

Maybe you mean "unmap_success = !folio_mapped(folio)"n ?

> + if (!unmap_success)
> + pr_err("%#lx: failed to unmap page (mapcount=%d)\n",
> + page_to_pfn(poisoned_page),
> + page_mapcount(folio_page(folio, 0)));
> +
> + return unmap_success;
> + }
> +
> + VM_BUG_ON_PAGE(!hwp_page, poisoned_page);
> +
> + /*
> + * Unmapping may not happen for some VMA:
> + * - HGM-eligible VMA but @poisoned_page is not faulted yet: nothing
> + * needs to be done at this point yet until page fault handling.
> + * - HGM-non-eliggible VMA: mapcount decreases by nr_subpages for each VMA,

s/eliggible/eligible/

> + * but not tracked so cannot tell if successfully unmapped from such VMA.
> + */
> + if (hwp_page->nr_vmas_mapped != hwp_page->nr_expected_unmaps)
> + pr_info("%#lx: mapped by %d VMAs but %d unmappings are expected\n",
> + page_to_pfn(poisoned_page), hwp_page->nr_vmas_mapped,
> + hwp_page->nr_expected_unmaps);
> +
> + unmap_success = hwp_page->nr_expected_unmaps == hwp_page->nr_actual_unmaps;
> +
> + if (!unmap_success)
> + pr_err("%#lx: failed to unmap page (folio_mapcount=%d)\n",
> + page_to_pfn(poisoned_page), folio_mapcount(folio));
> +
> + return unmap_success;
> +}
> +
> /*
> * Do all that is necessary to remove user space mappings. Unmap
> * the pages and send SIGBUS to the processes if the data was dirty.
> @@ -1631,10 +1686,7 @@ static bool hwpoison_user_mappings(struct page *p, unsigned long pfn,
> try_to_unmap(folio, ttu);
> }
>
> - unmap_success = !page_mapped(hpage);
> - if (!unmap_success)
> - pr_err("%#lx: failed to unmap page (mapcount=%d)\n",
> - pfn, page_mapcount(hpage));
> + unmap_success = is_unmapping_successful(folio, p);
>
> /*
> * try_to_unmap() might put mlocked page in lru cache, so call
> @@ -1827,6 +1879,31 @@ EXPORT_SYMBOL_GPL(mf_dax_kill_procs);
>
> #ifdef CONFIG_HUGETLB_PAGE
>
> +/*
> + * Given a HWPOISON @subpage as raw page, find its location in @folio's
> + * _hugetlb_hwpoison. Return NULL if @subpage is not in the list.
> + */
> +struct raw_hwp_page *find_in_raw_hwp_list(struct folio *folio,
> + struct page *subpage)
> +{
> + struct llist_node *t, *tnode;
> + struct llist_head *raw_hwp_head = raw_hwp_list_head(folio);
> + struct raw_hwp_page *hwp_page = NULL;
> + struct raw_hwp_page *p;
> +
> + VM_BUG_ON_PAGE(PageHWPoison(subpage), subpage);
> +
> + llist_for_each_safe(tnode, t, raw_hwp_head->first) {
> + p = container_of(tnode, struct raw_hwp_page, node);
> + if (subpage == p->page) {
> + hwp_page = p;
> + break;
> + }
> + }
> +
> + return hwp_page;
> +}
> +
> static unsigned long __folio_free_raw_hwp(struct folio *folio, bool move_flag)
> {
> struct llist_head *head;
> @@ -1837,6 +1914,9 @@ static unsigned long __folio_free_raw_hwp(struct folio *folio, bool move_flag)
> llist_for_each_safe(tnode, t, head->first) {
> struct raw_hwp_page *p = container_of(tnode, struct raw_hwp_page, node);
>
> + /* Ideally raw HWPoison pages are fully unmapped if possible. */
> + WARN_ON(p->nr_expected_unmaps != p->nr_actual_unmaps);
> +
> if (move_flag)
> SetPageHWPoison(p->page);
> else
> @@ -1853,7 +1933,8 @@ static int folio_set_hugetlb_hwpoison(struct folio *folio, struct page *page)
> struct llist_head *head;
> struct raw_hwp_page *raw_hwp;
> struct llist_node *t, *tnode;
> - int ret = folio_test_set_hwpoison(folio) ? -EHWPOISON : 0;
> + bool has_hwpoison = folio_test_set_hwpoison(folio);
> + bool hgm_enabled = IS_ENABLED(CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING);
>
> /*
> * Once the hwpoison hugepage has lost reliable raw error info,
> @@ -1873,9 +1954,20 @@ static int folio_set_hugetlb_hwpoison(struct folio *folio, struct page *page)
> raw_hwp = kmalloc(sizeof(struct raw_hwp_page), GFP_ATOMIC);
> if (raw_hwp) {
> raw_hwp->page = page;
> + raw_hwp->nr_vmas_mapped = 0;
> + raw_hwp->nr_expected_unmaps = 0;
> + raw_hwp->nr_actual_unmaps = 0;
> llist_add(&raw_hwp->node, head);
> + if (hgm_enabled)
> + /*
> + * A new raw poisoned page. Don't return
> + * HWPOISON. Error event will be counted

(Sorry nitpicking ...)
"Don't return -EHWPOISON" ?

> + * in action_result().
> + */
> + return 0;
> +
> /* the first error event will be counted in action_result(). */
> - if (ret)
> + if (has_hwpoison)
> num_poisoned_pages_inc(page_to_pfn(page));
> } else {
> /*
> @@ -1889,8 +1981,16 @@ static int folio_set_hugetlb_hwpoison(struct folio *folio, struct page *page)
> * used any more, so free it.
> */
> __folio_free_raw_hwp(folio, false);
> +
> + /*
> + * HGM relies on raw_hwp allocated and inserted to raw_hwp_list.
> + */
> + if (hgm_enabled)
> + return -ENOMEM;
> }
> - return ret;
> +
> + BUG_ON(hgm_enabled);
> + return has_hwpoison ? -EHWPOISON : 0;
> }
>
> static unsigned long folio_free_raw_hwp(struct folio *folio, bool move_flag)
> @@ -1936,6 +2036,7 @@ int __get_huge_page_for_hwpoison(unsigned long pfn, int flags,
> struct page *page = pfn_to_page(pfn);
> struct folio *folio = page_folio(page);
> int ret = 2; /* fallback to normal page handling */
> + int set_page_hwpoison = 0;
> bool count_increased = false;
>
> if (!folio_test_hugetlb(folio))
> @@ -1956,8 +2057,9 @@ int __get_huge_page_for_hwpoison(unsigned long pfn, int flags,
> goto out;
> }
>
> - if (folio_set_hugetlb_hwpoison(folio, page)) {
> - ret = -EHWPOISON;
> + set_page_hwpoison = folio_set_hugetlb_hwpoison(folio, page);
> + if (set_page_hwpoison) {
> + ret = set_page_hwpoison;

This adds a new error return code, so could you update comment on this function?

Thanks,
Naoya Horiguchi

> goto out;
> }
>
> @@ -2004,7 +2106,7 @@ static int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb
> res = kill_accessing_process(current, folio_pfn(folio), flags);
> }
> return res;
> - } else if (res == -EBUSY) {
> + } else if (res == -EBUSY || res == -ENOMEM) {
> if (!(flags & MF_NO_RETRY)) {
> flags |= MF_NO_RETRY;
> goto retry;
> diff --git a/mm/rmap.c b/mm/rmap.c
> index d3bc81466902..4cfaa34b001e 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1453,6 +1453,8 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> struct mmu_notifier_range range;
> enum ttu_flags flags = (enum ttu_flags)(long)arg;
> bool page_poisoned;
> + bool hgm_eligible = hugetlb_hgm_eligible(vma);
> + struct raw_hwp_page *hwp_page;
>
> /*
> * When racing against e.g. zap_pte_range() on another cpu,
> @@ -1525,6 +1527,29 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> * in the case where the hugetlb page is poisoned.
> */
> VM_BUG_ON_FOLIO(!page_poisoned, folio);
> +
> + /*
> + * When VMA is not HGM eligible, unmap at hugepage's
> + * original P*D.
> + *
> + * When HGM is eligible:
> + * - if original P*D is split to smaller P*Ds and
> + * PTEs, we skip subpage if it is not raw HWPoison
> + * page, or it was but was already unmapped.
> + * - if original P*D is not split, skip unmapping
> + * and memory_failure result will be MF_IGNORED.
> + */
> + if (hgm_eligible) {
> + if (pvmw.pte_order > 0)
> + continue;
> + hwp_page = find_in_raw_hwp_list(folio, subpage);
> + if (hwp_page == NULL)
> + continue;
> + if (hwp_page->nr_expected_unmaps ==
> + hwp_page->nr_actual_unmaps)
> + continue;
> + }
> +
> /*
> * huge_pmd_unshare may unmap an entire PMD page.
> * There is no way of knowing exactly which PMDs may
> @@ -1760,12 +1785,19 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> *
> * See Documentation/mm/mmu_notifier.rst
> */
> - if (folio_test_hugetlb(folio))
> + if (!folio_test_hugetlb(folio))
> + page_remove_rmap(subpage, vma, false);
> + else {
> hugetlb_remove_rmap(subpage,
> pvmw.pte_order + PAGE_SHIFT,
> hstate_vma(vma), vma);
> - else
> - page_remove_rmap(subpage, vma, false);
> + if (hgm_eligible) {
> + VM_BUG_ON_FOLIO(pvmw.pte_order > 0, folio);
> + VM_BUG_ON_FOLIO(!hwp_page, folio);
> + VM_BUG_ON_FOLIO(subpage != hwp_page->page, folio);
> + ++hwp_page->nr_actual_unmaps;
> + }
> + }
>
> if (vma->vm_flags & VM_LOCKED)
> mlock_drain_local();
> --
> 2.40.1.495.gc816e09b53d-goog
>
>
>