2023-02-17 08:55:12

by David Stevens

[permalink] [raw]
Subject: [PATCH v4 0/3] mm/khugepaged: fix khugepaged+shmem races

From: David Stevens <[email protected]>

Fix two races in khugepaged+shmem that cause issues with userfaultfd and
lseek, respectively.

v3 -> v4:
- Base changes on mm-everything (fba720cb4dc0)
- Add patch to refactor error handling control flow in collapse_file
- Rebase userfaultfd patch with no significant logic changes
- Different approach for fixing lseek race
v2 -> v3:
- Use XA_RETRY_ENTRY to synchronize with reads from the page cache
under the RCU read lock in userfaultfd fix
- Add patch to fix lseek race
v1 -> v2:
- Different approach for userfaultfd fix

David Stevens (3):
mm/khugepaged: refactor collapse_file control flow
mm/khugepaged: skip shmem with userfaultfd
mm/khugepaged: maintain page cache uptodate flag

include/trace/events/huge_memory.h | 3 +-
mm/khugepaged.c | 263 ++++++++++++++++-------------
2 files changed, 144 insertions(+), 122 deletions(-)

--
2.39.2.637.g21b0678d19-goog



2023-02-17 08:55:24

by David Stevens

[permalink] [raw]
Subject: [PATCH v4 1/3] mm/khugepaged: refactor collapse_file control flow

From: David Stevens <[email protected]>

Add a rollback label to deal with failure, instead of continuously
checking for RESULT_SUCCESS, to make it easier to add more failure
cases. The refactoring also allows the collapse_file tracepoint to
include hpage on success (instead of NULL).

Signed-off-by: David Stevens <[email protected]>
---
mm/khugepaged.c | 223 ++++++++++++++++++++++++------------------------
1 file changed, 110 insertions(+), 113 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 8dbc39896811..6a3d6d2e25e0 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1885,6 +1885,12 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
if (result != SCAN_SUCCEED)
goto out;

+ __SetPageLocked(hpage);
+ if (is_shmem)
+ __SetPageSwapBacked(hpage);
+ hpage->index = start;
+ hpage->mapping = mapping;
+
/*
* Ensure we have slots for all the pages in the range. This is
* almost certainly a no-op because most of the pages must be present
@@ -1897,16 +1903,10 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
xas_unlock_irq(&xas);
if (!xas_nomem(&xas, GFP_KERNEL)) {
result = SCAN_FAIL;
- goto out;
+ goto rollback;
}
} while (1);

- __SetPageLocked(hpage);
- if (is_shmem)
- __SetPageSwapBacked(hpage);
- hpage->index = start;
- hpage->mapping = mapping;
-
/*
* At this point the hpage is locked and not up-to-date.
* It's safe to insert it into the page cache, because nobody would
@@ -2123,131 +2123,128 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
*/
try_to_unmap_flush();

- if (result == SCAN_SUCCEED) {
- /*
- * Replacing old pages with new one has succeeded, now we
- * attempt to copy the contents.
- */
- index = start;
- list_for_each_entry(page, &pagelist, lru) {
- while (index < page->index) {
- clear_highpage(hpage + (index % HPAGE_PMD_NR));
- index++;
- }
- if (copy_mc_page(hpage + (page->index % HPAGE_PMD_NR),
- page) > 0) {
- result = SCAN_COPY_MC;
- break;
- }
- index++;
- }
- while (result == SCAN_SUCCEED && index < end) {
+ if (result != SCAN_SUCCEED)
+ goto rollback;
+
+ /*
+ * Replacing old pages with new one has succeeded, now we
+ * attempt to copy the contents.
+ */
+ index = start;
+ list_for_each_entry(page, &pagelist, lru) {
+ while (index < page->index) {
clear_highpage(hpage + (index % HPAGE_PMD_NR));
index++;
}
+ if (copy_mc_page(hpage + (page->index % HPAGE_PMD_NR),
+ page) > 0) {
+ result = SCAN_COPY_MC;
+ goto rollback;
+ }
+ index++;
+ }
+ while (index < end) {
+ clear_highpage(hpage + (index % HPAGE_PMD_NR));
+ index++;
}

- if (result == SCAN_SUCCEED) {
- /*
- * Copying old pages to huge one has succeeded, now we
- * need to free the old pages.
- */
- list_for_each_entry_safe(page, tmp, &pagelist, lru) {
- list_del(&page->lru);
- page->mapping = NULL;
- page_ref_unfreeze(page, 1);
- ClearPageActive(page);
- ClearPageUnevictable(page);
- unlock_page(page);
- put_page(page);
- }
+ /*
+ * Copying old pages to huge one has succeeded, now we
+ * need to free the old pages.
+ */
+ list_for_each_entry_safe(page, tmp, &pagelist, lru) {
+ list_del(&page->lru);
+ page->mapping = NULL;
+ page_ref_unfreeze(page, 1);
+ ClearPageActive(page);
+ ClearPageUnevictable(page);
+ unlock_page(page);
+ put_page(page);
+ }

- xas_lock_irq(&xas);
- if (is_shmem)
- __mod_lruvec_page_state(hpage, NR_SHMEM_THPS, nr);
- else
- __mod_lruvec_page_state(hpage, NR_FILE_THPS, nr);
+ xas_lock_irq(&xas);
+ if (is_shmem)
+ __mod_lruvec_page_state(hpage, NR_SHMEM_THPS, nr);
+ else
+ __mod_lruvec_page_state(hpage, NR_FILE_THPS, nr);
+
+ if (nr_none) {
+ __mod_lruvec_page_state(hpage, NR_FILE_PAGES, nr_none);
+ /* nr_none is always 0 for non-shmem. */
+ __mod_lruvec_page_state(hpage, NR_SHMEM, nr_none);
+ }
+ /* Join all the small entries into a single multi-index entry. */
+ xas_set_order(&xas, start, HPAGE_PMD_ORDER);
+ xas_store(&xas, hpage);
+ xas_unlock_irq(&xas);

- if (nr_none) {
- __mod_lruvec_page_state(hpage, NR_FILE_PAGES, nr_none);
- /* nr_none is always 0 for non-shmem. */
- __mod_lruvec_page_state(hpage, NR_SHMEM, nr_none);
- }
- /* Join all the small entries into a single multi-index entry. */
- xas_set_order(&xas, start, HPAGE_PMD_ORDER);
- xas_store(&xas, hpage);
- xas_unlock_irq(&xas);
+ folio = page_folio(hpage);
+ folio_mark_uptodate(folio);
+ folio_ref_add(folio, HPAGE_PMD_NR - 1);

- folio = page_folio(hpage);
- folio_mark_uptodate(folio);
- folio_ref_add(folio, HPAGE_PMD_NR - 1);
+ if (is_shmem)
+ folio_mark_dirty(folio);
+ folio_add_lru(folio);

- if (is_shmem)
- folio_mark_dirty(folio);
- folio_add_lru(folio);
+ /*
+ * Remove pte page tables, so we can re-fault the page as huge.
+ */
+ result = retract_page_tables(mapping, start, mm, addr, hpage,
+ cc);
+ unlock_page(hpage);
+ goto out;
+
+rollback:
+ /* Something went wrong: roll back page cache changes */
+ xas_lock_irq(&xas);
+ if (nr_none) {
+ mapping->nrpages -= nr_none;
+ shmem_uncharge(mapping->host, nr_none);
+ }

- /*
- * Remove pte page tables, so we can re-fault the page as huge.
- */
- result = retract_page_tables(mapping, start, mm, addr, hpage,
- cc);
- unlock_page(hpage);
- hpage = NULL;
- } else {
- /* Something went wrong: roll back page cache changes */
- xas_lock_irq(&xas);
- if (nr_none) {
- mapping->nrpages -= nr_none;
- shmem_uncharge(mapping->host, nr_none);
+ xas_set(&xas, start);
+ xas_for_each(&xas, page, end - 1) {
+ page = list_first_entry_or_null(&pagelist,
+ struct page, lru);
+ if (!page || xas.xa_index < page->index) {
+ if (!nr_none)
+ break;
+ nr_none--;
+ /* Put holes back where they were */
+ xas_store(&xas, NULL);
+ continue;
}

- xas_set(&xas, start);
- xas_for_each(&xas, page, end - 1) {
- page = list_first_entry_or_null(&pagelist,
- struct page, lru);
- if (!page || xas.xa_index < page->index) {
- if (!nr_none)
- break;
- nr_none--;
- /* Put holes back where they were */
- xas_store(&xas, NULL);
- continue;
- }
+ VM_BUG_ON_PAGE(page->index != xas.xa_index, page);

- VM_BUG_ON_PAGE(page->index != xas.xa_index, page);
+ /* Unfreeze the page. */
+ list_del(&page->lru);
+ page_ref_unfreeze(page, 2);
+ xas_store(&xas, page);
+ xas_pause(&xas);
+ xas_unlock_irq(&xas);
+ unlock_page(page);
+ putback_lru_page(page);
+ xas_lock_irq(&xas);
+ }
+ VM_BUG_ON(nr_none);
+ /*
+ * Undo the updates of filemap_nr_thps_inc for non-SHMEM file only.
+ * This undo is not needed unless failure is due to SCAN_COPY_MC.
+ */
+ if (!is_shmem && result == SCAN_COPY_MC)
+ filemap_nr_thps_dec(mapping);

- /* Unfreeze the page. */
- list_del(&page->lru);
- page_ref_unfreeze(page, 2);
- xas_store(&xas, page);
- xas_pause(&xas);
- xas_unlock_irq(&xas);
- unlock_page(page);
- putback_lru_page(page);
- xas_lock_irq(&xas);
- }
- VM_BUG_ON(nr_none);
- /*
- * Undo the updates of filemap_nr_thps_inc for non-SHMEM file only.
- * This undo is not needed unless failure is due to SCAN_COPY_MC.
- */
- if (!is_shmem && result == SCAN_COPY_MC)
- filemap_nr_thps_dec(mapping);
+ xas_unlock_irq(&xas);

- xas_unlock_irq(&xas);
+ hpage->mapping = NULL;

- hpage->mapping = NULL;
- }
+ unlock_page(hpage);
+ mem_cgroup_uncharge(page_folio(hpage));
+ put_page(hpage);

- if (hpage)
- unlock_page(hpage);
out:
VM_BUG_ON(!list_empty(&pagelist));
- if (hpage) {
- mem_cgroup_uncharge(page_folio(hpage));
- put_page(hpage);
- }
-
trace_mm_khugepaged_collapse_file(mm, hpage, index, is_shmem, addr, file, nr, result);
return result;
}
--
2.39.2.637.g21b0678d19-goog


2023-02-17 08:55:30

by David Stevens

[permalink] [raw]
Subject: [PATCH v4 2/3] mm/khugepaged: skip shmem with userfaultfd

From: David Stevens <[email protected]>

Make sure that collapse_file respects any userfaultfds registered with
MODE_MISSING. If userspace has any such userfaultfds registered, then
for any page which it knows to be missing, it may expect a
UFFD_EVENT_PAGEFAULT. This means collapse_file needs to be careful when
collapsing a shmem range would result in replacing an empty page with a
THP, to avoid breaking userfaultfd.

Synchronization when checking for userfaultfds in collapse_file is
tricky because the mmap locks can't be used to prevent races with the
registration of new userfaultfds. Instead, we provide synchronization by
ensuring that userspace cannot observe the fact that pages are missing
before we check for userfaultfds. Although this allows registration of a
userfaultfd to race with collapse_file, it ensures that userspace cannot
observe any pages transition from missing to present after such a race
occurs. This makes such a race indistinguishable to the collapse
occurring immediately before the userfaultfd registration.

The first step to provide this synchronization is to stop filling gaps
during the loop iterating over the target range, since the page cache
lock can be dropped during that loop. The second step is to fill the
gaps with XA_RETRY_ENTRY after the page cache lock is acquired the final
time, to avoid races with accesses to the page cache that only take the
RCU read lock.

The fact that we don't fill holes during the initial iteration means
that collapse_file now has to handle faults occurring during the
collapse. This is done by re-validating the number of missing pages
after acquiring the page cache lock for the final time.

This fix is targeted at khugepaged, but the change also applies to
MADV_COLLAPSE. MADV_COLLAPSE on a range with a userfaultfd will now
return EBUSY if there are any missing pages (instead of succeeding on
shmem and returning EINVAL on anonymous memory). There is also now a
window during MADV_COLLAPSE where a fault on a missing page will cause
the syscall to fail with EAGAIN.

The fact that intermediate page cache state can no longer be observed
before the rollback of a failed collapse is also technically a
userspace-visible change (via at least SEEK_DATA and SEEK_END), but it
is exceedingly unlikely that anything relies on being able to observe
that transient state.

Signed-off-by: David Stevens <[email protected]>
---
include/trace/events/huge_memory.h | 3 +-
mm/khugepaged.c | 92 +++++++++++++++++++++++-------
2 files changed, 73 insertions(+), 22 deletions(-)

diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
index 46cce509957b..7ee85fff89a3 100644
--- a/include/trace/events/huge_memory.h
+++ b/include/trace/events/huge_memory.h
@@ -37,7 +37,8 @@
EM( SCAN_CGROUP_CHARGE_FAIL, "ccgroup_charge_failed") \
EM( SCAN_TRUNCATED, "truncated") \
EM( SCAN_PAGE_HAS_PRIVATE, "page_has_private") \
- EMe(SCAN_COPY_MC, "copy_poisoned_page") \
+ EM( SCAN_COPY_MC, "copy_poisoned_page") \
+ EMe(SCAN_PAGE_FILLED, "page_filled") \

#undef EM
#undef EMe
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 6a3d6d2e25e0..1c37f9151345 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -57,6 +57,7 @@ enum scan_result {
SCAN_TRUNCATED,
SCAN_PAGE_HAS_PRIVATE,
SCAN_COPY_MC,
+ SCAN_PAGE_FILLED,
};

#define CREATE_TRACE_POINTS
@@ -1851,8 +1852,8 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
* - allocate and lock a new huge page;
* - scan page cache replacing old pages with the new one
* + swap/gup in pages if necessary;
- * + fill in gaps;
* + keep old pages around in case rollback is required;
+ * - finalize updates to the page cache;
* - if replacing succeeds:
* + copy data over;
* + free old pages;
@@ -1930,13 +1931,12 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
result = SCAN_TRUNCATED;
goto xa_locked;
}
- xas_set(&xas, index);
+ xas_set(&xas, index + 1);
}
if (!shmem_charge(mapping->host, 1)) {
result = SCAN_FAIL;
goto xa_locked;
}
- xas_store(&xas, hpage);
nr_none++;
continue;
}
@@ -2148,21 +2148,57 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
index++;
}

- /*
- * Copying old pages to huge one has succeeded, now we
- * need to free the old pages.
- */
- list_for_each_entry_safe(page, tmp, &pagelist, lru) {
- list_del(&page->lru);
- page->mapping = NULL;
- page_ref_unfreeze(page, 1);
- ClearPageActive(page);
- ClearPageUnevictable(page);
- unlock_page(page);
- put_page(page);
+ if (nr_none) {
+ struct vm_area_struct *vma;
+ int nr_none_check = 0;
+
+ i_mmap_lock_read(mapping);
+ xas_lock_irq(&xas);
+
+ xas_set(&xas, start);
+ for (index = start; index < end; index++) {
+ if (!xas_next(&xas)) {
+ xas_store(&xas, XA_RETRY_ENTRY);
+ nr_none_check++;
+ }
+ }
+
+ if (nr_none != nr_none_check) {
+ result = SCAN_PAGE_FILLED;
+ goto immap_locked;
+ }
+
+ /*
+ * If userspace observed a missing page in a VMA with an armed
+ * userfaultfd, then it might expect a UFFD_EVENT_PAGEFAULT for
+ * that page, so we need to roll back to avoid suppressing such
+ * an event. Any userfaultfds armed after this point will not be
+ * able to observe any missing pages due to the previously
+ * inserted retry entries.
+ */
+ vma_interval_tree_foreach(vma, &mapping->i_mmap, start, start) {
+ if (userfaultfd_missing(vma)) {
+ result = SCAN_EXCEED_NONE_PTE;
+ goto immap_locked;
+ }
+ }
+
+immap_locked:
+ i_mmap_unlock_read(mapping);
+ if (result != SCAN_SUCCEED) {
+ xas_set(&xas, start);
+ for (index = start; index < end; index++) {
+ if (xas_next(&xas) == XA_RETRY_ENTRY)
+ xas_store(&xas, NULL);
+ }
+
+ xas_unlock_irq(&xas);
+ goto rollback;
+ }
+ } else {
+ xas_lock_irq(&xas);
}

- xas_lock_irq(&xas);
if (is_shmem)
__mod_lruvec_page_state(hpage, NR_SHMEM_THPS, nr);
else
@@ -2192,6 +2228,20 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
result = retract_page_tables(mapping, start, mm, addr, hpage,
cc);
unlock_page(hpage);
+
+ /*
+ * The collapse has succeeded, so free the old pages.
+ */
+ list_for_each_entry_safe(page, tmp, &pagelist, lru) {
+ list_del(&page->lru);
+ page->mapping = NULL;
+ page_ref_unfreeze(page, 1);
+ ClearPageActive(page);
+ ClearPageUnevictable(page);
+ unlock_page(page);
+ put_page(page);
+ }
+
goto out;

rollback:
@@ -2203,15 +2253,13 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
}

xas_set(&xas, start);
- xas_for_each(&xas, page, end - 1) {
+ end = index;
+ for (index = start; index < end; index++) {
+ xas_next(&xas);
page = list_first_entry_or_null(&pagelist,
struct page, lru);
if (!page || xas.xa_index < page->index) {
- if (!nr_none)
- break;
nr_none--;
- /* Put holes back where they were */
- xas_store(&xas, NULL);
continue;
}

@@ -2730,12 +2778,14 @@ static int madvise_collapse_errno(enum scan_result r)
case SCAN_ALLOC_HUGE_PAGE_FAIL:
return -ENOMEM;
case SCAN_CGROUP_CHARGE_FAIL:
+ case SCAN_EXCEED_NONE_PTE:
return -EBUSY;
/* Resource temporary unavailable - trying again might succeed */
case SCAN_PAGE_COUNT:
case SCAN_PAGE_LOCK:
case SCAN_PAGE_LRU:
case SCAN_DEL_PAGE_LRU:
+ case SCAN_PAGE_FILLED:
return -EAGAIN;
/*
* Other: Trying again likely not to succeed / error intrinsic to
--
2.39.2.637.g21b0678d19-goog


2023-02-17 08:55:34

by David Stevens

[permalink] [raw]
Subject: [PATCH v4 3/3] mm/khugepaged: maintain page cache uptodate flag

From: David Stevens <[email protected]>

Make sure that collapse_file doesn't interfere with checking the
uptodate flag in the page cache by only inserting hpage into the page
cache after it has been updated and marked uptodate. This is achieved by
simply not replacing present pages with hpage when iterating over them
target range. The present pages are already locked, so replacing the
with the locked hpage before the collapse is finalized is unnecessary.

This fixes a race where folio_seek_hole_data would mistake hpage for
an fallocated but unwritten page. This race is visible to userspace via
data temporarily disappearing from SEEK_DATA/SEEK_HOLE.

Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
Signed-off-by: David Stevens <[email protected]>
---
mm/khugepaged.c | 50 ++++++++++++-------------------------------------
1 file changed, 12 insertions(+), 38 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 1c37f9151345..e08cf7c5ebdf 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1908,12 +1908,6 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
}
} while (1);

- /*
- * At this point the hpage is locked and not up-to-date.
- * It's safe to insert it into the page cache, because nobody would
- * be able to map it or use it in another way until we unlock it.
- */
-
xas_set(&xas, start);
for (index = start; index < end; index++) {
page = xas_next(&xas);
@@ -2082,13 +2076,9 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
}

/*
- * Add the page to the list to be able to undo the collapse if
- * something go wrong.
+ * Accumulate the pages that are being collapsed.
*/
list_add_tail(&page->lru, &pagelist);
-
- /* Finally, replace with the new page. */
- xas_store(&xas, hpage);
continue;
out_unlock:
unlock_page(page);
@@ -2127,8 +2117,7 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
goto rollback;

/*
- * Replacing old pages with new one has succeeded, now we
- * attempt to copy the contents.
+ * The old pages are locked, so they won't change anymore.
*/
index = start;
list_for_each_entry(page, &pagelist, lru) {
@@ -2209,11 +2198,11 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
/* nr_none is always 0 for non-shmem. */
__mod_lruvec_page_state(hpage, NR_SHMEM, nr_none);
}
- /* Join all the small entries into a single multi-index entry. */
- xas_set_order(&xas, start, HPAGE_PMD_ORDER);
- xas_store(&xas, hpage);
- xas_unlock_irq(&xas);

+ /*
+ * Mark hpage as uptodate before inserting it into the page cache so
+ * that it isn't mistaken for an fallocated but unwritten page.
+ */
folio = page_folio(hpage);
folio_mark_uptodate(folio);
folio_ref_add(folio, HPAGE_PMD_NR - 1);
@@ -2222,6 +2211,11 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
folio_mark_dirty(folio);
folio_add_lru(folio);

+ /* Join all the small entries into a single multi-index entry. */
+ xas_set_order(&xas, start, HPAGE_PMD_ORDER);
+ xas_store(&xas, hpage);
+ xas_unlock_irq(&xas);
+
/*
* Remove pte page tables, so we can re-fault the page as huge.
*/
@@ -2246,36 +2240,18 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,

rollback:
/* Something went wrong: roll back page cache changes */
- xas_lock_irq(&xas);
if (nr_none) {
mapping->nrpages -= nr_none;
shmem_uncharge(mapping->host, nr_none);
}

- xas_set(&xas, start);
- end = index;
- for (index = start; index < end; index++) {
- xas_next(&xas);
- page = list_first_entry_or_null(&pagelist,
- struct page, lru);
- if (!page || xas.xa_index < page->index) {
- nr_none--;
- continue;
- }
-
- VM_BUG_ON_PAGE(page->index != xas.xa_index, page);
-
+ list_for_each_entry_safe(page, tmp, &pagelist, lru) {
/* Unfreeze the page. */
list_del(&page->lru);
page_ref_unfreeze(page, 2);
- xas_store(&xas, page);
- xas_pause(&xas);
- xas_unlock_irq(&xas);
unlock_page(page);
putback_lru_page(page);
- xas_lock_irq(&xas);
}
- VM_BUG_ON(nr_none);
/*
* Undo the updates of filemap_nr_thps_inc for non-SHMEM file only.
* This undo is not needed unless failure is due to SCAN_COPY_MC.
@@ -2283,8 +2259,6 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
if (!is_shmem && result == SCAN_COPY_MC)
filemap_nr_thps_dec(mapping);

- xas_unlock_irq(&xas);
-
hpage->mapping = NULL;

unlock_page(hpage);
--
2.39.2.637.g21b0678d19-goog


2023-02-17 10:37:27

by Miko Larsson

[permalink] [raw]
Subject: Re: [PATCH v4 0/3] mm/khugepaged: fix khugepaged+shmem races

On Fri, 2023-02-17 at 17:54 +0900, David Stevens wrote:
> From: David Stevens <[email protected]>
>
> Fix two races in khugepaged+shmem that cause issues with userfaultfd
> and
> lseek, respectively.
>
> v3 -> v4:
>  - Base changes on mm-everything (fba720cb4dc0)
>  - Add patch to refactor error handling control flow in collapse_file
>  - Rebase userfaultfd patch with no significant logic changes
>  - Different approach for fixing lseek race
> v2 -> v3:
>  - Use XA_RETRY_ENTRY to synchronize with reads from the page cache
>    under the RCU read lock in userfaultfd fix
>  - Add patch to fix lseek race
> v1 -> v2:
>  - Different approach for userfaultfd fix
>
> David Stevens (3):
>   mm/khugepaged: refactor collapse_file control flow
>   mm/khugepaged: skip shmem with userfaultfd
>   mm/khugepaged: maintain page cache uptodate flag
>
>  include/trace/events/huge_memory.h |   3 +-
>  mm/khugepaged.c                    | 263 ++++++++++++++++-----------
> --
>  2 files changed, 144 insertions(+), 122 deletions(-)
>

Might want to Cc this to the stable mailing list.
--
~miko

2023-02-17 23:46:57

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCH v4 1/3] mm/khugepaged: refactor collapse_file control flow

On Fri, Feb 17, 2023 at 12:55 AM David Stevens <[email protected]> wrote:
>
> From: David Stevens <[email protected]>
>
> Add a rollback label to deal with failure, instead of continuously
> checking for RESULT_SUCCESS, to make it easier to add more failure
> cases. The refactoring also allows the collapse_file tracepoint to
> include hpage on success (instead of NULL).
>
> Signed-off-by: David Stevens <[email protected]>

The refactor looks good to me. Reviewed-by: Yang Shi <[email protected]>

> ---
> mm/khugepaged.c | 223 ++++++++++++++++++++++++------------------------
> 1 file changed, 110 insertions(+), 113 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 8dbc39896811..6a3d6d2e25e0 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1885,6 +1885,12 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> if (result != SCAN_SUCCEED)
> goto out;
>
> + __SetPageLocked(hpage);
> + if (is_shmem)
> + __SetPageSwapBacked(hpage);
> + hpage->index = start;
> + hpage->mapping = mapping;
> +
> /*
> * Ensure we have slots for all the pages in the range. This is
> * almost certainly a no-op because most of the pages must be present
> @@ -1897,16 +1903,10 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> xas_unlock_irq(&xas);
> if (!xas_nomem(&xas, GFP_KERNEL)) {
> result = SCAN_FAIL;
> - goto out;
> + goto rollback;
> }
> } while (1);
>
> - __SetPageLocked(hpage);
> - if (is_shmem)
> - __SetPageSwapBacked(hpage);
> - hpage->index = start;
> - hpage->mapping = mapping;
> -
> /*
> * At this point the hpage is locked and not up-to-date.
> * It's safe to insert it into the page cache, because nobody would
> @@ -2123,131 +2123,128 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> */
> try_to_unmap_flush();
>
> - if (result == SCAN_SUCCEED) {
> - /*
> - * Replacing old pages with new one has succeeded, now we
> - * attempt to copy the contents.
> - */
> - index = start;
> - list_for_each_entry(page, &pagelist, lru) {
> - while (index < page->index) {
> - clear_highpage(hpage + (index % HPAGE_PMD_NR));
> - index++;
> - }
> - if (copy_mc_page(hpage + (page->index % HPAGE_PMD_NR),
> - page) > 0) {
> - result = SCAN_COPY_MC;
> - break;
> - }
> - index++;
> - }
> - while (result == SCAN_SUCCEED && index < end) {
> + if (result != SCAN_SUCCEED)
> + goto rollback;
> +
> + /*
> + * Replacing old pages with new one has succeeded, now we
> + * attempt to copy the contents.
> + */
> + index = start;
> + list_for_each_entry(page, &pagelist, lru) {
> + while (index < page->index) {
> clear_highpage(hpage + (index % HPAGE_PMD_NR));
> index++;
> }
> + if (copy_mc_page(hpage + (page->index % HPAGE_PMD_NR),
> + page) > 0) {
> + result = SCAN_COPY_MC;
> + goto rollback;
> + }
> + index++;
> + }
> + while (index < end) {
> + clear_highpage(hpage + (index % HPAGE_PMD_NR));
> + index++;
> }
>
> - if (result == SCAN_SUCCEED) {
> - /*
> - * Copying old pages to huge one has succeeded, now we
> - * need to free the old pages.
> - */
> - list_for_each_entry_safe(page, tmp, &pagelist, lru) {
> - list_del(&page->lru);
> - page->mapping = NULL;
> - page_ref_unfreeze(page, 1);
> - ClearPageActive(page);
> - ClearPageUnevictable(page);
> - unlock_page(page);
> - put_page(page);
> - }
> + /*
> + * Copying old pages to huge one has succeeded, now we
> + * need to free the old pages.
> + */
> + list_for_each_entry_safe(page, tmp, &pagelist, lru) {
> + list_del(&page->lru);
> + page->mapping = NULL;
> + page_ref_unfreeze(page, 1);
> + ClearPageActive(page);
> + ClearPageUnevictable(page);
> + unlock_page(page);
> + put_page(page);
> + }
>
> - xas_lock_irq(&xas);
> - if (is_shmem)
> - __mod_lruvec_page_state(hpage, NR_SHMEM_THPS, nr);
> - else
> - __mod_lruvec_page_state(hpage, NR_FILE_THPS, nr);
> + xas_lock_irq(&xas);
> + if (is_shmem)
> + __mod_lruvec_page_state(hpage, NR_SHMEM_THPS, nr);
> + else
> + __mod_lruvec_page_state(hpage, NR_FILE_THPS, nr);
> +
> + if (nr_none) {
> + __mod_lruvec_page_state(hpage, NR_FILE_PAGES, nr_none);
> + /* nr_none is always 0 for non-shmem. */
> + __mod_lruvec_page_state(hpage, NR_SHMEM, nr_none);
> + }
> + /* Join all the small entries into a single multi-index entry. */
> + xas_set_order(&xas, start, HPAGE_PMD_ORDER);
> + xas_store(&xas, hpage);
> + xas_unlock_irq(&xas);
>
> - if (nr_none) {
> - __mod_lruvec_page_state(hpage, NR_FILE_PAGES, nr_none);
> - /* nr_none is always 0 for non-shmem. */
> - __mod_lruvec_page_state(hpage, NR_SHMEM, nr_none);
> - }
> - /* Join all the small entries into a single multi-index entry. */
> - xas_set_order(&xas, start, HPAGE_PMD_ORDER);
> - xas_store(&xas, hpage);
> - xas_unlock_irq(&xas);
> + folio = page_folio(hpage);
> + folio_mark_uptodate(folio);
> + folio_ref_add(folio, HPAGE_PMD_NR - 1);
>
> - folio = page_folio(hpage);
> - folio_mark_uptodate(folio);
> - folio_ref_add(folio, HPAGE_PMD_NR - 1);
> + if (is_shmem)
> + folio_mark_dirty(folio);
> + folio_add_lru(folio);
>
> - if (is_shmem)
> - folio_mark_dirty(folio);
> - folio_add_lru(folio);
> + /*
> + * Remove pte page tables, so we can re-fault the page as huge.
> + */
> + result = retract_page_tables(mapping, start, mm, addr, hpage,
> + cc);
> + unlock_page(hpage);
> + goto out;
> +
> +rollback:
> + /* Something went wrong: roll back page cache changes */
> + xas_lock_irq(&xas);
> + if (nr_none) {
> + mapping->nrpages -= nr_none;
> + shmem_uncharge(mapping->host, nr_none);
> + }
>
> - /*
> - * Remove pte page tables, so we can re-fault the page as huge.
> - */
> - result = retract_page_tables(mapping, start, mm, addr, hpage,
> - cc);
> - unlock_page(hpage);
> - hpage = NULL;
> - } else {
> - /* Something went wrong: roll back page cache changes */
> - xas_lock_irq(&xas);
> - if (nr_none) {
> - mapping->nrpages -= nr_none;
> - shmem_uncharge(mapping->host, nr_none);
> + xas_set(&xas, start);
> + xas_for_each(&xas, page, end - 1) {
> + page = list_first_entry_or_null(&pagelist,
> + struct page, lru);
> + if (!page || xas.xa_index < page->index) {
> + if (!nr_none)
> + break;
> + nr_none--;
> + /* Put holes back where they were */
> + xas_store(&xas, NULL);
> + continue;
> }
>
> - xas_set(&xas, start);
> - xas_for_each(&xas, page, end - 1) {
> - page = list_first_entry_or_null(&pagelist,
> - struct page, lru);
> - if (!page || xas.xa_index < page->index) {
> - if (!nr_none)
> - break;
> - nr_none--;
> - /* Put holes back where they were */
> - xas_store(&xas, NULL);
> - continue;
> - }
> + VM_BUG_ON_PAGE(page->index != xas.xa_index, page);
>
> - VM_BUG_ON_PAGE(page->index != xas.xa_index, page);
> + /* Unfreeze the page. */
> + list_del(&page->lru);
> + page_ref_unfreeze(page, 2);
> + xas_store(&xas, page);
> + xas_pause(&xas);
> + xas_unlock_irq(&xas);
> + unlock_page(page);
> + putback_lru_page(page);
> + xas_lock_irq(&xas);
> + }
> + VM_BUG_ON(nr_none);
> + /*
> + * Undo the updates of filemap_nr_thps_inc for non-SHMEM file only.
> + * This undo is not needed unless failure is due to SCAN_COPY_MC.
> + */
> + if (!is_shmem && result == SCAN_COPY_MC)
> + filemap_nr_thps_dec(mapping);
>
> - /* Unfreeze the page. */
> - list_del(&page->lru);
> - page_ref_unfreeze(page, 2);
> - xas_store(&xas, page);
> - xas_pause(&xas);
> - xas_unlock_irq(&xas);
> - unlock_page(page);
> - putback_lru_page(page);
> - xas_lock_irq(&xas);
> - }
> - VM_BUG_ON(nr_none);
> - /*
> - * Undo the updates of filemap_nr_thps_inc for non-SHMEM file only.
> - * This undo is not needed unless failure is due to SCAN_COPY_MC.
> - */
> - if (!is_shmem && result == SCAN_COPY_MC)
> - filemap_nr_thps_dec(mapping);
> + xas_unlock_irq(&xas);
>
> - xas_unlock_irq(&xas);
> + hpage->mapping = NULL;
>
> - hpage->mapping = NULL;
> - }
> + unlock_page(hpage);
> + mem_cgroup_uncharge(page_folio(hpage));
> + put_page(hpage);
>
> - if (hpage)
> - unlock_page(hpage);
> out:
> VM_BUG_ON(!list_empty(&pagelist));
> - if (hpage) {
> - mem_cgroup_uncharge(page_folio(hpage));
> - put_page(hpage);
> - }
> -
> trace_mm_khugepaged_collapse_file(mm, hpage, index, is_shmem, addr, file, nr, result);
> return result;
> }
> --
> 2.39.2.637.g21b0678d19-goog
>

2023-02-21 21:55:03

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v4 1/3] mm/khugepaged: refactor collapse_file control flow

On Fri, Feb 17, 2023 at 05:54:37PM +0900, David Stevens wrote:
> From: David Stevens <[email protected]>
>
> Add a rollback label to deal with failure, instead of continuously
> checking for RESULT_SUCCESS, to make it easier to add more failure
> cases. The refactoring also allows the collapse_file tracepoint to
> include hpage on success (instead of NULL).
>
> Signed-off-by: David Stevens <[email protected]>
> ---
> mm/khugepaged.c | 223 ++++++++++++++++++++++++------------------------
> 1 file changed, 110 insertions(+), 113 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 8dbc39896811..6a3d6d2e25e0 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1885,6 +1885,12 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> if (result != SCAN_SUCCEED)
> goto out;
>
> + __SetPageLocked(hpage);
> + if (is_shmem)
> + __SetPageSwapBacked(hpage);
> + hpage->index = start;
> + hpage->mapping = mapping;
> +
> /*
> * Ensure we have slots for all the pages in the range. This is
> * almost certainly a no-op because most of the pages must be present
> @@ -1897,16 +1903,10 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> xas_unlock_irq(&xas);
> if (!xas_nomem(&xas, GFP_KERNEL)) {
> result = SCAN_FAIL;
> - goto out;
> + goto rollback;
> }
> } while (1);
>
> - __SetPageLocked(hpage);
> - if (is_shmem)
> - __SetPageSwapBacked(hpage);
> - hpage->index = start;
> - hpage->mapping = mapping;
> -
> /*
> * At this point the hpage is locked and not up-to-date.
> * It's safe to insert it into the page cache, because nobody would
> @@ -2123,131 +2123,128 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> */
> try_to_unmap_flush();
>
> - if (result == SCAN_SUCCEED) {
> - /*
> - * Replacing old pages with new one has succeeded, now we
> - * attempt to copy the contents.
> - */
> - index = start;
> - list_for_each_entry(page, &pagelist, lru) {
> - while (index < page->index) {
> - clear_highpage(hpage + (index % HPAGE_PMD_NR));
> - index++;
> - }
> - if (copy_mc_page(hpage + (page->index % HPAGE_PMD_NR),
> - page) > 0) {
> - result = SCAN_COPY_MC;
> - break;
> - }
> - index++;
> - }
> - while (result == SCAN_SUCCEED && index < end) {
> + if (result != SCAN_SUCCEED)
> + goto rollback;
> +
> + /*
> + * Replacing old pages with new one has succeeded, now we
> + * attempt to copy the contents.
> + */
> + index = start;
> + list_for_each_entry(page, &pagelist, lru) {
> + while (index < page->index) {
> clear_highpage(hpage + (index % HPAGE_PMD_NR));
> index++;
> }
> + if (copy_mc_page(hpage + (page->index % HPAGE_PMD_NR),
> + page) > 0) {
> + result = SCAN_COPY_MC;
> + goto rollback;
> + }
> + index++;
> + }
> + while (index < end) {
> + clear_highpage(hpage + (index % HPAGE_PMD_NR));
> + index++;
> }
>
> - if (result == SCAN_SUCCEED) {
> - /*
> - * Copying old pages to huge one has succeeded, now we
> - * need to free the old pages.
> - */
> - list_for_each_entry_safe(page, tmp, &pagelist, lru) {
> - list_del(&page->lru);
> - page->mapping = NULL;
> - page_ref_unfreeze(page, 1);
> - ClearPageActive(page);
> - ClearPageUnevictable(page);
> - unlock_page(page);
> - put_page(page);
> - }
> + /*
> + * Copying old pages to huge one has succeeded, now we
> + * need to free the old pages.
> + */
> + list_for_each_entry_safe(page, tmp, &pagelist, lru) {
> + list_del(&page->lru);
> + page->mapping = NULL;
> + page_ref_unfreeze(page, 1);
> + ClearPageActive(page);
> + ClearPageUnevictable(page);
> + unlock_page(page);
> + put_page(page);
> + }
>
> - xas_lock_irq(&xas);
> - if (is_shmem)
> - __mod_lruvec_page_state(hpage, NR_SHMEM_THPS, nr);
> - else
> - __mod_lruvec_page_state(hpage, NR_FILE_THPS, nr);
> + xas_lock_irq(&xas);
> + if (is_shmem)
> + __mod_lruvec_page_state(hpage, NR_SHMEM_THPS, nr);
> + else
> + __mod_lruvec_page_state(hpage, NR_FILE_THPS, nr);
> +
> + if (nr_none) {
> + __mod_lruvec_page_state(hpage, NR_FILE_PAGES, nr_none);
> + /* nr_none is always 0 for non-shmem. */
> + __mod_lruvec_page_state(hpage, NR_SHMEM, nr_none);
> + }
> + /* Join all the small entries into a single multi-index entry. */
> + xas_set_order(&xas, start, HPAGE_PMD_ORDER);
> + xas_store(&xas, hpage);
> + xas_unlock_irq(&xas);
>
> - if (nr_none) {
> - __mod_lruvec_page_state(hpage, NR_FILE_PAGES, nr_none);
> - /* nr_none is always 0 for non-shmem. */
> - __mod_lruvec_page_state(hpage, NR_SHMEM, nr_none);
> - }
> - /* Join all the small entries into a single multi-index entry. */
> - xas_set_order(&xas, start, HPAGE_PMD_ORDER);
> - xas_store(&xas, hpage);
> - xas_unlock_irq(&xas);
> + folio = page_folio(hpage);
> + folio_mark_uptodate(folio);
> + folio_ref_add(folio, HPAGE_PMD_NR - 1);
>
> - folio = page_folio(hpage);
> - folio_mark_uptodate(folio);
> - folio_ref_add(folio, HPAGE_PMD_NR - 1);
> + if (is_shmem)
> + folio_mark_dirty(folio);
> + folio_add_lru(folio);
>
> - if (is_shmem)
> - folio_mark_dirty(folio);
> - folio_add_lru(folio);
> + /*
> + * Remove pte page tables, so we can re-fault the page as huge.
> + */
> + result = retract_page_tables(mapping, start, mm, addr, hpage,
> + cc);
> + unlock_page(hpage);
> + goto out;
> +
> +rollback:
> + /* Something went wrong: roll back page cache changes */
> + xas_lock_irq(&xas);
> + if (nr_none) {
> + mapping->nrpages -= nr_none;
> + shmem_uncharge(mapping->host, nr_none);
> + }
>
> - /*
> - * Remove pte page tables, so we can re-fault the page as huge.
> - */
> - result = retract_page_tables(mapping, start, mm, addr, hpage,
> - cc);
> - unlock_page(hpage);
> - hpage = NULL;
> - } else {
> - /* Something went wrong: roll back page cache changes */
> - xas_lock_irq(&xas);
> - if (nr_none) {
> - mapping->nrpages -= nr_none;
> - shmem_uncharge(mapping->host, nr_none);
> + xas_set(&xas, start);
> + xas_for_each(&xas, page, end - 1) {
> + page = list_first_entry_or_null(&pagelist,
> + struct page, lru);
> + if (!page || xas.xa_index < page->index) {
> + if (!nr_none)
> + break;
> + nr_none--;
> + /* Put holes back where they were */
> + xas_store(&xas, NULL);
> + continue;
> }
>
> - xas_set(&xas, start);
> - xas_for_each(&xas, page, end - 1) {
> - page = list_first_entry_or_null(&pagelist,
> - struct page, lru);
> - if (!page || xas.xa_index < page->index) {
> - if (!nr_none)
> - break;
> - nr_none--;
> - /* Put holes back where they were */
> - xas_store(&xas, NULL);
> - continue;
> - }
> + VM_BUG_ON_PAGE(page->index != xas.xa_index, page);
>
> - VM_BUG_ON_PAGE(page->index != xas.xa_index, page);
> + /* Unfreeze the page. */
> + list_del(&page->lru);
> + page_ref_unfreeze(page, 2);
> + xas_store(&xas, page);
> + xas_pause(&xas);
> + xas_unlock_irq(&xas);
> + unlock_page(page);
> + putback_lru_page(page);
> + xas_lock_irq(&xas);
> + }
> + VM_BUG_ON(nr_none);
> + /*
> + * Undo the updates of filemap_nr_thps_inc for non-SHMEM file only.
> + * This undo is not needed unless failure is due to SCAN_COPY_MC.
> + */
> + if (!is_shmem && result == SCAN_COPY_MC)
> + filemap_nr_thps_dec(mapping);
>
> - /* Unfreeze the page. */
> - list_del(&page->lru);
> - page_ref_unfreeze(page, 2);
> - xas_store(&xas, page);
> - xas_pause(&xas);
> - xas_unlock_irq(&xas);
> - unlock_page(page);
> - putback_lru_page(page);
> - xas_lock_irq(&xas);
> - }
> - VM_BUG_ON(nr_none);
> - /*
> - * Undo the updates of filemap_nr_thps_inc for non-SHMEM file only.
> - * This undo is not needed unless failure is due to SCAN_COPY_MC.
> - */
> - if (!is_shmem && result == SCAN_COPY_MC)
> - filemap_nr_thps_dec(mapping);
> + xas_unlock_irq(&xas);
>
> - xas_unlock_irq(&xas);
> + hpage->mapping = NULL;
>
> - hpage->mapping = NULL;
> - }
> + unlock_page(hpage);
> + mem_cgroup_uncharge(page_folio(hpage));
> + put_page(hpage);
>
> - if (hpage)
> - unlock_page(hpage);
> out:
> VM_BUG_ON(!list_empty(&pagelist));
> - if (hpage) {
> - mem_cgroup_uncharge(page_folio(hpage));
> - put_page(hpage);
> - }

Moving this chunk seems wrong, as it can leak the huge page if
alloc_charge_hpage() failed on mem charging, iiuc.

And I found that keeping it seems wrong either, because if mem charge
failed we'll uncharge even without charging it before. But that's nothing
about this patch alone.

Posted a patch for this:

https://lore.kernel.org/all/[email protected]/

I _think_ this patch will make sense after rebasing to that fix if that's
correct, but please review it and double check.

Thanks,

> -
> trace_mm_khugepaged_collapse_file(mm, hpage, index, is_shmem, addr, file, nr, result);
> return result;
> }
> --
> 2.39.2.637.g21b0678d19-goog
>

--
Peter Xu


2023-02-21 22:13:45

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v4 2/3] mm/khugepaged: skip shmem with userfaultfd

On Fri, Feb 17, 2023 at 05:54:38PM +0900, David Stevens wrote:
> From: David Stevens <[email protected]>
>
> Make sure that collapse_file respects any userfaultfds registered with
> MODE_MISSING. If userspace has any such userfaultfds registered, then
> for any page which it knows to be missing, it may expect a
> UFFD_EVENT_PAGEFAULT. This means collapse_file needs to be careful when
> collapsing a shmem range would result in replacing an empty page with a
> THP, to avoid breaking userfaultfd.
>
> Synchronization when checking for userfaultfds in collapse_file is
> tricky because the mmap locks can't be used to prevent races with the
> registration of new userfaultfds. Instead, we provide synchronization by
> ensuring that userspace cannot observe the fact that pages are missing
> before we check for userfaultfds. Although this allows registration of a
> userfaultfd to race with collapse_file, it ensures that userspace cannot
> observe any pages transition from missing to present after such a race
> occurs. This makes such a race indistinguishable to the collapse
> occurring immediately before the userfaultfd registration.
>
> The first step to provide this synchronization is to stop filling gaps
> during the loop iterating over the target range, since the page cache
> lock can be dropped during that loop. The second step is to fill the
> gaps with XA_RETRY_ENTRY after the page cache lock is acquired the final
> time, to avoid races with accesses to the page cache that only take the
> RCU read lock.
>
> The fact that we don't fill holes during the initial iteration means
> that collapse_file now has to handle faults occurring during the
> collapse. This is done by re-validating the number of missing pages
> after acquiring the page cache lock for the final time.
>
> This fix is targeted at khugepaged, but the change also applies to
> MADV_COLLAPSE. MADV_COLLAPSE on a range with a userfaultfd will now
> return EBUSY if there are any missing pages (instead of succeeding on
> shmem and returning EINVAL on anonymous memory). There is also now a
> window during MADV_COLLAPSE where a fault on a missing page will cause
> the syscall to fail with EAGAIN.
>
> The fact that intermediate page cache state can no longer be observed
> before the rollback of a failed collapse is also technically a
> userspace-visible change (via at least SEEK_DATA and SEEK_END), but it
> is exceedingly unlikely that anything relies on being able to observe
> that transient state.
>
> Signed-off-by: David Stevens <[email protected]>

It'll be great to have another eye looking, but... AFAICT this works for
us. Thanks David, this is better than what I suggested.

Acked-by: Peter Xu <[email protected]>

--
Peter Xu


2023-02-21 22:19:15

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v4 3/3] mm/khugepaged: maintain page cache uptodate flag

On Fri, Feb 17, 2023 at 05:54:39PM +0900, David Stevens wrote:
> From: David Stevens <[email protected]>
>
> Make sure that collapse_file doesn't interfere with checking the
> uptodate flag in the page cache by only inserting hpage into the page
> cache after it has been updated and marked uptodate. This is achieved by
> simply not replacing present pages with hpage when iterating over them
> target range. The present pages are already locked, so replacing the
> with the locked hpage before the collapse is finalized is unnecessary.
>
> This fixes a race where folio_seek_hole_data would mistake hpage for
> an fallocated but unwritten page. This race is visible to userspace via
> data temporarily disappearing from SEEK_DATA/SEEK_HOLE.
>
> Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
> Signed-off-by: David Stevens <[email protected]>

Per my knowledge, this one looks all correct too.

Acked-by: Peter Xu <[email protected]>

So at least to me the whole set looks mostly good except patch 1 needs some
confirmation.

Thanks,

--
Peter Xu


2023-02-21 22:28:45

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCH v4 1/3] mm/khugepaged: refactor collapse_file control flow

On Tue, Feb 21, 2023 at 1:54 PM Peter Xu <[email protected]> wrote:
>
> On Fri, Feb 17, 2023 at 05:54:37PM +0900, David Stevens wrote:
> > From: David Stevens <[email protected]>
> >
> > Add a rollback label to deal with failure, instead of continuously
> > checking for RESULT_SUCCESS, to make it easier to add more failure
> > cases. The refactoring also allows the collapse_file tracepoint to
> > include hpage on success (instead of NULL).
> >
> > Signed-off-by: David Stevens <[email protected]>
> > ---
> > mm/khugepaged.c | 223 ++++++++++++++++++++++++------------------------
> > 1 file changed, 110 insertions(+), 113 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 8dbc39896811..6a3d6d2e25e0 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -1885,6 +1885,12 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> > if (result != SCAN_SUCCEED)
> > goto out;
> >
> > + __SetPageLocked(hpage);
> > + if (is_shmem)
> > + __SetPageSwapBacked(hpage);
> > + hpage->index = start;
> > + hpage->mapping = mapping;
> > +
> > /*
> > * Ensure we have slots for all the pages in the range. This is
> > * almost certainly a no-op because most of the pages must be present
> > @@ -1897,16 +1903,10 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> > xas_unlock_irq(&xas);
> > if (!xas_nomem(&xas, GFP_KERNEL)) {
> > result = SCAN_FAIL;
> > - goto out;
> > + goto rollback;
> > }
> > } while (1);
> >
> > - __SetPageLocked(hpage);
> > - if (is_shmem)
> > - __SetPageSwapBacked(hpage);
> > - hpage->index = start;
> > - hpage->mapping = mapping;
> > -
> > /*
> > * At this point the hpage is locked and not up-to-date.
> > * It's safe to insert it into the page cache, because nobody would
> > @@ -2123,131 +2123,128 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> > */
> > try_to_unmap_flush();
> >
> > - if (result == SCAN_SUCCEED) {
> > - /*
> > - * Replacing old pages with new one has succeeded, now we
> > - * attempt to copy the contents.
> > - */
> > - index = start;
> > - list_for_each_entry(page, &pagelist, lru) {
> > - while (index < page->index) {
> > - clear_highpage(hpage + (index % HPAGE_PMD_NR));
> > - index++;
> > - }
> > - if (copy_mc_page(hpage + (page->index % HPAGE_PMD_NR),
> > - page) > 0) {
> > - result = SCAN_COPY_MC;
> > - break;
> > - }
> > - index++;
> > - }
> > - while (result == SCAN_SUCCEED && index < end) {
> > + if (result != SCAN_SUCCEED)
> > + goto rollback;
> > +
> > + /*
> > + * Replacing old pages with new one has succeeded, now we
> > + * attempt to copy the contents.
> > + */
> > + index = start;
> > + list_for_each_entry(page, &pagelist, lru) {
> > + while (index < page->index) {
> > clear_highpage(hpage + (index % HPAGE_PMD_NR));
> > index++;
> > }
> > + if (copy_mc_page(hpage + (page->index % HPAGE_PMD_NR),
> > + page) > 0) {
> > + result = SCAN_COPY_MC;
> > + goto rollback;
> > + }
> > + index++;
> > + }
> > + while (index < end) {
> > + clear_highpage(hpage + (index % HPAGE_PMD_NR));
> > + index++;
> > }
> >
> > - if (result == SCAN_SUCCEED) {
> > - /*
> > - * Copying old pages to huge one has succeeded, now we
> > - * need to free the old pages.
> > - */
> > - list_for_each_entry_safe(page, tmp, &pagelist, lru) {
> > - list_del(&page->lru);
> > - page->mapping = NULL;
> > - page_ref_unfreeze(page, 1);
> > - ClearPageActive(page);
> > - ClearPageUnevictable(page);
> > - unlock_page(page);
> > - put_page(page);
> > - }
> > + /*
> > + * Copying old pages to huge one has succeeded, now we
> > + * need to free the old pages.
> > + */
> > + list_for_each_entry_safe(page, tmp, &pagelist, lru) {
> > + list_del(&page->lru);
> > + page->mapping = NULL;
> > + page_ref_unfreeze(page, 1);
> > + ClearPageActive(page);
> > + ClearPageUnevictable(page);
> > + unlock_page(page);
> > + put_page(page);
> > + }
> >
> > - xas_lock_irq(&xas);
> > - if (is_shmem)
> > - __mod_lruvec_page_state(hpage, NR_SHMEM_THPS, nr);
> > - else
> > - __mod_lruvec_page_state(hpage, NR_FILE_THPS, nr);
> > + xas_lock_irq(&xas);
> > + if (is_shmem)
> > + __mod_lruvec_page_state(hpage, NR_SHMEM_THPS, nr);
> > + else
> > + __mod_lruvec_page_state(hpage, NR_FILE_THPS, nr);
> > +
> > + if (nr_none) {
> > + __mod_lruvec_page_state(hpage, NR_FILE_PAGES, nr_none);
> > + /* nr_none is always 0 for non-shmem. */
> > + __mod_lruvec_page_state(hpage, NR_SHMEM, nr_none);
> > + }
> > + /* Join all the small entries into a single multi-index entry. */
> > + xas_set_order(&xas, start, HPAGE_PMD_ORDER);
> > + xas_store(&xas, hpage);
> > + xas_unlock_irq(&xas);
> >
> > - if (nr_none) {
> > - __mod_lruvec_page_state(hpage, NR_FILE_PAGES, nr_none);
> > - /* nr_none is always 0 for non-shmem. */
> > - __mod_lruvec_page_state(hpage, NR_SHMEM, nr_none);
> > - }
> > - /* Join all the small entries into a single multi-index entry. */
> > - xas_set_order(&xas, start, HPAGE_PMD_ORDER);
> > - xas_store(&xas, hpage);
> > - xas_unlock_irq(&xas);
> > + folio = page_folio(hpage);
> > + folio_mark_uptodate(folio);
> > + folio_ref_add(folio, HPAGE_PMD_NR - 1);
> >
> > - folio = page_folio(hpage);
> > - folio_mark_uptodate(folio);
> > - folio_ref_add(folio, HPAGE_PMD_NR - 1);
> > + if (is_shmem)
> > + folio_mark_dirty(folio);
> > + folio_add_lru(folio);
> >
> > - if (is_shmem)
> > - folio_mark_dirty(folio);
> > - folio_add_lru(folio);
> > + /*
> > + * Remove pte page tables, so we can re-fault the page as huge.
> > + */
> > + result = retract_page_tables(mapping, start, mm, addr, hpage,
> > + cc);
> > + unlock_page(hpage);
> > + goto out;
> > +
> > +rollback:
> > + /* Something went wrong: roll back page cache changes */
> > + xas_lock_irq(&xas);
> > + if (nr_none) {
> > + mapping->nrpages -= nr_none;
> > + shmem_uncharge(mapping->host, nr_none);
> > + }
> >
> > - /*
> > - * Remove pte page tables, so we can re-fault the page as huge.
> > - */
> > - result = retract_page_tables(mapping, start, mm, addr, hpage,
> > - cc);
> > - unlock_page(hpage);
> > - hpage = NULL;
> > - } else {
> > - /* Something went wrong: roll back page cache changes */
> > - xas_lock_irq(&xas);
> > - if (nr_none) {
> > - mapping->nrpages -= nr_none;
> > - shmem_uncharge(mapping->host, nr_none);
> > + xas_set(&xas, start);
> > + xas_for_each(&xas, page, end - 1) {
> > + page = list_first_entry_or_null(&pagelist,
> > + struct page, lru);
> > + if (!page || xas.xa_index < page->index) {
> > + if (!nr_none)
> > + break;
> > + nr_none--;
> > + /* Put holes back where they were */
> > + xas_store(&xas, NULL);
> > + continue;
> > }
> >
> > - xas_set(&xas, start);
> > - xas_for_each(&xas, page, end - 1) {
> > - page = list_first_entry_or_null(&pagelist,
> > - struct page, lru);
> > - if (!page || xas.xa_index < page->index) {
> > - if (!nr_none)
> > - break;
> > - nr_none--;
> > - /* Put holes back where they were */
> > - xas_store(&xas, NULL);
> > - continue;
> > - }
> > + VM_BUG_ON_PAGE(page->index != xas.xa_index, page);
> >
> > - VM_BUG_ON_PAGE(page->index != xas.xa_index, page);
> > + /* Unfreeze the page. */
> > + list_del(&page->lru);
> > + page_ref_unfreeze(page, 2);
> > + xas_store(&xas, page);
> > + xas_pause(&xas);
> > + xas_unlock_irq(&xas);
> > + unlock_page(page);
> > + putback_lru_page(page);
> > + xas_lock_irq(&xas);
> > + }
> > + VM_BUG_ON(nr_none);
> > + /*
> > + * Undo the updates of filemap_nr_thps_inc for non-SHMEM file only.
> > + * This undo is not needed unless failure is due to SCAN_COPY_MC.
> > + */
> > + if (!is_shmem && result == SCAN_COPY_MC)
> > + filemap_nr_thps_dec(mapping);
> >
> > - /* Unfreeze the page. */
> > - list_del(&page->lru);
> > - page_ref_unfreeze(page, 2);
> > - xas_store(&xas, page);
> > - xas_pause(&xas);
> > - xas_unlock_irq(&xas);
> > - unlock_page(page);
> > - putback_lru_page(page);
> > - xas_lock_irq(&xas);
> > - }
> > - VM_BUG_ON(nr_none);
> > - /*
> > - * Undo the updates of filemap_nr_thps_inc for non-SHMEM file only.
> > - * This undo is not needed unless failure is due to SCAN_COPY_MC.
> > - */
> > - if (!is_shmem && result == SCAN_COPY_MC)
> > - filemap_nr_thps_dec(mapping);
> > + xas_unlock_irq(&xas);
> >
> > - xas_unlock_irq(&xas);
> > + hpage->mapping = NULL;
> >
> > - hpage->mapping = NULL;
> > - }
> > + unlock_page(hpage);
> > + mem_cgroup_uncharge(page_folio(hpage));
> > + put_page(hpage);
> >
> > - if (hpage)
> > - unlock_page(hpage);
> > out:
> > VM_BUG_ON(!list_empty(&pagelist));
> > - if (hpage) {
> > - mem_cgroup_uncharge(page_folio(hpage));
> > - put_page(hpage);
> > - }
>
> Moving this chunk seems wrong, as it can leak the huge page if
> alloc_charge_hpage() failed on mem charging, iiuc.

Yeah, good catch.

>
> And I found that keeping it seems wrong either, because if mem charge
> failed we'll uncharge even without charging it before. But that's nothing
> about this patch alone.

We should be able to just simply check the return value. For example:

if (result == SCAN_CGROUP_CHARGE_FAIL)
put_page(hpage);

>
> Posted a patch for this:
>
> https://lore.kernel.org/all/[email protected]/
>
> I _think_ this patch will make sense after rebasing to that fix if that's
> correct, but please review it and double check.

It is ok too.


>
> Thanks,
>
> > -
> > trace_mm_khugepaged_collapse_file(mm, hpage, index, is_shmem, addr, file, nr, result);
> > return result;
> > }
> > --
> > 2.39.2.637.g21b0678d19-goog
> >
>
> --
> Peter Xu
>

2023-02-22 04:08:40

by David Stevens

[permalink] [raw]
Subject: Re: [PATCH v4 1/3] mm/khugepaged: refactor collapse_file control flow

On Wed, Feb 22, 2023 at 6:54 AM Peter Xu <[email protected]> wrote:
>
> On Fri, Feb 17, 2023 at 05:54:37PM +0900, David Stevens wrote:
> > From: David Stevens <[email protected]>
> >
> > Add a rollback label to deal with failure, instead of continuously
> > checking for RESULT_SUCCESS, to make it easier to add more failure
> > cases. The refactoring also allows the collapse_file tracepoint to
> > include hpage on success (instead of NULL).
> >
> > Signed-off-by: David Stevens <[email protected]>
> > ---
> > mm/khugepaged.c | 223 ++++++++++++++++++++++++------------------------
> > 1 file changed, 110 insertions(+), 113 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 8dbc39896811..6a3d6d2e25e0 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -1885,6 +1885,12 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> > if (result != SCAN_SUCCEED)
> > goto out;
> >
> > + __SetPageLocked(hpage);
> > + if (is_shmem)
> > + __SetPageSwapBacked(hpage);
> > + hpage->index = start;
> > + hpage->mapping = mapping;
> > +
> > /*
> > * Ensure we have slots for all the pages in the range. This is
> > * almost certainly a no-op because most of the pages must be present
> > @@ -1897,16 +1903,10 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> > xas_unlock_irq(&xas);
> > if (!xas_nomem(&xas, GFP_KERNEL)) {
> > result = SCAN_FAIL;
> > - goto out;
> > + goto rollback;
> > }
> > } while (1);
> >
> > - __SetPageLocked(hpage);
> > - if (is_shmem)
> > - __SetPageSwapBacked(hpage);
> > - hpage->index = start;
> > - hpage->mapping = mapping;
> > -
> > /*
> > * At this point the hpage is locked and not up-to-date.
> > * It's safe to insert it into the page cache, because nobody would
> > @@ -2123,131 +2123,128 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> > */
> > try_to_unmap_flush();
> >
> > - if (result == SCAN_SUCCEED) {
> > - /*
> > - * Replacing old pages with new one has succeeded, now we
> > - * attempt to copy the contents.
> > - */
> > - index = start;
> > - list_for_each_entry(page, &pagelist, lru) {
> > - while (index < page->index) {
> > - clear_highpage(hpage + (index % HPAGE_PMD_NR));
> > - index++;
> > - }
> > - if (copy_mc_page(hpage + (page->index % HPAGE_PMD_NR),
> > - page) > 0) {
> > - result = SCAN_COPY_MC;
> > - break;
> > - }
> > - index++;
> > - }
> > - while (result == SCAN_SUCCEED && index < end) {
> > + if (result != SCAN_SUCCEED)
> > + goto rollback;
> > +
> > + /*
> > + * Replacing old pages with new one has succeeded, now we
> > + * attempt to copy the contents.
> > + */
> > + index = start;
> > + list_for_each_entry(page, &pagelist, lru) {
> > + while (index < page->index) {
> > clear_highpage(hpage + (index % HPAGE_PMD_NR));
> > index++;
> > }
> > + if (copy_mc_page(hpage + (page->index % HPAGE_PMD_NR),
> > + page) > 0) {
> > + result = SCAN_COPY_MC;
> > + goto rollback;
> > + }
> > + index++;
> > + }
> > + while (index < end) {
> > + clear_highpage(hpage + (index % HPAGE_PMD_NR));
> > + index++;
> > }
> >
> > - if (result == SCAN_SUCCEED) {
> > - /*
> > - * Copying old pages to huge one has succeeded, now we
> > - * need to free the old pages.
> > - */
> > - list_for_each_entry_safe(page, tmp, &pagelist, lru) {
> > - list_del(&page->lru);
> > - page->mapping = NULL;
> > - page_ref_unfreeze(page, 1);
> > - ClearPageActive(page);
> > - ClearPageUnevictable(page);
> > - unlock_page(page);
> > - put_page(page);
> > - }
> > + /*
> > + * Copying old pages to huge one has succeeded, now we
> > + * need to free the old pages.
> > + */
> > + list_for_each_entry_safe(page, tmp, &pagelist, lru) {
> > + list_del(&page->lru);
> > + page->mapping = NULL;
> > + page_ref_unfreeze(page, 1);
> > + ClearPageActive(page);
> > + ClearPageUnevictable(page);
> > + unlock_page(page);
> > + put_page(page);
> > + }
> >
> > - xas_lock_irq(&xas);
> > - if (is_shmem)
> > - __mod_lruvec_page_state(hpage, NR_SHMEM_THPS, nr);
> > - else
> > - __mod_lruvec_page_state(hpage, NR_FILE_THPS, nr);
> > + xas_lock_irq(&xas);
> > + if (is_shmem)
> > + __mod_lruvec_page_state(hpage, NR_SHMEM_THPS, nr);
> > + else
> > + __mod_lruvec_page_state(hpage, NR_FILE_THPS, nr);
> > +
> > + if (nr_none) {
> > + __mod_lruvec_page_state(hpage, NR_FILE_PAGES, nr_none);
> > + /* nr_none is always 0 for non-shmem. */
> > + __mod_lruvec_page_state(hpage, NR_SHMEM, nr_none);
> > + }
> > + /* Join all the small entries into a single multi-index entry. */
> > + xas_set_order(&xas, start, HPAGE_PMD_ORDER);
> > + xas_store(&xas, hpage);
> > + xas_unlock_irq(&xas);
> >
> > - if (nr_none) {
> > - __mod_lruvec_page_state(hpage, NR_FILE_PAGES, nr_none);
> > - /* nr_none is always 0 for non-shmem. */
> > - __mod_lruvec_page_state(hpage, NR_SHMEM, nr_none);
> > - }
> > - /* Join all the small entries into a single multi-index entry. */
> > - xas_set_order(&xas, start, HPAGE_PMD_ORDER);
> > - xas_store(&xas, hpage);
> > - xas_unlock_irq(&xas);
> > + folio = page_folio(hpage);
> > + folio_mark_uptodate(folio);
> > + folio_ref_add(folio, HPAGE_PMD_NR - 1);
> >
> > - folio = page_folio(hpage);
> > - folio_mark_uptodate(folio);
> > - folio_ref_add(folio, HPAGE_PMD_NR - 1);
> > + if (is_shmem)
> > + folio_mark_dirty(folio);
> > + folio_add_lru(folio);
> >
> > - if (is_shmem)
> > - folio_mark_dirty(folio);
> > - folio_add_lru(folio);
> > + /*
> > + * Remove pte page tables, so we can re-fault the page as huge.
> > + */
> > + result = retract_page_tables(mapping, start, mm, addr, hpage,
> > + cc);
> > + unlock_page(hpage);
> > + goto out;
> > +
> > +rollback:
> > + /* Something went wrong: roll back page cache changes */
> > + xas_lock_irq(&xas);
> > + if (nr_none) {
> > + mapping->nrpages -= nr_none;
> > + shmem_uncharge(mapping->host, nr_none);
> > + }
> >
> > - /*
> > - * Remove pte page tables, so we can re-fault the page as huge.
> > - */
> > - result = retract_page_tables(mapping, start, mm, addr, hpage,
> > - cc);
> > - unlock_page(hpage);
> > - hpage = NULL;
> > - } else {
> > - /* Something went wrong: roll back page cache changes */
> > - xas_lock_irq(&xas);
> > - if (nr_none) {
> > - mapping->nrpages -= nr_none;
> > - shmem_uncharge(mapping->host, nr_none);
> > + xas_set(&xas, start);
> > + xas_for_each(&xas, page, end - 1) {
> > + page = list_first_entry_or_null(&pagelist,
> > + struct page, lru);
> > + if (!page || xas.xa_index < page->index) {
> > + if (!nr_none)
> > + break;
> > + nr_none--;
> > + /* Put holes back where they were */
> > + xas_store(&xas, NULL);
> > + continue;
> > }
> >
> > - xas_set(&xas, start);
> > - xas_for_each(&xas, page, end - 1) {
> > - page = list_first_entry_or_null(&pagelist,
> > - struct page, lru);
> > - if (!page || xas.xa_index < page->index) {
> > - if (!nr_none)
> > - break;
> > - nr_none--;
> > - /* Put holes back where they were */
> > - xas_store(&xas, NULL);
> > - continue;
> > - }
> > + VM_BUG_ON_PAGE(page->index != xas.xa_index, page);
> >
> > - VM_BUG_ON_PAGE(page->index != xas.xa_index, page);
> > + /* Unfreeze the page. */
> > + list_del(&page->lru);
> > + page_ref_unfreeze(page, 2);
> > + xas_store(&xas, page);
> > + xas_pause(&xas);
> > + xas_unlock_irq(&xas);
> > + unlock_page(page);
> > + putback_lru_page(page);
> > + xas_lock_irq(&xas);
> > + }
> > + VM_BUG_ON(nr_none);
> > + /*
> > + * Undo the updates of filemap_nr_thps_inc for non-SHMEM file only.
> > + * This undo is not needed unless failure is due to SCAN_COPY_MC.
> > + */
> > + if (!is_shmem && result == SCAN_COPY_MC)
> > + filemap_nr_thps_dec(mapping);
> >
> > - /* Unfreeze the page. */
> > - list_del(&page->lru);
> > - page_ref_unfreeze(page, 2);
> > - xas_store(&xas, page);
> > - xas_pause(&xas);
> > - xas_unlock_irq(&xas);
> > - unlock_page(page);
> > - putback_lru_page(page);
> > - xas_lock_irq(&xas);
> > - }
> > - VM_BUG_ON(nr_none);
> > - /*
> > - * Undo the updates of filemap_nr_thps_inc for non-SHMEM file only.
> > - * This undo is not needed unless failure is due to SCAN_COPY_MC.
> > - */
> > - if (!is_shmem && result == SCAN_COPY_MC)
> > - filemap_nr_thps_dec(mapping);
> > + xas_unlock_irq(&xas);
> >
> > - xas_unlock_irq(&xas);
> > + hpage->mapping = NULL;
> >
> > - hpage->mapping = NULL;
> > - }
> > + unlock_page(hpage);
> > + mem_cgroup_uncharge(page_folio(hpage));
> > + put_page(hpage);
> >
> > - if (hpage)
> > - unlock_page(hpage);
> > out:
> > VM_BUG_ON(!list_empty(&pagelist));
> > - if (hpage) {
> > - mem_cgroup_uncharge(page_folio(hpage));
> > - put_page(hpage);
> > - }
>
> Moving this chunk seems wrong, as it can leak the huge page if
> alloc_charge_hpage() failed on mem charging, iiuc.
>
> And I found that keeping it seems wrong either, because if mem charge
> failed we'll uncharge even without charging it before. But that's nothing
> about this patch alone.
>
> Posted a patch for this:
>
> https://lore.kernel.org/all/[email protected]/
>
> I _think_ this patch will make sense after rebasing to that fix if that's
> correct, but please review it and double check.
>

Ah, good catch. I didn't notice that alloc_charge_hpage could allocate
*hpage even on failure. The failure path should work properly with
your fix.

-David

2023-02-22 16:25:55

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v4 1/3] mm/khugepaged: refactor collapse_file control flow

On Wed, Feb 22, 2023 at 01:08:19PM +0900, David Stevens wrote:
> > > out:
> > > VM_BUG_ON(!list_empty(&pagelist));
> > > - if (hpage) {
> > > - mem_cgroup_uncharge(page_folio(hpage));
> > > - put_page(hpage);
> > > - }
> >
> > Moving this chunk seems wrong, as it can leak the huge page if
> > alloc_charge_hpage() failed on mem charging, iiuc.
> >
> > And I found that keeping it seems wrong either, because if mem charge
> > failed we'll uncharge even without charging it before. But that's nothing
> > about this patch alone.
> >
> > Posted a patch for this:
> >
> > https://lore.kernel.org/all/[email protected]/

[1]

> >
> > I _think_ this patch will make sense after rebasing to that fix if that's
> > correct, but please review it and double check.
> >
>
> Ah, good catch. I didn't notice that alloc_charge_hpage could allocate
> *hpage even on failure. The failure path should work properly with
> your fix.

Thanks for confirming.

If we can merge above [1] before this patch, then I think this patch is
correct. Only if so:

Acked-by: Peter Xu <[email protected]>

--
Peter Xu


2023-03-03 15:36:52

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v4 0/3] mm/khugepaged: fix khugepaged+shmem races

On Fri, Feb 17, 2023 at 05:54:36PM +0900, David Stevens wrote:
> From: David Stevens <[email protected]>
>
> Fix two races in khugepaged+shmem that cause issues with userfaultfd and
> lseek, respectively.
>
> v3 -> v4:
> - Base changes on mm-everything (fba720cb4dc0)
> - Add patch to refactor error handling control flow in collapse_file
> - Rebase userfaultfd patch with no significant logic changes
> - Different approach for fixing lseek race

I just noticed this one hasn't landed unstable, so I guess I just posted a
trivial cleanup that can conflict with this so it won't apply cleanly..

https://lore.kernel.org/r/[email protected]

The resolution will be fairly straightforward though, and I'm happy to
rebase that one to this since this targets a real bug so should have higher
priority.

I guess it's possible Andrew thought the series has unsettled comment so
Andrew could just have ignored that whole set in the mark ups. A repost
could possibly clarify that.

Again, it'll always great to get another eye on this slightly involved
series. Matthew / Yang were already on the list, also copying Zach for his
recent works on khugepaged just in case he spots anything wrong.

Thanks,

--
Peter Xu


2023-03-03 15:46:07

by Zach O'Keefe

[permalink] [raw]
Subject: Re: [PATCH v4 0/3] mm/khugepaged: fix khugepaged+shmem races

On Mar 03 10:35, Peter Xu wrote:
> On Fri, Feb 17, 2023 at 05:54:36PM +0900, David Stevens wrote:
> > From: David Stevens <[email protected]>
> >
> > Fix two races in khugepaged+shmem that cause issues with userfaultfd and
> > lseek, respectively.
> >
> > v3 -> v4:
> > - Base changes on mm-everything (fba720cb4dc0)
> > - Add patch to refactor error handling control flow in collapse_file
> > - Rebase userfaultfd patch with no significant logic changes
> > - Different approach for fixing lseek race
>
> I just noticed this one hasn't landed unstable, so I guess I just posted a
> trivial cleanup that can conflict with this so it won't apply cleanly..
>
> https://lore.kernel.org/r/[email protected]
>
> The resolution will be fairly straightforward though, and I'm happy to
> rebase that one to this since this targets a real bug so should have higher
> priority.
>
> I guess it's possible Andrew thought the series has unsettled comment so
> Andrew could just have ignored that whole set in the mark ups. A repost
> could possibly clarify that.
>
> Again, it'll always great to get another eye on this slightly involved
> series. Matthew / Yang were already on the list, also copying Zach for his
> recent works on khugepaged just in case he spots anything wrong.

Thanks Peter. My email filters need some tinkering, since I missed this until
yesterday when I came across it by chance. Thanks for Cc'ing me. I'm taking a
look now, and likewise will if reposted.

Best,
Zach

>
> Thanks,
>
> --
> Peter Xu
>

2023-03-03 18:55:31

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCH v4 0/3] mm/khugepaged: fix khugepaged+shmem races

On Fri, Mar 3, 2023 at 7:35 AM Peter Xu <[email protected]> wrote:
>
> On Fri, Feb 17, 2023 at 05:54:36PM +0900, David Stevens wrote:
> > From: David Stevens <[email protected]>
> >
> > Fix two races in khugepaged+shmem that cause issues with userfaultfd and
> > lseek, respectively.
> >
> > v3 -> v4:
> > - Base changes on mm-everything (fba720cb4dc0)
> > - Add patch to refactor error handling control flow in collapse_file
> > - Rebase userfaultfd patch with no significant logic changes
> > - Different approach for fixing lseek race
>
> I just noticed this one hasn't landed unstable, so I guess I just posted a
> trivial cleanup that can conflict with this so it won't apply cleanly..
>
> https://lore.kernel.org/r/[email protected]
>
> The resolution will be fairly straightforward though, and I'm happy to
> rebase that one to this since this targets a real bug so should have higher
> priority.
>
> I guess it's possible Andrew thought the series has unsettled comment so
> Andrew could just have ignored that whole set in the mark ups. A repost
> could possibly clarify that.

IIUC this series should need to be rebased on top of your clean up patch?

>
> Again, it'll always great to get another eye on this slightly involved
> series. Matthew / Yang were already on the list, also copying Zach for his
> recent works on khugepaged just in case he spots anything wrong.

I had a quick look at it before, I didn't recall I spotted anything
wrong about khugepaged, but I'm not familiar with userfaultfd.

>
> Thanks,
>
> --
> Peter Xu
>

2023-03-03 22:54:02

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v4 0/3] mm/khugepaged: fix khugepaged+shmem races

On Fri, 3 Mar 2023 10:35:53 -0500 Peter Xu <[email protected]> wrote:

> On Fri, Feb 17, 2023 at 05:54:36PM +0900, David Stevens wrote:
> > From: David Stevens <[email protected]>
> >
> > Fix two races in khugepaged+shmem that cause issues with userfaultfd and
> > lseek, respectively.
> >
> > v3 -> v4:
> > - Base changes on mm-everything (fba720cb4dc0)
> > - Add patch to refactor error handling control flow in collapse_file
> > - Rebase userfaultfd patch with no significant logic changes
> > - Different approach for fixing lseek race
>
> I just noticed this one hasn't landed unstable, so I guess I just posted a
> trivial cleanup that can conflict with this so it won't apply cleanly..
>
> https://lore.kernel.org/r/[email protected]
>
> The resolution will be fairly straightforward though, and I'm happy to
> rebase that one to this since this targets a real bug so should have higher
> priority.

Even without the above patch ("mm/khugepaged: Cleanup memcg uncharge
for failure path") I'm seeing a big reject in khugepaged.c. Might be
easily fixed, didn't look.

> I guess it's possible Andrew thought the series has unsettled comment so
> Andrew could just have ignored that whole set in the mark ups. A repost
> could possibly clarify that.

Yes please. Lets gather the acks thus far, rebase on
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm's mm-unstable
branch and resend?

> Again, it'll always great to get another eye on this slightly involved
> series. Matthew / Yang were already on the list, also copying Zach for his
> recent works on khugepaged just in case he spots anything wrong.


2023-03-06 02:46:03

by David Stevens

[permalink] [raw]
Subject: Re: [PATCH v4 0/3] mm/khugepaged: fix khugepaged+shmem races

On Sat, Mar 4, 2023 at 7:53 AM Andrew Morton <[email protected]> wrote:
>
> On Fri, 3 Mar 2023 10:35:53 -0500 Peter Xu <[email protected]> wrote:
>
> > On Fri, Feb 17, 2023 at 05:54:36PM +0900, David Stevens wrote:
> > > From: David Stevens <[email protected]>
> > >
> > > Fix two races in khugepaged+shmem that cause issues with userfaultfd and
> > > lseek, respectively.
> > >
> > > v3 -> v4:
> > > - Base changes on mm-everything (fba720cb4dc0)
> > > - Add patch to refactor error handling control flow in collapse_file
> > > - Rebase userfaultfd patch with no significant logic changes
> > > - Different approach for fixing lseek race
> >
> > I just noticed this one hasn't landed unstable, so I guess I just posted a
> > trivial cleanup that can conflict with this so it won't apply cleanly..
> >
> > https://lore.kernel.org/r/[email protected]
> >
> > The resolution will be fairly straightforward though, and I'm happy to
> > rebase that one to this since this targets a real bug so should have higher
> > priority.
>
> Even without the above patch ("mm/khugepaged: Cleanup memcg uncharge
> for failure path") I'm seeing a big reject in khugepaged.c. Might be
> easily fixed, didn't look.
>
> > I guess it's possible Andrew thought the series has unsettled comment so
> > Andrew could just have ignored that whole set in the mark ups. A repost
> > could possibly clarify that.
>
> Yes please. Lets gather the acks thus far, rebase on
> git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm's mm-unstable
> branch and resend?

This conflicts pretty heavily with the "Memory poison recovery in
khugepaged collapsing" series. This series was written on top of v9 of
that series, but it looks like v9 of that series was dropped and is
being replaced with v10. Which series should go in first? If we're
confident that v10 of that series won't also be dropped, then rebasing
this series onto v10 of that series should be pretty easy. Otherwise
we could try reworking things to minimize conflicts between the two
series (create a 0th refactoring series?). Andrew, what course would
you prefer?

-David

> > Again, it'll always great to get another eye on this slightly involved
> > series. Matthew / Yang were already on the list, also copying Zach for his
> > recent works on khugepaged just in case he spots anything wrong.
>

2023-03-06 21:25:28

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v4 0/3] mm/khugepaged: fix khugepaged+shmem races

On Mon, 6 Mar 2023 11:44:59 +0900 David Stevens <[email protected]> wrote:

> > Yes please. Lets gather the acks thus far, rebase on
> > git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm's mm-unstable
> > branch and resend?
>
> This conflicts pretty heavily with the "Memory poison recovery in
> khugepaged collapsing" series. This series was written on top of v9 of
> that series, but it looks like v9 of that series was dropped and is
> being replaced with v10. Which series should go in first? If we're
> confident that v10 of that series won't also be dropped, then rebasing
> this series onto v10 of that series should be pretty easy. Otherwise
> we could try reworking things to minimize conflicts between the two
> series (create a 0th refactoring series?). Andrew, what course would
> you prefer?

I've merged v10 of "Memory poison recovery in khugepaged collapsing" so
let's base on that please?

"Memory poison recovery in khugepaged collapsing" shows no sign of
review, which is worrisome for a series at version 10. Hopefully
people will step up soon. So there's presently a risk that "Memory
poison recovery in khugepaged collapsing" will go away again, so don't
lose the pre-rebase series :(