2023-04-04 12:11:48

by David Stevens

[permalink] [raw]
Subject: [PATCH v6 0/4] mm/khugepaged: fixes for khugepaged+shmem

From: David Stevens <[email protected]>

This series reworks collapse_file so that the intermediate state of the
collapse does not leak out of collapse_file. Although this makes
collapse_file a bit more complicated, it means that the rest of the
kernel doesn't have to deal with the unusual state. This directly fixes
races with both lseek and mincore.

This series also fixes the fact that khugepaged completely breaks
userfaultfd+shmem. The rework of collapse_file provides a convenient
place to check for registered userfaultfds without making the shmem
userfaultfd implementation care about khugepaged.

Finally, this series adds a lru_add_drain after swapping in shmem pages,
which makes the subsequent folio_isolate_lru significantly more likely
to succeed.

v5 -> v6:
- Stop freezing the old pages so that we don't deadlock with
mc_handle_file_pte and mincore.
- Add missing locking around shmem charge rollback.
- Rebase on mm-unstable (f01f73d64cb5). Beyond straightfoward
conflicts, this involves adapting the fix for f520a742287e (i.e. an
unhandled ENOMEM).
- Fix bug with bounds used with vma_interval_tree_foreach.
- Add a patch doing lru_add_drain after swapping in the shmem case.
- Update/clarify some comments.
- Drop ack on final patch
v4 -> v5:
- Rebase on mm-unstable (9caa15b8a499)
- Gather acks
v3 -> v4:
- Base changes on mm-everything (fba720cb4dc0)
- Add patch to refactor error handling control flow in collapse_file
- Rebase userfaultfd patch with no significant logic changes
- Different approach for fixing lseek race
v2 -> v3:
- Use XA_RETRY_ENTRY to synchronize with reads from the page cache
under the RCU read lock in userfaultfd fix
- Add patch to fix lseek race
v1 -> v2:
- Different approach for userfaultfd fix

*** BLURB HERE ***

David Stevens (4):
mm/khugepaged: drain lru after swapping in shmem
mm/khugepaged: refactor collapse_file control flow
mm/khugepaged: skip shmem with userfaultfd
mm/khugepaged: maintain page cache uptodate flag

include/trace/events/huge_memory.h | 3 +-
mm/khugepaged.c | 312 ++++++++++++++++-------------
2 files changed, 171 insertions(+), 144 deletions(-)

--
2.40.0.348.gf938b09366-goog


2023-04-04 12:11:53

by David Stevens

[permalink] [raw]
Subject: [PATCH v6 1/4] mm/khugepaged: drain lru after swapping in shmem

From: David Stevens <[email protected]>

Call lru_add_drain after swapping in shmem pages so that
isolate_lru_page is more likely to succeed.

Signed-off-by: David Stevens <[email protected]>
---
mm/khugepaged.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 666d2c4e38dd..90577247cfaf 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1963,6 +1963,8 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
result = SCAN_FAIL;
goto xa_unlocked;
}
+ /* drain pagevecs to help isolate_lru_page() */
+ lru_add_drain();
page = folio_file_page(folio, index);
} else if (trylock_page(page)) {
get_page(page);
--
2.40.0.348.gf938b09366-goog

2023-04-04 12:13:08

by David Stevens

[permalink] [raw]
Subject: [PATCH v6 3/4] mm/khugepaged: skip shmem with userfaultfd

From: David Stevens <[email protected]>

Make sure that collapse_file respects any userfaultfds registered with
MODE_MISSING. If userspace has any such userfaultfds registered, then
for any page which it knows to be missing, it may expect a
UFFD_EVENT_PAGEFAULT. This means collapse_file needs to be careful when
collapsing a shmem range would result in replacing an empty page with a
THP, to avoid breaking userfaultfd.

Synchronization when checking for userfaultfds in collapse_file is
tricky because the mmap locks can't be used to prevent races with the
registration of new userfaultfds. Instead, we provide synchronization by
ensuring that userspace cannot observe the fact that pages are missing
before we check for userfaultfds. Although this allows registration of a
userfaultfd to race with collapse_file, it ensures that userspace cannot
observe any pages transition from missing to present after such a race
occurs. This makes such a race indistinguishable to the collapse
occurring immediately before the userfaultfd registration.

The first step to provide this synchronization is to stop filling gaps
during the loop iterating over the target range, since the page cache
lock can be dropped during that loop. The second step is to fill the
gaps with XA_RETRY_ENTRY after the page cache lock is acquired the final
time, to avoid races with accesses to the page cache that only take the
RCU read lock.

The fact that we don't fill holes during the initial iteration means
that collapse_file now has to handle faults occurring during the
collapse. This is done by re-validating the number of missing pages
after acquiring the page cache lock for the final time.

This fix is targeted at khugepaged, but the change also applies to
MADV_COLLAPSE. MADV_COLLAPSE on a range with a userfaultfd will now
return EBUSY if there are any missing pages (instead of succeeding on
shmem and returning EINVAL on anonymous memory). There is also now a
window during MADV_COLLAPSE where a fault on a missing page will cause
the syscall to fail with EAGAIN.

The fact that intermediate page cache state can no longer be observed
before the rollback of a failed collapse is also technically a
userspace-visible change (via at least SEEK_DATA and SEEK_END), but it
is exceedingly unlikely that anything relies on being able to observe
that transient state.

Signed-off-by: David Stevens <[email protected]>
Acked-by: Peter Xu <[email protected]>
---
include/trace/events/huge_memory.h | 3 +-
mm/khugepaged.c | 109 +++++++++++++++++++++--------
2 files changed, 81 insertions(+), 31 deletions(-)

diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
index eca4c6f3625e..877cbf9fd2ec 100644
--- a/include/trace/events/huge_memory.h
+++ b/include/trace/events/huge_memory.h
@@ -38,7 +38,8 @@
EM( SCAN_TRUNCATED, "truncated") \
EM( SCAN_PAGE_HAS_PRIVATE, "page_has_private") \
EM( SCAN_STORE_FAILED, "store_failed") \
- EMe(SCAN_COPY_MC, "copy_poisoned_page")
+ EM( SCAN_COPY_MC, "copy_poisoned_page") \
+ EMe(SCAN_PAGE_FILLED, "page_filled") \

#undef EM
#undef EMe
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 90828272a065..7679551e9540 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -57,6 +57,7 @@ enum scan_result {
SCAN_PAGE_HAS_PRIVATE,
SCAN_COPY_MC,
SCAN_STORE_FAILED,
+ SCAN_PAGE_FILLED,
};

#define CREATE_TRACE_POINTS
@@ -1856,8 +1857,8 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
* - allocate and lock a new huge page;
* - scan page cache replacing old pages with the new one
* + swap/gup in pages if necessary;
- * + fill in gaps;
* + keep old pages around in case rollback is required;
+ * - finalize updates to the page cache;
* - if replacing succeeds:
* + copy data over;
* + free old pages;
@@ -1935,22 +1936,12 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
result = SCAN_TRUNCATED;
goto xa_locked;
}
- xas_set(&xas, index);
+ xas_set(&xas, index + 1);
}
if (!shmem_charge(mapping->host, 1)) {
result = SCAN_FAIL;
goto xa_locked;
}
- xas_store(&xas, hpage);
- if (xas_error(&xas)) {
- /* revert shmem_charge performed
- * in the previous condition
- */
- mapping->nrpages--;
- shmem_uncharge(mapping->host, 1);
- result = SCAN_STORE_FAILED;
- goto xa_locked;
- }
nr_none++;
continue;
}
@@ -2161,22 +2152,66 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
index++;
}

- /*
- * Copying old pages to huge one has succeeded, now we
- * need to free the old pages.
- */
- list_for_each_entry_safe(page, tmp, &pagelist, lru) {
- list_del(&page->lru);
- page->mapping = NULL;
- page_ref_unfreeze(page, 1);
- ClearPageActive(page);
- ClearPageUnevictable(page);
- unlock_page(page);
- put_page(page);
+ if (nr_none) {
+ struct vm_area_struct *vma;
+ int nr_none_check = 0;
+
+ i_mmap_lock_read(mapping);
+ xas_lock_irq(&xas);
+
+ xas_set(&xas, start);
+ for (index = start; index < end; index++) {
+ if (!xas_next(&xas)) {
+ xas_store(&xas, XA_RETRY_ENTRY);
+ if (xas_error(&xas)) {
+ result = SCAN_STORE_FAILED;
+ goto immap_locked;
+ }
+ nr_none_check++;
+ }
+ }
+
+ if (nr_none != nr_none_check) {
+ result = SCAN_PAGE_FILLED;
+ goto immap_locked;
+ }
+
+ /*
+ * If userspace observed a missing page in a VMA with a MODE_MISSING
+ * userfaultfd, then it might expect a UFFD_EVENT_PAGEFAULT for that
+ * page. If so, we need to roll back to avoid suppressing such an
+ * event. Since wp/minor userfaultfds don't give userspace any
+ * guarantees that the kernel doesn't fill a missing page with a zero
+ * page, so they don't matter here.
+ *
+ * Any userfaultfds registered after this point will not be able to
+ * observe any missing pages due to the previously inserted retry
+ * entries.
+ */
+ vma_interval_tree_foreach(vma, &mapping->i_mmap, start, end) {
+ if (userfaultfd_missing(vma)) {
+ result = SCAN_EXCEED_NONE_PTE;
+ goto immap_locked;
+ }
+ }
+
+immap_locked:
+ i_mmap_unlock_read(mapping);
+ if (result != SCAN_SUCCEED) {
+ xas_set(&xas, start);
+ for (index = start; index < end; index++) {
+ if (xas_next(&xas) == XA_RETRY_ENTRY)
+ xas_store(&xas, NULL);
+ }
+
+ xas_unlock_irq(&xas);
+ goto rollback;
+ }
+ } else {
+ xas_lock_irq(&xas);
}

nr = thp_nr_pages(hpage);
- xas_lock_irq(&xas);
if (is_shmem)
__mod_lruvec_page_state(hpage, NR_SHMEM_THPS, nr);
else
@@ -2206,6 +2241,20 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
result = retract_page_tables(mapping, start, mm, addr, hpage,
cc);
unlock_page(hpage);
+
+ /*
+ * The collapse has succeeded, so free the old pages.
+ */
+ list_for_each_entry_safe(page, tmp, &pagelist, lru) {
+ list_del(&page->lru);
+ page->mapping = NULL;
+ page_ref_unfreeze(page, 1);
+ ClearPageActive(page);
+ ClearPageUnevictable(page);
+ unlock_page(page);
+ put_page(page);
+ }
+
goto out;

rollback:
@@ -2217,15 +2266,13 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
}

xas_set(&xas, start);
- xas_for_each(&xas, page, end - 1) {
+ end = index;
+ for (index = start; index < end; index++) {
+ xas_next(&xas);
page = list_first_entry_or_null(&pagelist,
struct page, lru);
if (!page || xas.xa_index < page->index) {
- if (!nr_none)
- break;
nr_none--;
- /* Put holes back where they were */
- xas_store(&xas, NULL);
continue;
}

@@ -2749,12 +2796,14 @@ static int madvise_collapse_errno(enum scan_result r)
case SCAN_ALLOC_HUGE_PAGE_FAIL:
return -ENOMEM;
case SCAN_CGROUP_CHARGE_FAIL:
+ case SCAN_EXCEED_NONE_PTE:
return -EBUSY;
/* Resource temporary unavailable - trying again might succeed */
case SCAN_PAGE_COUNT:
case SCAN_PAGE_LOCK:
case SCAN_PAGE_LRU:
case SCAN_DEL_PAGE_LRU:
+ case SCAN_PAGE_FILLED:
return -EAGAIN;
/*
* Other: Trying again likely not to succeed / error intrinsic to
--
2.40.0.348.gf938b09366-goog

2023-04-04 12:13:26

by David Stevens

[permalink] [raw]
Subject: [PATCH v6 4/4] mm/khugepaged: maintain page cache uptodate flag

From: David Stevens <[email protected]>

Make sure that collapse_file doesn't interfere with checking the
uptodate flag in the page cache by only inserting hpage into the page
cache after it has been updated and marked uptodate. This is achieved by
simply not replacing present pages with hpage when iterating over the
target range.

The present pages are already locked, so replacing them with the locked
hpage before the collapse is finalized is unnecessary. However, it is
necessary to stop freezing the present pages after validating them,
since leaving long-term frozen pages in the page cache can lead to
deadlocks. Simply checking the reference count is sufficient to ensure
that there are no long-term references hanging around that would the
collapse would break. Similar to hpage, there is no reason that the
present pages actually need to be frozen in addition to being locked.

This fixes a race where folio_seek_hole_data would mistake hpage for
an fallocated but unwritten page. This race is visible to userspace via
data temporarily disappearing from SEEK_DATA/SEEK_HOLE. This also fixes
a similar race where pages could temporarily disappear from mincore.

Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
Signed-off-by: David Stevens <[email protected]>
---
mm/khugepaged.c | 79 ++++++++++++++++++-------------------------------
1 file changed, 29 insertions(+), 50 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 7679551e9540..a19aa140fd52 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1855,17 +1855,18 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
*
* Basic scheme is simple, details are more complex:
* - allocate and lock a new huge page;
- * - scan page cache replacing old pages with the new one
+ * - scan page cache, locking old pages
* + swap/gup in pages if necessary;
- * + keep old pages around in case rollback is required;
+ * - copy data to new page
+ * - handle shmem holes
+ * + re-validate that holes weren't filled by someone else
+ * + check for userfaultfd
* - finalize updates to the page cache;
* - if replacing succeeds:
- * + copy data over;
- * + free old pages;
* + unlock huge page;
+ * + free old pages;
* - if replacing failed;
- * + put all pages back and unfreeze them;
- * + restore gaps in the page cache;
+ * + unlock old pages
* + unlock and free huge page;
*/
static int collapse_file(struct mm_struct *mm, unsigned long addr,
@@ -1913,12 +1914,6 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
}
} while (1);

- /*
- * At this point the hpage is locked and not up-to-date.
- * It's safe to insert it into the page cache, because nobody would
- * be able to map it or use it in another way until we unlock it.
- */
-
xas_set(&xas, start);
for (index = start; index < end; index++) {
page = xas_next(&xas);
@@ -2076,12 +2071,16 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
VM_BUG_ON_PAGE(page != xas_load(&xas), page);

/*
- * The page is expected to have page_count() == 3:
+ * We control three references to the page:
* - we hold a pin on it;
* - one reference from page cache;
* - one from isolate_lru_page;
+ * If those are the only references, then any new usage of the
+ * page will have to fetch it from the page cache. That requires
+ * locking the page to handle truncate, so any new usage will be
+ * blocked until we unlock page after collapse/during rollback.
*/
- if (!page_ref_freeze(page, 3)) {
+ if (page_count(page) != 3) {
result = SCAN_PAGE_COUNT;
xas_unlock_irq(&xas);
putback_lru_page(page);
@@ -2089,13 +2088,9 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
}

/*
- * Add the page to the list to be able to undo the collapse if
- * something go wrong.
+ * Accumulate the pages that are being collapsed.
*/
list_add_tail(&page->lru, &pagelist);
-
- /* Finally, replace with the new page. */
- xas_store(&xas, hpage);
continue;
out_unlock:
unlock_page(page);
@@ -2132,8 +2127,7 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
goto rollback;

/*
- * Replacing old pages with new one has succeeded, now we
- * attempt to copy the contents.
+ * The old pages are locked, so they won't change anymore.
*/
index = start;
list_for_each_entry(page, &pagelist, lru) {
@@ -2222,11 +2216,11 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
/* nr_none is always 0 for non-shmem. */
__mod_lruvec_page_state(hpage, NR_SHMEM, nr_none);
}
- /* Join all the small entries into a single multi-index entry. */
- xas_set_order(&xas, start, HPAGE_PMD_ORDER);
- xas_store(&xas, hpage);
- xas_unlock_irq(&xas);

+ /*
+ * Mark hpage as uptodate before inserting it into the page cache so
+ * that it isn't mistaken for an fallocated but unwritten page.
+ */
folio = page_folio(hpage);
folio_mark_uptodate(folio);
folio_ref_add(folio, HPAGE_PMD_NR - 1);
@@ -2235,6 +2229,11 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
folio_mark_dirty(folio);
folio_add_lru(folio);

+ /* Join all the small entries into a single multi-index entry. */
+ xas_set_order(&xas, start, HPAGE_PMD_ORDER);
+ xas_store(&xas, hpage);
+ xas_unlock_irq(&xas);
+
/*
* Remove pte page tables, so we can re-fault the page as huge.
*/
@@ -2248,47 +2247,29 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
list_for_each_entry_safe(page, tmp, &pagelist, lru) {
list_del(&page->lru);
page->mapping = NULL;
- page_ref_unfreeze(page, 1);
ClearPageActive(page);
ClearPageUnevictable(page);
unlock_page(page);
- put_page(page);
+ folio_put_refs(page_folio(page), 3);
}

goto out;

rollback:
/* Something went wrong: roll back page cache changes */
- xas_lock_irq(&xas);
if (nr_none) {
+ xas_lock_irq(&xas);
mapping->nrpages -= nr_none;
shmem_uncharge(mapping->host, nr_none);
+ xas_unlock_irq(&xas);
}

- xas_set(&xas, start);
- end = index;
- for (index = start; index < end; index++) {
- xas_next(&xas);
- page = list_first_entry_or_null(&pagelist,
- struct page, lru);
- if (!page || xas.xa_index < page->index) {
- nr_none--;
- continue;
- }
-
- VM_BUG_ON_PAGE(page->index != xas.xa_index, page);
-
- /* Unfreeze the page. */
+ list_for_each_entry_safe(page, tmp, &pagelist, lru) {
list_del(&page->lru);
- page_ref_unfreeze(page, 2);
- xas_store(&xas, page);
- xas_pause(&xas);
- xas_unlock_irq(&xas);
unlock_page(page);
putback_lru_page(page);
- xas_lock_irq(&xas);
+ put_page(page);
}
- VM_BUG_ON(nr_none);
/*
* Undo the updates of filemap_nr_thps_inc for non-SHMEM
* file only. This undo is not needed unless failure is
@@ -2303,8 +2284,6 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
smp_mb();
}

- xas_unlock_irq(&xas);
-
hpage->mapping = NULL;

unlock_page(hpage);
--
2.40.0.348.gf938b09366-goog

2023-04-04 21:23:48

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v6 4/4] mm/khugepaged: maintain page cache uptodate flag

On Tue, Apr 04, 2023 at 09:01:17PM +0900, David Stevens wrote:
> From: David Stevens <[email protected]>
>
> Make sure that collapse_file doesn't interfere with checking the
> uptodate flag in the page cache by only inserting hpage into the page
> cache after it has been updated and marked uptodate. This is achieved by
> simply not replacing present pages with hpage when iterating over the
> target range.
>
> The present pages are already locked, so replacing them with the locked
> hpage before the collapse is finalized is unnecessary. However, it is
> necessary to stop freezing the present pages after validating them,
> since leaving long-term frozen pages in the page cache can lead to
> deadlocks. Simply checking the reference count is sufficient to ensure
> that there are no long-term references hanging around that would the
> collapse would break. Similar to hpage, there is no reason that the
> present pages actually need to be frozen in addition to being locked.
>
> This fixes a race where folio_seek_hole_data would mistake hpage for
> an fallocated but unwritten page. This race is visible to userspace via
> data temporarily disappearing from SEEK_DATA/SEEK_HOLE. This also fixes
> a similar race where pages could temporarily disappear from mincore.
>
> Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
> Signed-off-by: David Stevens <[email protected]>
> ---
> mm/khugepaged.c | 79 ++++++++++++++++++-------------------------------
> 1 file changed, 29 insertions(+), 50 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 7679551e9540..a19aa140fd52 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1855,17 +1855,18 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
> *
> * Basic scheme is simple, details are more complex:
> * - allocate and lock a new huge page;
> - * - scan page cache replacing old pages with the new one
> + * - scan page cache, locking old pages
> * + swap/gup in pages if necessary;
> - * + keep old pages around in case rollback is required;
> + * - copy data to new page
> + * - handle shmem holes
> + * + re-validate that holes weren't filled by someone else
> + * + check for userfaultfd

PS: some of the changes may belong to previous patch here, but not
necessary to repost only for this, just in case there'll be a new one.

> * - finalize updates to the page cache;
> * - if replacing succeeds:
> - * + copy data over;
> - * + free old pages;
> * + unlock huge page;
> + * + free old pages;
> * - if replacing failed;
> - * + put all pages back and unfreeze them;
> - * + restore gaps in the page cache;
> + * + unlock old pages
> * + unlock and free huge page;
> */
> static int collapse_file(struct mm_struct *mm, unsigned long addr,
> @@ -1913,12 +1914,6 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> }
> } while (1);
>
> - /*
> - * At this point the hpage is locked and not up-to-date.
> - * It's safe to insert it into the page cache, because nobody would
> - * be able to map it or use it in another way until we unlock it.
> - */
> -
> xas_set(&xas, start);
> for (index = start; index < end; index++) {
> page = xas_next(&xas);
> @@ -2076,12 +2071,16 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> VM_BUG_ON_PAGE(page != xas_load(&xas), page);
>
> /*
> - * The page is expected to have page_count() == 3:
> + * We control three references to the page:
> * - we hold a pin on it;
> * - one reference from page cache;
> * - one from isolate_lru_page;
> + * If those are the only references, then any new usage of the
> + * page will have to fetch it from the page cache. That requires
> + * locking the page to handle truncate, so any new usage will be
> + * blocked until we unlock page after collapse/during rollback.
> */
> - if (!page_ref_freeze(page, 3)) {
> + if (page_count(page) != 3) {
> result = SCAN_PAGE_COUNT;
> xas_unlock_irq(&xas);
> putback_lru_page(page);

Personally I don't see anything wrong with this change to resolve the dead
lock. E.g. fast gup race right before unmapping the pgtables seems fine,
since we'll just bail out with >3 refcounts (or fast-gup bails out by
checking pte changes). Either way looks fine here.

So far it looks good to me, but that may not mean much per the history on
what I can overlook. It'll be always good to hear from Hugh and others.

--
Peter Xu

2023-04-19 04:50:43

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH v6 4/4] mm/khugepaged: maintain page cache uptodate flag

On Tue, 4 Apr 2023, Peter Xu wrote:
> On Tue, Apr 04, 2023 at 09:01:17PM +0900, David Stevens wrote:
> > From: David Stevens <[email protected]>
> >
> > Make sure that collapse_file doesn't interfere with checking the
> > uptodate flag in the page cache by only inserting hpage into the page
> > cache after it has been updated and marked uptodate. This is achieved by
> > simply not replacing present pages with hpage when iterating over the
> > target range.
> >
> > The present pages are already locked, so replacing them with the locked
> > hpage before the collapse is finalized is unnecessary. However, it is
> > necessary to stop freezing the present pages after validating them,
> > since leaving long-term frozen pages in the page cache can lead to
> > deadlocks. Simply checking the reference count is sufficient to ensure
> > that there are no long-term references hanging around that would the
> > collapse would break. Similar to hpage, there is no reason that the
> > present pages actually need to be frozen in addition to being locked.
> >
> > This fixes a race where folio_seek_hole_data would mistake hpage for
> > an fallocated but unwritten page. This race is visible to userspace via
> > data temporarily disappearing from SEEK_DATA/SEEK_HOLE. This also fixes
> > a similar race where pages could temporarily disappear from mincore.
> >
> > Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
> > Signed-off-by: David Stevens <[email protected]>
...
>
> Personally I don't see anything wrong with this change to resolve the dead
> lock. E.g. fast gup race right before unmapping the pgtables seems fine,
> since we'll just bail out with >3 refcounts (or fast-gup bails out by
> checking pte changes). Either way looks fine here.
>
> So far it looks good to me, but that may not mean much per the history on
> what I can overlook. It'll be always good to hear from Hugh and others.

I'm uneasy about it, and haven't let it sink in for long enough: but
haven't spotted anything wrong with it, nor experienced any trouble.

I would have much preferred David to stick with the current scheme, and
fix up seek_hole_data, and be less concerned with the mincore transients:
this patch makes a significant change that is difficult to be sure of.

I was dubious about the unfrozen "page_count(page) != 3" check (where
another task can grab a reference an instant later), but perhaps it
does serve a purpose, since we hold the page lock there: excludes
concurrent shmem reads which grab but drop page lock before copying
(though it's not clear that those do actually need excluding).

I had thought shmem was peculiar in relying on page lock while writing,
but turn out to be quite wrong about that: most filesystems rely on
page lock while writing, though I'm not sure whether that's true of
all (and it doesn't matter while collapse of non-shmem file is only
permitted on read-only).

We shall see.

Hugh

2023-06-20 21:40:21

by Andres Freund

[permalink] [raw]
Subject: Re: [PATCH v6 4/4] mm/khugepaged: maintain page cache uptodate flag

Hi,

On 2023-04-04 21:01:17 +0900, David Stevens wrote:
> From: David Stevens <[email protected]>
>
> Make sure that collapse_file doesn't interfere with checking the
> uptodate flag in the page cache by only inserting hpage into the page
> cache after it has been updated and marked uptodate. This is achieved by
> simply not replacing present pages with hpage when iterating over the
> target range.
>
> The present pages are already locked, so replacing them with the locked
> hpage before the collapse is finalized is unnecessary. However, it is
> necessary to stop freezing the present pages after validating them,
> since leaving long-term frozen pages in the page cache can lead to
> deadlocks. Simply checking the reference count is sufficient to ensure
> that there are no long-term references hanging around that would the
> collapse would break. Similar to hpage, there is no reason that the
> present pages actually need to be frozen in addition to being locked.
>
> This fixes a race where folio_seek_hole_data would mistake hpage for
> an fallocated but unwritten page. This race is visible to userspace via
> data temporarily disappearing from SEEK_DATA/SEEK_HOLE. This also fixes
> a similar race where pages could temporarily disappear from mincore.
>
> Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
> Signed-off-by: David Stevens <[email protected]>

I noticed that recently MADV_COLLAPSE stopped being able to collapse a
binary's executable code, always failing with EAGAIN. I bisected it down to
a2e17cc2efc7 - this commit.

Using perf trace -e 'huge_memory:*' -a I see

1000.433 postgres.2/1872144 huge_memory:mm_khugepaged_collapse_file(mm: 0xffff889e800bdf00, hpfn: 46720000, index: 1537, is_shmem: 1, filename: "postgres.2", result: 17)
1000.445 postgres.2/1872144 huge_memory:mm_khugepaged_scan_file(mm: 0xffff889e800bdf00, pfn: -1, filename: "postgres.2", present: 512, result: 17)
1000.485 postgres.2/1872144 huge_memory:mm_khugepaged_collapse_file(mm: 0xffff889e800bdf00, hpfn: 46720000, index: 2049, is_shmem: 1, filename: "postgres.2", result: 17)
1000.489 postgres.2/1872144 huge_memory:mm_khugepaged_scan_file(mm: 0xffff889e800bdf00, pfn: -1, filename: "postgres.2", present: 512, result: 17)
1000.526 postgres.2/1872144 huge_memory:mm_khugepaged_collapse_file(mm: 0xffff889e800bdf00, hpfn: 46720000, index: 2561, is_shmem: 1, filename: "postgres.2", result: 17)
1000.532 postgres.2/1872144 huge_memory:mm_khugepaged_scan_file(mm: 0xffff889e800bdf00, pfn: -1, filename: "postgres.2", present: 512, result: 17)
1000.570 postgres.2/1872144 huge_memory:mm_khugepaged_collapse_file(mm: 0xffff889e800bdf00, hpfn: 46720000, index: 3073, is_shmem: 1, filename: "postgres.2", result: 17)
1000.575 postgres.2/1872144 huge_memory:mm_khugepaged_scan_file(mm: 0xffff889e800bdf00, pfn: -1, filename: "postgres.2", present: 512, result: 17)

for every attempt at doing madvise(MADV_COLLAPSE).


I'm sad about that, because MADV_COLLAPSE was the first thing that allowed
using huge pages for executable code that wasn't entirely completely gross.


I don't yet have a standalone repro, but can write one if that's helpful.

Greetings,

Andres Freund

2023-06-20 21:55:58

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v6 4/4] mm/khugepaged: maintain page cache uptodate flag

On Tue, Jun 20, 2023 at 01:55:47PM -0700, Andres Freund wrote:
> Hi,

Hi, Andres,

>
> On 2023-04-04 21:01:17 +0900, David Stevens wrote:
> > From: David Stevens <[email protected]>
> >
> > Make sure that collapse_file doesn't interfere with checking the
> > uptodate flag in the page cache by only inserting hpage into the page
> > cache after it has been updated and marked uptodate. This is achieved by
> > simply not replacing present pages with hpage when iterating over the
> > target range.
> >
> > The present pages are already locked, so replacing them with the locked
> > hpage before the collapse is finalized is unnecessary. However, it is
> > necessary to stop freezing the present pages after validating them,
> > since leaving long-term frozen pages in the page cache can lead to
> > deadlocks. Simply checking the reference count is sufficient to ensure
> > that there are no long-term references hanging around that would the
> > collapse would break. Similar to hpage, there is no reason that the
> > present pages actually need to be frozen in addition to being locked.
> >
> > This fixes a race where folio_seek_hole_data would mistake hpage for
> > an fallocated but unwritten page. This race is visible to userspace via
> > data temporarily disappearing from SEEK_DATA/SEEK_HOLE. This also fixes
> > a similar race where pages could temporarily disappear from mincore.
> >
> > Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
> > Signed-off-by: David Stevens <[email protected]>
>
> I noticed that recently MADV_COLLAPSE stopped being able to collapse a
> binary's executable code, always failing with EAGAIN. I bisected it down to
> a2e17cc2efc7 - this commit.
>
> Using perf trace -e 'huge_memory:*' -a I see
>
> 1000.433 postgres.2/1872144 huge_memory:mm_khugepaged_collapse_file(mm: 0xffff889e800bdf00, hpfn: 46720000, index: 1537, is_shmem: 1, filename: "postgres.2", result: 17)
> 1000.445 postgres.2/1872144 huge_memory:mm_khugepaged_scan_file(mm: 0xffff889e800bdf00, pfn: -1, filename: "postgres.2", present: 512, result: 17)
> 1000.485 postgres.2/1872144 huge_memory:mm_khugepaged_collapse_file(mm: 0xffff889e800bdf00, hpfn: 46720000, index: 2049, is_shmem: 1, filename: "postgres.2", result: 17)
> 1000.489 postgres.2/1872144 huge_memory:mm_khugepaged_scan_file(mm: 0xffff889e800bdf00, pfn: -1, filename: "postgres.2", present: 512, result: 17)
> 1000.526 postgres.2/1872144 huge_memory:mm_khugepaged_collapse_file(mm: 0xffff889e800bdf00, hpfn: 46720000, index: 2561, is_shmem: 1, filename: "postgres.2", result: 17)
> 1000.532 postgres.2/1872144 huge_memory:mm_khugepaged_scan_file(mm: 0xffff889e800bdf00, pfn: -1, filename: "postgres.2", present: 512, result: 17)
> 1000.570 postgres.2/1872144 huge_memory:mm_khugepaged_collapse_file(mm: 0xffff889e800bdf00, hpfn: 46720000, index: 3073, is_shmem: 1, filename: "postgres.2", result: 17)
> 1000.575 postgres.2/1872144 huge_memory:mm_khugepaged_scan_file(mm: 0xffff889e800bdf00, pfn: -1, filename: "postgres.2", present: 512, result: 17)
>
> for every attempt at doing madvise(MADV_COLLAPSE).
>
>
> I'm sad about that, because MADV_COLLAPSE was the first thing that allowed
> using huge pages for executable code that wasn't entirely completely gross.
>
>
> I don't yet have a standalone repro, but can write one if that's helpful.

There's a fix:

https://lore.kernel.org/all/[email protected]/

Already in today's Andrew's pull for rc7:

https://lore.kernel.org/all/[email protected]/

--
Peter Xu


2023-06-20 22:19:33

by Andres Freund

[permalink] [raw]
Subject: Re: [PATCH v6 4/4] mm/khugepaged: maintain page cache uptodate flag

Hi,

On 2023-06-20 17:11:30 -0400, Peter Xu wrote:
> On Tue, Jun 20, 2023 at 01:55:47PM -0700, Andres Freund wrote:
> > On 2023-04-04 21:01:17 +0900, David Stevens wrote:
> > > From: David Stevens <[email protected]>
> > >
> > > Make sure that collapse_file doesn't interfere with checking the
> > > uptodate flag in the page cache by only inserting hpage into the page
> > > cache after it has been updated and marked uptodate. This is achieved by
> > > simply not replacing present pages with hpage when iterating over the
> > > target range.
> > >
> > > The present pages are already locked, so replacing them with the locked
> > > hpage before the collapse is finalized is unnecessary. However, it is
> > > necessary to stop freezing the present pages after validating them,
> > > since leaving long-term frozen pages in the page cache can lead to
> > > deadlocks. Simply checking the reference count is sufficient to ensure
> > > that there are no long-term references hanging around that would the
> > > collapse would break. Similar to hpage, there is no reason that the
> > > present pages actually need to be frozen in addition to being locked.
> > >
> > > This fixes a race where folio_seek_hole_data would mistake hpage for
> > > an fallocated but unwritten page. This race is visible to userspace via
> > > data temporarily disappearing from SEEK_DATA/SEEK_HOLE. This also fixes
> > > a similar race where pages could temporarily disappear from mincore.
> > >
> > > Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
> > > Signed-off-by: David Stevens <[email protected]>
> >
> > I noticed that recently MADV_COLLAPSE stopped being able to collapse a
> > binary's executable code, always failing with EAGAIN. I bisected it down to
> > a2e17cc2efc7 - this commit.
> >
> > Using perf trace -e 'huge_memory:*' -a I see
> >
> > 1000.433 postgres.2/1872144 huge_memory:mm_khugepaged_collapse_file(mm: 0xffff889e800bdf00, hpfn: 46720000, index: 1537, is_shmem: 1, filename: "postgres.2", result: 17)
> > 1000.445 postgres.2/1872144 huge_memory:mm_khugepaged_scan_file(mm: 0xffff889e800bdf00, pfn: -1, filename: "postgres.2", present: 512, result: 17)
> > 1000.485 postgres.2/1872144 huge_memory:mm_khugepaged_collapse_file(mm: 0xffff889e800bdf00, hpfn: 46720000, index: 2049, is_shmem: 1, filename: "postgres.2", result: 17)
> > 1000.489 postgres.2/1872144 huge_memory:mm_khugepaged_scan_file(mm: 0xffff889e800bdf00, pfn: -1, filename: "postgres.2", present: 512, result: 17)
> > 1000.526 postgres.2/1872144 huge_memory:mm_khugepaged_collapse_file(mm: 0xffff889e800bdf00, hpfn: 46720000, index: 2561, is_shmem: 1, filename: "postgres.2", result: 17)
> > 1000.532 postgres.2/1872144 huge_memory:mm_khugepaged_scan_file(mm: 0xffff889e800bdf00, pfn: -1, filename: "postgres.2", present: 512, result: 17)
> > 1000.570 postgres.2/1872144 huge_memory:mm_khugepaged_collapse_file(mm: 0xffff889e800bdf00, hpfn: 46720000, index: 3073, is_shmem: 1, filename: "postgres.2", result: 17)
> > 1000.575 postgres.2/1872144 huge_memory:mm_khugepaged_scan_file(mm: 0xffff889e800bdf00, pfn: -1, filename: "postgres.2", present: 512, result: 17)
> >
> > for every attempt at doing madvise(MADV_COLLAPSE).
> >
> >
> > I'm sad about that, because MADV_COLLAPSE was the first thing that allowed
> > using huge pages for executable code that wasn't entirely completely gross.
> >
> >
> > I don't yet have a standalone repro, but can write one if that's helpful.
>
> There's a fix:
>
> https://lore.kernel.org/all/[email protected]/
>
> Already in today's Andrew's pull for rc7:
>
> https://lore.kernel.org/all/[email protected]/

Ah, great!

I can confirm that the fix unbreaks our use of MADV_COLLAPSE for executable
code...

Greetings,

Andres Freund