2024-06-10 12:02:29

by Lance Yang

[permalink] [raw]
Subject: [PATCH v7 0/4] Reclaim lazyfree THP without splitting

Hi all,

This series adds support for reclaiming PMD-mapped THP marked as lazyfree
without needing to first split the large folio via split_huge_pmd_address().

When the user no longer requires the pages, they would use madvise(MADV_FREE)
to mark the pages as lazy free. Subsequently, they typically would not re-write
to that memory again.

During memory reclaim, if we detect that the large folio and its PMD are both
still marked as clean and there are no unexpected references(such as GUP), so we
can just discard the memory lazily, improving the efficiency of memory
reclamation in this case.

Performance Testing
===================

On an Intel i5 CPU, reclaiming 1GiB of lazyfree THPs using
mem_cgroup_force_empty() results in the following runtimes in seconds
(shorter is better):

--------------------------------------------
| Old | New | Change |
--------------------------------------------
| 0.683426 | 0.049197 | -92.80% |
--------------------------------------------

---

Changes since v6 [6]
====================
- mm/rmap: remove duplicated exit code in pagewalk loop
- Pick RB from David - thanks!
- mm/rmap: add helper to restart pgtable walk on changes
- Add the page_vma_mapped_walk_restart() helper to handle scenarios
where the page table walk needs to be restarted due to changes in
the page table (suggested by David)
- mm/rmap: integrate PMD-mapped folio splitting into pagewalk loop
- Pass 'pvmw.address' to split_huge_pmd_locked() (per David)
- Drop the check for PMD-mapped THP that is missing the mlock (per David)
- mm/vmscan: avoid split lazyfree THP during shrink_folio_list()
- Rename the function __discard_trans_pmd_locked() to
__discard_anon_folio_pmd_locked() (per David)

Changes since v5 [5]
====================
- mm/rmap: remove duplicated exit code in pagewalk loop
- Pick RB from Baolin Wang - thanks!
- mm/mlock: check for THP missing the mlock in try_to_unmap_one()
- Merge this patch into patch 2 (per Baolin Wang)
- mm/vmscan: avoid split lazyfree THP during shrink_folio_list()
- Mark a folio as being backed by swap space if the folio or its PMD
was redirtied (per Baolin Wang)
- Use pmdp_huge_clear_flush() to get and flush a PMD entry
(per Baolin Wang)

Changes since v4 [4]
====================
- mm/rmap: remove duplicated exit code in pagewalk loop
- Pick RB from Zi Yan - thanks!
- mm/rmap: integrate PMD-mapped folio splitting into pagewalk loop
- Remove the redundant alignment (per Baolin Wang)
- Set pvmw.ptl to NULL after unlocking the PTL (per Baolin Wang)
- mm/mlock: check for THP missing the mlock in try_to_unmap_one()
- Check whether the mlock of PMD-mapped THP was missed
(suggested by Baolin Wang)
- mm/vmscan: avoid split lazyfree THP during shrink_folio_list()
- No need to check the TTU_SPLIT_HUGE_PMD flag for unmap_huge_pmd_locked()
(per Zi Yan)
- Drain the local mlock batch after folio_remove_rmap_pmd()
(per Baolin Wang)

Changes since v3 [3]
====================
- mm/rmap: integrate PMD-mapped folio splitting into pagewalk loop
- Resolve compilation errors by handling the case where
CONFIG_PGTABLE_HAS_HUGE_LEAVES is undefined (thanks to SeongJae Park)
- mm/vmscan: avoid split lazyfree THP during shrink_folio_list()
- Remove the unnecessary conditional compilation directives
(thanks to Barry Song)
- Resolve compilation errors due to undefined references to
unmap_huge_pmd_locked and split_huge_pmd_locked (thanks to Barry)

Changes since v2 [2]
====================
- Update the changelog (thanks to David Hildenbrand)
- Support try_to_unmap_one() to unmap PMD-mapped folios
(thanks a lot to David Hildenbrand and Zi Yan)

Changes since v1 [1]
====================
- Update the changelog
- Follow the exact same logic as in try_to_unmap_one() (per David Hildenbrand)
- Remove the extra code from rmap.c (per Matthew Wilcox)

[1] https://lore.kernel.org/linux-mm/[email protected]
[2] https://lore.kernel.org/linux-mm/[email protected]
[3] https://lore.kernel.org/linux-mm/[email protected]
[4] https://lore.kernel.org/linux-mm/[email protected]
[5] https://lore.kernel.org/linux-mm/[email protected]
[6] https://lore.kernel.org/linux-mm/[email protected]

Lance Yang (4):
mm/rmap: remove duplicated exit code in pagewalk loop
mm/rmap: add helper to restart pgtable walk on changes
mm/rmap: integrate PMD-mapped folio splitting into pagewalk loop
mm/vmscan: avoid split lazyfree THP during shrink_folio_list()

include/linux/huge_mm.h | 15 +++++
include/linux/rmap.h | 22 ++++++++
mm/huge_memory.c | 122 +++++++++++++++++++++++++++++++++-------
mm/rmap.c | 77 ++++++++++++++-----------
4 files changed, 184 insertions(+), 52 deletions(-)


base-commit: 6b77d01728f219a091f55ca3142bbc4e4057ca50
--
2.33.1



2024-06-10 12:08:43

by Lance Yang

[permalink] [raw]
Subject: [PATCH v7 4/4] mm/vmscan: avoid split lazyfree THP during shrink_folio_list()

When the user no longer requires the pages, they would use
madvise(MADV_FREE) to mark the pages as lazy free. Subsequently, they
typically would not re-write to that memory again.

During memory reclaim, if we detect that the large folio and its PMD are
both still marked as clean and there are no unexpected references
(such as GUP), so we can just discard the memory lazily, improving the
efficiency of memory reclamation in this case.

On an Intel i5 CPU, reclaiming 1GiB of lazyfree THPs using
mem_cgroup_force_empty() results in the following runtimes in seconds
(shorter is better):

--------------------------------------------
| Old | New | Change |
--------------------------------------------
| 0.683426 | 0.049197 | -92.80% |
--------------------------------------------

Suggested-by: Zi Yan <[email protected]>
Suggested-by: David Hildenbrand <[email protected]>
Signed-off-by: Lance Yang <[email protected]>
---
include/linux/huge_mm.h | 9 +++++
mm/huge_memory.c | 80 +++++++++++++++++++++++++++++++++++++++++
mm/rmap.c | 36 +++++++++++++------
3 files changed, 114 insertions(+), 11 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 4670c6ee118b..020e2344eb86 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -417,6 +417,8 @@ static inline bool thp_migration_supported(void)

void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
pmd_t *pmd, bool freeze, struct folio *folio);
+bool unmap_huge_pmd_locked(struct vm_area_struct *vma, unsigned long addr,
+ pmd_t *pmdp, struct folio *folio);

#else /* CONFIG_TRANSPARENT_HUGEPAGE */

@@ -484,6 +486,13 @@ static inline void split_huge_pmd_locked(struct vm_area_struct *vma,
unsigned long address, pmd_t *pmd,
bool freeze, struct folio *folio) {}

+static inline bool unmap_huge_pmd_locked(struct vm_area_struct *vma,
+ unsigned long addr, pmd_t *pmdp,
+ struct folio *folio)
+{
+ return false;
+}
+
#define split_huge_pud(__vma, __pmd, __address) \
do { } while (0)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d2697cc8f9d4..19592d3f1167 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2687,6 +2687,86 @@ static void unmap_folio(struct folio *folio)
try_to_unmap_flush();
}

+static bool __discard_anon_folio_pmd_locked(struct vm_area_struct *vma,
+ unsigned long addr, pmd_t *pmdp,
+ struct folio *folio)
+{
+ VM_WARN_ON_FOLIO(folio_test_swapbacked(folio), folio);
+ VM_WARN_ON_FOLIO(!folio_test_anon(folio), folio);
+
+ struct mm_struct *mm = vma->vm_mm;
+ int ref_count, map_count;
+ pmd_t orig_pmd = *pmdp;
+ struct page *page;
+
+ if (unlikely(!pmd_present(orig_pmd) || !pmd_trans_huge(orig_pmd)))
+ return false;
+
+ page = pmd_page(orig_pmd);
+ if (unlikely(page_folio(page) != folio))
+ return false;
+
+ if (folio_test_dirty(folio) || pmd_dirty(orig_pmd)) {
+ folio_set_swapbacked(folio);
+ return false;
+ }
+
+ orig_pmd = pmdp_huge_clear_flush(vma, addr, pmdp);
+
+ /*
+ * Syncing against concurrent GUP-fast:
+ * - clear PMD; barrier; read refcount
+ * - inc refcount; barrier; read PMD
+ */
+ smp_mb();
+
+ ref_count = folio_ref_count(folio);
+ map_count = folio_mapcount(folio);
+
+ /*
+ * Order reads for folio refcount and dirty flag
+ * (see comments in __remove_mapping()).
+ */
+ smp_rmb();
+
+ /*
+ * If the folio or its PMD is redirtied at this point, or if there
+ * are unexpected references, we will give up to discard this folio
+ * and remap it.
+ *
+ * The only folio refs must be one from isolation plus the rmap(s).
+ */
+ if (folio_test_dirty(folio) || pmd_dirty(orig_pmd))
+ folio_set_swapbacked(folio);
+
+ if (folio_test_swapbacked(folio) || ref_count != map_count + 1) {
+ set_pmd_at(mm, addr, pmdp, orig_pmd);
+ return false;
+ }
+
+ folio_remove_rmap_pmd(folio, page, vma);
+ zap_deposited_table(mm, pmdp);
+ add_mm_counter(mm, MM_ANONPAGES, -HPAGE_PMD_NR);
+ if (vma->vm_flags & VM_LOCKED)
+ mlock_drain_local();
+ folio_put(folio);
+
+ return true;
+}
+
+bool unmap_huge_pmd_locked(struct vm_area_struct *vma, unsigned long addr,
+ pmd_t *pmdp, struct folio *folio)
+{
+ VM_WARN_ON_FOLIO(!folio_test_pmd_mappable(folio), folio);
+ VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
+ VM_WARN_ON_ONCE(!IS_ALIGNED(addr, HPAGE_PMD_SIZE));
+
+ if (folio_test_anon(folio) && !folio_test_swapbacked(folio))
+ return __discard_anon_folio_pmd_locked(vma, addr, pmdp, folio);
+
+ return false;
+}
+
static void remap_page(struct folio *folio, unsigned long nr)
{
int i = 0;
diff --git a/mm/rmap.c b/mm/rmap.c
index b77f88695588..8e901636ade9 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1630,6 +1630,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
enum ttu_flags flags = (enum ttu_flags)(long)arg;
unsigned long pfn;
unsigned long hsz = 0;
+ bool pmd_mapped = false;

/*
* When racing against e.g. zap_pte_range() on another cpu,
@@ -1676,16 +1677,24 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
goto walk_done_err;
}

- if (!pvmw.pte && (flags & TTU_SPLIT_HUGE_PMD)) {
- /*
- * We temporarily have to drop the PTL and start once
- * again from that now-PTE-mapped page table.
- */
- split_huge_pmd_locked(vma, pvmw.address, pvmw.pmd,
- false, folio);
- flags &= ~TTU_SPLIT_HUGE_PMD;
- page_vma_mapped_walk_restart(&pvmw);
- continue;
+ if (!pvmw.pte) {
+ pmd_mapped = true;
+ if (unmap_huge_pmd_locked(vma, pvmw.address, pvmw.pmd,
+ folio))
+ goto walk_done;
+
+ if (flags & TTU_SPLIT_HUGE_PMD) {
+ /*
+ * We temporarily have to drop the PTL and start
+ * once again from that now-PTE-mapped page
+ * table.
+ */
+ split_huge_pmd_locked(vma, pvmw.address,
+ pvmw.pmd, false, folio);
+ flags &= ~TTU_SPLIT_HUGE_PMD;
+ page_vma_mapped_walk_restart(&pvmw);
+ continue;
+ }
}

/* Unexpected PMD-mapped THP? */
@@ -1813,7 +1822,12 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
*/
if (unlikely(folio_test_swapbacked(folio) !=
folio_test_swapcache(folio))) {
- WARN_ON_ONCE(1);
+ /*
+ * unmap_huge_pmd_locked() will unmark a
+ * PMD-mapped folio as lazyfree if the folio or
+ * its PMD was redirtied.
+ */
+ WARN_ON_ONCE(!pmd_mapped);
goto walk_done_err;
}

--
2.33.1


2024-06-10 12:25:41

by Lance Yang

[permalink] [raw]
Subject: [PATCH v7 1/4] mm/rmap: remove duplicated exit code in pagewalk loop

Introduce the labels walk_done and walk_done_err as exit points to
eliminate duplicated exit code in the pagewalk loop.

Reviewed-by: Zi Yan <[email protected]>
Reviewed-by: Baolin Wang <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Signed-off-by: Lance Yang <[email protected]>
---
mm/rmap.c | 40 +++++++++++++++-------------------------
1 file changed, 15 insertions(+), 25 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index e8fc5ecb59b2..ddffa30c79fb 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1679,9 +1679,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
/* Restore the mlock which got missed */
if (!folio_test_large(folio))
mlock_vma_folio(folio, vma);
- page_vma_mapped_walk_done(&pvmw);
- ret = false;
- break;
+ goto walk_done_err;
}

pfn = pte_pfn(ptep_get(pvmw.pte));
@@ -1719,11 +1717,8 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
*/
if (!anon) {
VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
- if (!hugetlb_vma_trylock_write(vma)) {
- page_vma_mapped_walk_done(&pvmw);
- ret = false;
- break;
- }
+ if (!hugetlb_vma_trylock_write(vma))
+ goto walk_done_err;
if (huge_pmd_unshare(mm, vma, address, pvmw.pte)) {
hugetlb_vma_unlock_write(vma);
flush_tlb_range(vma,
@@ -1738,8 +1733,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
* actual page and drop map count
* to zero.
*/
- page_vma_mapped_walk_done(&pvmw);
- break;
+ goto walk_done;
}
hugetlb_vma_unlock_write(vma);
}
@@ -1811,9 +1805,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
if (unlikely(folio_test_swapbacked(folio) !=
folio_test_swapcache(folio))) {
WARN_ON_ONCE(1);
- ret = false;
- page_vma_mapped_walk_done(&pvmw);
- break;
+ goto walk_done_err;
}

/* MADV_FREE page check */
@@ -1852,23 +1844,17 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
*/
set_pte_at(mm, address, pvmw.pte, pteval);
folio_set_swapbacked(folio);
- ret = false;
- page_vma_mapped_walk_done(&pvmw);
- break;
+ goto walk_done_err;
}

if (swap_duplicate(entry) < 0) {
set_pte_at(mm, address, pvmw.pte, pteval);
- ret = false;
- page_vma_mapped_walk_done(&pvmw);
- break;
+ goto walk_done_err;
}
if (arch_unmap_one(mm, vma, address, pteval) < 0) {
swap_free(entry);
set_pte_at(mm, address, pvmw.pte, pteval);
- ret = false;
- page_vma_mapped_walk_done(&pvmw);
- break;
+ goto walk_done_err;
}

/* See folio_try_share_anon_rmap(): clear PTE first. */
@@ -1876,9 +1862,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
folio_try_share_anon_rmap_pte(folio, subpage)) {
swap_free(entry);
set_pte_at(mm, address, pvmw.pte, pteval);
- ret = false;
- page_vma_mapped_walk_done(&pvmw);
- break;
+ goto walk_done_err;
}
if (list_empty(&mm->mmlist)) {
spin_lock(&mmlist_lock);
@@ -1918,6 +1902,12 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
if (vma->vm_flags & VM_LOCKED)
mlock_drain_local();
folio_put(folio);
+ continue;
+walk_done_err:
+ ret = false;
+walk_done:
+ page_vma_mapped_walk_done(&pvmw);
+ break;
}

mmu_notifier_invalidate_range_end(&range);
--
2.33.1


2024-06-10 12:26:00

by Lance Yang

[permalink] [raw]
Subject: [PATCH v7 2/4] mm/rmap: add helper to restart pgtable walk on changes

Introduce the page_vma_mapped_walk_restart() helper to handle scenarios
where the page table walk needs to be restarted due to changes in the page
table, such as when a PMD is split. It releases the PTL held during the
previous walk and resets the state, allowing a new walk to start at the
current address stored in pvmw->address.

Suggested-by: David Hildenbrand <[email protected]>
Signed-off-by: Lance Yang <[email protected]>
---
include/linux/rmap.h | 22 ++++++++++++++++++++++
1 file changed, 22 insertions(+)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 7229b9baf20d..5f18509610cc 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -710,6 +710,28 @@ static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw)
spin_unlock(pvmw->ptl);
}

+/**
+ * page_vma_mapped_walk_restart - Restart the page table walk.
+ * @pvmw: Pointer to struct page_vma_mapped_walk.
+ *
+ * It restarts the page table walk when changes occur in the page
+ * table, such as splitting a PMD. Ensures that the PTL held during
+ * the previous walk is released and resets the state to allow for
+ * a new walk starting at the current address stored in pvmw->address.
+ */
+static inline void
+page_vma_mapped_walk_restart(struct page_vma_mapped_walk *pvmw)
+{
+ WARN_ON_ONCE(!pvmw->pmd);
+ WARN_ON_ONCE(!pvmw->ptl);
+
+ if (pvmw->ptl)
+ spin_unlock(pvmw->ptl);
+
+ pvmw->ptl = NULL;
+ pvmw->pmd = NULL;
+}
+
bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw);

/*
--
2.33.1


2024-06-10 12:27:33

by Lance Yang

[permalink] [raw]
Subject: [PATCH v7 3/4] mm/rmap: integrate PMD-mapped folio splitting into pagewalk loop

In preparation for supporting try_to_unmap_one() to unmap PMD-mapped
folios, start the pagewalk first, then call split_huge_pmd_address() to
split the folio.

Suggested-by: David Hildenbrand <[email protected]>
Suggested-by: Baolin Wang <[email protected]>
Signed-off-by: Lance Yang <[email protected]>
---
include/linux/huge_mm.h | 6 ++++++
mm/huge_memory.c | 42 +++++++++++++++++++++--------------------
mm/rmap.c | 21 +++++++++++++++------
3 files changed, 43 insertions(+), 26 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 088d66a54643..4670c6ee118b 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -415,6 +415,9 @@ static inline bool thp_migration_supported(void)
return IS_ENABLED(CONFIG_ARCH_ENABLE_THP_MIGRATION);
}

+void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
+ pmd_t *pmd, bool freeze, struct folio *folio);
+
#else /* CONFIG_TRANSPARENT_HUGEPAGE */

static inline bool folio_test_pmd_mappable(struct folio *folio)
@@ -477,6 +480,9 @@ static inline void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
unsigned long address, bool freeze, struct folio *folio) {}
static inline void split_huge_pmd_address(struct vm_area_struct *vma,
unsigned long address, bool freeze, struct folio *folio) {}
+static inline void split_huge_pmd_locked(struct vm_area_struct *vma,
+ unsigned long address, pmd_t *pmd,
+ bool freeze, struct folio *folio) {}

#define split_huge_pud(__vma, __pmd, __address) \
do { } while (0)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e6d26c2eb670..d2697cc8f9d4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2581,6 +2581,27 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
pmd_populate(mm, pmd, pgtable);
}

+void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
+ pmd_t *pmd, bool freeze, struct folio *folio)
+{
+ VM_WARN_ON_ONCE(folio && !folio_test_pmd_mappable(folio));
+ VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PMD_SIZE));
+ VM_WARN_ON_ONCE(folio && !folio_test_locked(folio));
+ VM_BUG_ON(freeze && !folio);
+
+ /*
+ * When the caller requests to set up a migration entry, we
+ * require a folio to check the PMD against. Otherwise, there
+ * is a risk of replacing the wrong folio.
+ */
+ if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd) ||
+ is_pmd_migration_entry(*pmd)) {
+ if (folio && folio != pmd_folio(*pmd))
+ return;
+ __split_huge_pmd_locked(vma, pmd, address, freeze);
+ }
+}
+
void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
unsigned long address, bool freeze, struct folio *folio)
{
@@ -2592,26 +2613,7 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
(address & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE);
mmu_notifier_invalidate_range_start(&range);
ptl = pmd_lock(vma->vm_mm, pmd);
-
- /*
- * If caller asks to setup a migration entry, we need a folio to check
- * pmd against. Otherwise we can end up replacing wrong folio.
- */
- VM_BUG_ON(freeze && !folio);
- VM_WARN_ON_ONCE(folio && !folio_test_locked(folio));
-
- if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd) ||
- is_pmd_migration_entry(*pmd)) {
- /*
- * It's safe to call pmd_page when folio is set because it's
- * guaranteed that pmd is present.
- */
- if (folio && folio != pmd_folio(*pmd))
- goto out;
- __split_huge_pmd_locked(vma, pmd, range.start, freeze);
- }
-
-out:
+ split_huge_pmd_locked(vma, range.start, pmd, freeze, folio);
spin_unlock(ptl);
mmu_notifier_invalidate_range_end(&range);
}
diff --git a/mm/rmap.c b/mm/rmap.c
index ddffa30c79fb..b77f88695588 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1640,9 +1640,6 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
if (flags & TTU_SYNC)
pvmw.flags = PVMW_SYNC;

- if (flags & TTU_SPLIT_HUGE_PMD)
- split_huge_pmd_address(vma, address, false, folio);
-
/*
* For THP, we have to assume the worse case ie pmd for invalidation.
* For hugetlb, it could be much worse if we need to do pud
@@ -1668,9 +1665,6 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
mmu_notifier_invalidate_range_start(&range);

while (page_vma_mapped_walk(&pvmw)) {
- /* Unexpected PMD-mapped THP? */
- VM_BUG_ON_FOLIO(!pvmw.pte, folio);
-
/*
* If the folio is in an mlock()d vma, we must not swap it out.
*/
@@ -1682,6 +1676,21 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
goto walk_done_err;
}

+ if (!pvmw.pte && (flags & TTU_SPLIT_HUGE_PMD)) {
+ /*
+ * We temporarily have to drop the PTL and start once
+ * again from that now-PTE-mapped page table.
+ */
+ split_huge_pmd_locked(vma, pvmw.address, pvmw.pmd,
+ false, folio);
+ flags &= ~TTU_SPLIT_HUGE_PMD;
+ page_vma_mapped_walk_restart(&pvmw);
+ continue;
+ }
+
+ /* Unexpected PMD-mapped THP? */
+ VM_BUG_ON_FOLIO(!pvmw.pte, folio);
+
pfn = pte_pfn(ptep_get(pvmw.pte));
subpage = folio_page(folio, pfn - folio_pfn(folio));
address = pvmw.address;
--
2.33.1


2024-06-13 06:56:02

by Lance Yang

[permalink] [raw]
Subject: Re: [PATCH v7 4/4] mm/vmscan: avoid split lazyfree THP during shrink_folio_list()


On Mon, Jun 10, 2024 at 8:08 PM Lance Yang <[email protected]> wrote:
>
[...]
> +static bool __discard_anon_folio_pmd_locked(struct vm_area_struct *vma,
> + unsigned long addr, pmd_t *pmdp,
> + struct folio *folio)
> +{
> + VM_WARN_ON_FOLIO(folio_test_swapbacked(folio), folio);
> + VM_WARN_ON_FOLIO(!folio_test_anon(folio), folio);
> +
> + struct mm_struct *mm = vma->vm_mm;
> + int ref_count, map_count;
> + pmd_t orig_pmd = *pmdp;
> + struct page *page;
> +
> + if (unlikely(!pmd_present(orig_pmd) || !pmd_trans_huge(orig_pmd)))
> + return false;
> +
> + page = pmd_page(orig_pmd);
> + if (unlikely(page_folio(page) != folio))
> + return false;
> +
> + if (folio_test_dirty(folio) || pmd_dirty(orig_pmd)) {
> + folio_set_swapbacked(folio);
> + return false;
> + }
> +
> + orig_pmd = pmdp_huge_clear_flush(vma, addr, pmdp);
> +
> + /*
> + * Syncing against concurrent GUP-fast:
> + * - clear PMD; barrier; read refcount
> + * - inc refcount; barrier; read PMD
> + */
> + smp_mb();
> +
> + ref_count = folio_ref_count(folio);
> + map_count = folio_mapcount(folio);
> +
> + /*
> + * Order reads for folio refcount and dirty flag
> + * (see comments in __remove_mapping()).
> + */
> + smp_rmb();
> +
> + /*
> + * If the folio or its PMD is redirtied at this point, or if there
> + * are unexpected references, we will give up to discard this folio
> + * and remap it.
> + *
> + * The only folio refs must be one from isolation plus the rmap(s).
> + */
> + if (folio_test_dirty(folio) || pmd_dirty(orig_pmd))
> + folio_set_swapbacked(folio);
> +
> + if (folio_test_swapbacked(folio) || ref_count != map_count + 1) {
> + set_pmd_at(mm, addr, pmdp, orig_pmd);
> + return false;
> + }
> +
> + folio_remove_rmap_pmd(folio, page, vma);
> + zap_deposited_table(mm, pmdp);
> + add_mm_counter(mm, MM_ANONPAGES, -HPAGE_PMD_NR);
> + if (vma->vm_flags & VM_LOCKED)
> + mlock_drain_local();
> + folio_put(folio);
> +
> + return true;
> +}
[...]
> diff --git a/mm/rmap.c b/mm/rmap.c
> index b77f88695588..8e901636ade9 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1630,6 +1630,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> enum ttu_flags flags = (enum ttu_flags)(long)arg;
> unsigned long pfn;
> unsigned long hsz = 0;
> + bool pmd_mapped = false;
>
> /*
> * When racing against e.g. zap_pte_range() on another cpu,
> @@ -1676,16 +1677,24 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> goto walk_done_err;
> }
>
> - if (!pvmw.pte && (flags & TTU_SPLIT_HUGE_PMD)) {
> - /*
> - * We temporarily have to drop the PTL and start once
> - * again from that now-PTE-mapped page table.
> - */
> - split_huge_pmd_locked(vma, pvmw.address, pvmw.pmd,
> - false, folio);
> - flags &= ~TTU_SPLIT_HUGE_PMD;
> - page_vma_mapped_walk_restart(&pvmw);
> - continue;
> + if (!pvmw.pte) {
> + pmd_mapped = true;
> + if (unmap_huge_pmd_locked(vma, pvmw.address, pvmw.pmd,
> + folio))
> + goto walk_done;
> +
> + if (flags & TTU_SPLIT_HUGE_PMD) {
> + /*
> + * We temporarily have to drop the PTL and start
> + * once again from that now-PTE-mapped page
> + * table.
> + */
> + split_huge_pmd_locked(vma, pvmw.address,
> + pvmw.pmd, false, folio);
> + flags &= ~TTU_SPLIT_HUGE_PMD;
> + page_vma_mapped_walk_restart(&pvmw);
> + continue;
> + }
> }
>
> /* Unexpected PMD-mapped THP? */
> @@ -1813,7 +1822,12 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> */
> if (unlikely(folio_test_swapbacked(folio) !=
> folio_test_swapcache(folio))) {
> - WARN_ON_ONCE(1);
> + /*
> + * unmap_huge_pmd_locked() will unmark a
> + * PMD-mapped folio as lazyfree if the folio or
> + * its PMD was redirtied.
> + */
> + WARN_ON_ONCE(!pmd_mapped);

Damn it. I forgot to remap the folio to the page table since we've cleared
the PTE here.

But it seems like we neither need to do that, nor unmark a PMD-mapped folio
as lazyfree in unmap_huge_pmd_locked(); the next block[1] will catch the
folio and do the same thing.

Hi David and Baolin, what do you think?

> goto walk_done_err;
> }

[1] https://elixir.bootlin.com/linux/v6.10-rc3/source/mm/rmap.c#L1820

/* MADV_FREE page check */
if (!folio_test_swapbacked(folio)) {
int ref_count, map_count;

[...]
/*
* The only page refs must be one from isolation
* plus the rmap(s) (dropped by discard:).
*/
if (ref_count == 1 + map_count &&
!folio_test_dirty(folio)) {
dec_mm_counter(mm, MM_ANONPAGES);
goto discard;
}

/*
* If the folio was redirtied, it cannot be
* discarded. Remap the page to page table.
*/
set_pte_at(mm, address, pvmw.pte, pteval);
folio_set_swapbacked(folio);
goto walk_done_err;
}


Thanks,
Lance

>
> --
> 2.33.1
>

2024-06-13 07:37:18

by Lance Yang

[permalink] [raw]
Subject: Re: [PATCH v7 4/4] mm/vmscan: avoid split lazyfree THP during shrink_folio_list()

Hi Andrew,

I'd like to fix the bug[1] I mentioned previously. Could you please fold the
following changes into this patch?

[1] https://lore.kernel.org/linux-mm/[email protected]/

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f409ea9fcc18..425374ae06ed 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2707,10 +2707,8 @@ static bool __discard_anon_folio_pmd_locked(struct vm_area_struct *vma,
if (unlikely(page_folio(page) != folio))
return false;

- if (folio_test_dirty(folio) || pmd_dirty(orig_pmd)) {
- folio_set_swapbacked(folio);
+ if (folio_test_dirty(folio) || pmd_dirty(orig_pmd))
return false;
- }

orig_pmd = pmdp_huge_clear_flush(vma, addr, pmdp);

@@ -2737,10 +2735,8 @@ static bool __discard_anon_folio_pmd_locked(struct vm_area_struct *vma,
*
* The only folio refs must be one from isolation plus the rmap(s).
*/
- if (folio_test_dirty(folio) || pmd_dirty(orig_pmd))
- folio_set_swapbacked(folio);
-
- if (folio_test_swapbacked(folio) || ref_count != map_count + 1) {
+ if (folio_test_dirty(folio) || pmd_dirty(orig_pmd) ||
+ ref_count != map_count + 1) {
set_pmd_at(mm, addr, pmdp, orig_pmd);
return false;
}
diff --git a/mm/rmap.c b/mm/rmap.c
index b9e5943c8349..393e2c11c44c 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1824,12 +1824,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
*/
if (unlikely(folio_test_swapbacked(folio) !=
folio_test_swapcache(folio))) {
- /*
- * unmap_huge_pmd_locked() will unmark a
- * PMD-mapped folio as lazyfree if the folio or
- * its PMD was redirtied.
- */
- WARN_ON_ONCE(!pmd_mapped);
+ WARN_ON_ONCE(1);
goto walk_done_err;
}

--

Thanks,
Lance

2024-06-13 07:53:04

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] mm/rmap: remove duplicated exit code in pagewalk loop

On Tue, Jun 11, 2024 at 12:02 AM Lance Yang <[email protected]> wrote:
>
> Introduce the labels walk_done and walk_done_err as exit points to
> eliminate duplicated exit code in the pagewalk loop.
>
> Reviewed-by: Zi Yan <[email protected]>
> Reviewed-by: Baolin Wang <[email protected]>
> Reviewed-by: David Hildenbrand <[email protected]>
> Signed-off-by: Lance Yang <[email protected]>

I don't think "return false" necessarily indicates an error, so
"walk_done_err" doesn't seem like an appropriate name.
However, this is a minor issue.

Reviewed-by: Barry Song <[email protected]>

> ---
> mm/rmap.c | 40 +++++++++++++++-------------------------
> 1 file changed, 15 insertions(+), 25 deletions(-)
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index e8fc5ecb59b2..ddffa30c79fb 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1679,9 +1679,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> /* Restore the mlock which got missed */
> if (!folio_test_large(folio))
> mlock_vma_folio(folio, vma);
> - page_vma_mapped_walk_done(&pvmw);
> - ret = false;
> - break;
> + goto walk_done_err;
> }
>
> pfn = pte_pfn(ptep_get(pvmw.pte));
> @@ -1719,11 +1717,8 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> */
> if (!anon) {
> VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
> - if (!hugetlb_vma_trylock_write(vma)) {
> - page_vma_mapped_walk_done(&pvmw);
> - ret = false;
> - break;
> - }
> + if (!hugetlb_vma_trylock_write(vma))
> + goto walk_done_err;
> if (huge_pmd_unshare(mm, vma, address, pvmw.pte)) {
> hugetlb_vma_unlock_write(vma);
> flush_tlb_range(vma,
> @@ -1738,8 +1733,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> * actual page and drop map count
> * to zero.
> */
> - page_vma_mapped_walk_done(&pvmw);
> - break;
> + goto walk_done;
> }
> hugetlb_vma_unlock_write(vma);
> }
> @@ -1811,9 +1805,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> if (unlikely(folio_test_swapbacked(folio) !=
> folio_test_swapcache(folio))) {
> WARN_ON_ONCE(1);
> - ret = false;
> - page_vma_mapped_walk_done(&pvmw);
> - break;
> + goto walk_done_err;
> }
>
> /* MADV_FREE page check */
> @@ -1852,23 +1844,17 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> */
> set_pte_at(mm, address, pvmw.pte, pteval);
> folio_set_swapbacked(folio);
> - ret = false;
> - page_vma_mapped_walk_done(&pvmw);
> - break;
> + goto walk_done_err;
> }
>
> if (swap_duplicate(entry) < 0) {
> set_pte_at(mm, address, pvmw.pte, pteval);
> - ret = false;
> - page_vma_mapped_walk_done(&pvmw);
> - break;
> + goto walk_done_err;
> }
> if (arch_unmap_one(mm, vma, address, pteval) < 0) {
> swap_free(entry);
> set_pte_at(mm, address, pvmw.pte, pteval);
> - ret = false;
> - page_vma_mapped_walk_done(&pvmw);
> - break;
> + goto walk_done_err;
> }
>
> /* See folio_try_share_anon_rmap(): clear PTE first. */
> @@ -1876,9 +1862,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> folio_try_share_anon_rmap_pte(folio, subpage)) {
> swap_free(entry);
> set_pte_at(mm, address, pvmw.pte, pteval);
> - ret = false;
> - page_vma_mapped_walk_done(&pvmw);
> - break;
> + goto walk_done_err;
> }
> if (list_empty(&mm->mmlist)) {
> spin_lock(&mmlist_lock);
> @@ -1918,6 +1902,12 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> if (vma->vm_flags & VM_LOCKED)
> mlock_drain_local();
> folio_put(folio);
> + continue;
> +walk_done_err:
> + ret = false;
> +walk_done:
> + page_vma_mapped_walk_done(&pvmw);
> + break;
> }
>
> mmu_notifier_invalidate_range_end(&range);
> --
> 2.33.1
>

2024-06-13 08:20:52

by Baolin Wang

[permalink] [raw]
Subject: Re: [PATCH v7 4/4] mm/vmscan: avoid split lazyfree THP during shrink_folio_list()



On 2024/6/13 15:28, Lance Yang wrote:
> Hi Andrew,
>
> I'd like to fix the bug[1] I mentioned previously. Could you please fold the
> following changes into this patch?
>
> [1] https://lore.kernel.org/linux-mm/[email protected]/
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index f409ea9fcc18..425374ae06ed 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2707,10 +2707,8 @@ static bool __discard_anon_folio_pmd_locked(struct vm_area_struct *vma,
> if (unlikely(page_folio(page) != folio))
> return false;
>
> - if (folio_test_dirty(folio) || pmd_dirty(orig_pmd)) {
> - folio_set_swapbacked(folio);
> + if (folio_test_dirty(folio) || pmd_dirty(orig_pmd))
> return false;
> - }
>
> orig_pmd = pmdp_huge_clear_flush(vma, addr, pmdp);
>
> @@ -2737,10 +2735,8 @@ static bool __discard_anon_folio_pmd_locked(struct vm_area_struct *vma,
> *
> * The only folio refs must be one from isolation plus the rmap(s).
> */
> - if (folio_test_dirty(folio) || pmd_dirty(orig_pmd))
> - folio_set_swapbacked(folio);
> -
> - if (folio_test_swapbacked(folio) || ref_count != map_count + 1) {
> + if (folio_test_dirty(folio) || pmd_dirty(orig_pmd) ||
> + ref_count != map_count + 1) {
> set_pmd_at(mm, addr, pmdp, orig_pmd);
> return false;
> }
> diff --git a/mm/rmap.c b/mm/rmap.c
> index b9e5943c8349..393e2c11c44c 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1824,12 +1824,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> */
> if (unlikely(folio_test_swapbacked(folio) !=
> folio_test_swapcache(folio))) {
> - /*
> - * unmap_huge_pmd_locked() will unmark a
> - * PMD-mapped folio as lazyfree if the folio or
> - * its PMD was redirtied.
> - */
> - WARN_ON_ONCE(!pmd_mapped);
> + WARN_ON_ONCE(1);
> goto walk_done_err;
> }
>

You can also drop the unused 'pmd_mapped' variable now.

2024-06-13 08:30:28

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] mm/rmap: remove duplicated exit code in pagewalk loop

On 13.06.24 09:52, Barry Song wrote:
> On Tue, Jun 11, 2024 at 12:02 AM Lance Yang <[email protected]> wrote:
>>
>> Introduce the labels walk_done and walk_done_err as exit points to
>> eliminate duplicated exit code in the pagewalk loop.
>>
>> Reviewed-by: Zi Yan <[email protected]>
>> Reviewed-by: Baolin Wang <[email protected]>
>> Reviewed-by: David Hildenbrand <[email protected]>
>> Signed-off-by: Lance Yang <[email protected]>
>
> I don't think "return false" necessarily indicates an error, so
> "walk_done_err" doesn't seem like an appropriate name.
> However, this is a minor issue.

Agreed. As we only have a single walk_done user, should we instead
remove "walk_done", keep the "page_vma_mapped_walk_done" for that single
user, and rename "walk_done_err" to "abort_walk" ?


--
Cheers,

David / dhildenb


2024-06-13 08:35:21

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v7 2/4] mm/rmap: add helper to restart pgtable walk on changes

On 10.06.24 14:02, Lance Yang wrote:
> Introduce the page_vma_mapped_walk_restart() helper to handle scenarios
> where the page table walk needs to be restarted due to changes in the page
> table, such as when a PMD is split. It releases the PTL held during the
> previous walk and resets the state, allowing a new walk to start at the
> current address stored in pvmw->address.
>
> Suggested-by: David Hildenbrand <[email protected]>
> Signed-off-by: Lance Yang <[email protected]>
> ---
> include/linux/rmap.h | 22 ++++++++++++++++++++++
> 1 file changed, 22 insertions(+)
>
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index 7229b9baf20d..5f18509610cc 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -710,6 +710,28 @@ static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw)
> spin_unlock(pvmw->ptl);
> }
>
> +/**
> + * page_vma_mapped_walk_restart - Restart the page table walk.
> + * @pvmw: Pointer to struct page_vma_mapped_walk.
> + *
> + * It restarts the page table walk when changes occur in the page
> + * table, such as splitting a PMD. Ensures that the PTL held during
> + * the previous walk is released and resets the state to allow for
> + * a new walk starting at the current address stored in pvmw->address.
> + */
> +static inline void
> +page_vma_mapped_walk_restart(struct page_vma_mapped_walk *pvmw)
> +{
> + WARN_ON_ONCE(!pvmw->pmd);

Can we have this more general, like

WARN_ON_ONCE(!pvmw->pmd && !pvmw->pte);

And then setting both to NULL below?


> + WARN_ON_ONCE(!pvmw->ptl);

This is confusing: you check for ptl below. What would be clearer is

if (likely(pvmw->ptl))
spin_unlock(pvmw->ptl);
else
WARN_ON_ONCE(1);


> +
> + if (pvmw->ptl)
> + spin_unlock(pvmw->ptl);
> +
> + pvmw->ptl = NULL;
> + pvmw->pmd = NULL;
> +}
> +
> bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw);
>
> /*

I'd suggest squashing that into the next patch.

--
Cheers,

David / dhildenb


2024-06-13 08:36:49

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v7 3/4] mm/rmap: integrate PMD-mapped folio splitting into pagewalk loop

On 10.06.24 14:06, Lance Yang wrote:
> In preparation for supporting try_to_unmap_one() to unmap PMD-mapped
> folios, start the pagewalk first, then call split_huge_pmd_address() to
> split the folio.
>
> Suggested-by: David Hildenbrand <[email protected]>
> Suggested-by: Baolin Wang <[email protected]>
> Signed-off-by: Lance Yang <[email protected]>
> ---
> include/linux/huge_mm.h | 6 ++++++
> mm/huge_memory.c | 42 +++++++++++++++++++++--------------------
> mm/rmap.c | 21 +++++++++++++++------
> 3 files changed, 43 insertions(+), 26 deletions(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 088d66a54643..4670c6ee118b 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -415,6 +415,9 @@ static inline bool thp_migration_supported(void)
> return IS_ENABLED(CONFIG_ARCH_ENABLE_THP_MIGRATION);
> }
>
> +void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
> + pmd_t *pmd, bool freeze, struct folio *folio);
> +
> #else /* CONFIG_TRANSPARENT_HUGEPAGE */
>
> static inline bool folio_test_pmd_mappable(struct folio *folio)
> @@ -477,6 +480,9 @@ static inline void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> unsigned long address, bool freeze, struct folio *folio) {}
> static inline void split_huge_pmd_address(struct vm_area_struct *vma,
> unsigned long address, bool freeze, struct folio *folio) {}
> +static inline void split_huge_pmd_locked(struct vm_area_struct *vma,
> + unsigned long address, pmd_t *pmd,
> + bool freeze, struct folio *folio) {}
>
> #define split_huge_pud(__vma, __pmd, __address) \
> do { } while (0)
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index e6d26c2eb670..d2697cc8f9d4 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2581,6 +2581,27 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> pmd_populate(mm, pmd, pgtable);
> }
>
> +void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
> + pmd_t *pmd, bool freeze, struct folio *folio)
> +{
> + VM_WARN_ON_ONCE(folio && !folio_test_pmd_mappable(folio));
> + VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PMD_SIZE));
> + VM_WARN_ON_ONCE(folio && !folio_test_locked(folio));
> + VM_BUG_ON(freeze && !folio);

Curious: could we actually end up here without a folio right now? That
would mean, that try_to_unmap_one() would be called with folio==NULL.

> +
> + /*
> + * When the caller requests to set up a migration entry, we
> + * require a folio to check the PMD against. Otherwise, there
> + * is a risk of replacing the wrong folio.
> + */
> + if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd) ||
> + is_pmd_migration_entry(*pmd)) {
> + if (folio && folio != pmd_folio(*pmd))
> + return;
> + __split_huge_pmd_locked(vma, pmd, address, freeze);
> + }
> +}
> +
> void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> unsigned long address, bool freeze, struct folio *folio)
> {
> @@ -2592,26 +2613,7 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> (address & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE);
> mmu_notifier_invalidate_range_start(&range);
> ptl = pmd_lock(vma->vm_mm, pmd);
> -
> - /*
> - * If caller asks to setup a migration entry, we need a folio to check
> - * pmd against. Otherwise we can end up replacing wrong folio.
> - */
> - VM_BUG_ON(freeze && !folio);
> - VM_WARN_ON_ONCE(folio && !folio_test_locked(folio));
> -
> - if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd) ||
> - is_pmd_migration_entry(*pmd)) {
> - /*
> - * It's safe to call pmd_page when folio is set because it's
> - * guaranteed that pmd is present.
> - */
> - if (folio && folio != pmd_folio(*pmd))
> - goto out;
> - __split_huge_pmd_locked(vma, pmd, range.start, freeze);
> - }
> -
> -out:
> + split_huge_pmd_locked(vma, range.start, pmd, freeze, folio);
> spin_unlock(ptl);
> mmu_notifier_invalidate_range_end(&range);
> }
> diff --git a/mm/rmap.c b/mm/rmap.c
> index ddffa30c79fb..b77f88695588 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1640,9 +1640,6 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> if (flags & TTU_SYNC)
> pvmw.flags = PVMW_SYNC;
>
> - if (flags & TTU_SPLIT_HUGE_PMD)
> - split_huge_pmd_address(vma, address, false, folio);
> -
> /*
> * For THP, we have to assume the worse case ie pmd for invalidation.
> * For hugetlb, it could be much worse if we need to do pud
> @@ -1668,9 +1665,6 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> mmu_notifier_invalidate_range_start(&range);
>
> while (page_vma_mapped_walk(&pvmw)) {
> - /* Unexpected PMD-mapped THP? */
> - VM_BUG_ON_FOLIO(!pvmw.pte, folio);
> -
> /*
> * If the folio is in an mlock()d vma, we must not swap it out.
> */
> @@ -1682,6 +1676,21 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> goto walk_done_err;
> }
>
> + if (!pvmw.pte && (flags & TTU_SPLIT_HUGE_PMD)) {
> + /*
> + * We temporarily have to drop the PTL and start once
> + * again from that now-PTE-mapped page table.
> + */
> + split_huge_pmd_locked(vma, pvmw.address, pvmw.pmd,
> + false, folio);
> + flags &= ~TTU_SPLIT_HUGE_PMD;
> + page_vma_mapped_walk_restart(&pvmw);

If, for some reason, split_huge_pmd_locked() would fail, we would keep
looping and never hit the VM_BUG_ON_FOLIO() below. Maybe we'd want to
let split_huge_pmd_locked() return whether splitting succeeded, and
handle that case differently?

> + continue;
> + }
> +
> + /* Unexpected PMD-mapped THP? */
> + VM_BUG_ON_FOLIO(!pvmw.pte, folio);
> +
> pfn = pte_pfn(ptep_get(pvmw.pte));
> subpage = folio_page(folio, pfn - folio_pfn(folio));
> address = pvmw.address;

--
Cheers,

David / dhildenb


2024-06-13 08:48:20

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v7 3/4] mm/rmap: integrate PMD-mapped folio splitting into pagewalk loop

On 13.06.24 10:34, David Hildenbrand wrote:
> On 10.06.24 14:06, Lance Yang wrote:
>> In preparation for supporting try_to_unmap_one() to unmap PMD-mapped
>> folios, start the pagewalk first, then call split_huge_pmd_address() to
>> split the folio.
>>
>> Suggested-by: David Hildenbrand <[email protected]>
>> Suggested-by: Baolin Wang <[email protected]>
>> Signed-off-by: Lance Yang <[email protected]>
>> ---
>> include/linux/huge_mm.h | 6 ++++++
>> mm/huge_memory.c | 42 +++++++++++++++++++++--------------------
>> mm/rmap.c | 21 +++++++++++++++------
>> 3 files changed, 43 insertions(+), 26 deletions(-)
>>
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index 088d66a54643..4670c6ee118b 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -415,6 +415,9 @@ static inline bool thp_migration_supported(void)
>> return IS_ENABLED(CONFIG_ARCH_ENABLE_THP_MIGRATION);
>> }
>>
>> +void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
>> + pmd_t *pmd, bool freeze, struct folio *folio);
>> +
>> #else /* CONFIG_TRANSPARENT_HUGEPAGE */
>>
>> static inline bool folio_test_pmd_mappable(struct folio *folio)
>> @@ -477,6 +480,9 @@ static inline void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>> unsigned long address, bool freeze, struct folio *folio) {}
>> static inline void split_huge_pmd_address(struct vm_area_struct *vma,
>> unsigned long address, bool freeze, struct folio *folio) {}
>> +static inline void split_huge_pmd_locked(struct vm_area_struct *vma,
>> + unsigned long address, pmd_t *pmd,
>> + bool freeze, struct folio *folio) {}
>>
>> #define split_huge_pud(__vma, __pmd, __address) \
>> do { } while (0)
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index e6d26c2eb670..d2697cc8f9d4 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -2581,6 +2581,27 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>> pmd_populate(mm, pmd, pgtable);
>> }
>>
>> +void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
>> + pmd_t *pmd, bool freeze, struct folio *folio)
>> +{
>> + VM_WARN_ON_ONCE(folio && !folio_test_pmd_mappable(folio));
>> + VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PMD_SIZE));
>> + VM_WARN_ON_ONCE(folio && !folio_test_locked(folio));
>> + VM_BUG_ON(freeze && !folio);
>
> Curious: could we actually end up here without a folio right now? That
> would mean, that try_to_unmap_one() would be called with folio==NULL.
>
>> +
>> + /*
>> + * When the caller requests to set up a migration entry, we
>> + * require a folio to check the PMD against. Otherwise, there
>> + * is a risk of replacing the wrong folio.
>> + */
>> + if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd) ||
>> + is_pmd_migration_entry(*pmd)) {
>> + if (folio && folio != pmd_folio(*pmd))
>> + return;
>> + __split_huge_pmd_locked(vma, pmd, address, freeze);
>> + }
>> +}
>> +
>> void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>> unsigned long address, bool freeze, struct folio *folio)
>> {
>> @@ -2592,26 +2613,7 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>> (address & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE);
>> mmu_notifier_invalidate_range_start(&range);
>> ptl = pmd_lock(vma->vm_mm, pmd);
>> -
>> - /*
>> - * If caller asks to setup a migration entry, we need a folio to check
>> - * pmd against. Otherwise we can end up replacing wrong folio.
>> - */
>> - VM_BUG_ON(freeze && !folio);
>> - VM_WARN_ON_ONCE(folio && !folio_test_locked(folio));
>> -
>> - if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd) ||
>> - is_pmd_migration_entry(*pmd)) {
>> - /*
>> - * It's safe to call pmd_page when folio is set because it's
>> - * guaranteed that pmd is present.
>> - */
>> - if (folio && folio != pmd_folio(*pmd))
>> - goto out;
>> - __split_huge_pmd_locked(vma, pmd, range.start, freeze);
>> - }
>> -
>> -out:
>> + split_huge_pmd_locked(vma, range.start, pmd, freeze, folio);
>> spin_unlock(ptl);
>> mmu_notifier_invalidate_range_end(&range);
>> }
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index ddffa30c79fb..b77f88695588 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1640,9 +1640,6 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>> if (flags & TTU_SYNC)
>> pvmw.flags = PVMW_SYNC;
>>
>> - if (flags & TTU_SPLIT_HUGE_PMD)
>> - split_huge_pmd_address(vma, address, false, folio);
>> -
>> /*
>> * For THP, we have to assume the worse case ie pmd for invalidation.
>> * For hugetlb, it could be much worse if we need to do pud
>> @@ -1668,9 +1665,6 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>> mmu_notifier_invalidate_range_start(&range);
>>
>> while (page_vma_mapped_walk(&pvmw)) {
>> - /* Unexpected PMD-mapped THP? */
>> - VM_BUG_ON_FOLIO(!pvmw.pte, folio);
>> -
>> /*
>> * If the folio is in an mlock()d vma, we must not swap it out.
>> */
>> @@ -1682,6 +1676,21 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>> goto walk_done_err;
>> }
>>
>> + if (!pvmw.pte && (flags & TTU_SPLIT_HUGE_PMD)) {
>> + /*
>> + * We temporarily have to drop the PTL and start once
>> + * again from that now-PTE-mapped page table.
>> + */
>> + split_huge_pmd_locked(vma, pvmw.address, pvmw.pmd,
>> + false, folio);
>> + flags &= ~TTU_SPLIT_HUGE_PMD;
>> + page_vma_mapped_walk_restart(&pvmw);
>
> If, for some reason, split_huge_pmd_locked() would fail, we would keep
> looping and never hit the VM_BUG_ON_FOLIO() below. Maybe we'd want to
> let split_huge_pmd_locked() return whether splitting succeeded, and
> handle that case differently?

I assume it could fail if we race with concurrent split? Or isn't that
possible?

--
Cheers,

David / dhildenb


2024-06-13 08:56:54

by Lance Yang

[permalink] [raw]
Subject: Re: [PATCH v7 4/4] mm/vmscan: avoid split lazyfree THP during shrink_folio_list()

On Thu, Jun 13, 2024 at 4:20 PM Baolin Wang
<[email protected]> wrote:
>
>
>
> On 2024/6/13 15:28, Lance Yang wrote:
> > Hi Andrew,
> >
> > I'd like to fix the bug[1] I mentioned previously. Could you please fold the
> > following changes into this patch?
> >
> > [1] https://lore.kernel.org/linux-mm/[email protected]/
> >
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index f409ea9fcc18..425374ae06ed 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -2707,10 +2707,8 @@ static bool __discard_anon_folio_pmd_locked(struct vm_area_struct *vma,
> > if (unlikely(page_folio(page) != folio))
> > return false;
> >
> > - if (folio_test_dirty(folio) || pmd_dirty(orig_pmd)) {
> > - folio_set_swapbacked(folio);
> > + if (folio_test_dirty(folio) || pmd_dirty(orig_pmd))
> > return false;
> > - }
> >
> > orig_pmd = pmdp_huge_clear_flush(vma, addr, pmdp);
> >
> > @@ -2737,10 +2735,8 @@ static bool __discard_anon_folio_pmd_locked(struct vm_area_struct *vma,
> > *
> > * The only folio refs must be one from isolation plus the rmap(s).
> > */
> > - if (folio_test_dirty(folio) || pmd_dirty(orig_pmd))
> > - folio_set_swapbacked(folio);
> > -
> > - if (folio_test_swapbacked(folio) || ref_count != map_count + 1) {
> > + if (folio_test_dirty(folio) || pmd_dirty(orig_pmd) ||
> > + ref_count != map_count + 1) {
> > set_pmd_at(mm, addr, pmdp, orig_pmd);
> > return false;
> > }
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index b9e5943c8349..393e2c11c44c 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -1824,12 +1824,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> > */
> > if (unlikely(folio_test_swapbacked(folio) !=
> > folio_test_swapcache(folio))) {
> > - /*
> > - * unmap_huge_pmd_locked() will unmark a
> > - * PMD-mapped folio as lazyfree if the folio or
> > - * its PMD was redirtied.
> > - */
> > - WARN_ON_ONCE(!pmd_mapped);
> > + WARN_ON_ONCE(1);
> > goto walk_done_err;
> > }
> >
>
> You can also drop the unused 'pmd_mapped' variable now.

Good catch!

Will drop it in the next version.

Thanks,
Lance

2024-06-13 09:15:49

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v7 3/4] mm/rmap: integrate PMD-mapped folio splitting into pagewalk loop

On 13.06.24 10:45, David Hildenbrand wrote:
> On 13.06.24 10:34, David Hildenbrand wrote:
>> On 10.06.24 14:06, Lance Yang wrote:
>>> In preparation for supporting try_to_unmap_one() to unmap PMD-mapped
>>> folios, start the pagewalk first, then call split_huge_pmd_address() to
>>> split the folio.
>>>
>>> Suggested-by: David Hildenbrand <[email protected]>
>>> Suggested-by: Baolin Wang <[email protected]>
>>> Signed-off-by: Lance Yang <[email protected]>
>>> ---
>>> include/linux/huge_mm.h | 6 ++++++
>>> mm/huge_memory.c | 42 +++++++++++++++++++++--------------------
>>> mm/rmap.c | 21 +++++++++++++++------
>>> 3 files changed, 43 insertions(+), 26 deletions(-)
>>>
>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>> index 088d66a54643..4670c6ee118b 100644
>>> --- a/include/linux/huge_mm.h
>>> +++ b/include/linux/huge_mm.h
>>> @@ -415,6 +415,9 @@ static inline bool thp_migration_supported(void)
>>> return IS_ENABLED(CONFIG_ARCH_ENABLE_THP_MIGRATION);
>>> }
>>>
>>> +void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
>>> + pmd_t *pmd, bool freeze, struct folio *folio);
>>> +
>>> #else /* CONFIG_TRANSPARENT_HUGEPAGE */
>>>
>>> static inline bool folio_test_pmd_mappable(struct folio *folio)
>>> @@ -477,6 +480,9 @@ static inline void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>>> unsigned long address, bool freeze, struct folio *folio) {}
>>> static inline void split_huge_pmd_address(struct vm_area_struct *vma,
>>> unsigned long address, bool freeze, struct folio *folio) {}
>>> +static inline void split_huge_pmd_locked(struct vm_area_struct *vma,
>>> + unsigned long address, pmd_t *pmd,
>>> + bool freeze, struct folio *folio) {}
>>>
>>> #define split_huge_pud(__vma, __pmd, __address) \
>>> do { } while (0)
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index e6d26c2eb670..d2697cc8f9d4 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -2581,6 +2581,27 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>> pmd_populate(mm, pmd, pgtable);
>>> }
>>>
>>> +void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
>>> + pmd_t *pmd, bool freeze, struct folio *folio)
>>> +{
>>> + VM_WARN_ON_ONCE(folio && !folio_test_pmd_mappable(folio));
>>> + VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PMD_SIZE));
>>> + VM_WARN_ON_ONCE(folio && !folio_test_locked(folio));
>>> + VM_BUG_ON(freeze && !folio);
>>
>> Curious: could we actually end up here without a folio right now? That
>> would mean, that try_to_unmap_one() would be called with folio==NULL.
>>
>>> +
>>> + /*
>>> + * When the caller requests to set up a migration entry, we
>>> + * require a folio to check the PMD against. Otherwise, there
>>> + * is a risk of replacing the wrong folio.
>>> + */
>>> + if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd) ||
>>> + is_pmd_migration_entry(*pmd)) {
>>> + if (folio && folio != pmd_folio(*pmd))
>>> + return;
>>> + __split_huge_pmd_locked(vma, pmd, address, freeze);
>>> + }
>>> +}
>>> +
>>> void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>>> unsigned long address, bool freeze, struct folio *folio)
>>> {
>>> @@ -2592,26 +2613,7 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>>> (address & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE);
>>> mmu_notifier_invalidate_range_start(&range);
>>> ptl = pmd_lock(vma->vm_mm, pmd);
>>> -
>>> - /*
>>> - * If caller asks to setup a migration entry, we need a folio to check
>>> - * pmd against. Otherwise we can end up replacing wrong folio.
>>> - */
>>> - VM_BUG_ON(freeze && !folio);
>>> - VM_WARN_ON_ONCE(folio && !folio_test_locked(folio));
>>> -
>>> - if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd) ||
>>> - is_pmd_migration_entry(*pmd)) {
>>> - /*
>>> - * It's safe to call pmd_page when folio is set because it's
>>> - * guaranteed that pmd is present.
>>> - */
>>> - if (folio && folio != pmd_folio(*pmd))
>>> - goto out;
>>> - __split_huge_pmd_locked(vma, pmd, range.start, freeze);
>>> - }
>>> -
>>> -out:
>>> + split_huge_pmd_locked(vma, range.start, pmd, freeze, folio);
>>> spin_unlock(ptl);
>>> mmu_notifier_invalidate_range_end(&range);
>>> }
>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>> index ddffa30c79fb..b77f88695588 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -1640,9 +1640,6 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>> if (flags & TTU_SYNC)
>>> pvmw.flags = PVMW_SYNC;
>>>
>>> - if (flags & TTU_SPLIT_HUGE_PMD)
>>> - split_huge_pmd_address(vma, address, false, folio);
>>> -
>>> /*
>>> * For THP, we have to assume the worse case ie pmd for invalidation.
>>> * For hugetlb, it could be much worse if we need to do pud
>>> @@ -1668,9 +1665,6 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>> mmu_notifier_invalidate_range_start(&range);
>>>
>>> while (page_vma_mapped_walk(&pvmw)) {
>>> - /* Unexpected PMD-mapped THP? */
>>> - VM_BUG_ON_FOLIO(!pvmw.pte, folio);
>>> -
>>> /*
>>> * If the folio is in an mlock()d vma, we must not swap it out.
>>> */
>>> @@ -1682,6 +1676,21 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>> goto walk_done_err;
>>> }
>>>
>>> + if (!pvmw.pte && (flags & TTU_SPLIT_HUGE_PMD)) {
>>> + /*
>>> + * We temporarily have to drop the PTL and start once
>>> + * again from that now-PTE-mapped page table.
>>> + */
>>> + split_huge_pmd_locked(vma, pvmw.address, pvmw.pmd,
>>> + false, folio);
>>> + flags &= ~TTU_SPLIT_HUGE_PMD;
>>> + page_vma_mapped_walk_restart(&pvmw);
>>
>> If, for some reason, split_huge_pmd_locked() would fail, we would keep
>> looping and never hit the VM_BUG_ON_FOLIO() below. Maybe we'd want to
>> let split_huge_pmd_locked() return whether splitting succeeded, and
>> handle that case differently?
>
> I assume it could fail if we race with concurrent split? Or isn't that
> possible?
>

(I assume we hold the PTL, so such races should not be possible)

--
Cheers,

David / dhildenb


2024-06-13 09:21:39

by Lance Yang

[permalink] [raw]
Subject: Re: [PATCH v7 3/4] mm/rmap: integrate PMD-mapped folio splitting into pagewalk loop

On Thu, Jun 13, 2024 at 4:34 PM David Hildenbrand <[email protected]> wrote:
>
> On 10.06.24 14:06, Lance Yang wrote:
> > In preparation for supporting try_to_unmap_one() to unmap PMD-mapped
> > folios, start the pagewalk first, then call split_huge_pmd_address() to
> > split the folio.
> >
> > Suggested-by: David Hildenbrand <[email protected]>
> > Suggested-by: Baolin Wang <[email protected]>
> > Signed-off-by: Lance Yang <[email protected]>
> > ---
> > include/linux/huge_mm.h | 6 ++++++
> > mm/huge_memory.c | 42 +++++++++++++++++++++--------------------
> > mm/rmap.c | 21 +++++++++++++++------
> > 3 files changed, 43 insertions(+), 26 deletions(-)
> >
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index 088d66a54643..4670c6ee118b 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -415,6 +415,9 @@ static inline bool thp_migration_supported(void)
> > return IS_ENABLED(CONFIG_ARCH_ENABLE_THP_MIGRATION);
> > }
> >
> > +void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
> > + pmd_t *pmd, bool freeze, struct folio *folio);
> > +
> > #else /* CONFIG_TRANSPARENT_HUGEPAGE */
> >
> > static inline bool folio_test_pmd_mappable(struct folio *folio)
> > @@ -477,6 +480,9 @@ static inline void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> > unsigned long address, bool freeze, struct folio *folio) {}
> > static inline void split_huge_pmd_address(struct vm_area_struct *vma,
> > unsigned long address, bool freeze, struct folio *folio) {}
> > +static inline void split_huge_pmd_locked(struct vm_area_struct *vma,
> > + unsigned long address, pmd_t *pmd,
> > + bool freeze, struct folio *folio) {}
> >
> > #define split_huge_pud(__vma, __pmd, __address) \
> > do { } while (0)
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index e6d26c2eb670..d2697cc8f9d4 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -2581,6 +2581,27 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> > pmd_populate(mm, pmd, pgtable);
> > }
> >
> > +void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
> > + pmd_t *pmd, bool freeze, struct folio *folio)
> > +{
> > + VM_WARN_ON_ONCE(folio && !folio_test_pmd_mappable(folio));
> > + VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PMD_SIZE));
> > + VM_WARN_ON_ONCE(folio && !folio_test_locked(folio));
> > + VM_BUG_ON(freeze && !folio);
>
> Curious: could we actually end up here without a folio right now? That
> would mean, that try_to_unmap_one() would be called with folio==NULL.

try_to_unmap_one() would not be called with folio==NULL, I guess.

I just moved 'VM_BUG_ON(freeze && !folio)' from __split_huge_pmd() to here,
and now __split_huge_pmd() will call split_huge_pmd_locked().

>
> > +
> > + /*
> > + * When the caller requests to set up a migration entry, we
> > + * require a folio to check the PMD against. Otherwise, there
> > + * is a risk of replacing the wrong folio.
> > + */
> > + if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd) ||
> > + is_pmd_migration_entry(*pmd)) {
> > + if (folio && folio != pmd_folio(*pmd))
> > + return;
> > + __split_huge_pmd_locked(vma, pmd, address, freeze);
> > + }
> > +}
> > +
> > void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> > unsigned long address, bool freeze, struct folio *folio)
> > {
> > @@ -2592,26 +2613,7 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> > (address & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE);
> > mmu_notifier_invalidate_range_start(&range);
> > ptl = pmd_lock(vma->vm_mm, pmd);
> > -
> > - /*
> > - * If caller asks to setup a migration entry, we need a folio to check
> > - * pmd against. Otherwise we can end up replacing wrong folio.
> > - */
> > - VM_BUG_ON(freeze && !folio);
> > - VM_WARN_ON_ONCE(folio && !folio_test_locked(folio));
> > -
> > - if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd) ||
> > - is_pmd_migration_entry(*pmd)) {
> > - /*
> > - * It's safe to call pmd_page when folio is set because it's
> > - * guaranteed that pmd is present.
> > - */
> > - if (folio && folio != pmd_folio(*pmd))
> > - goto out;
> > - __split_huge_pmd_locked(vma, pmd, range.start, freeze);
> > - }
> > -
> > -out:
> > + split_huge_pmd_locked(vma, range.start, pmd, freeze, folio);
> > spin_unlock(ptl);
> > mmu_notifier_invalidate_range_end(&range);
> > }
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index ddffa30c79fb..b77f88695588 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -1640,9 +1640,6 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> > if (flags & TTU_SYNC)
> > pvmw.flags = PVMW_SYNC;
> >
> > - if (flags & TTU_SPLIT_HUGE_PMD)
> > - split_huge_pmd_address(vma, address, false, folio);
> > -
> > /*
> > * For THP, we have to assume the worse case ie pmd for invalidation.
> > * For hugetlb, it could be much worse if we need to do pud
> > @@ -1668,9 +1665,6 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> > mmu_notifier_invalidate_range_start(&range);
> >
> > while (page_vma_mapped_walk(&pvmw)) {
> > - /* Unexpected PMD-mapped THP? */
> > - VM_BUG_ON_FOLIO(!pvmw.pte, folio);
> > -
> > /*
> > * If the folio is in an mlock()d vma, we must not swap it out.
> > */
> > @@ -1682,6 +1676,21 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> > goto walk_done_err;
> > }
> >
> > + if (!pvmw.pte && (flags & TTU_SPLIT_HUGE_PMD)) {
> > + /*
> > + * We temporarily have to drop the PTL and start once
> > + * again from that now-PTE-mapped page table.
> > + */
> > + split_huge_pmd_locked(vma, pvmw.address, pvmw.pmd,
> > + false, folio);
> > + flags &= ~TTU_SPLIT_HUGE_PMD;
> > + page_vma_mapped_walk_restart(&pvmw);
>
> If, for some reason, split_huge_pmd_locked() would fail, we would keep
> looping and never hit the VM_BUG_ON_FOLIO() below. Maybe we'd want to
> let split_huge_pmd_locked() return whether splitting succeeded, and
> handle that case differently?

Hmm... after calling split_huge_pmd_locked(), we also do
"flags &= ~TTU_SPLIT_HUGE_PMD", preventing re-entry into this block,
then triggering the VM_BUG_ON_FOLIO() below if split_huge_pmd_locked()
fails, IIUC.

What do you think?

Thanks,
Lance

>
> > + continue;
> > + }
> > +
> > + /* Unexpected PMD-mapped THP? */
> > + VM_BUG_ON_FOLIO(!pvmw.pte, folio);
> > +
> > pfn = pte_pfn(ptep_get(pvmw.pte));
> > subpage = folio_page(folio, pfn - folio_pfn(folio));
> > address = pvmw.address;
>
> --
> Cheers,
>
> David / dhildenb
>

2024-06-13 09:23:32

by Lance Yang

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] mm/rmap: remove duplicated exit code in pagewalk loop

On Thu, Jun 13, 2024 at 4:27 PM David Hildenbrand <[email protected]> wrote:
>
> On 13.06.24 09:52, Barry Song wrote:
> > On Tue, Jun 11, 2024 at 12:02 AM Lance Yang <[email protected]> wrote:
> >>
> >> Introduce the labels walk_done and walk_done_err as exit points to
> >> eliminate duplicated exit code in the pagewalk loop.
> >>
> >> Reviewed-by: Zi Yan <[email protected]>
> >> Reviewed-by: Baolin Wang <[email protected]>
> >> Reviewed-by: David Hildenbrand <[email protected]>
> >> Signed-off-by: Lance Yang <[email protected]>
> >

Hi Barry and David,

Thanks for taking time to review!

> > I don't think "return false" necessarily indicates an error, so
> > "walk_done_err" doesn't seem like an appropriate name.
> > However, this is a minor issue.
>
> Agreed. As we only have a single walk_done user, should we instead
> remove "walk_done", keep the "page_vma_mapped_walk_done" for that single
> user, and rename "walk_done_err" to "abort_walk" ?

Yeah, I agree that 'abort_walk' is better than 'walk_done_err', and let's
keep 'page_vma_mapped_walk_done' for that single user ;)

Thanks again for the suggestions!
Lance

>
>
> --
> Cheers,
>
> David / dhildenb
>

2024-06-13 09:26:35

by Lance Yang

[permalink] [raw]
Subject: Re: [PATCH v7 2/4] mm/rmap: add helper to restart pgtable walk on changes

On Thu, Jun 13, 2024 at 4:30 PM David Hildenbrand <[email protected]> wrote:
>
> On 10.06.24 14:02, Lance Yang wrote:
> > Introduce the page_vma_mapped_walk_restart() helper to handle scenarios
> > where the page table walk needs to be restarted due to changes in the page
> > table, such as when a PMD is split. It releases the PTL held during the
> > previous walk and resets the state, allowing a new walk to start at the
> > current address stored in pvmw->address.
> >
> > Suggested-by: David Hildenbrand <[email protected]>
> > Signed-off-by: Lance Yang <[email protected]>
> > ---
> > include/linux/rmap.h | 22 ++++++++++++++++++++++
> > 1 file changed, 22 insertions(+)
> >
> > diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> > index 7229b9baf20d..5f18509610cc 100644
> > --- a/include/linux/rmap.h
> > +++ b/include/linux/rmap.h
> > @@ -710,6 +710,28 @@ static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw)
> > spin_unlock(pvmw->ptl);
> > }
> >
> > +/**
> > + * page_vma_mapped_walk_restart - Restart the page table walk.
> > + * @pvmw: Pointer to struct page_vma_mapped_walk.
> > + *
> > + * It restarts the page table walk when changes occur in the page
> > + * table, such as splitting a PMD. Ensures that the PTL held during
> > + * the previous walk is released and resets the state to allow for
> > + * a new walk starting at the current address stored in pvmw->address.
> > + */
> > +static inline void
> > +page_vma_mapped_walk_restart(struct page_vma_mapped_walk *pvmw)
> > +{
> > + WARN_ON_ONCE(!pvmw->pmd);
>
> Can we have this more general, like
>
> WARN_ON_ONCE(!pvmw->pmd && !pvmw->pte);

Sure, let’s make it more general.

>
> And then setting both to NULL below?
>
>
> > + WARN_ON_ONCE(!pvmw->ptl);
>
> This is confusing: you check for ptl below. What would be clearer is
>
> if (likely(pvmw->ptl))
> spin_unlock(pvmw->ptl);
> else
> WARN_ON_ONCE(1);

Will adjust as you suggested, thanks!

>
>
> > +
> > + if (pvmw->ptl)
> > + spin_unlock(pvmw->ptl);
> > +
> > + pvmw->ptl = NULL;
> > + pvmw->pmd = NULL;
> > +}
> > +
> > bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw);
> >
> > /*
>
> I'd suggest squashing that into the next patch.

Got it.

Thanks,
Lance

>
> --
> Cheers,
>
> David / dhildenb
>

2024-06-13 09:52:24

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v7 3/4] mm/rmap: integrate PMD-mapped folio splitting into pagewalk loop

On 13.06.24 11:21, Lance Yang wrote:
> On Thu, Jun 13, 2024 at 4:34 PM David Hildenbrand <[email protected]> wrote:
>>
>> On 10.06.24 14:06, Lance Yang wrote:
>>> In preparation for supporting try_to_unmap_one() to unmap PMD-mapped
>>> folios, start the pagewalk first, then call split_huge_pmd_address() to
>>> split the folio.
>>>
>>> Suggested-by: David Hildenbrand <[email protected]>
>>> Suggested-by: Baolin Wang <[email protected]>
>>> Signed-off-by: Lance Yang <[email protected]>
>>> ---
>>> include/linux/huge_mm.h | 6 ++++++
>>> mm/huge_memory.c | 42 +++++++++++++++++++++--------------------
>>> mm/rmap.c | 21 +++++++++++++++------
>>> 3 files changed, 43 insertions(+), 26 deletions(-)
>>>
>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>> index 088d66a54643..4670c6ee118b 100644
>>> --- a/include/linux/huge_mm.h
>>> +++ b/include/linux/huge_mm.h
>>> @@ -415,6 +415,9 @@ static inline bool thp_migration_supported(void)
>>> return IS_ENABLED(CONFIG_ARCH_ENABLE_THP_MIGRATION);
>>> }
>>>
>>> +void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
>>> + pmd_t *pmd, bool freeze, struct folio *folio);
>>> +
>>> #else /* CONFIG_TRANSPARENT_HUGEPAGE */
>>>
>>> static inline bool folio_test_pmd_mappable(struct folio *folio)
>>> @@ -477,6 +480,9 @@ static inline void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>>> unsigned long address, bool freeze, struct folio *folio) {}
>>> static inline void split_huge_pmd_address(struct vm_area_struct *vma,
>>> unsigned long address, bool freeze, struct folio *folio) {}
>>> +static inline void split_huge_pmd_locked(struct vm_area_struct *vma,
>>> + unsigned long address, pmd_t *pmd,
>>> + bool freeze, struct folio *folio) {}
>>>
>>> #define split_huge_pud(__vma, __pmd, __address) \
>>> do { } while (0)
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index e6d26c2eb670..d2697cc8f9d4 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -2581,6 +2581,27 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>> pmd_populate(mm, pmd, pgtable);
>>> }
>>>
>>> +void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
>>> + pmd_t *pmd, bool freeze, struct folio *folio)
>>> +{
>>> + VM_WARN_ON_ONCE(folio && !folio_test_pmd_mappable(folio));
>>> + VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PMD_SIZE));
>>> + VM_WARN_ON_ONCE(folio && !folio_test_locked(folio));
>>> + VM_BUG_ON(freeze && !folio);
>>
>> Curious: could we actually end up here without a folio right now? That
>> would mean, that try_to_unmap_one() would be called with folio==NULL.
>
> try_to_unmap_one() would not be called with folio==NULL, I guess.
>
> I just moved 'VM_BUG_ON(freeze && !folio)' from __split_huge_pmd() to here,
> and now __split_huge_pmd() will call split_huge_pmd_locked().

Right, on that path we could likely get !folio.

>
>>
>>> +
>>> + /*
>>> + * When the caller requests to set up a migration entry, we
>>> + * require a folio to check the PMD against. Otherwise, there
>>> + * is a risk of replacing the wrong folio.
>>> + */
>>> + if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd) ||
>>> + is_pmd_migration_entry(*pmd)) {
>>> + if (folio && folio != pmd_folio(*pmd))
>>> + return;
>>> + __split_huge_pmd_locked(vma, pmd, address, freeze);
>>> + }
>>> +}
>>> +
>>> void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>>> unsigned long address, bool freeze, struct folio *folio)
>>> {
>>> @@ -2592,26 +2613,7 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>>> (address & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE);
>>> mmu_notifier_invalidate_range_start(&range);
>>> ptl = pmd_lock(vma->vm_mm, pmd);
>>> -
>>> - /*
>>> - * If caller asks to setup a migration entry, we need a folio to check
>>> - * pmd against. Otherwise we can end up replacing wrong folio.
>>> - */
>>> - VM_BUG_ON(freeze && !folio);
>>> - VM_WARN_ON_ONCE(folio && !folio_test_locked(folio));
>>> -
>>> - if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd) ||
>>> - is_pmd_migration_entry(*pmd)) {
>>> - /*
>>> - * It's safe to call pmd_page when folio is set because it's
>>> - * guaranteed that pmd is present.
>>> - */
>>> - if (folio && folio != pmd_folio(*pmd))
>>> - goto out;
>>> - __split_huge_pmd_locked(vma, pmd, range.start, freeze);
>>> - }
>>> -
>>> -out:
>>> + split_huge_pmd_locked(vma, range.start, pmd, freeze, folio);
>>> spin_unlock(ptl);
>>> mmu_notifier_invalidate_range_end(&range);
>>> }
>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>> index ddffa30c79fb..b77f88695588 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -1640,9 +1640,6 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>> if (flags & TTU_SYNC)
>>> pvmw.flags = PVMW_SYNC;
>>>
>>> - if (flags & TTU_SPLIT_HUGE_PMD)
>>> - split_huge_pmd_address(vma, address, false, folio);
>>> -
>>> /*
>>> * For THP, we have to assume the worse case ie pmd for invalidation.
>>> * For hugetlb, it could be much worse if we need to do pud
>>> @@ -1668,9 +1665,6 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>> mmu_notifier_invalidate_range_start(&range);
>>>
>>> while (page_vma_mapped_walk(&pvmw)) {
>>> - /* Unexpected PMD-mapped THP? */
>>> - VM_BUG_ON_FOLIO(!pvmw.pte, folio);
>>> -
>>> /*
>>> * If the folio is in an mlock()d vma, we must not swap it out.
>>> */
>>> @@ -1682,6 +1676,21 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>> goto walk_done_err;
>>> }
>>>
>>> + if (!pvmw.pte && (flags & TTU_SPLIT_HUGE_PMD)) {
>>> + /*
>>> + * We temporarily have to drop the PTL and start once
>>> + * again from that now-PTE-mapped page table.
>>> + */
>>> + split_huge_pmd_locked(vma, pvmw.address, pvmw.pmd,
>>> + false, folio);
>>> + flags &= ~TTU_SPLIT_HUGE_PMD;
>>> + page_vma_mapped_walk_restart(&pvmw);
>>
>> If, for some reason, split_huge_pmd_locked() would fail, we would keep
>> looping and never hit the VM_BUG_ON_FOLIO() below. Maybe we'd want to
>> let split_huge_pmd_locked() return whether splitting succeeded, and
>> handle that case differently?
>
> Hmm... after calling split_huge_pmd_locked(), we also do
> "flags &= ~TTU_SPLIT_HUGE_PMD", preventing re-entry into this block,
> then triggering the VM_BUG_ON_FOLIO() below if split_huge_pmd_locked()
> fails, IIUC.
>
> What do you think?

Missed that, you are right.

--
Cheers,

David / dhildenb


2024-06-13 13:08:54

by Lance Yang

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] mm/rmap: remove duplicated exit code in pagewalk loop

On Thu, Jun 13, 2024 at 4:49 PM Lance Yang <[email protected]> wrote:
>
> On Thu, Jun 13, 2024 at 4:27 PM David Hildenbrand <[email protected]> wrote:
> >
> > On 13.06.24 09:52, Barry Song wrote:
> > > On Tue, Jun 11, 2024 at 12:02 AM Lance Yang <[email protected]> wrote:
> > >>
> > >> Introduce the labels walk_done and walk_done_err as exit points to
> > >> eliminate duplicated exit code in the pagewalk loop.
> > >>
> > >> Reviewed-by: Zi Yan <[email protected]>
> > >> Reviewed-by: Baolin Wang <[email protected]>
> > >> Reviewed-by: David Hildenbrand <[email protected]>
> > >> Signed-off-by: Lance Yang <[email protected]>
> > >
>
> Hi Barry and David,
>
> Thanks for taking time to review!
>
> > > I don't think "return false" necessarily indicates an error, so
> > > "walk_done_err" doesn't seem like an appropriate name.
> > > However, this is a minor issue.
> >
> > Agreed. As we only have a single walk_done user, should we instead
> > remove "walk_done", keep the "page_vma_mapped_walk_done" for that single
> > user, and rename "walk_done_err" to "abort_walk" ?
>
> Yeah, I agree that 'abort_walk' is better than 'walk_done_err', and let's
> keep 'page_vma_mapped_walk_done' for that single user ;)

I just realized that there is another walk_done user, which is
unmap_huge_pmd_locked().

Could I keep "walk_done" but rename it to "done_walk"?

Thanks,
Lance

>
> Thanks again for the suggestions!
> Lance
>
> >
> >
> > --
> > Cheers,
> >
> > David / dhildenb
> >

2024-06-13 13:37:03

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] mm/rmap: remove duplicated exit code in pagewalk loop

On 13.06.24 14:43, Lance Yang wrote:
> On Thu, Jun 13, 2024 at 4:49 PM Lance Yang <[email protected]> wrote:
>>
>> On Thu, Jun 13, 2024 at 4:27 PM David Hildenbrand <[email protected]> wrote:
>>>
>>> On 13.06.24 09:52, Barry Song wrote:
>>>> On Tue, Jun 11, 2024 at 12:02 AM Lance Yang <[email protected]> wrote:
>>>>>
>>>>> Introduce the labels walk_done and walk_done_err as exit points to
>>>>> eliminate duplicated exit code in the pagewalk loop.
>>>>>
>>>>> Reviewed-by: Zi Yan <[email protected]>
>>>>> Reviewed-by: Baolin Wang <[email protected]>
>>>>> Reviewed-by: David Hildenbrand <[email protected]>
>>>>> Signed-off-by: Lance Yang <[email protected]>
>>>>
>>
>> Hi Barry and David,
>>
>> Thanks for taking time to review!
>>
>>>> I don't think "return false" necessarily indicates an error, so
>>>> "walk_done_err" doesn't seem like an appropriate name.
>>>> However, this is a minor issue.
>>>
>>> Agreed. As we only have a single walk_done user, should we instead
>>> remove "walk_done", keep the "page_vma_mapped_walk_done" for that single
>>> user, and rename "walk_done_err" to "abort_walk" ?
>>
>> Yeah, I agree that 'abort_walk' is better than 'walk_done_err', and let's
>> keep 'page_vma_mapped_walk_done' for that single user ;)
>
> I just realized that there is another walk_done user, which is
> unmap_huge_pmd_locked().
>
> Could I keep "walk_done" but rename it to "done_walk"?

Sure. "walk_done"/"walk_abort" might sound better.

--
Cheers,

David / dhildenb


2024-06-13 13:46:12

by Lance Yang

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] mm/rmap: remove duplicated exit code in pagewalk loop

On Thu, Jun 13, 2024 at 9:29 PM David Hildenbrand <[email protected]> wrote:
>
> On 13.06.24 14:43, Lance Yang wrote:
> > On Thu, Jun 13, 2024 at 4:49 PM Lance Yang <[email protected]> wrote:
> >>
> >> On Thu, Jun 13, 2024 at 4:27 PM David Hildenbrand <[email protected]> wrote:
> >>>
> >>> On 13.06.24 09:52, Barry Song wrote:
> >>>> On Tue, Jun 11, 2024 at 12:02 AM Lance Yang <[email protected]> wrote:
> >>>>>
> >>>>> Introduce the labels walk_done and walk_done_err as exit points to
> >>>>> eliminate duplicated exit code in the pagewalk loop.
> >>>>>
> >>>>> Reviewed-by: Zi Yan <[email protected]>
> >>>>> Reviewed-by: Baolin Wang <[email protected]>
> >>>>> Reviewed-by: David Hildenbrand <[email protected]>
> >>>>> Signed-off-by: Lance Yang <[email protected]>
> >>>>
> >>
> >> Hi Barry and David,
> >>
> >> Thanks for taking time to review!
> >>
> >>>> I don't think "return false" necessarily indicates an error, so
> >>>> "walk_done_err" doesn't seem like an appropriate name.
> >>>> However, this is a minor issue.
> >>>
> >>> Agreed. As we only have a single walk_done user, should we instead
> >>> remove "walk_done", keep the "page_vma_mapped_walk_done" for that single
> >>> user, and rename "walk_done_err" to "abort_walk" ?
> >>
> >> Yeah, I agree that 'abort_walk' is better than 'walk_done_err', and let's
> >> keep 'page_vma_mapped_walk_done' for that single user ;)
> >
> > I just realized that there is another walk_done user, which is
> > unmap_huge_pmd_locked().
> >
> > Could I keep "walk_done" but rename it to "done_walk"?
>
> Sure. "walk_done"/"walk_abort" might sound better.

Nice. Thanks for the suggestion!

Lance

>
> --
> Cheers,
>
> David / dhildenb
>