Hi all,
This series adds support for reclaiming PMD-mapped THP marked as lazyfree
without needing to first split the large folio via split_huge_pmd_address().
When the user no longer requires the pages, they would use madvise(MADV_FREE)
to mark the pages as lazy free. Subsequently, they typically would not re-write
to that memory again.
During memory reclaim, if we detect that the large folio and its PMD are both
still marked as clean and there are no unexpected references(such as GUP), so we
can just discard the memory lazily, improving the efficiency of memory
reclamation in this case.
Performance Testing
===================
On an Intel i5 CPU, reclaiming 1GiB of lazyfree THPs using
mem_cgroup_force_empty() results in the following runtimes in seconds
(shorter is better):
--------------------------------------------
| Old | New | Change |
--------------------------------------------
| 0.683426 | 0.049197 | -92.80% |
--------------------------------------------
---
Changes since v5 [5]
====================
- mm/rmap: remove duplicated exit code in pagewalk loop
- Pick RB from Baolin Wang - thanks!
- mm/mlock: check for THP missing the mlock in try_to_unmap_one()
- Merge this patch into patch 2 (per Baolin Wang)
- mm/vmscan: avoid split lazyfree THP during shrink_folio_list()
- Mark a folio as being backed by swap space if the folio or its PMD
was redirtied (per Baolin Wang)
- Use pmdp_huge_clear_flush() to get and flush a PMD entry
(per Baolin Wang)
Changes since v4 [4]
====================
- mm/rmap: remove duplicated exit code in pagewalk loop
- Pick RB from Zi Yan - thanks!
- mm/rmap: integrate PMD-mapped folio splitting into pagewalk loop
- Remove the redundant alignment (per Baolin Wang)
- Set pvmw.ptl to NULL after unlocking the PTL (per Baolin Wang)
- mm/mlock: check for THP missing the mlock in try_to_unmap_one()
- Check whether the mlock of PMD-mapped THP was missed
(suggested by Baolin Wang)
- mm/vmscan: avoid split lazyfree THP during shrink_folio_list()
- No need to check the TTU_SPLIT_HUGE_PMD flag for unmap_huge_pmd_locked()
(per Zi Yan)
- Drain the local mlock batch after folio_remove_rmap_pmd()
(per Baolin Wang)
Changes since v3 [3]
====================
- mm/rmap: integrate PMD-mapped folio splitting into pagewalk loop
- Resolve compilation errors by handling the case where
CONFIG_PGTABLE_HAS_HUGE_LEAVES is undefined (thanks to SeongJae Park)
- mm/vmscan: avoid split lazyfree THP during shrink_folio_list()
- Remove the unnecessary conditional compilation directives
(thanks to Barry Song)
- Resolve compilation errors due to undefined references to
unmap_huge_pmd_locked and split_huge_pmd_locked (thanks to Barry)
Changes since v2 [2]
====================
- Update the changelog (thanks to David Hildenbrand)
- Support try_to_unmap_one() to unmap PMD-mapped folios
(thanks a lot to David Hildenbrand and Zi Yan)
Changes since v1 [1]
====================
- Update the changelog
- Follow the exact same logic as in try_to_unmap_one() (per David Hildenbrand)
- Remove the extra code from rmap.c (per Matthew Wilcox)
[1] https://lore.kernel.org/linux-mm/[email protected]
[2] https://lore.kernel.org/linux-mm/[email protected]
[3] https://lore.kernel.org/linux-mm/[email protected]
[4] https://lore.kernel.org/linux-mm/[email protected]
[5] https://lore.kernel.org/linux-mm/[email protected]
Lance Yang (3):
mm/rmap: remove duplicated exit code in pagewalk loop
mm/rmap: integrate PMD-mapped folio splitting into pagewalk loop
mm/vmscan: avoid split lazyfree THP during shrink_folio_list()
include/linux/huge_mm.h | 15 +++++
mm/huge_memory.c | 122 +++++++++++++++++++++++++++++++++-------
mm/rmap.c | 83 ++++++++++++++++-----------
3 files changed, 167 insertions(+), 53 deletions(-)
--
2.33.1
Introduce the labels walk_done and walk_done_err as exit points to
eliminate duplicated exit code in the pagewalk loop.
Reviewed-by: Zi Yan <[email protected]>
Reviewed-by: Baolin Wang <[email protected]>
Signed-off-by: Lance Yang <[email protected]>
---
mm/rmap.c | 40 +++++++++++++++-------------------------
1 file changed, 15 insertions(+), 25 deletions(-)
diff --git a/mm/rmap.c b/mm/rmap.c
index e8fc5ecb59b2..ddffa30c79fb 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1679,9 +1679,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
/* Restore the mlock which got missed */
if (!folio_test_large(folio))
mlock_vma_folio(folio, vma);
- page_vma_mapped_walk_done(&pvmw);
- ret = false;
- break;
+ goto walk_done_err;
}
pfn = pte_pfn(ptep_get(pvmw.pte));
@@ -1719,11 +1717,8 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
*/
if (!anon) {
VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
- if (!hugetlb_vma_trylock_write(vma)) {
- page_vma_mapped_walk_done(&pvmw);
- ret = false;
- break;
- }
+ if (!hugetlb_vma_trylock_write(vma))
+ goto walk_done_err;
if (huge_pmd_unshare(mm, vma, address, pvmw.pte)) {
hugetlb_vma_unlock_write(vma);
flush_tlb_range(vma,
@@ -1738,8 +1733,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
* actual page and drop map count
* to zero.
*/
- page_vma_mapped_walk_done(&pvmw);
- break;
+ goto walk_done;
}
hugetlb_vma_unlock_write(vma);
}
@@ -1811,9 +1805,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
if (unlikely(folio_test_swapbacked(folio) !=
folio_test_swapcache(folio))) {
WARN_ON_ONCE(1);
- ret = false;
- page_vma_mapped_walk_done(&pvmw);
- break;
+ goto walk_done_err;
}
/* MADV_FREE page check */
@@ -1852,23 +1844,17 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
*/
set_pte_at(mm, address, pvmw.pte, pteval);
folio_set_swapbacked(folio);
- ret = false;
- page_vma_mapped_walk_done(&pvmw);
- break;
+ goto walk_done_err;
}
if (swap_duplicate(entry) < 0) {
set_pte_at(mm, address, pvmw.pte, pteval);
- ret = false;
- page_vma_mapped_walk_done(&pvmw);
- break;
+ goto walk_done_err;
}
if (arch_unmap_one(mm, vma, address, pteval) < 0) {
swap_free(entry);
set_pte_at(mm, address, pvmw.pte, pteval);
- ret = false;
- page_vma_mapped_walk_done(&pvmw);
- break;
+ goto walk_done_err;
}
/* See folio_try_share_anon_rmap(): clear PTE first. */
@@ -1876,9 +1862,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
folio_try_share_anon_rmap_pte(folio, subpage)) {
swap_free(entry);
set_pte_at(mm, address, pvmw.pte, pteval);
- ret = false;
- page_vma_mapped_walk_done(&pvmw);
- break;
+ goto walk_done_err;
}
if (list_empty(&mm->mmlist)) {
spin_lock(&mmlist_lock);
@@ -1918,6 +1902,12 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
if (vma->vm_flags & VM_LOCKED)
mlock_drain_local();
folio_put(folio);
+ continue;
+walk_done_err:
+ ret = false;
+walk_done:
+ page_vma_mapped_walk_done(&pvmw);
+ break;
}
mmu_notifier_invalidate_range_end(&range);
--
2.33.1
In preparation for supporting try_to_unmap_one() to unmap PMD-mapped
folios, start the pagewalk first, then call split_huge_pmd_address() to
split the folio.
Since TTU_SPLIT_HUGE_PMD will no longer perform immediately, we might
encounter a PMD-mapped THP missing the mlock in the VM_LOCKED range during
the page walk. It’s probably necessary to mlock this THP to prevent it from
being picked up during page reclaim.
Suggested-by: David Hildenbrand <[email protected]>
Suggested-by: Baolin Wang <[email protected]>
Signed-off-by: Lance Yang <[email protected]>
---
include/linux/huge_mm.h | 6 ++++++
mm/huge_memory.c | 42 +++++++++++++++++++++--------------------
mm/rmap.c | 26 ++++++++++++++++++-------
3 files changed, 47 insertions(+), 27 deletions(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index c8d3ec116e29..9fcb0b0b6ed1 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -409,6 +409,9 @@ static inline bool thp_migration_supported(void)
return IS_ENABLED(CONFIG_ARCH_ENABLE_THP_MIGRATION);
}
+void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
+ pmd_t *pmd, bool freeze, struct folio *folio);
+
#else /* CONFIG_TRANSPARENT_HUGEPAGE */
static inline bool folio_test_pmd_mappable(struct folio *folio)
@@ -471,6 +474,9 @@ static inline void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
unsigned long address, bool freeze, struct folio *folio) {}
static inline void split_huge_pmd_address(struct vm_area_struct *vma,
unsigned long address, bool freeze, struct folio *folio) {}
+static inline void split_huge_pmd_locked(struct vm_area_struct *vma,
+ unsigned long address, pmd_t *pmd,
+ bool freeze, struct folio *folio) {}
#define split_huge_pud(__vma, __pmd, __address) \
do { } while (0)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 317de2afd371..425272c6c50b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2581,6 +2581,27 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
pmd_populate(mm, pmd, pgtable);
}
+void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
+ pmd_t *pmd, bool freeze, struct folio *folio)
+{
+ VM_WARN_ON_ONCE(folio && !folio_test_pmd_mappable(folio));
+ VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PMD_SIZE));
+ VM_WARN_ON_ONCE(folio && !folio_test_locked(folio));
+ VM_BUG_ON(freeze && !folio);
+
+ /*
+ * When the caller requests to set up a migration entry, we
+ * require a folio to check the PMD against. Otherwise, there
+ * is a risk of replacing the wrong folio.
+ */
+ if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd) ||
+ is_pmd_migration_entry(*pmd)) {
+ if (folio && folio != pmd_folio(*pmd))
+ return;
+ __split_huge_pmd_locked(vma, pmd, address, freeze);
+ }
+}
+
void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
unsigned long address, bool freeze, struct folio *folio)
{
@@ -2592,26 +2613,7 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
(address & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE);
mmu_notifier_invalidate_range_start(&range);
ptl = pmd_lock(vma->vm_mm, pmd);
-
- /*
- * If caller asks to setup a migration entry, we need a folio to check
- * pmd against. Otherwise we can end up replacing wrong folio.
- */
- VM_BUG_ON(freeze && !folio);
- VM_WARN_ON_ONCE(folio && !folio_test_locked(folio));
-
- if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd) ||
- is_pmd_migration_entry(*pmd)) {
- /*
- * It's safe to call pmd_page when folio is set because it's
- * guaranteed that pmd is present.
- */
- if (folio && folio != pmd_folio(*pmd))
- goto out;
- __split_huge_pmd_locked(vma, pmd, range.start, freeze);
- }
-
-out:
+ split_huge_pmd_locked(vma, range.start, pmd, freeze, folio);
spin_unlock(ptl);
mmu_notifier_invalidate_range_end(&range);
}
diff --git a/mm/rmap.c b/mm/rmap.c
index ddffa30c79fb..08a93347f283 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1640,9 +1640,6 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
if (flags & TTU_SYNC)
pvmw.flags = PVMW_SYNC;
- if (flags & TTU_SPLIT_HUGE_PMD)
- split_huge_pmd_address(vma, address, false, folio);
-
/*
* For THP, we have to assume the worse case ie pmd for invalidation.
* For hugetlb, it could be much worse if we need to do pud
@@ -1668,20 +1665,35 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
mmu_notifier_invalidate_range_start(&range);
while (page_vma_mapped_walk(&pvmw)) {
- /* Unexpected PMD-mapped THP? */
- VM_BUG_ON_FOLIO(!pvmw.pte, folio);
-
/*
* If the folio is in an mlock()d vma, we must not swap it out.
*/
if (!(flags & TTU_IGNORE_MLOCK) &&
(vma->vm_flags & VM_LOCKED)) {
/* Restore the mlock which got missed */
- if (!folio_test_large(folio))
+ if (!folio_test_large(folio) ||
+ (!pvmw.pte && (flags & TTU_SPLIT_HUGE_PMD)))
mlock_vma_folio(folio, vma);
goto walk_done_err;
}
+ if (!pvmw.pte && (flags & TTU_SPLIT_HUGE_PMD)) {
+ /*
+ * We temporarily have to drop the PTL and start once
+ * again from that now-PTE-mapped page table.
+ */
+ split_huge_pmd_locked(vma, range.start, pvmw.pmd, false,
+ folio);
+ pvmw.pmd = NULL;
+ spin_unlock(pvmw.ptl);
+ pvmw.ptl = NULL;
+ flags &= ~TTU_SPLIT_HUGE_PMD;
+ continue;
+ }
+
+ /* Unexpected PMD-mapped THP? */
+ VM_BUG_ON_FOLIO(!pvmw.pte, folio);
+
pfn = pte_pfn(ptep_get(pvmw.pte));
subpage = folio_page(folio, pfn - folio_pfn(folio));
address = pvmw.address;
--
2.33.1
When the user no longer requires the pages, they would use
madvise(MADV_FREE) to mark the pages as lazy free. Subsequently, they
typically would not re-write to that memory again.
During memory reclaim, if we detect that the large folio and its PMD are
both still marked as clean and there are no unexpected references
(such as GUP), so we can just discard the memory lazily, improving the
efficiency of memory reclamation in this case.
On an Intel i5 CPU, reclaiming 1GiB of lazyfree THPs using
mem_cgroup_force_empty() results in the following runtimes in seconds
(shorter is better):
--------------------------------------------
| Old | New | Change |
--------------------------------------------
| 0.683426 | 0.049197 | -92.80% |
--------------------------------------------
Suggested-by: Zi Yan <[email protected]>
Suggested-by: David Hildenbrand <[email protected]>
Signed-off-by: Lance Yang <[email protected]>
---
include/linux/huge_mm.h | 9 +++++
mm/huge_memory.c | 80 +++++++++++++++++++++++++++++++++++++++++
mm/rmap.c | 41 ++++++++++++++-------
3 files changed, 117 insertions(+), 13 deletions(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 9fcb0b0b6ed1..cfd7ec2b6d0a 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -411,6 +411,8 @@ static inline bool thp_migration_supported(void)
void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
pmd_t *pmd, bool freeze, struct folio *folio);
+bool unmap_huge_pmd_locked(struct vm_area_struct *vma, unsigned long addr,
+ pmd_t *pmdp, struct folio *folio);
#else /* CONFIG_TRANSPARENT_HUGEPAGE */
@@ -478,6 +480,13 @@ static inline void split_huge_pmd_locked(struct vm_area_struct *vma,
unsigned long address, pmd_t *pmd,
bool freeze, struct folio *folio) {}
+static inline bool unmap_huge_pmd_locked(struct vm_area_struct *vma,
+ unsigned long addr, pmd_t *pmdp,
+ struct folio *folio)
+{
+ return false;
+}
+
#define split_huge_pud(__vma, __pmd, __address) \
do { } while (0)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 425272c6c50b..4793ffa912ca 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2687,6 +2687,86 @@ static void unmap_folio(struct folio *folio)
try_to_unmap_flush();
}
+static bool __discard_trans_pmd_locked(struct vm_area_struct *vma,
+ unsigned long addr, pmd_t *pmdp,
+ struct folio *folio)
+{
+ VM_WARN_ON_FOLIO(folio_test_swapbacked(folio), folio);
+ VM_WARN_ON_FOLIO(!folio_test_anon(folio), folio);
+
+ struct mm_struct *mm = vma->vm_mm;
+ int ref_count, map_count;
+ pmd_t orig_pmd = *pmdp;
+ struct page *page;
+
+ if (unlikely(!pmd_present(orig_pmd) || !pmd_trans_huge(orig_pmd)))
+ return false;
+
+ page = pmd_page(orig_pmd);
+ if (unlikely(page_folio(page) != folio))
+ return false;
+
+ if (folio_test_dirty(folio) || pmd_dirty(orig_pmd)) {
+ folio_set_swapbacked(folio);
+ return false;
+ }
+
+ orig_pmd = pmdp_huge_clear_flush(vma, addr, pmdp);
+
+ /*
+ * Syncing against concurrent GUP-fast:
+ * - clear PMD; barrier; read refcount
+ * - inc refcount; barrier; read PMD
+ */
+ smp_mb();
+
+ ref_count = folio_ref_count(folio);
+ map_count = folio_mapcount(folio);
+
+ /*
+ * Order reads for folio refcount and dirty flag
+ * (see comments in __remove_mapping()).
+ */
+ smp_rmb();
+
+ /*
+ * If the folio or its PMD is redirtied at this point, or if there
+ * are unexpected references, we will give up to discard this folio
+ * and remap it.
+ *
+ * The only folio refs must be one from isolation plus the rmap(s).
+ */
+ if (folio_test_dirty(folio) || pmd_dirty(orig_pmd))
+ folio_set_swapbacked(folio);
+
+ if (folio_test_swapbacked(folio) || ref_count != map_count + 1) {
+ set_pmd_at(mm, addr, pmdp, orig_pmd);
+ return false;
+ }
+
+ folio_remove_rmap_pmd(folio, page, vma);
+ zap_deposited_table(mm, pmdp);
+ add_mm_counter(mm, MM_ANONPAGES, -HPAGE_PMD_NR);
+ if (vma->vm_flags & VM_LOCKED)
+ mlock_drain_local();
+ folio_put(folio);
+
+ return true;
+}
+
+bool unmap_huge_pmd_locked(struct vm_area_struct *vma, unsigned long addr,
+ pmd_t *pmdp, struct folio *folio)
+{
+ VM_WARN_ON_FOLIO(!folio_test_pmd_mappable(folio), folio);
+ VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
+ VM_WARN_ON_ONCE(!IS_ALIGNED(addr, HPAGE_PMD_SIZE));
+
+ if (folio_test_anon(folio) && !folio_test_swapbacked(folio))
+ return __discard_trans_pmd_locked(vma, addr, pmdp, folio);
+
+ return false;
+}
+
static void remap_page(struct folio *folio, unsigned long nr)
{
int i = 0;
diff --git a/mm/rmap.c b/mm/rmap.c
index 08a93347f283..249d6e305bec 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1630,6 +1630,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
enum ttu_flags flags = (enum ttu_flags)(long)arg;
unsigned long pfn;
unsigned long hsz = 0;
+ bool pmd_mapped = false;
/*
* When racing against e.g. zap_pte_range() on another cpu,
@@ -1677,18 +1678,26 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
goto walk_done_err;
}
- if (!pvmw.pte && (flags & TTU_SPLIT_HUGE_PMD)) {
- /*
- * We temporarily have to drop the PTL and start once
- * again from that now-PTE-mapped page table.
- */
- split_huge_pmd_locked(vma, range.start, pvmw.pmd, false,
- folio);
- pvmw.pmd = NULL;
- spin_unlock(pvmw.ptl);
- pvmw.ptl = NULL;
- flags &= ~TTU_SPLIT_HUGE_PMD;
- continue;
+ if (!pvmw.pte) {
+ pmd_mapped = true;
+ if (unmap_huge_pmd_locked(vma, range.start, pvmw.pmd,
+ folio))
+ goto walk_done;
+
+ if (flags & TTU_SPLIT_HUGE_PMD) {
+ /*
+ * We temporarily have to drop the PTL and start
+ * once again from that now-PTE-mapped page
+ * table.
+ */
+ split_huge_pmd_locked(vma, range.start,
+ pvmw.pmd, false, folio);
+ pvmw.pmd = NULL;
+ spin_unlock(pvmw.ptl);
+ pvmw.ptl = NULL;
+ flags &= ~TTU_SPLIT_HUGE_PMD;
+ continue;
+ }
}
/* Unexpected PMD-mapped THP? */
@@ -1816,7 +1825,13 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
*/
if (unlikely(folio_test_swapbacked(folio) !=
folio_test_swapcache(folio))) {
- WARN_ON_ONCE(1);
+ /*
+ * unmap_huge_pmd_locked() will unmark a
+ * PMD-mapped folio as lazyfree if the folio or
+ * its PMD was redirtied.
+ */
+ if (!pmd_mapped)
+ WARN_ON_ONCE(1);
goto walk_done_err;
}
--
2.33.1
On 21.05.24 06:02, Lance Yang wrote:
> Introduce the labels walk_done and walk_done_err as exit points to
> eliminate duplicated exit code in the pagewalk loop.
>
> Reviewed-by: Zi Yan <[email protected]>
> Reviewed-by: Baolin Wang <[email protected]>
> Signed-off-by: Lance Yang <[email protected]>
> ---
Sorry for the late review
Reviewed-by: David Hildenbrand <[email protected]>
--
Cheers,
David / dhildenb
On 21.05.24 06:02, Lance Yang wrote:
> When the user no longer requires the pages, they would use
> madvise(MADV_FREE) to mark the pages as lazy free. Subsequently, they
> typically would not re-write to that memory again.
>
> During memory reclaim, if we detect that the large folio and its PMD are
> both still marked as clean and there are no unexpected references
> (such as GUP), so we can just discard the memory lazily, improving the
> efficiency of memory reclamation in this case.
>
> On an Intel i5 CPU, reclaiming 1GiB of lazyfree THPs using
> mem_cgroup_force_empty() results in the following runtimes in seconds
> (shorter is better):
>
> --------------------------------------------
> | Old | New | Change |
> --------------------------------------------
> | 0.683426 | 0.049197 | -92.80% |
> --------------------------------------------
>
> Suggested-by: Zi Yan <[email protected]>
> Suggested-by: David Hildenbrand <[email protected]>
> Signed-off-by: Lance Yang <[email protected]>
> ---
> include/linux/huge_mm.h | 9 +++++
> mm/huge_memory.c | 80 +++++++++++++++++++++++++++++++++++++++++
> mm/rmap.c | 41 ++++++++++++++-------
> 3 files changed, 117 insertions(+), 13 deletions(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 9fcb0b0b6ed1..cfd7ec2b6d0a 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -411,6 +411,8 @@ static inline bool thp_migration_supported(void)
>
> void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
> pmd_t *pmd, bool freeze, struct folio *folio);
> +bool unmap_huge_pmd_locked(struct vm_area_struct *vma, unsigned long addr,
> + pmd_t *pmdp, struct folio *folio);
>
> #else /* CONFIG_TRANSPARENT_HUGEPAGE */
>
> @@ -478,6 +480,13 @@ static inline void split_huge_pmd_locked(struct vm_area_struct *vma,
> unsigned long address, pmd_t *pmd,
> bool freeze, struct folio *folio) {}
>
> +static inline bool unmap_huge_pmd_locked(struct vm_area_struct *vma,
> + unsigned long addr, pmd_t *pmdp,
> + struct folio *folio)
> +{
> + return false;
> +}
> +
> #define split_huge_pud(__vma, __pmd, __address) \
> do { } while (0)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 425272c6c50b..4793ffa912ca 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2687,6 +2687,86 @@ static void unmap_folio(struct folio *folio)
> try_to_unmap_flush();
> }
>
> +static bool __discard_trans_pmd_locked(struct vm_area_struct *vma,
Can we move towards folio terminology?
__discard_anon_folio_pmd_locked() or sth like that?
> + unsigned long addr, pmd_t *pmdp,
> + struct folio *folio)
> +{
> + VM_WARN_ON_FOLIO(folio_test_swapbacked(folio), folio);
> + VM_WARN_ON_FOLIO(!folio_test_anon(folio), folio);
> +
> + struct mm_struct *mm = vma->vm_mm;
> + int ref_count, map_count;
> + pmd_t orig_pmd = *pmdp;
> + struct page *page;
> +
> + if (unlikely(!pmd_present(orig_pmd) || !pmd_trans_huge(orig_pmd)))
> + return false;
> +
> + page = pmd_page(orig_pmd);
> + if (unlikely(page_folio(page) != folio))
> + return false;
> +
> + if (folio_test_dirty(folio) || pmd_dirty(orig_pmd)) {
> + folio_set_swapbacked(folio);
> + return false;
> + }
> +
> + orig_pmd = pmdp_huge_clear_flush(vma, addr, pmdp);
> +
> + /*
> + * Syncing against concurrent GUP-fast:
> + * - clear PMD; barrier; read refcount
> + * - inc refcount; barrier; read PMD
> + */
> + smp_mb();
> +
> + ref_count = folio_ref_count(folio);
> + map_count = folio_mapcount(folio);
> +
> + /*
> + * Order reads for folio refcount and dirty flag
> + * (see comments in __remove_mapping()).
> + */
> + smp_rmb();
> +
> + /*
> + * If the folio or its PMD is redirtied at this point, or if there
> + * are unexpected references, we will give up to discard this folio
> + * and remap it.
> + *
> + * The only folio refs must be one from isolation plus the rmap(s).
> + */
> + if (folio_test_dirty(folio) || pmd_dirty(orig_pmd))
> + folio_set_swapbacked(folio);
> +
> + if (folio_test_swapbacked(folio) || ref_count != map_count + 1) {
> + set_pmd_at(mm, addr, pmdp, orig_pmd);
> + return false;
> + }
> +
> + folio_remove_rmap_pmd(folio, page, vma);
> + zap_deposited_table(mm, pmdp);
> + add_mm_counter(mm, MM_ANONPAGES, -HPAGE_PMD_NR);
> + if (vma->vm_flags & VM_LOCKED)
> + mlock_drain_local();
> + folio_put(folio);
> +
> + return true;
> +}
> +
> +bool unmap_huge_pmd_locked(struct vm_area_struct *vma, unsigned long addr,
> + pmd_t *pmdp, struct folio *folio)
> +{
> + VM_WARN_ON_FOLIO(!folio_test_pmd_mappable(folio), folio);
> + VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
> + VM_WARN_ON_ONCE(!IS_ALIGNED(addr, HPAGE_PMD_SIZE));
> +
> + if (folio_test_anon(folio) && !folio_test_swapbacked(folio))
> + return __discard_trans_pmd_locked(vma, addr, pmdp, folio);
> +
> + return false;
> +}
> +
> static void remap_page(struct folio *folio, unsigned long nr)
> {
> int i = 0;
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 08a93347f283..249d6e305bec 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1630,6 +1630,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> enum ttu_flags flags = (enum ttu_flags)(long)arg;
> unsigned long pfn;
> unsigned long hsz = 0;
> + bool pmd_mapped = false;
>
> /*
> * When racing against e.g. zap_pte_range() on another cpu,
> @@ -1677,18 +1678,26 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> goto walk_done_err;
> }
>
> - if (!pvmw.pte && (flags & TTU_SPLIT_HUGE_PMD)) {
> - /*
> - * We temporarily have to drop the PTL and start once
> - * again from that now-PTE-mapped page table.
> - */
> - split_huge_pmd_locked(vma, range.start, pvmw.pmd, false,
> - folio);
> - pvmw.pmd = NULL;
> - spin_unlock(pvmw.ptl);
> - pvmw.ptl = NULL;
> - flags &= ~TTU_SPLIT_HUGE_PMD;
> - continue;
> + if (!pvmw.pte) {
> + pmd_mapped = true;
> + if (unmap_huge_pmd_locked(vma, range.start, pvmw.pmd,
> + folio))
> + goto walk_done;
> +
> + if (flags & TTU_SPLIT_HUGE_PMD) {
> + /*
> + * We temporarily have to drop the PTL and start
> + * once again from that now-PTE-mapped page
> + * table.
> + */
> + split_huge_pmd_locked(vma, range.start,
> + pvmw.pmd, false, folio);
> + pvmw.pmd = NULL;
> + spin_unlock(pvmw.ptl);
> + pvmw.ptl = NULL;
> + flags &= ~TTU_SPLIT_HUGE_PMD;
> + continue;
> + }
> }
>
> /* Unexpected PMD-mapped THP? */
> @@ -1816,7 +1825,13 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> */
> if (unlikely(folio_test_swapbacked(folio) !=
> folio_test_swapcache(folio))) {
> - WARN_ON_ONCE(1);
> + /*
> + * unmap_huge_pmd_locked() will unmark a
> + * PMD-mapped folio as lazyfree if the folio or
> + * its PMD was redirtied.
> + */
> + if (!pmd_mapped)
Isn't that simply "pvmw.pte" ?
Also,
WARN_ON_ONCE(!pmd_mapped);
> + WARN_ON_ONCE(1);
> goto walk_done_err;
> }
>
--
Cheers,
David / dhildenb
On Wed, Jun 5, 2024 at 8:34 PM David Hildenbrand <[email protected]> wrote:
>
> On 21.05.24 06:02, Lance Yang wrote:
> > Introduce the labels walk_done and walk_done_err as exit points to
> > eliminate duplicated exit code in the pagewalk loop.
> >
> > Reviewed-by: Zi Yan <[email protected]>
> > Reviewed-by: Baolin Wang <[email protected]>
> > Signed-off-by: Lance Yang <[email protected]>
> > ---
>
> Sorry for the late review
>
> Reviewed-by: David Hildenbrand <[email protected]>
No worries at all, I appreciate you taking time to review!
Thanks,
Lance
>
> --
> Cheers,
>
> David / dhildenb
>
On 21.05.24 06:02, Lance Yang wrote:
> In preparation for supporting try_to_unmap_one() to unmap PMD-mapped
> folios, start the pagewalk first, then call split_huge_pmd_address() to
> split the folio.
>
> Since TTU_SPLIT_HUGE_PMD will no longer perform immediately, we might
> encounter a PMD-mapped THP missing the mlock in the VM_LOCKED range during
> the page walk. It’s probably necessary to mlock this THP to prevent it from
> being picked up during page reclaim.
>
> Suggested-by: David Hildenbrand <[email protected]>
> Suggested-by: Baolin Wang <[email protected]>
> Signed-off-by: Lance Yang <[email protected]>
> ---
[...] again, sorry for the late review.
> diff --git a/mm/rmap.c b/mm/rmap.c
> index ddffa30c79fb..08a93347f283 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1640,9 +1640,6 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> if (flags & TTU_SYNC)
> pvmw.flags = PVMW_SYNC;
>
> - if (flags & TTU_SPLIT_HUGE_PMD)
> - split_huge_pmd_address(vma, address, false, folio);
> -
> /*
> * For THP, we have to assume the worse case ie pmd for invalidation.
> * For hugetlb, it could be much worse if we need to do pud
> @@ -1668,20 +1665,35 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> mmu_notifier_invalidate_range_start(&range);
>
> while (page_vma_mapped_walk(&pvmw)) {
> - /* Unexpected PMD-mapped THP? */
> - VM_BUG_ON_FOLIO(!pvmw.pte, folio);
> -
> /*
> * If the folio is in an mlock()d vma, we must not swap it out.
> */
> if (!(flags & TTU_IGNORE_MLOCK) &&
> (vma->vm_flags & VM_LOCKED)) {
> /* Restore the mlock which got missed */
> - if (!folio_test_large(folio))
> + if (!folio_test_large(folio) ||
> + (!pvmw.pte && (flags & TTU_SPLIT_HUGE_PMD)))
> mlock_vma_folio(folio, vma);
Can you elaborate why you think this would be required? If we would have
performed the split_huge_pmd_address() beforehand, we would still be
left with a large folio, no?
> goto walk_done_err;
> }
>
> + if (!pvmw.pte && (flags & TTU_SPLIT_HUGE_PMD)) {
> + /*
> + * We temporarily have to drop the PTL and start once
> + * again from that now-PTE-mapped page table.
> + */
> + split_huge_pmd_locked(vma, range.start, pvmw.pmd, false,
> + folio);
Using range.start here is a bit weird. Wouldn't this be pvmw.address?
[did not check]
> + pvmw.pmd = NULL;
> + spin_unlock(pvmw.ptl);
> + pvmw.ptl = NULL;
Would we want a
page_vma_mapped_walk_restart() that is exactly for that purpose?
> + flags &= ~TTU_SPLIT_HUGE_PMD;
> + continue;
> + }
> +
> + /* Unexpected PMD-mapped THP? */
> + VM_BUG_ON_FOLIO(!pvmw.pte, folio);
--
Cheers,
David / dhildenb
On 05.06.24 16:20, Lance Yang wrote:
> Hi David,
>
> On Wed, Jun 5, 2024 at 8:46 PM David Hildenbrand <[email protected]> wrote:
>>
>> On 21.05.24 06:02, Lance Yang wrote:
>>> In preparation for supporting try_to_unmap_one() to unmap PMD-mapped
>>> folios, start the pagewalk first, then call split_huge_pmd_address() to
>>> split the folio.
>>>
>>> Since TTU_SPLIT_HUGE_PMD will no longer perform immediately, we might
>>> encounter a PMD-mapped THP missing the mlock in the VM_LOCKED range during
>>> the page walk. It’s probably necessary to mlock this THP to prevent it from
>>> being picked up during page reclaim.
>>>
>>> Suggested-by: David Hildenbrand <[email protected]>
>>> Suggested-by: Baolin Wang <[email protected]>
>>> Signed-off-by: Lance Yang <[email protected]>
>>> ---
>>
>> [...] again, sorry for the late review.
>
> No worries at all, thanks for taking time to review!
>
>>
>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>> index ddffa30c79fb..08a93347f283 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -1640,9 +1640,6 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>> if (flags & TTU_SYNC)
>>> pvmw.flags = PVMW_SYNC;
>>>
>>> - if (flags & TTU_SPLIT_HUGE_PMD)
>>> - split_huge_pmd_address(vma, address, false, folio);
>>> -
>>> /*
>>> * For THP, we have to assume the worse case ie pmd for invalidation.
>>> * For hugetlb, it could be much worse if we need to do pud
>>> @@ -1668,20 +1665,35 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>> mmu_notifier_invalidate_range_start(&range);
>>>
>>> while (page_vma_mapped_walk(&pvmw)) {
>>> - /* Unexpected PMD-mapped THP? */
>>> - VM_BUG_ON_FOLIO(!pvmw.pte, folio);
>>> -
>>> /*
>>> * If the folio is in an mlock()d vma, we must not swap it out.
>>> */
>>> if (!(flags & TTU_IGNORE_MLOCK) &&
>>> (vma->vm_flags & VM_LOCKED)) {
>>> /* Restore the mlock which got missed */
>>> - if (!folio_test_large(folio))
>>> + if (!folio_test_large(folio) ||
>>> + (!pvmw.pte && (flags & TTU_SPLIT_HUGE_PMD)))
>>> mlock_vma_folio(folio, vma);
>>
>> Can you elaborate why you think this would be required? If we would have
>> performed the split_huge_pmd_address() beforehand, we would still be
>> left with a large folio, no?
>
> Yep, there would still be a large folio, but it wouldn't be PMD-mapped.
>
> After Weifeng's series[1], the kernel supports mlock for PTE-mapped large
> folio, but there are a few scenarios where we don't mlock a large folio, such
> as when it crosses a VM_LOCKed VMA boundary.
>
> - if (!folio_test_large(folio))
> + if (!folio_test_large(folio) ||
> + (!pvmw.pte && (flags & TTU_SPLIT_HUGE_PMD)))
>
> And this check is just future-proofing and likely unnecessary. If encountering a
> PMD-mapped THP missing the mlock for some reason, we can mlock this
> THP to prevent it from being picked up during page reclaim, since it is fully
> mapped and doesn't cross the VMA boundary, IIUC.
>
> What do you think?
> I would appreciate any suggestions regarding this check ;)
Reading this patch only, I wonder if this change makes sense in the
context here.
Before this patch, we would have PTE-mapped the PMD-mapped THP before
reaching this call and skipped it due to "!folio_test_large(folio)".
After this patch, we either
a) PTE-remap the THP after this check, but retry and end-up here again,
whereby we would skip it due to "!folio_test_large(folio)".
b) Discard the PMD-mapped THP due to lazyfree directly. Can that
co-exist with mlock and what would be the problem here with mlock?
So if the check is required in this patch, we really have to understand
why. If not, we should better drop it from this patch.
At least my opinion, still struggling to understand why it would be
required (I have 0 knowledge about mlock interaction with large folios :) ).
--
Cheers,
David / dhildenb
Hi David,
On Wed, Jun 5, 2024 at 8:46 PM David Hildenbrand <[email protected]> wrote:
>
> On 21.05.24 06:02, Lance Yang wrote:
> > In preparation for supporting try_to_unmap_one() to unmap PMD-mapped
> > folios, start the pagewalk first, then call split_huge_pmd_address() to
> > split the folio.
> >
> > Since TTU_SPLIT_HUGE_PMD will no longer perform immediately, we might
> > encounter a PMD-mapped THP missing the mlock in the VM_LOCKED range during
> > the page walk. It’s probably necessary to mlock this THP to prevent it from
> > being picked up during page reclaim.
> >
> > Suggested-by: David Hildenbrand <[email protected]>
> > Suggested-by: Baolin Wang <[email protected]>
> > Signed-off-by: Lance Yang <[email protected]>
> > ---
>
> [...] again, sorry for the late review.
No worries at all, thanks for taking time to review!
>
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index ddffa30c79fb..08a93347f283 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -1640,9 +1640,6 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> > if (flags & TTU_SYNC)
> > pvmw.flags = PVMW_SYNC;
> >
> > - if (flags & TTU_SPLIT_HUGE_PMD)
> > - split_huge_pmd_address(vma, address, false, folio);
> > -
> > /*
> > * For THP, we have to assume the worse case ie pmd for invalidation.
> > * For hugetlb, it could be much worse if we need to do pud
> > @@ -1668,20 +1665,35 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> > mmu_notifier_invalidate_range_start(&range);
> >
> > while (page_vma_mapped_walk(&pvmw)) {
> > - /* Unexpected PMD-mapped THP? */
> > - VM_BUG_ON_FOLIO(!pvmw.pte, folio);
> > -
> > /*
> > * If the folio is in an mlock()d vma, we must not swap it out.
> > */
> > if (!(flags & TTU_IGNORE_MLOCK) &&
> > (vma->vm_flags & VM_LOCKED)) {
> > /* Restore the mlock which got missed */
> > - if (!folio_test_large(folio))
> > + if (!folio_test_large(folio) ||
> > + (!pvmw.pte && (flags & TTU_SPLIT_HUGE_PMD)))
> > mlock_vma_folio(folio, vma);
>
> Can you elaborate why you think this would be required? If we would have
> performed the split_huge_pmd_address() beforehand, we would still be
> left with a large folio, no?
Yep, there would still be a large folio, but it wouldn't be PMD-mapped.
After Weifeng's series[1], the kernel supports mlock for PTE-mapped large
folio, but there are a few scenarios where we don't mlock a large folio, such
as when it crosses a VM_LOCKed VMA boundary.
- if (!folio_test_large(folio))
+ if (!folio_test_large(folio) ||
+ (!pvmw.pte && (flags & TTU_SPLIT_HUGE_PMD)))
And this check is just future-proofing and likely unnecessary. If encountering a
PMD-mapped THP missing the mlock for some reason, we can mlock this
THP to prevent it from being picked up during page reclaim, since it is fully
mapped and doesn't cross the VMA boundary, IIUC.
What do you think?
I would appreciate any suggestions regarding this check ;)
[1] https://lore.kernel.org/all/[email protected]/T/#mdab40248cf3705581d8bfb64e1ebf2d9cd81c095
>
> > goto walk_done_err;
> > }
> >
> > + if (!pvmw.pte && (flags & TTU_SPLIT_HUGE_PMD)) {
> > + /*
> > + * We temporarily have to drop the PTL and start once
> > + * again from that now-PTE-mapped page table.
> > + */
> > + split_huge_pmd_locked(vma, range.start, pvmw.pmd, false,
> > + folio);
>
> Using range.start here is a bit weird. Wouldn't this be pvmw.address?
> [did not check]
Hmm... we may adjust range.start before the page walk, but pvmw.address
does not.
At least for now, pvmw.address seems better. Will adjust as you suggested.
>
> > + pvmw.pmd = NULL;
> > + spin_unlock(pvmw.ptl);
> > + pvmw.ptl = NULL;
>
>
> Would we want a
>
> page_vma_mapped_walk_restart() that is exactly for that purpose?
Nice, let's add page_vma_mapped_walk_restart() for that purpose :)
Thanks again for your time!
Lance
>
> > + flags &= ~TTU_SPLIT_HUGE_PMD;
> > + continue;
> > + }
> > +
> > + /* Unexpected PMD-mapped THP? */
> > + VM_BUG_ON_FOLIO(!pvmw.pte, folio);
>
> --
> Cheers,
>
> David / dhildenb
>
On 05.06.24 16:28, David Hildenbrand wrote:
> On 05.06.24 16:20, Lance Yang wrote:
>> Hi David,
>>
>> On Wed, Jun 5, 2024 at 8:46 PM David Hildenbrand <[email protected]> wrote:
>>>
>>> On 21.05.24 06:02, Lance Yang wrote:
>>>> In preparation for supporting try_to_unmap_one() to unmap PMD-mapped
>>>> folios, start the pagewalk first, then call split_huge_pmd_address() to
>>>> split the folio.
>>>>
>>>> Since TTU_SPLIT_HUGE_PMD will no longer perform immediately, we might
>>>> encounter a PMD-mapped THP missing the mlock in the VM_LOCKED range during
>>>> the page walk. It’s probably necessary to mlock this THP to prevent it from
>>>> being picked up during page reclaim.
>>>>
>>>> Suggested-by: David Hildenbrand <[email protected]>
>>>> Suggested-by: Baolin Wang <[email protected]>
>>>> Signed-off-by: Lance Yang <[email protected]>
>>>> ---
>>>
>>> [...] again, sorry for the late review.
>>
>> No worries at all, thanks for taking time to review!
>>
>>>
>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>> index ddffa30c79fb..08a93347f283 100644
>>>> --- a/mm/rmap.c
>>>> +++ b/mm/rmap.c
>>>> @@ -1640,9 +1640,6 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>>> if (flags & TTU_SYNC)
>>>> pvmw.flags = PVMW_SYNC;
>>>>
>>>> - if (flags & TTU_SPLIT_HUGE_PMD)
>>>> - split_huge_pmd_address(vma, address, false, folio);
>>>> -
>>>> /*
>>>> * For THP, we have to assume the worse case ie pmd for invalidation.
>>>> * For hugetlb, it could be much worse if we need to do pud
>>>> @@ -1668,20 +1665,35 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>>> mmu_notifier_invalidate_range_start(&range);
>>>>
>>>> while (page_vma_mapped_walk(&pvmw)) {
>>>> - /* Unexpected PMD-mapped THP? */
>>>> - VM_BUG_ON_FOLIO(!pvmw.pte, folio);
>>>> -
>>>> /*
>>>> * If the folio is in an mlock()d vma, we must not swap it out.
>>>> */
>>>> if (!(flags & TTU_IGNORE_MLOCK) &&
>>>> (vma->vm_flags & VM_LOCKED)) {
>>>> /* Restore the mlock which got missed */
>>>> - if (!folio_test_large(folio))
>>>> + if (!folio_test_large(folio) ||
>>>> + (!pvmw.pte && (flags & TTU_SPLIT_HUGE_PMD)))
>>>> mlock_vma_folio(folio, vma);
>>>
>>> Can you elaborate why you think this would be required? If we would have
>>> performed the split_huge_pmd_address() beforehand, we would still be
>>> left with a large folio, no?
>>
>> Yep, there would still be a large folio, but it wouldn't be PMD-mapped.
>>
>> After Weifeng's series[1], the kernel supports mlock for PTE-mapped large
>> folio, but there are a few scenarios where we don't mlock a large folio, such
>> as when it crosses a VM_LOCKed VMA boundary.
>>
>> - if (!folio_test_large(folio))
>> + if (!folio_test_large(folio) ||
>> + (!pvmw.pte && (flags & TTU_SPLIT_HUGE_PMD)))
>>
>> And this check is just future-proofing and likely unnecessary. If encountering a
>> PMD-mapped THP missing the mlock for some reason, we can mlock this
>> THP to prevent it from being picked up during page reclaim, since it is fully
>> mapped and doesn't cross the VMA boundary, IIUC.
>>
>> What do you think?
>> I would appreciate any suggestions regarding this check ;)
>
> Reading this patch only, I wonder if this change makes sense in the
> context here.
>
> Before this patch, we would have PTE-mapped the PMD-mapped THP before
> reaching this call and skipped it due to "!folio_test_large(folio)".
>
> After this patch, we either
>
> a) PTE-remap the THP after this check, but retry and end-up here again,
> whereby we would skip it due to "!folio_test_large(folio)".
>
> b) Discard the PMD-mapped THP due to lazyfree directly. Can that
> co-exist with mlock and what would be the problem here with mlock?
>
>
> So if the check is required in this patch, we really have to understand
> why. If not, we should better drop it from this patch.
>
> At least my opinion, still struggling to understand why it would be
> required (I have 0 knowledge about mlock interaction with large folios :) ).
>
Looking at that series, in folio_references_one(), we do
if (!folio_test_large(folio) || !pvmw.pte) {
/* Restore the mlock which got missed */
mlock_vma_folio(folio, vma);
page_vma_mapped_walk_done(&pvmw);
pra->vm_flags |= VM_LOCKED;
return false; /* To break the loop */
}
I wonder if we want that here as well now: in case of lazyfree we
would not back off, right?
But I'm not sure if lazyfree in mlocked areas are even possible.
Adding the "!pvmw.pte" would be much clearer to me than the flag check.
--
Cheers,
David / dhildenb
Hi David,
Thanks for taking time to review!
On Wed, Jun 5, 2024 at 8:50 PM David Hildenbrand <[email protected]> wrote:
>
> On 21.05.24 06:02, Lance Yang wrote:
> > When the user no longer requires the pages, they would use
> > madvise(MADV_FREE) to mark the pages as lazy free. Subsequently, they
> > typically would not re-write to that memory again.
> >
> > During memory reclaim, if we detect that the large folio and its PMD are
> > both still marked as clean and there are no unexpected references
> > (such as GUP), so we can just discard the memory lazily, improving the
> > efficiency of memory reclamation in this case.
> >
> > On an Intel i5 CPU, reclaiming 1GiB of lazyfree THPs using
> > mem_cgroup_force_empty() results in the following runtimes in seconds
> > (shorter is better):
> >
> > --------------------------------------------
> > | Old | New | Change |
> > --------------------------------------------
> > | 0.683426 | 0.049197 | -92.80% |
> > --------------------------------------------
> >
> > Suggested-by: Zi Yan <[email protected]>
> > Suggested-by: David Hildenbrand <[email protected]>
> > Signed-off-by: Lance Yang <[email protected]>
> > ---
> > include/linux/huge_mm.h | 9 +++++
> > mm/huge_memory.c | 80 +++++++++++++++++++++++++++++++++++++++++
> > mm/rmap.c | 41 ++++++++++++++-------
> > 3 files changed, 117 insertions(+), 13 deletions(-)
> >
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index 9fcb0b0b6ed1..cfd7ec2b6d0a 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -411,6 +411,8 @@ static inline bool thp_migration_supported(void)
> >
> > void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
> > pmd_t *pmd, bool freeze, struct folio *folio);
> > +bool unmap_huge_pmd_locked(struct vm_area_struct *vma, unsigned long addr,
> > + pmd_t *pmdp, struct folio *folio);
> >
> > #else /* CONFIG_TRANSPARENT_HUGEPAGE */
> >
> > @@ -478,6 +480,13 @@ static inline void split_huge_pmd_locked(struct vm_area_struct *vma,
> > unsigned long address, pmd_t *pmd,
> > bool freeze, struct folio *folio) {}
> >
> > +static inline bool unmap_huge_pmd_locked(struct vm_area_struct *vma,
> > + unsigned long addr, pmd_t *pmdp,
> > + struct folio *folio)
> > +{
> > + return false;
> > +}
> > +
> > #define split_huge_pud(__vma, __pmd, __address) \
> > do { } while (0)
> >
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 425272c6c50b..4793ffa912ca 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -2687,6 +2687,86 @@ static void unmap_folio(struct folio *folio)
> > try_to_unmap_flush();
> > }
> >
> > +static bool __discard_trans_pmd_locked(struct vm_area_struct *vma,
>
> Can we move towards folio terminology?
>
> __discard_anon_folio_pmd_locked() or sth like that?
Nice, it's much clearer!
>
> > + unsigned long addr, pmd_t *pmdp,
> > + struct folio *folio)
> > +{
> > + VM_WARN_ON_FOLIO(folio_test_swapbacked(folio), folio);
> > + VM_WARN_ON_FOLIO(!folio_test_anon(folio), folio);
> > +
> > + struct mm_struct *mm = vma->vm_mm;
> > + int ref_count, map_count;
> > + pmd_t orig_pmd = *pmdp;
> > + struct page *page;
> > +
> > + if (unlikely(!pmd_present(orig_pmd) || !pmd_trans_huge(orig_pmd)))
> > + return false;
> > +
> > + page = pmd_page(orig_pmd);
> > + if (unlikely(page_folio(page) != folio))
> > + return false;
> > +
> > + if (folio_test_dirty(folio) || pmd_dirty(orig_pmd)) {
> > + folio_set_swapbacked(folio);
> > + return false;
> > + }
> > +
> > + orig_pmd = pmdp_huge_clear_flush(vma, addr, pmdp);
> > +
> > + /*
> > + * Syncing against concurrent GUP-fast:
> > + * - clear PMD; barrier; read refcount
> > + * - inc refcount; barrier; read PMD
> > + */
> > + smp_mb();
> > +
> > + ref_count = folio_ref_count(folio);
> > + map_count = folio_mapcount(folio);
> > +
> > + /*
> > + * Order reads for folio refcount and dirty flag
> > + * (see comments in __remove_mapping()).
> > + */
> > + smp_rmb();
> > +
> > + /*
> > + * If the folio or its PMD is redirtied at this point, or if there
> > + * are unexpected references, we will give up to discard this folio
> > + * and remap it.
> > + *
> > + * The only folio refs must be one from isolation plus the rmap(s).
> > + */
> > + if (folio_test_dirty(folio) || pmd_dirty(orig_pmd))
> > + folio_set_swapbacked(folio);
> > +
> > + if (folio_test_swapbacked(folio) || ref_count != map_count + 1) {
> > + set_pmd_at(mm, addr, pmdp, orig_pmd);
> > + return false;
> > + }
> > +
> > + folio_remove_rmap_pmd(folio, page, vma);
> > + zap_deposited_table(mm, pmdp);
> > + add_mm_counter(mm, MM_ANONPAGES, -HPAGE_PMD_NR);
> > + if (vma->vm_flags & VM_LOCKED)
> > + mlock_drain_local();
> > + folio_put(folio);
> > +
> > + return true;
> > +}
> > +
> > +bool unmap_huge_pmd_locked(struct vm_area_struct *vma, unsigned long addr,
> > + pmd_t *pmdp, struct folio *folio)
> > +{
> > + VM_WARN_ON_FOLIO(!folio_test_pmd_mappable(folio), folio);
> > + VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
> > + VM_WARN_ON_ONCE(!IS_ALIGNED(addr, HPAGE_PMD_SIZE));
> > +
> > + if (folio_test_anon(folio) && !folio_test_swapbacked(folio))
> > + return __discard_trans_pmd_locked(vma, addr, pmdp, folio);
> > +
> > + return false;
> > +}
> > +
> > static void remap_page(struct folio *folio, unsigned long nr)
> > {
> > int i = 0;
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 08a93347f283..249d6e305bec 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -1630,6 +1630,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> > enum ttu_flags flags = (enum ttu_flags)(long)arg;
> > unsigned long pfn;
> > unsigned long hsz = 0;
> > + bool pmd_mapped = false;
> >
> > /*
> > * When racing against e.g. zap_pte_range() on another cpu,
> > @@ -1677,18 +1678,26 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> > goto walk_done_err;
> > }
> >
> > - if (!pvmw.pte && (flags & TTU_SPLIT_HUGE_PMD)) {
> > - /*
> > - * We temporarily have to drop the PTL and start once
> > - * again from that now-PTE-mapped page table.
> > - */
> > - split_huge_pmd_locked(vma, range.start, pvmw.pmd, false,
> > - folio);
> > - pvmw.pmd = NULL;
> > - spin_unlock(pvmw.ptl);
> > - pvmw.ptl = NULL;
> > - flags &= ~TTU_SPLIT_HUGE_PMD;
> > - continue;
> > + if (!pvmw.pte) {
> > + pmd_mapped = true;
> > + if (unmap_huge_pmd_locked(vma, range.start, pvmw.pmd,
> > + folio))
> > + goto walk_done;
> > +
> > + if (flags & TTU_SPLIT_HUGE_PMD) {
> > + /*
> > + * We temporarily have to drop the PTL and start
> > + * once again from that now-PTE-mapped page
> > + * table.
> > + */
> > + split_huge_pmd_locked(vma, range.start,
> > + pvmw.pmd, false, folio);
> > + pvmw.pmd = NULL;
> > + spin_unlock(pvmw.ptl);
> > + pvmw.ptl = NULL;
> > + flags &= ~TTU_SPLIT_HUGE_PMD;
> > + continue;
> > + }
> > }
> >
> > /* Unexpected PMD-mapped THP? */
> > @@ -1816,7 +1825,13 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> > */
> > if (unlikely(folio_test_swapbacked(folio) !=
> > folio_test_swapcache(folio))) {
> > - WARN_ON_ONCE(1);
> > + /*
> > + * unmap_huge_pmd_locked() will unmark a
> > + * PMD-mapped folio as lazyfree if the folio or
> > + * its PMD was redirtied.
> > + */
> > + if (!pmd_mapped)
>
> Isn't that simply "pvmw.pte" ?
>
> Also,
>
> WARN_ON_ONCE(!pmd_mapped);
Good catch! I'll adjust as you suggested.
Thanks again for the review!
Lance
>
> > + WARN_ON_ONCE(1);
> > goto walk_done_err;
> > }
> >
>
> --
> Cheers,
>
> David / dhildenb
>
On Wed, Jun 5, 2024 at 10:39 PM David Hildenbrand <[email protected]> wrote:
>
> On 05.06.24 16:28, David Hildenbrand wrote:
> > On 05.06.24 16:20, Lance Yang wrote:
> >> Hi David,
> >>
> >> On Wed, Jun 5, 2024 at 8:46 PM David Hildenbrand <[email protected]> wrote:
> >>>
> >>> On 21.05.24 06:02, Lance Yang wrote:
> >>>> In preparation for supporting try_to_unmap_one() to unmap PMD-mapped
> >>>> folios, start the pagewalk first, then call split_huge_pmd_address() to
> >>>> split the folio.
> >>>>
> >>>> Since TTU_SPLIT_HUGE_PMD will no longer perform immediately, we might
> >>>> encounter a PMD-mapped THP missing the mlock in the VM_LOCKED range during
> >>>> the page walk. It’s probably necessary to mlock this THP to prevent it from
> >>>> being picked up during page reclaim.
> >>>>
> >>>> Suggested-by: David Hildenbrand <[email protected]>
> >>>> Suggested-by: Baolin Wang <[email protected]>
> >>>> Signed-off-by: Lance Yang <[email protected]>
> >>>> ---
> >>>
> >>> [...] again, sorry for the late review.
> >>
> >> No worries at all, thanks for taking time to review!
> >>
> >>>
> >>>> diff --git a/mm/rmap.c b/mm/rmap.c
> >>>> index ddffa30c79fb..08a93347f283 100644
> >>>> --- a/mm/rmap.c
> >>>> +++ b/mm/rmap.c
> >>>> @@ -1640,9 +1640,6 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> >>>> if (flags & TTU_SYNC)
> >>>> pvmw.flags = PVMW_SYNC;
> >>>>
> >>>> - if (flags & TTU_SPLIT_HUGE_PMD)
> >>>> - split_huge_pmd_address(vma, address, false, folio);
> >>>> -
> >>>> /*
> >>>> * For THP, we have to assume the worse case ie pmd for invalidation.
> >>>> * For hugetlb, it could be much worse if we need to do pud
> >>>> @@ -1668,20 +1665,35 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> >>>> mmu_notifier_invalidate_range_start(&range);
> >>>>
> >>>> while (page_vma_mapped_walk(&pvmw)) {
> >>>> - /* Unexpected PMD-mapped THP? */
> >>>> - VM_BUG_ON_FOLIO(!pvmw.pte, folio);
> >>>> -
> >>>> /*
> >>>> * If the folio is in an mlock()d vma, we must not swap it out.
> >>>> */
> >>>> if (!(flags & TTU_IGNORE_MLOCK) &&
> >>>> (vma->vm_flags & VM_LOCKED)) {
> >>>> /* Restore the mlock which got missed */
> >>>> - if (!folio_test_large(folio))
> >>>> + if (!folio_test_large(folio) ||
> >>>> + (!pvmw.pte && (flags & TTU_SPLIT_HUGE_PMD)))
> >>>> mlock_vma_folio(folio, vma);
> >>>
> >>> Can you elaborate why you think this would be required? If we would have
> >>> performed the split_huge_pmd_address() beforehand, we would still be
> >>> left with a large folio, no?
> >>
> >> Yep, there would still be a large folio, but it wouldn't be PMD-mapped.
> >>
> >> After Weifeng's series[1], the kernel supports mlock for PTE-mapped large
> >> folio, but there are a few scenarios where we don't mlock a large folio, such
> >> as when it crosses a VM_LOCKed VMA boundary.
> >>
> >> - if (!folio_test_large(folio))
> >> + if (!folio_test_large(folio) ||
> >> + (!pvmw.pte && (flags & TTU_SPLIT_HUGE_PMD)))
> >>
> >> And this check is just future-proofing and likely unnecessary. If encountering a
> >> PMD-mapped THP missing the mlock for some reason, we can mlock this
> >> THP to prevent it from being picked up during page reclaim, since it is fully
> >> mapped and doesn't cross the VMA boundary, IIUC.
> >>
> >> What do you think?
> >> I would appreciate any suggestions regarding this check ;)
> >
> > Reading this patch only, I wonder if this change makes sense in the
> > context here.
> >
> > Before this patch, we would have PTE-mapped the PMD-mapped THP before
> > reaching this call and skipped it due to "!folio_test_large(folio)".
> >
> > After this patch, we either
> >
> > a) PTE-remap the THP after this check, but retry and end-up here again,
> > whereby we would skip it due to "!folio_test_large(folio)".
> >
> > b) Discard the PMD-mapped THP due to lazyfree directly. Can that
> > co-exist with mlock and what would be the problem here with mlock?
> >
> >
Thanks a lot for clarifying!
> > So if the check is required in this patch, we really have to understand
> > why. If not, we should better drop it from this patch.
> >
> > At least my opinion, still struggling to understand why it would be
> > required (I have 0 knowledge about mlock interaction with large folios :) ).
> >
>
> Looking at that series, in folio_references_one(), we do
>
> if (!folio_test_large(folio) || !pvmw.pte) {
> /* Restore the mlock which got missed */
> mlock_vma_folio(folio, vma);
> page_vma_mapped_walk_done(&pvmw);
> pra->vm_flags |= VM_LOCKED;
> return false; /* To break the loop */
> }
>
> I wonder if we want that here as well now: in case of lazyfree we
> would not back off, right?
>
> But I'm not sure if lazyfree in mlocked areas are even possible.
>
> Adding the "!pvmw.pte" would be much clearer to me than the flag check.
Hmm... How about we drop it from this patch for now, and add it back if needed
in the future?
Thanks,
Lance
>
> --
> Cheers,
>
> David / dhildenb
>
On 05.06.24 16:57, Lance Yang wrote:
> On Wed, Jun 5, 2024 at 10:39 PM David Hildenbrand <[email protected]> wrote:
>>
>> On 05.06.24 16:28, David Hildenbrand wrote:
>>> On 05.06.24 16:20, Lance Yang wrote:
>>>> Hi David,
>>>>
>>>> On Wed, Jun 5, 2024 at 8:46 PM David Hildenbrand <[email protected]> wrote:
>>>>>
>>>>> On 21.05.24 06:02, Lance Yang wrote:
>>>>>> In preparation for supporting try_to_unmap_one() to unmap PMD-mapped
>>>>>> folios, start the pagewalk first, then call split_huge_pmd_address() to
>>>>>> split the folio.
>>>>>>
>>>>>> Since TTU_SPLIT_HUGE_PMD will no longer perform immediately, we might
>>>>>> encounter a PMD-mapped THP missing the mlock in the VM_LOCKED range during
>>>>>> the page walk. It’s probably necessary to mlock this THP to prevent it from
>>>>>> being picked up during page reclaim.
>>>>>>
>>>>>> Suggested-by: David Hildenbrand <[email protected]>
>>>>>> Suggested-by: Baolin Wang <[email protected]>
>>>>>> Signed-off-by: Lance Yang <[email protected]>
>>>>>> ---
>>>>>
>>>>> [...] again, sorry for the late review.
>>>>
>>>> No worries at all, thanks for taking time to review!
>>>>
>>>>>
>>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>>>> index ddffa30c79fb..08a93347f283 100644
>>>>>> --- a/mm/rmap.c
>>>>>> +++ b/mm/rmap.c
>>>>>> @@ -1640,9 +1640,6 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>>>>> if (flags & TTU_SYNC)
>>>>>> pvmw.flags = PVMW_SYNC;
>>>>>>
>>>>>> - if (flags & TTU_SPLIT_HUGE_PMD)
>>>>>> - split_huge_pmd_address(vma, address, false, folio);
>>>>>> -
>>>>>> /*
>>>>>> * For THP, we have to assume the worse case ie pmd for invalidation.
>>>>>> * For hugetlb, it could be much worse if we need to do pud
>>>>>> @@ -1668,20 +1665,35 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>>>>> mmu_notifier_invalidate_range_start(&range);
>>>>>>
>>>>>> while (page_vma_mapped_walk(&pvmw)) {
>>>>>> - /* Unexpected PMD-mapped THP? */
>>>>>> - VM_BUG_ON_FOLIO(!pvmw.pte, folio);
>>>>>> -
>>>>>> /*
>>>>>> * If the folio is in an mlock()d vma, we must not swap it out.
>>>>>> */
>>>>>> if (!(flags & TTU_IGNORE_MLOCK) &&
>>>>>> (vma->vm_flags & VM_LOCKED)) {
>>>>>> /* Restore the mlock which got missed */
>>>>>> - if (!folio_test_large(folio))
>>>>>> + if (!folio_test_large(folio) ||
>>>>>> + (!pvmw.pte && (flags & TTU_SPLIT_HUGE_PMD)))
>>>>>> mlock_vma_folio(folio, vma);
>>>>>
>>>>> Can you elaborate why you think this would be required? If we would have
>>>>> performed the split_huge_pmd_address() beforehand, we would still be
>>>>> left with a large folio, no?
>>>>
>>>> Yep, there would still be a large folio, but it wouldn't be PMD-mapped.
>>>>
>>>> After Weifeng's series[1], the kernel supports mlock for PTE-mapped large
>>>> folio, but there are a few scenarios where we don't mlock a large folio, such
>>>> as when it crosses a VM_LOCKed VMA boundary.
>>>>
>>>> - if (!folio_test_large(folio))
>>>> + if (!folio_test_large(folio) ||
>>>> + (!pvmw.pte && (flags & TTU_SPLIT_HUGE_PMD)))
>>>>
>>>> And this check is just future-proofing and likely unnecessary. If encountering a
>>>> PMD-mapped THP missing the mlock for some reason, we can mlock this
>>>> THP to prevent it from being picked up during page reclaim, since it is fully
>>>> mapped and doesn't cross the VMA boundary, IIUC.
>>>>
>>>> What do you think?
>>>> I would appreciate any suggestions regarding this check ;)
>>>
>>> Reading this patch only, I wonder if this change makes sense in the
>>> context here.
>>>
>>> Before this patch, we would have PTE-mapped the PMD-mapped THP before
>>> reaching this call and skipped it due to "!folio_test_large(folio)".
>>>
>>> After this patch, we either
>>>
>>> a) PTE-remap the THP after this check, but retry and end-up here again,
>>> whereby we would skip it due to "!folio_test_large(folio)".
>>>
>>> b) Discard the PMD-mapped THP due to lazyfree directly. Can that
>>> co-exist with mlock and what would be the problem here with mlock?
>>>
>>>
>
> Thanks a lot for clarifying!
>
>>> So if the check is required in this patch, we really have to understand
>>> why. If not, we should better drop it from this patch.
>>>
>>> At least my opinion, still struggling to understand why it would be
>>> required (I have 0 knowledge about mlock interaction with large folios :) ).
>>>
>>
>> Looking at that series, in folio_references_one(), we do
>>
>> if (!folio_test_large(folio) || !pvmw.pte) {
>> /* Restore the mlock which got missed */
>> mlock_vma_folio(folio, vma);
>> page_vma_mapped_walk_done(&pvmw);
>> pra->vm_flags |= VM_LOCKED;
>> return false; /* To break the loop */
>> }
>>
>> I wonder if we want that here as well now: in case of lazyfree we
>> would not back off, right?
>>
>> But I'm not sure if lazyfree in mlocked areas are even possible.
>>
>> Adding the "!pvmw.pte" would be much clearer to me than the flag check.
>
> Hmm... How about we drop it from this patch for now, and add it back if needed
> in the future?
If we can rule out that MADV_FREE + mlock() keeps working as expected in
the PMD-mapped case, we're good.
Can we rule that out? (especially for MADV_FREE followed by mlock())
--
Cheers,
David / dhildenb
On Wed, Jun 5, 2024 at 11:03 PM David Hildenbrand <[email protected]> wrote:
>
> On 05.06.24 16:57, Lance Yang wrote:
> > On Wed, Jun 5, 2024 at 10:39 PM David Hildenbrand <[email protected]> wrote:
> >>
> >> On 05.06.24 16:28, David Hildenbrand wrote:
> >>> On 05.06.24 16:20, Lance Yang wrote:
> >>>> Hi David,
> >>>>
> >>>> On Wed, Jun 5, 2024 at 8:46 PM David Hildenbrand <[email protected]> wrote:
> >>>>>
> >>>>> On 21.05.24 06:02, Lance Yang wrote:
> >>>>>> In preparation for supporting try_to_unmap_one() to unmap PMD-mapped
> >>>>>> folios, start the pagewalk first, then call split_huge_pmd_address() to
> >>>>>> split the folio.
> >>>>>>
> >>>>>> Since TTU_SPLIT_HUGE_PMD will no longer perform immediately, we might
> >>>>>> encounter a PMD-mapped THP missing the mlock in the VM_LOCKED range during
> >>>>>> the page walk. It’s probably necessary to mlock this THP to prevent it from
> >>>>>> being picked up during page reclaim.
> >>>>>>
> >>>>>> Suggested-by: David Hildenbrand <[email protected]>
> >>>>>> Suggested-by: Baolin Wang <[email protected]>
> >>>>>> Signed-off-by: Lance Yang <[email protected]>
> >>>>>> ---
> >>>>>
> >>>>> [...] again, sorry for the late review.
> >>>>
> >>>> No worries at all, thanks for taking time to review!
> >>>>
> >>>>>
> >>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
> >>>>>> index ddffa30c79fb..08a93347f283 100644
> >>>>>> --- a/mm/rmap.c
> >>>>>> +++ b/mm/rmap.c
> >>>>>> @@ -1640,9 +1640,6 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> >>>>>> if (flags & TTU_SYNC)
> >>>>>> pvmw.flags = PVMW_SYNC;
> >>>>>>
> >>>>>> - if (flags & TTU_SPLIT_HUGE_PMD)
> >>>>>> - split_huge_pmd_address(vma, address, false, folio);
> >>>>>> -
> >>>>>> /*
> >>>>>> * For THP, we have to assume the worse case ie pmd for invalidation.
> >>>>>> * For hugetlb, it could be much worse if we need to do pud
> >>>>>> @@ -1668,20 +1665,35 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> >>>>>> mmu_notifier_invalidate_range_start(&range);
> >>>>>>
> >>>>>> while (page_vma_mapped_walk(&pvmw)) {
> >>>>>> - /* Unexpected PMD-mapped THP? */
> >>>>>> - VM_BUG_ON_FOLIO(!pvmw.pte, folio);
> >>>>>> -
> >>>>>> /*
> >>>>>> * If the folio is in an mlock()d vma, we must not swap it out.
> >>>>>> */
> >>>>>> if (!(flags & TTU_IGNORE_MLOCK) &&
> >>>>>> (vma->vm_flags & VM_LOCKED)) {
> >>>>>> /* Restore the mlock which got missed */
> >>>>>> - if (!folio_test_large(folio))
> >>>>>> + if (!folio_test_large(folio) ||
> >>>>>> + (!pvmw.pte && (flags & TTU_SPLIT_HUGE_PMD)))
> >>>>>> mlock_vma_folio(folio, vma);
> >>>>>
> >>>>> Can you elaborate why you think this would be required? If we would have
> >>>>> performed the split_huge_pmd_address() beforehand, we would still be
> >>>>> left with a large folio, no?
> >>>>
> >>>> Yep, there would still be a large folio, but it wouldn't be PMD-mapped.
> >>>>
> >>>> After Weifeng's series[1], the kernel supports mlock for PTE-mapped large
> >>>> folio, but there are a few scenarios where we don't mlock a large folio, such
> >>>> as when it crosses a VM_LOCKed VMA boundary.
> >>>>
> >>>> - if (!folio_test_large(folio))
> >>>> + if (!folio_test_large(folio) ||
> >>>> + (!pvmw.pte && (flags & TTU_SPLIT_HUGE_PMD)))
> >>>>
> >>>> And this check is just future-proofing and likely unnecessary. If encountering a
> >>>> PMD-mapped THP missing the mlock for some reason, we can mlock this
> >>>> THP to prevent it from being picked up during page reclaim, since it is fully
> >>>> mapped and doesn't cross the VMA boundary, IIUC.
> >>>>
> >>>> What do you think?
> >>>> I would appreciate any suggestions regarding this check ;)
> >>>
> >>> Reading this patch only, I wonder if this change makes sense in the
> >>> context here.
> >>>
> >>> Before this patch, we would have PTE-mapped the PMD-mapped THP before
> >>> reaching this call and skipped it due to "!folio_test_large(folio)".
> >>>
> >>> After this patch, we either
> >>>
> >>> a) PTE-remap the THP after this check, but retry and end-up here again,
> >>> whereby we would skip it due to "!folio_test_large(folio)".
> >>>
> >>> b) Discard the PMD-mapped THP due to lazyfree directly. Can that
> >>> co-exist with mlock and what would be the problem here with mlock?
> >>>
> >>>
> >
> > Thanks a lot for clarifying!
> >
> >>> So if the check is required in this patch, we really have to understand
> >>> why. If not, we should better drop it from this patch.
> >>>
> >>> At least my opinion, still struggling to understand why it would be
> >>> required (I have 0 knowledge about mlock interaction with large folios :) ).
> >>>
> >>
> >> Looking at that series, in folio_references_one(), we do
> >>
> >> if (!folio_test_large(folio) || !pvmw.pte) {
> >> /* Restore the mlock which got missed */
> >> mlock_vma_folio(folio, vma);
> >> page_vma_mapped_walk_done(&pvmw);
> >> pra->vm_flags |= VM_LOCKED;
> >> return false; /* To break the loop */
> >> }
> >>
> >> I wonder if we want that here as well now: in case of lazyfree we
> >> would not back off, right?
> >>
> >> But I'm not sure if lazyfree in mlocked areas are even possible.
> >>
> >> Adding the "!pvmw.pte" would be much clearer to me than the flag check.
> >
> > Hmm... How about we drop it from this patch for now, and add it back if needed
> > in the future?
>
> If we can rule out that MADV_FREE + mlock() keeps working as expected in
> the PMD-mapped case, we're good.
>
> Can we rule that out? (especially for MADV_FREE followed by mlock())
Perhaps we don't worry about that.
IIUC, without that check, MADV_FREE + mlock() still works as expected in
the PMD-mapped case, since if encountering a large folio in a VM_LOCKED
VMA range, we will stop the page walk immediately.
Thanks,
Lance
>
> --
> Cheers,
>
> David / dhildenb
>
On 05.06.24 17:43, Lance Yang wrote:
> On Wed, Jun 5, 2024 at 11:03 PM David Hildenbrand <[email protected]> wrote:
>>
>> On 05.06.24 16:57, Lance Yang wrote:
>>> On Wed, Jun 5, 2024 at 10:39 PM David Hildenbrand <[email protected]> wrote:
>>>>
>>>> On 05.06.24 16:28, David Hildenbrand wrote:
>>>>> On 05.06.24 16:20, Lance Yang wrote:
>>>>>> Hi David,
>>>>>>
>>>>>> On Wed, Jun 5, 2024 at 8:46 PM David Hildenbrand <[email protected]> wrote:
>>>>>>>
>>>>>>> On 21.05.24 06:02, Lance Yang wrote:
>>>>>>>> In preparation for supporting try_to_unmap_one() to unmap PMD-mapped
>>>>>>>> folios, start the pagewalk first, then call split_huge_pmd_address() to
>>>>>>>> split the folio.
>>>>>>>>
>>>>>>>> Since TTU_SPLIT_HUGE_PMD will no longer perform immediately, we might
>>>>>>>> encounter a PMD-mapped THP missing the mlock in the VM_LOCKED range during
>>>>>>>> the page walk. It’s probably necessary to mlock this THP to prevent it from
>>>>>>>> being picked up during page reclaim.
>>>>>>>>
>>>>>>>> Suggested-by: David Hildenbrand <[email protected]>
>>>>>>>> Suggested-by: Baolin Wang <[email protected]>
>>>>>>>> Signed-off-by: Lance Yang <[email protected]>
>>>>>>>> ---
>>>>>>>
>>>>>>> [...] again, sorry for the late review.
>>>>>>
>>>>>> No worries at all, thanks for taking time to review!
>>>>>>
>>>>>>>
>>>>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>>>>>> index ddffa30c79fb..08a93347f283 100644
>>>>>>>> --- a/mm/rmap.c
>>>>>>>> +++ b/mm/rmap.c
>>>>>>>> @@ -1640,9 +1640,6 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>>>>>>> if (flags & TTU_SYNC)
>>>>>>>> pvmw.flags = PVMW_SYNC;
>>>>>>>>
>>>>>>>> - if (flags & TTU_SPLIT_HUGE_PMD)
>>>>>>>> - split_huge_pmd_address(vma, address, false, folio);
>>>>>>>> -
>>>>>>>> /*
>>>>>>>> * For THP, we have to assume the worse case ie pmd for invalidation.
>>>>>>>> * For hugetlb, it could be much worse if we need to do pud
>>>>>>>> @@ -1668,20 +1665,35 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>>>>>>> mmu_notifier_invalidate_range_start(&range);
>>>>>>>>
>>>>>>>> while (page_vma_mapped_walk(&pvmw)) {
>>>>>>>> - /* Unexpected PMD-mapped THP? */
>>>>>>>> - VM_BUG_ON_FOLIO(!pvmw.pte, folio);
>>>>>>>> -
>>>>>>>> /*
>>>>>>>> * If the folio is in an mlock()d vma, we must not swap it out.
>>>>>>>> */
>>>>>>>> if (!(flags & TTU_IGNORE_MLOCK) &&
>>>>>>>> (vma->vm_flags & VM_LOCKED)) {
>>>>>>>> /* Restore the mlock which got missed */
>>>>>>>> - if (!folio_test_large(folio))
>>>>>>>> + if (!folio_test_large(folio) ||
>>>>>>>> + (!pvmw.pte && (flags & TTU_SPLIT_HUGE_PMD)))
>>>>>>>> mlock_vma_folio(folio, vma);
>>>>>>>
>>>>>>> Can you elaborate why you think this would be required? If we would have
>>>>>>> performed the split_huge_pmd_address() beforehand, we would still be
>>>>>>> left with a large folio, no?
>>>>>>
>>>>>> Yep, there would still be a large folio, but it wouldn't be PMD-mapped.
>>>>>>
>>>>>> After Weifeng's series[1], the kernel supports mlock for PTE-mapped large
>>>>>> folio, but there are a few scenarios where we don't mlock a large folio, such
>>>>>> as when it crosses a VM_LOCKed VMA boundary.
>>>>>>
>>>>>> - if (!folio_test_large(folio))
>>>>>> + if (!folio_test_large(folio) ||
>>>>>> + (!pvmw.pte && (flags & TTU_SPLIT_HUGE_PMD)))
>>>>>>
>>>>>> And this check is just future-proofing and likely unnecessary. If encountering a
>>>>>> PMD-mapped THP missing the mlock for some reason, we can mlock this
>>>>>> THP to prevent it from being picked up during page reclaim, since it is fully
>>>>>> mapped and doesn't cross the VMA boundary, IIUC.
>>>>>>
>>>>>> What do you think?
>>>>>> I would appreciate any suggestions regarding this check ;)
>>>>>
>>>>> Reading this patch only, I wonder if this change makes sense in the
>>>>> context here.
>>>>>
>>>>> Before this patch, we would have PTE-mapped the PMD-mapped THP before
>>>>> reaching this call and skipped it due to "!folio_test_large(folio)".
>>>>>
>>>>> After this patch, we either
>>>>>
>>>>> a) PTE-remap the THP after this check, but retry and end-up here again,
>>>>> whereby we would skip it due to "!folio_test_large(folio)".
>>>>>
>>>>> b) Discard the PMD-mapped THP due to lazyfree directly. Can that
>>>>> co-exist with mlock and what would be the problem here with mlock?
>>>>>
>>>>>
>>>
>>> Thanks a lot for clarifying!
>>>
>>>>> So if the check is required in this patch, we really have to understand
>>>>> why. If not, we should better drop it from this patch.
>>>>>
>>>>> At least my opinion, still struggling to understand why it would be
>>>>> required (I have 0 knowledge about mlock interaction with large folios :) ).
>>>>>
>>>>
>>>> Looking at that series, in folio_references_one(), we do
>>>>
>>>> if (!folio_test_large(folio) || !pvmw.pte) {
>>>> /* Restore the mlock which got missed */
>>>> mlock_vma_folio(folio, vma);
>>>> page_vma_mapped_walk_done(&pvmw);
>>>> pra->vm_flags |= VM_LOCKED;
>>>> return false; /* To break the loop */
>>>> }
>>>>
>>>> I wonder if we want that here as well now: in case of lazyfree we
>>>> would not back off, right?
>>>>
>>>> But I'm not sure if lazyfree in mlocked areas are even possible.
>>>>
>>>> Adding the "!pvmw.pte" would be much clearer to me than the flag check.
>>>
>>> Hmm... How about we drop it from this patch for now, and add it back if needed
>>> in the future?
>>
>> If we can rule out that MADV_FREE + mlock() keeps working as expected in
>> the PMD-mapped case, we're good.
>>
>> Can we rule that out? (especially for MADV_FREE followed by mlock())
>
> Perhaps we don't worry about that.
>
> IIUC, without that check, MADV_FREE + mlock() still works as expected in
> the PMD-mapped case, since if encountering a large folio in a VM_LOCKED
> VMA range, we will stop the page walk immediately.
Can you point me at the code (especially considering patch #3?)
--
Cheers,
David / dhildenb
On Wed, Jun 5, 2024 at 10:28 PM David Hildenbrand <[email protected]> wrote:
>
> On 05.06.24 16:20, Lance Yang wrote:
> > Hi David,
> >
> > On Wed, Jun 5, 2024 at 8:46 PM David Hildenbrand <[email protected]> wrote:
> >>
> >> On 21.05.24 06:02, Lance Yang wrote:
> >>> In preparation for supporting try_to_unmap_one() to unmap PMD-mapped
> >>> folios, start the pagewalk first, then call split_huge_pmd_address() to
> >>> split the folio.
> >>>
> >>> Since TTU_SPLIT_HUGE_PMD will no longer perform immediately, we might
> >>> encounter a PMD-mapped THP missing the mlock in the VM_LOCKED range during
> >>> the page walk. It’s probably necessary to mlock this THP to prevent it from
> >>> being picked up during page reclaim.
> >>>
> >>> Suggested-by: David Hildenbrand <[email protected]>
> >>> Suggested-by: Baolin Wang <[email protected]>
> >>> Signed-off-by: Lance Yang <[email protected]>
> >>> ---
> >>
> >> [...] again, sorry for the late review.
> >
> > No worries at all, thanks for taking time to review!
> >
> >>
> >>> diff --git a/mm/rmap.c b/mm/rmap.c
> >>> index ddffa30c79fb..08a93347f283 100644
> >>> --- a/mm/rmap.c
> >>> +++ b/mm/rmap.c
> >>> @@ -1640,9 +1640,6 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> >>> if (flags & TTU_SYNC)
> >>> pvmw.flags = PVMW_SYNC;
> >>>
> >>> - if (flags & TTU_SPLIT_HUGE_PMD)
> >>> - split_huge_pmd_address(vma, address, false, folio);
> >>> -
> >>> /*
> >>> * For THP, we have to assume the worse case ie pmd for invalidation.
> >>> * For hugetlb, it could be much worse if we need to do pud
> >>> @@ -1668,20 +1665,35 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> >>> mmu_notifier_invalidate_range_start(&range);
> >>>
> >>> while (page_vma_mapped_walk(&pvmw)) {
> >>> - /* Unexpected PMD-mapped THP? */
> >>> - VM_BUG_ON_FOLIO(!pvmw.pte, folio);
> >>> -
> >>> /*
> >>> * If the folio is in an mlock()d vma, we must not swap it out.
> >>> */
> >>> if (!(flags & TTU_IGNORE_MLOCK) &&
> >>> (vma->vm_flags & VM_LOCKED)) {
> >>> /* Restore the mlock which got missed */
> >>> - if (!folio_test_large(folio))
> >>> + if (!folio_test_large(folio) ||
> >>> + (!pvmw.pte && (flags & TTU_SPLIT_HUGE_PMD)))
> >>> mlock_vma_folio(folio, vma);
> >>
> >> Can you elaborate why you think this would be required? If we would have
> >> performed the split_huge_pmd_address() beforehand, we would still be
> >> left with a large folio, no?
> >
> > Yep, there would still be a large folio, but it wouldn't be PMD-mapped.
> >
> > After Weifeng's series[1], the kernel supports mlock for PTE-mapped large
> > folio, but there are a few scenarios where we don't mlock a large folio, such
> > as when it crosses a VM_LOCKed VMA boundary.
> >
> > - if (!folio_test_large(folio))
> > + if (!folio_test_large(folio) ||
> > + (!pvmw.pte && (flags & TTU_SPLIT_HUGE_PMD)))
> >
> > And this check is just future-proofing and likely unnecessary. If encountering a
> > PMD-mapped THP missing the mlock for some reason, we can mlock this
> > THP to prevent it from being picked up during page reclaim, since it is fully
> > mapped and doesn't cross the VMA boundary, IIUC.
> >
> > What do you think?
> > I would appreciate any suggestions regarding this check ;)
>
> Reading this patch only, I wonder if this change makes sense in the
> context here.
Allow me to try explaining it again ;)
>
> Before this patch, we would have PTE-mapped the PMD-mapped THP before
> reaching this call and skipped it due to "!folio_test_large(folio)".
Yes, there is only a PTE-mapped THP when doing the "!folio_test_large(folio)"
check, as we will first conditionally split the PMD via
split_huge_pmd_address().
>
> After this patch, we either
Things will change. We'll first do the "!folio_test_large(folio)" check, then
conditionally split the PMD via split_huge_pmd_address().
>
> a) PTE-remap the THP after this check, but retry and end-up here again,
> whereby we would skip it due to "!folio_test_large(folio)".
Hmm...
IIUC, we will skip it after this check, stop the page walk, and not
PTE-remap the THP.
>
> b) Discard the PMD-mapped THP due to lazyfree directly. Can that
> co-exist with mlock and what would be the problem here with mlock?
Before discarding a PMD-mapped THP as a whole, as patch #3 did,
we also perform the "!folio_test_large(folio)" check. If the THP coexists
with mlock, we will skip it, stop the page walk, and not discard it. IIUC.
>
>
> So if the check is required in this patch, we really have to understand
> why. If not, we should better drop it from this patch.
I added the "!pvmw.pte && (flags & TTU_SPLIT_HUGE_PMD))" check
in this patch just to future-proof mlock for a PMD-mapped THP missing
the mlock, to prevent it from being picked up during page reclaim.
But is this really required? It seems like nothing should really be broken
without this check.
Perhaps, we should drop it from this patch until we fully understand the
reason for it. Could you get me some suggestions?
Thanks,
Lance
>
> At least my opinion, still struggling to understand why it would be
> required (I have 0 knowledge about mlock interaction with large folios :) ).
>
> --
> Cheers,
>
> David / dhildenb
>
On Thu, Jun 6, 2024 at 12:16 AM David Hildenbrand <[email protected]> wrote:
>
> On 05.06.24 17:43, Lance Yang wrote:
> > On Wed, Jun 5, 2024 at 11:03 PM David Hildenbrand <[email protected]> wrote:
> >>
> >> On 05.06.24 16:57, Lance Yang wrote:
> >>> On Wed, Jun 5, 2024 at 10:39 PM David Hildenbrand <[email protected]> wrote:
> >>>>
> >>>> On 05.06.24 16:28, David Hildenbrand wrote:
> >>>>> On 05.06.24 16:20, Lance Yang wrote:
> >>>>>> Hi David,
> >>>>>>
> >>>>>> On Wed, Jun 5, 2024 at 8:46 PM David Hildenbrand <[email protected]> wrote:
> >>>>>>>
> >>>>>>> On 21.05.24 06:02, Lance Yang wrote:
> >>>>>>>> In preparation for supporting try_to_unmap_one() to unmap PMD-mapped
> >>>>>>>> folios, start the pagewalk first, then call split_huge_pmd_address() to
> >>>>>>>> split the folio.
> >>>>>>>>
> >>>>>>>> Since TTU_SPLIT_HUGE_PMD will no longer perform immediately, we might
> >>>>>>>> encounter a PMD-mapped THP missing the mlock in the VM_LOCKED range during
> >>>>>>>> the page walk. It’s probably necessary to mlock this THP to prevent it from
> >>>>>>>> being picked up during page reclaim.
> >>>>>>>>
> >>>>>>>> Suggested-by: David Hildenbrand <[email protected]>
> >>>>>>>> Suggested-by: Baolin Wang <[email protected]>
> >>>>>>>> Signed-off-by: Lance Yang <[email protected]>
> >>>>>>>> ---
> >>>>>>>
> >>>>>>> [...] again, sorry for the late review.
> >>>>>>
> >>>>>> No worries at all, thanks for taking time to review!
> >>>>>>
> >>>>>>>
> >>>>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
> >>>>>>>> index ddffa30c79fb..08a93347f283 100644
> >>>>>>>> --- a/mm/rmap.c
> >>>>>>>> +++ b/mm/rmap.c
> >>>>>>>> @@ -1640,9 +1640,6 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> >>>>>>>> if (flags & TTU_SYNC)
> >>>>>>>> pvmw.flags = PVMW_SYNC;
> >>>>>>>>
> >>>>>>>> - if (flags & TTU_SPLIT_HUGE_PMD)
> >>>>>>>> - split_huge_pmd_address(vma, address, false, folio);
> >>>>>>>> -
> >>>>>>>> /*
> >>>>>>>> * For THP, we have to assume the worse case ie pmd for invalidation.
> >>>>>>>> * For hugetlb, it could be much worse if we need to do pud
> >>>>>>>> @@ -1668,20 +1665,35 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> >>>>>>>> mmu_notifier_invalidate_range_start(&range);
> >>>>>>>>
> >>>>>>>> while (page_vma_mapped_walk(&pvmw)) {
> >>>>>>>> - /* Unexpected PMD-mapped THP? */
> >>>>>>>> - VM_BUG_ON_FOLIO(!pvmw.pte, folio);
> >>>>>>>> -
> >>>>>>>> /*
> >>>>>>>> * If the folio is in an mlock()d vma, we must not swap it out.
> >>>>>>>> */
> >>>>>>>> if (!(flags & TTU_IGNORE_MLOCK) &&
> >>>>>>>> (vma->vm_flags & VM_LOCKED)) {
> >>>>>>>> /* Restore the mlock which got missed */
> >>>>>>>> - if (!folio_test_large(folio))
> >>>>>>>> + if (!folio_test_large(folio) ||
> >>>>>>>> + (!pvmw.pte && (flags & TTU_SPLIT_HUGE_PMD)))
> >>>>>>>> mlock_vma_folio(folio, vma);
> >>>>>>>
> >>>>>>> Can you elaborate why you think this would be required? If we would have
> >>>>>>> performed the split_huge_pmd_address() beforehand, we would still be
> >>>>>>> left with a large folio, no?
> >>>>>>
> >>>>>> Yep, there would still be a large folio, but it wouldn't be PMD-mapped.
> >>>>>>
> >>>>>> After Weifeng's series[1], the kernel supports mlock for PTE-mapped large
> >>>>>> folio, but there are a few scenarios where we don't mlock a large folio, such
> >>>>>> as when it crosses a VM_LOCKed VMA boundary.
> >>>>>>
> >>>>>> - if (!folio_test_large(folio))
> >>>>>> + if (!folio_test_large(folio) ||
> >>>>>> + (!pvmw.pte && (flags & TTU_SPLIT_HUGE_PMD)))
> >>>>>>
> >>>>>> And this check is just future-proofing and likely unnecessary. If encountering a
> >>>>>> PMD-mapped THP missing the mlock for some reason, we can mlock this
> >>>>>> THP to prevent it from being picked up during page reclaim, since it is fully
> >>>>>> mapped and doesn't cross the VMA boundary, IIUC.
> >>>>>>
> >>>>>> What do you think?
> >>>>>> I would appreciate any suggestions regarding this check ;)
> >>>>>
> >>>>> Reading this patch only, I wonder if this change makes sense in the
> >>>>> context here.
> >>>>>
> >>>>> Before this patch, we would have PTE-mapped the PMD-mapped THP before
> >>>>> reaching this call and skipped it due to "!folio_test_large(folio)".
> >>>>>
> >>>>> After this patch, we either
> >>>>>
> >>>>> a) PTE-remap the THP after this check, but retry and end-up here again,
> >>>>> whereby we would skip it due to "!folio_test_large(folio)".
> >>>>>
> >>>>> b) Discard the PMD-mapped THP due to lazyfree directly. Can that
> >>>>> co-exist with mlock and what would be the problem here with mlock?
> >>>>>
> >>>>>
> >>>
> >>> Thanks a lot for clarifying!
> >>>
> >>>>> So if the check is required in this patch, we really have to understand
> >>>>> why. If not, we should better drop it from this patch.
> >>>>>
> >>>>> At least my opinion, still struggling to understand why it would be
> >>>>> required (I have 0 knowledge about mlock interaction with large folios :) ).
> >>>>>
> >>>>
> >>>> Looking at that series, in folio_references_one(), we do
> >>>>
> >>>> if (!folio_test_large(folio) || !pvmw.pte) {
> >>>> /* Restore the mlock which got missed */
> >>>> mlock_vma_folio(folio, vma);
> >>>> page_vma_mapped_walk_done(&pvmw);
> >>>> pra->vm_flags |= VM_LOCKED;
> >>>> return false; /* To break the loop */
> >>>> }
> >>>>
> >>>> I wonder if we want that here as well now: in case of lazyfree we
> >>>> would not back off, right?
> >>>>
> >>>> But I'm not sure if lazyfree in mlocked areas are even possible.
> >>>>
> >>>> Adding the "!pvmw.pte" would be much clearer to me than the flag check.
> >>>
> >>> Hmm... How about we drop it from this patch for now, and add it back if needed
> >>> in the future?
> >>
> >> If we can rule out that MADV_FREE + mlock() keeps working as expected in
> >> the PMD-mapped case, we're good.
> >>
> >> Can we rule that out? (especially for MADV_FREE followed by mlock())
> >
> > Perhaps we don't worry about that.
> >
> > IIUC, without that check, MADV_FREE + mlock() still works as expected in
> > the PMD-mapped case, since if encountering a large folio in a VM_LOCKED
> > VMA range, we will stop the page walk immediately.
>
>
> Can you point me at the code (especially considering patch #3?)
Yep, please see my other mail ;)
Thanks,
Lance
>
> --
> Cheers,
>
> David / dhildenb
>
On 06.06.24 05:55, Lance Yang wrote:
> On Wed, Jun 5, 2024 at 10:28 PM David Hildenbrand <[email protected]> wrote:
>>
>> On 05.06.24 16:20, Lance Yang wrote:
>>> Hi David,
>>>
>>> On Wed, Jun 5, 2024 at 8:46 PM David Hildenbrand <[email protected]> wrote:
>>>>
>>>> On 21.05.24 06:02, Lance Yang wrote:
>>>>> In preparation for supporting try_to_unmap_one() to unmap PMD-mapped
>>>>> folios, start the pagewalk first, then call split_huge_pmd_address() to
>>>>> split the folio.
>>>>>
>>>>> Since TTU_SPLIT_HUGE_PMD will no longer perform immediately, we might
>>>>> encounter a PMD-mapped THP missing the mlock in the VM_LOCKED range during
>>>>> the page walk. It’s probably necessary to mlock this THP to prevent it from
>>>>> being picked up during page reclaim.
>>>>>
>>>>> Suggested-by: David Hildenbrand <[email protected]>
>>>>> Suggested-by: Baolin Wang <[email protected]>
>>>>> Signed-off-by: Lance Yang <[email protected]>
>>>>> ---
>>>>
>>>> [...] again, sorry for the late review.
>>>
>>> No worries at all, thanks for taking time to review!
>>>
>>>>
>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>>> index ddffa30c79fb..08a93347f283 100644
>>>>> --- a/mm/rmap.c
>>>>> +++ b/mm/rmap.c
>>>>> @@ -1640,9 +1640,6 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>>>> if (flags & TTU_SYNC)
>>>>> pvmw.flags = PVMW_SYNC;
>>>>>
>>>>> - if (flags & TTU_SPLIT_HUGE_PMD)
>>>>> - split_huge_pmd_address(vma, address, false, folio);
>>>>> -
>>>>> /*
>>>>> * For THP, we have to assume the worse case ie pmd for invalidation.
>>>>> * For hugetlb, it could be much worse if we need to do pud
>>>>> @@ -1668,20 +1665,35 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>>>> mmu_notifier_invalidate_range_start(&range);
>>>>>
>>>>> while (page_vma_mapped_walk(&pvmw)) {
>>>>> - /* Unexpected PMD-mapped THP? */
>>>>> - VM_BUG_ON_FOLIO(!pvmw.pte, folio);
>>>>> -
>>>>> /*
>>>>> * If the folio is in an mlock()d vma, we must not swap it out.
>>>>> */
>>>>> if (!(flags & TTU_IGNORE_MLOCK) &&
>>>>> (vma->vm_flags & VM_LOCKED)) {
>>>>> /* Restore the mlock which got missed */
>>>>> - if (!folio_test_large(folio))
>>>>> + if (!folio_test_large(folio) ||
>>>>> + (!pvmw.pte && (flags & TTU_SPLIT_HUGE_PMD)))
>>>>> mlock_vma_folio(folio, vma);
>>>>
>>>> Can you elaborate why you think this would be required? If we would have
>>>> performed the split_huge_pmd_address() beforehand, we would still be
>>>> left with a large folio, no?
>>>
>>> Yep, there would still be a large folio, but it wouldn't be PMD-mapped.
>>>
>>> After Weifeng's series[1], the kernel supports mlock for PTE-mapped large
>>> folio, but there are a few scenarios where we don't mlock a large folio, such
>>> as when it crosses a VM_LOCKed VMA boundary.
>>>
>>> - if (!folio_test_large(folio))
>>> + if (!folio_test_large(folio) ||
>>> + (!pvmw.pte && (flags & TTU_SPLIT_HUGE_PMD)))
>>>
>>> And this check is just future-proofing and likely unnecessary. If encountering a
>>> PMD-mapped THP missing the mlock for some reason, we can mlock this
>>> THP to prevent it from being picked up during page reclaim, since it is fully
>>> mapped and doesn't cross the VMA boundary, IIUC.
>>>
>>> What do you think?
>>> I would appreciate any suggestions regarding this check ;)
>>
>> Reading this patch only, I wonder if this change makes sense in the
>> context here.
>
> Allow me to try explaining it again ;)
>
>>
>> Before this patch, we would have PTE-mapped the PMD-mapped THP before
>> reaching this call and skipped it due to "!folio_test_large(folio)".
>
> Yes, there is only a PTE-mapped THP when doing the "!folio_test_large(folio)"
> check, as we will first conditionally split the PMD via
> split_huge_pmd_address().
>
>>
>> After this patch, we either
>
> Things will change. We'll first do the "!folio_test_large(folio)" check, then
> conditionally split the PMD via split_huge_pmd_address().
>
>>
>> a) PTE-remap the THP after this check, but retry and end-up here again,
>> whereby we would skip it due to "!folio_test_large(folio)".
>
> Hmm...
>
> IIUC, we will skip it after this check, stop the page walk, and not
> PTE-remap the THP.
>
>>
>> b) Discard the PMD-mapped THP due to lazyfree directly. Can that
>> co-exist with mlock and what would be the problem here with mlock?
>
> Before discarding a PMD-mapped THP as a whole, as patch #3 did,
> we also perform the "!folio_test_large(folio)" check. If the THP coexists
> with mlock, we will skip it, stop the page walk, and not discard it. IIUC.
But "!folio_test_large(folio)" would *skip* the THP and not consider it
regarding mlock.
I'm probably missing something and should try current mm/mm-unstable
with MADV_FREE + mlock() on a PMD-mapped THP.
>
>>
>>
>> So if the check is required in this patch, we really have to understand
>> why. If not, we should better drop it from this patch.
>
> I added the "!pvmw.pte && (flags & TTU_SPLIT_HUGE_PMD))" check
> in this patch just to future-proof mlock for a PMD-mapped THP missing
> the mlock, to prevent it from being picked up during page reclaim.
>
> But is this really required? It seems like nothing should really be broken
> without this check.
>
> Perhaps, we should drop it from this patch until we fully understand the
> reason for it. Could you get me some suggestions?
We should drop it from this patch, agreed. We might need it
("!pvmw.pte") in patch #3, but I still have to understand if there
really would be a problem.
--
Cheers,
David / dhildenb
On 06.06.24 10:01, David Hildenbrand wrote:
> On 06.06.24 05:55, Lance Yang wrote:
>> On Wed, Jun 5, 2024 at 10:28 PM David Hildenbrand <[email protected]> wrote:
>>>
>>> On 05.06.24 16:20, Lance Yang wrote:
>>>> Hi David,
>>>>
>>>> On Wed, Jun 5, 2024 at 8:46 PM David Hildenbrand <[email protected]> wrote:
>>>>>
>>>>> On 21.05.24 06:02, Lance Yang wrote:
>>>>>> In preparation for supporting try_to_unmap_one() to unmap PMD-mapped
>>>>>> folios, start the pagewalk first, then call split_huge_pmd_address() to
>>>>>> split the folio.
>>>>>>
>>>>>> Since TTU_SPLIT_HUGE_PMD will no longer perform immediately, we might
>>>>>> encounter a PMD-mapped THP missing the mlock in the VM_LOCKED range during
>>>>>> the page walk. It’s probably necessary to mlock this THP to prevent it from
>>>>>> being picked up during page reclaim.
>>>>>>
>>>>>> Suggested-by: David Hildenbrand <[email protected]>
>>>>>> Suggested-by: Baolin Wang <[email protected]>
>>>>>> Signed-off-by: Lance Yang <[email protected]>
>>>>>> ---
>>>>>
>>>>> [...] again, sorry for the late review.
>>>>
>>>> No worries at all, thanks for taking time to review!
>>>>
>>>>>
>>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>>>> index ddffa30c79fb..08a93347f283 100644
>>>>>> --- a/mm/rmap.c
>>>>>> +++ b/mm/rmap.c
>>>>>> @@ -1640,9 +1640,6 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>>>>> if (flags & TTU_SYNC)
>>>>>> pvmw.flags = PVMW_SYNC;
>>>>>>
>>>>>> - if (flags & TTU_SPLIT_HUGE_PMD)
>>>>>> - split_huge_pmd_address(vma, address, false, folio);
>>>>>> -
>>>>>> /*
>>>>>> * For THP, we have to assume the worse case ie pmd for invalidation.
>>>>>> * For hugetlb, it could be much worse if we need to do pud
>>>>>> @@ -1668,20 +1665,35 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>>>>> mmu_notifier_invalidate_range_start(&range);
>>>>>>
>>>>>> while (page_vma_mapped_walk(&pvmw)) {
>>>>>> - /* Unexpected PMD-mapped THP? */
>>>>>> - VM_BUG_ON_FOLIO(!pvmw.pte, folio);
>>>>>> -
>>>>>> /*
>>>>>> * If the folio is in an mlock()d vma, we must not swap it out.
>>>>>> */
>>>>>> if (!(flags & TTU_IGNORE_MLOCK) &&
>>>>>> (vma->vm_flags & VM_LOCKED)) {
>>>>>> /* Restore the mlock which got missed */
>>>>>> - if (!folio_test_large(folio))
>>>>>> + if (!folio_test_large(folio) ||
>>>>>> + (!pvmw.pte && (flags & TTU_SPLIT_HUGE_PMD)))
>>>>>> mlock_vma_folio(folio, vma);
>>>>>
>>>>> Can you elaborate why you think this would be required? If we would have
>>>>> performed the split_huge_pmd_address() beforehand, we would still be
>>>>> left with a large folio, no?
>>>>
>>>> Yep, there would still be a large folio, but it wouldn't be PMD-mapped.
>>>>
>>>> After Weifeng's series[1], the kernel supports mlock for PTE-mapped large
>>>> folio, but there are a few scenarios where we don't mlock a large folio, such
>>>> as when it crosses a VM_LOCKed VMA boundary.
>>>>
>>>> - if (!folio_test_large(folio))
>>>> + if (!folio_test_large(folio) ||
>>>> + (!pvmw.pte && (flags & TTU_SPLIT_HUGE_PMD)))
>>>>
>>>> And this check is just future-proofing and likely unnecessary. If encountering a
>>>> PMD-mapped THP missing the mlock for some reason, we can mlock this
>>>> THP to prevent it from being picked up during page reclaim, since it is fully
>>>> mapped and doesn't cross the VMA boundary, IIUC.
>>>>
>>>> What do you think?
>>>> I would appreciate any suggestions regarding this check ;)
>>>
>>> Reading this patch only, I wonder if this change makes sense in the
>>> context here.
>>
>> Allow me to try explaining it again ;)
>>
>>>
>>> Before this patch, we would have PTE-mapped the PMD-mapped THP before
>>> reaching this call and skipped it due to "!folio_test_large(folio)".
>>
>> Yes, there is only a PTE-mapped THP when doing the "!folio_test_large(folio)"
>> check, as we will first conditionally split the PMD via
>> split_huge_pmd_address().
>>
>>>
>>> After this patch, we either
>>
>> Things will change. We'll first do the "!folio_test_large(folio)" check, then
>> conditionally split the PMD via split_huge_pmd_address().
>>
>>>
>>> a) PTE-remap the THP after this check, but retry and end-up here again,
>>> whereby we would skip it due to "!folio_test_large(folio)".
>>
>> Hmm...
>>
>> IIUC, we will skip it after this check, stop the page walk, and not
>> PTE-remap the THP.
>>
>>>
>>> b) Discard the PMD-mapped THP due to lazyfree directly. Can that
>>> co-exist with mlock and what would be the problem here with mlock?
>>
>> Before discarding a PMD-mapped THP as a whole, as patch #3 did,
>> we also perform the "!folio_test_large(folio)" check. If the THP coexists
>> with mlock, we will skip it, stop the page walk, and not discard it. IIUC.
>
> But "!folio_test_large(folio)" would *skip* the THP and not consider it
> regarding mlock.
>
> I'm probably missing something
I'm stupid, I missed that we still do the "goto walk_done_err;", only
that we don't do the mlock_vma_folio(folio, vma);
Yes, let's drop it for now! :)
--
Cheers,
David / dhildenb
On 06.06.24 11:38, Lance Yang wrote:
> On Thu, Jun 6, 2024 at 4:06 PM David Hildenbrand <[email protected]> wrote:
>>
>> On 06.06.24 10:01, David Hildenbrand wrote:
>>> On 06.06.24 05:55, Lance Yang wrote:
>>>> On Wed, Jun 5, 2024 at 10:28 PM David Hildenbrand <[email protected]> wrote:
>>>>>
>>>>> On 05.06.24 16:20, Lance Yang wrote:
>>>>>> Hi David,
>>>>>>
>>>>>> On Wed, Jun 5, 2024 at 8:46 PM David Hildenbrand <[email protected]> wrote:
>>>>>>>
>>>>>>> On 21.05.24 06:02, Lance Yang wrote:
>>>>>>>> In preparation for supporting try_to_unmap_one() to unmap PMD-mapped
>>>>>>>> folios, start the pagewalk first, then call split_huge_pmd_address() to
>>>>>>>> split the folio.
>>>>>>>>
>>>>>>>> Since TTU_SPLIT_HUGE_PMD will no longer perform immediately, we might
>>>>>>>> encounter a PMD-mapped THP missing the mlock in the VM_LOCKED range during
>>>>>>>> the page walk. It’s probably necessary to mlock this THP to prevent it from
>>>>>>>> being picked up during page reclaim.
>>>>>>>>
>>>>>>>> Suggested-by: David Hildenbrand <[email protected]>
>>>>>>>> Suggested-by: Baolin Wang <[email protected]>
>>>>>>>> Signed-off-by: Lance Yang <[email protected]>
>>>>>>>> ---
>>>>>>>
>>>>>>> [...] again, sorry for the late review.
>>>>>>
>>>>>> No worries at all, thanks for taking time to review!
>>>>>>
>>>>>>>
>>>>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>>>>>> index ddffa30c79fb..08a93347f283 100644
>>>>>>>> --- a/mm/rmap.c
>>>>>>>> +++ b/mm/rmap.c
>>>>>>>> @@ -1640,9 +1640,6 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>>>>>>> if (flags & TTU_SYNC)
>>>>>>>> pvmw.flags = PVMW_SYNC;
>>>>>>>>
>>>>>>>> - if (flags & TTU_SPLIT_HUGE_PMD)
>>>>>>>> - split_huge_pmd_address(vma, address, false, folio);
>>>>>>>> -
>>>>>>>> /*
>>>>>>>> * For THP, we have to assume the worse case ie pmd for invalidation.
>>>>>>>> * For hugetlb, it could be much worse if we need to do pud
>>>>>>>> @@ -1668,20 +1665,35 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>>>>>>> mmu_notifier_invalidate_range_start(&range);
>>>>>>>>
>>>>>>>> while (page_vma_mapped_walk(&pvmw)) {
>>>>>>>> - /* Unexpected PMD-mapped THP? */
>>>>>>>> - VM_BUG_ON_FOLIO(!pvmw.pte, folio);
>>>>>>>> -
>>>>>>>> /*
>>>>>>>> * If the folio is in an mlock()d vma, we must not swap it out.
>>>>>>>> */
>>>>>>>> if (!(flags & TTU_IGNORE_MLOCK) &&
>>>>>>>> (vma->vm_flags & VM_LOCKED)) {
>>>>>>>> /* Restore the mlock which got missed */
>>>>>>>> - if (!folio_test_large(folio))
>>>>>>>> + if (!folio_test_large(folio) ||
>>>>>>>> + (!pvmw.pte && (flags & TTU_SPLIT_HUGE_PMD)))
>>>>>>>> mlock_vma_folio(folio, vma);
>
> Should we still keep the '!pvmw.pte' here? Something like:
>
> if (!folio_test_large(folio) || !pvmw.pte)
> mlock_vma_folio(folio, vma);
I was wondering the same the whole time ...
>
> We can mlock the THP to prevent it from being picked up during page reclaim.
>
> David, I’d like to hear your thoughts on this ;)
but I think there is no need to for now, in the context of your patchset. :)
--
Cheers,
David / dhildenb
On Thu, Jun 6, 2024 at 4:06 PM David Hildenbrand <[email protected]> wrote:
>
> On 06.06.24 10:01, David Hildenbrand wrote:
> > On 06.06.24 05:55, Lance Yang wrote:
> >> On Wed, Jun 5, 2024 at 10:28 PM David Hildenbrand <[email protected]> wrote:
> >>>
> >>> On 05.06.24 16:20, Lance Yang wrote:
> >>>> Hi David,
> >>>>
> >>>> On Wed, Jun 5, 2024 at 8:46 PM David Hildenbrand <[email protected]> wrote:
> >>>>>
> >>>>> On 21.05.24 06:02, Lance Yang wrote:
> >>>>>> In preparation for supporting try_to_unmap_one() to unmap PMD-mapped
> >>>>>> folios, start the pagewalk first, then call split_huge_pmd_address() to
> >>>>>> split the folio.
> >>>>>>
> >>>>>> Since TTU_SPLIT_HUGE_PMD will no longer perform immediately, we might
> >>>>>> encounter a PMD-mapped THP missing the mlock in the VM_LOCKED range during
> >>>>>> the page walk. It’s probably necessary to mlock this THP to prevent it from
> >>>>>> being picked up during page reclaim.
> >>>>>>
> >>>>>> Suggested-by: David Hildenbrand <[email protected]>
> >>>>>> Suggested-by: Baolin Wang <[email protected]>
> >>>>>> Signed-off-by: Lance Yang <[email protected]>
> >>>>>> ---
> >>>>>
> >>>>> [...] again, sorry for the late review.
> >>>>
> >>>> No worries at all, thanks for taking time to review!
> >>>>
> >>>>>
> >>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
> >>>>>> index ddffa30c79fb..08a93347f283 100644
> >>>>>> --- a/mm/rmap.c
> >>>>>> +++ b/mm/rmap.c
> >>>>>> @@ -1640,9 +1640,6 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> >>>>>> if (flags & TTU_SYNC)
> >>>>>> pvmw.flags = PVMW_SYNC;
> >>>>>>
> >>>>>> - if (flags & TTU_SPLIT_HUGE_PMD)
> >>>>>> - split_huge_pmd_address(vma, address, false, folio);
> >>>>>> -
> >>>>>> /*
> >>>>>> * For THP, we have to assume the worse case ie pmd for invalidation.
> >>>>>> * For hugetlb, it could be much worse if we need to do pud
> >>>>>> @@ -1668,20 +1665,35 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> >>>>>> mmu_notifier_invalidate_range_start(&range);
> >>>>>>
> >>>>>> while (page_vma_mapped_walk(&pvmw)) {
> >>>>>> - /* Unexpected PMD-mapped THP? */
> >>>>>> - VM_BUG_ON_FOLIO(!pvmw.pte, folio);
> >>>>>> -
> >>>>>> /*
> >>>>>> * If the folio is in an mlock()d vma, we must not swap it out.
> >>>>>> */
> >>>>>> if (!(flags & TTU_IGNORE_MLOCK) &&
> >>>>>> (vma->vm_flags & VM_LOCKED)) {
> >>>>>> /* Restore the mlock which got missed */
> >>>>>> - if (!folio_test_large(folio))
> >>>>>> + if (!folio_test_large(folio) ||
> >>>>>> + (!pvmw.pte && (flags & TTU_SPLIT_HUGE_PMD)))
> >>>>>> mlock_vma_folio(folio, vma);
Should we still keep the '!pvmw.pte' here? Something like:
if (!folio_test_large(folio) || !pvmw.pte)
mlock_vma_folio(folio, vma);
We can mlock the THP to prevent it from being picked up during page reclaim.
David, I’d like to hear your thoughts on this ;)
Thanks,
Lance
> >>>>>
> >>>>> Can you elaborate why you think this would be required? If we would have
> >>>>> performed the split_huge_pmd_address() beforehand, we would still be
> >>>>> left with a large folio, no?
> >>>>
> >>>> Yep, there would still be a large folio, but it wouldn't be PMD-mapped.
> >>>>
> >>>> After Weifeng's series[1], the kernel supports mlock for PTE-mapped large
> >>>> folio, but there are a few scenarios where we don't mlock a large folio, such
> >>>> as when it crosses a VM_LOCKed VMA boundary.
> >>>>
> >>>> - if (!folio_test_large(folio))
> >>>> + if (!folio_test_large(folio) ||
> >>>> + (!pvmw.pte && (flags & TTU_SPLIT_HUGE_PMD)))
> >>>>
> >>>> And this check is just future-proofing and likely unnecessary. If encountering a
> >>>> PMD-mapped THP missing the mlock for some reason, we can mlock this
> >>>> THP to prevent it from being picked up during page reclaim, since it is fully
> >>>> mapped and doesn't cross the VMA boundary, IIUC.
> >>>>
> >>>> What do you think?
> >>>> I would appreciate any suggestions regarding this check ;)
> >>>
> >>> Reading this patch only, I wonder if this change makes sense in the
> >>> context here.
> >>
> >> Allow me to try explaining it again ;)
> >>
> >>>
> >>> Before this patch, we would have PTE-mapped the PMD-mapped THP before
> >>> reaching this call and skipped it due to "!folio_test_large(folio)".
> >>
> >> Yes, there is only a PTE-mapped THP when doing the "!folio_test_large(folio)"
> >> check, as we will first conditionally split the PMD via
> >> split_huge_pmd_address().
> >>
> >>>
> >>> After this patch, we either
> >>
> >> Things will change. We'll first do the "!folio_test_large(folio)" check, then
> >> conditionally split the PMD via split_huge_pmd_address().
> >>
> >>>
> >>> a) PTE-remap the THP after this check, but retry and end-up here again,
> >>> whereby we would skip it due to "!folio_test_large(folio)".
> >>
> >> Hmm...
> >>
> >> IIUC, we will skip it after this check, stop the page walk, and not
> >> PTE-remap the THP.
> >>
> >>>
> >>> b) Discard the PMD-mapped THP due to lazyfree directly. Can that
> >>> co-exist with mlock and what would be the problem here with mlock?
> >>
> >> Before discarding a PMD-mapped THP as a whole, as patch #3 did,
> >> we also perform the "!folio_test_large(folio)" check. If the THP coexists
> >> with mlock, we will skip it, stop the page walk, and not discard it. IIUC.
> >
> > But "!folio_test_large(folio)" would *skip* the THP and not consider it
> > regarding mlock.
> >
> > I'm probably missing something
>
> I'm stupid, I missed that we still do the "goto walk_done_err;", only
> that we don't do the mlock_vma_folio(folio, vma);
>
> Yes, let's drop it for now! :)
>
> --
> Cheers,
>
> David / dhildenb
>
On Thu, Jun 6, 2024 at 5:41 PM David Hildenbrand <[email protected]> wrote:
>
> On 06.06.24 11:38, Lance Yang wrote:
> > On Thu, Jun 6, 2024 at 4:06 PM David Hildenbrand <[email protected]> wrote:
> >>
> >> On 06.06.24 10:01, David Hildenbrand wrote:
> >>> On 06.06.24 05:55, Lance Yang wrote:
> >>>> On Wed, Jun 5, 2024 at 10:28 PM David Hildenbrand <[email protected]> wrote:
> >>>>>
> >>>>> On 05.06.24 16:20, Lance Yang wrote:
> >>>>>> Hi David,
> >>>>>>
> >>>>>> On Wed, Jun 5, 2024 at 8:46 PM David Hildenbrand <[email protected]> wrote:
> >>>>>>>
> >>>>>>> On 21.05.24 06:02, Lance Yang wrote:
> >>>>>>>> In preparation for supporting try_to_unmap_one() to unmap PMD-mapped
> >>>>>>>> folios, start the pagewalk first, then call split_huge_pmd_address() to
> >>>>>>>> split the folio.
> >>>>>>>>
> >>>>>>>> Since TTU_SPLIT_HUGE_PMD will no longer perform immediately, we might
> >>>>>>>> encounter a PMD-mapped THP missing the mlock in the VM_LOCKED range during
> >>>>>>>> the page walk. It’s probably necessary to mlock this THP to prevent it from
> >>>>>>>> being picked up during page reclaim.
> >>>>>>>>
> >>>>>>>> Suggested-by: David Hildenbrand <[email protected]>
> >>>>>>>> Suggested-by: Baolin Wang <[email protected]>
> >>>>>>>> Signed-off-by: Lance Yang <[email protected]>
> >>>>>>>> ---
> >>>>>>>
> >>>>>>> [...] again, sorry for the late review.
> >>>>>>
> >>>>>> No worries at all, thanks for taking time to review!
> >>>>>>
> >>>>>>>
> >>>>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
> >>>>>>>> index ddffa30c79fb..08a93347f283 100644
> >>>>>>>> --- a/mm/rmap.c
> >>>>>>>> +++ b/mm/rmap.c
> >>>>>>>> @@ -1640,9 +1640,6 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> >>>>>>>> if (flags & TTU_SYNC)
> >>>>>>>> pvmw.flags = PVMW_SYNC;
> >>>>>>>>
> >>>>>>>> - if (flags & TTU_SPLIT_HUGE_PMD)
> >>>>>>>> - split_huge_pmd_address(vma, address, false, folio);
> >>>>>>>> -
> >>>>>>>> /*
> >>>>>>>> * For THP, we have to assume the worse case ie pmd for invalidation.
> >>>>>>>> * For hugetlb, it could be much worse if we need to do pud
> >>>>>>>> @@ -1668,20 +1665,35 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> >>>>>>>> mmu_notifier_invalidate_range_start(&range);
> >>>>>>>>
> >>>>>>>> while (page_vma_mapped_walk(&pvmw)) {
> >>>>>>>> - /* Unexpected PMD-mapped THP? */
> >>>>>>>> - VM_BUG_ON_FOLIO(!pvmw.pte, folio);
> >>>>>>>> -
> >>>>>>>> /*
> >>>>>>>> * If the folio is in an mlock()d vma, we must not swap it out.
> >>>>>>>> */
> >>>>>>>> if (!(flags & TTU_IGNORE_MLOCK) &&
> >>>>>>>> (vma->vm_flags & VM_LOCKED)) {
> >>>>>>>> /* Restore the mlock which got missed */
> >>>>>>>> - if (!folio_test_large(folio))
> >>>>>>>> + if (!folio_test_large(folio) ||
> >>>>>>>> + (!pvmw.pte && (flags & TTU_SPLIT_HUGE_PMD)))
> >>>>>>>> mlock_vma_folio(folio, vma);
> >
> > Should we still keep the '!pvmw.pte' here? Something like:
> >
> > if (!folio_test_large(folio) || !pvmw.pte)
> > mlock_vma_folio(folio, vma);
>
> I was wondering the same the whole time ...
>
> >
> > We can mlock the THP to prevent it from being picked up during page reclaim.
> >
> > David, I’d like to hear your thoughts on this ;)
>
> but I think there is no need to for now, in the context of your patchset. :)
Agreed. Let's drop it for now :)
Thanks a lot for your thoughts!
Lance
>
> --
> Cheers,
>
> David / dhildenb
>