2024-06-13 08:41:43

by Qi Zheng

[permalink] [raw]
Subject: [RFC PATCH 0/3] asynchronously scan and free empty user PTE pages

Hi all,

This series aims to asynchronously scan and free empty user PTE pages.

1. Background
=============

We often find huge user PTE memory usage on our servers, such as the following:

VIRT: 55t
RES: 590g
VmPTE: 110g

The root cause is that these processes use some high-performance mmeory
allocators (such as jemalloc, tcmalloc, etc). These memory allocators use
madvise(MADV_DONTNEED or MADV_FREE) to release physical memory, but neither
MADV_DONTNEED nor MADV_FREE will release page table memory, which may cause
huge page table memory usage.

This issue has been discussed on LSFMM 2022 (led by David Hildenbrand):

topic link: https://lore.kernel.org/linux-mm/[email protected]/
youtube link: https://www.youtube.com/watch?v=naO_BRhcU68

In the past, I have tried to introduce refcount for PTE pages to solve this
problem, but these methods [1][2][3] introduced too much complexity.

[1]. https://lore.kernel.org/lkml/[email protected]/
[2]. https://lore.kernel.org/lkml/[email protected]/
[3]. https://lore.kernel.org/lkml/[email protected]/

2. Infrastructure
=================

Later, in order to freeing retracted page table, Hugh Dickins added a lot of
PTE-related infrastructure[4][5][6]:

- allow pte_offset_map_lock() etc to fail
- make PTE pages can be removed without mmap or rmap locks
(see collapse_pte_mapped_thp() and retract_page_tables())
- make PTE pages can be freed by RCU (via pte_free_defer())
- etc

These are all beneficial to freeing empty PTE pages.

[4]. https://lore.kernel.org/all/[email protected]/
[5]. https://lore.kernel.org/all/[email protected]/
[6]. https://lore.kernel.org/all/[email protected]/

3. Implementation
=================

For empty user PTE pages, we don't actually need to free it immediately, nor do
we need to free all of it.

Therefore, in this patchset, we register a task_work for the user tasks to
asyncronously scan and free empty PTE pages when they return to user space.
(The scanning time interval and address space size can be adjusted.)

When scanning, we can filter out some unsuitable vmas:

- VM_HUGETLB vma
- VM_UFFD_WP vma
- etc

And for some PTE pages that spans multiple vmas, we can also skip.

For locking:

- use the mmap read lock to traverse the vma tree and pgtable
- use pmd lock for clearing pmd entry
- use pte lock for checking empty PTE page, and release it after clearing
pmd entry, then we can capture the changed pmd in pte_offset_map_lock()
etc after holding this pte lock. Thanks to this, we don't need to hold the
rmap-related locks.
- users of pte_offset_map_lock() etc all expect the PTE page to be stable by
using rcu lock, so use pte_free_defer() to free PTE pages.

For the path that will also free PTE pages in THP, we need to recheck whether the
content of pmd entry is valid after holding pmd lock or pte lock.

4. TODO
=======

Some applications may be concerned about the overhead of scanning and rebuilding
page tables, so the following features are considered for implementation in the
future:

- add per-process switch (via prctl)
- add a madvise option (like THP)
- add MM_PGTABLE_SCAN_DELAY/MM_PGTABLE_SCAN_SIZE control (via procfs file)

Perhaps we can add the refcount to PTE pages in the future as well, which would
help improve the scanning speed.

This series is based on next-20240612.

Comments and suggestions are welcome!

Thanks,
Qi

Qi Zheng (3):
mm: pgtable: move pte_free_defer() out of CONFIG_TRANSPARENT_HUGEPAGE
mm: pgtable: make pte_offset_map_nolock() return pmdval
mm: free empty user PTE pages

Documentation/mm/split_page_table_lock.rst | 3 +-
arch/arm/mm/fault-armv.c | 2 +-
arch/powerpc/mm/pgtable-frag.c | 2 -
arch/powerpc/mm/pgtable.c | 2 +-
arch/s390/mm/pgalloc.c | 2 -
arch/sparc/mm/init_64.c | 2 +-
include/linux/mm.h | 4 +-
include/linux/mm_types.h | 4 +
include/linux/pgtable.h | 14 ++
include/linux/sched.h | 1 +
kernel/sched/core.c | 1 +
kernel/sched/fair.c | 2 +
mm/Makefile | 2 +-
mm/filemap.c | 2 +-
mm/freept.c | 180 +++++++++++++++++++++
mm/khugepaged.c | 20 ++-
mm/memory.c | 4 +-
mm/mremap.c | 2 +-
mm/page_vma_mapped.c | 2 +-
mm/pgtable-generic.c | 23 +--
mm/userfaultfd.c | 4 +-
mm/vmscan.c | 2 +-
22 files changed, 249 insertions(+), 31 deletions(-)
create mode 100644 mm/freept.c

--
2.20.1



2024-06-13 08:42:13

by Qi Zheng

[permalink] [raw]
Subject: [RFC PATCH 2/3] mm: pgtable: make pte_offset_map_nolock() return pmdval

Make pte_offset_map_nolock() return pmdval so that we can recheck the
*pmd once the lock is taken. This is a preparation for freeing empty
PTE pages, no functional changes are expected.

Signed-off-by: Qi Zheng <[email protected]>
---
Documentation/mm/split_page_table_lock.rst | 3 ++-
arch/arm/mm/fault-armv.c | 2 +-
arch/powerpc/mm/pgtable.c | 2 +-
include/linux/mm.h | 4 ++--
mm/filemap.c | 2 +-
mm/khugepaged.c | 4 ++--
mm/memory.c | 4 ++--
mm/mremap.c | 2 +-
mm/page_vma_mapped.c | 2 +-
mm/pgtable-generic.c | 21 ++++++++++++---------
mm/userfaultfd.c | 4 ++--
mm/vmscan.c | 2 +-
12 files changed, 28 insertions(+), 24 deletions(-)

diff --git a/Documentation/mm/split_page_table_lock.rst b/Documentation/mm/split_page_table_lock.rst
index e4f6972eb6c0..e6a47d57531c 100644
--- a/Documentation/mm/split_page_table_lock.rst
+++ b/Documentation/mm/split_page_table_lock.rst
@@ -18,7 +18,8 @@ There are helpers to lock/unlock a table and other accessor functions:
pointer to its PTE table lock, or returns NULL if no PTE table;
- pte_offset_map_nolock()
maps PTE, returns pointer to PTE with pointer to its PTE table
- lock (not taken), or returns NULL if no PTE table;
+ lock (not taken) and the value of its pmd entry, or returns NULL
+ if no PTE table;
- pte_offset_map()
maps PTE, returns pointer to PTE, or returns NULL if no PTE table;
- pte_unmap()
diff --git a/arch/arm/mm/fault-armv.c b/arch/arm/mm/fault-armv.c
index 2286c2ea60ec..3e4ed99b9330 100644
--- a/arch/arm/mm/fault-armv.c
+++ b/arch/arm/mm/fault-armv.c
@@ -117,7 +117,7 @@ static int adjust_pte(struct vm_area_struct *vma, unsigned long address,
* must use the nested version. This also means we need to
* open-code the spin-locking.
*/
- pte = pte_offset_map_nolock(vma->vm_mm, pmd, address, &ptl);
+ pte = pte_offset_map_nolock(vma->vm_mm, pmd, NULL, address, &ptl);
if (!pte)
return 0;

diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index 9e7ba9c3851f..ab0250f1b226 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -350,7 +350,7 @@ void assert_pte_locked(struct mm_struct *mm, unsigned long addr)
*/
if (pmd_none(*pmd))
return;
- pte = pte_offset_map_nolock(mm, pmd, addr, &ptl);
+ pte = pte_offset_map_nolock(mm, pmd, NULL, addr, &ptl);
BUG_ON(!pte);
assert_spin_locked(ptl);
pte_unmap(pte);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 106bb0310352..d5550c3dc550 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2969,8 +2969,8 @@ static inline pte_t *pte_offset_map_lock(struct mm_struct *mm, pmd_t *pmd,
return pte;
}

-pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
- unsigned long addr, spinlock_t **ptlp);
+pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd, pmd_t *pmdvalp,
+ unsigned long addr, spinlock_t **ptlp);

#define pte_unmap_unlock(pte, ptl) do { \
spin_unlock(ptl); \
diff --git a/mm/filemap.c b/mm/filemap.c
index 37061aafd191..7eb2e3599966 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3231,7 +3231,7 @@ static vm_fault_t filemap_fault_recheck_pte_none(struct vm_fault *vmf)
if (!(vmf->flags & FAULT_FLAG_ORIG_PTE_VALID))
return 0;

- ptep = pte_offset_map_nolock(vma->vm_mm, vmf->pmd, vmf->address,
+ ptep = pte_offset_map_nolock(vma->vm_mm, vmf->pmd, NULL, vmf->address,
&vmf->ptl);
if (unlikely(!ptep))
return VM_FAULT_NOPAGE;
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 774a97e6e2da..2a8703ee876c 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -992,7 +992,7 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
};

if (!pte++) {
- pte = pte_offset_map_nolock(mm, pmd, address, &ptl);
+ pte = pte_offset_map_nolock(mm, pmd, NULL, address, &ptl);
if (!pte) {
mmap_read_unlock(mm);
result = SCAN_PMD_NULL;
@@ -1581,7 +1581,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
if (userfaultfd_armed(vma) && !(vma->vm_flags & VM_SHARED))
pml = pmd_lock(mm, pmd);

- start_pte = pte_offset_map_nolock(mm, pmd, haddr, &ptl);
+ start_pte = pte_offset_map_nolock(mm, pmd, NULL, haddr, &ptl);
if (!start_pte) /* mmap_lock + page lock should prevent this */
goto abort;
if (!pml)
diff --git a/mm/memory.c b/mm/memory.c
index 1bd2ffb76ec2..694c0989a1d8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1108,7 +1108,7 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
ret = -ENOMEM;
goto out;
}
- src_pte = pte_offset_map_nolock(src_mm, src_pmd, addr, &src_ptl);
+ src_pte = pte_offset_map_nolock(src_mm, src_pmd, NULL, addr, &src_ptl);
if (!src_pte) {
pte_unmap_unlock(dst_pte, dst_ptl);
/* ret == 0 */
@@ -5486,7 +5486,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
* it into a huge pmd: just retry later if so.
*/
vmf->pte = pte_offset_map_nolock(vmf->vma->vm_mm, vmf->pmd,
- vmf->address, &vmf->ptl);
+ NULL, vmf->address, &vmf->ptl);
if (unlikely(!vmf->pte))
return 0;
vmf->orig_pte = ptep_get_lockless(vmf->pte);
diff --git a/mm/mremap.c b/mm/mremap.c
index e7ae140fc640..f672d0218a6f 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -175,7 +175,7 @@ static int move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
err = -EAGAIN;
goto out;
}
- new_pte = pte_offset_map_nolock(mm, new_pmd, new_addr, &new_ptl);
+ new_pte = pte_offset_map_nolock(mm, new_pmd, NULL, new_addr, &new_ptl);
if (!new_pte) {
pte_unmap_unlock(old_pte, old_ptl);
err = -EAGAIN;
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index ae5cc42aa208..507701b7bcc1 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -33,7 +33,7 @@ static bool map_pte(struct page_vma_mapped_walk *pvmw, spinlock_t **ptlp)
* Though, in most cases, page lock already protects this.
*/
pvmw->pte = pte_offset_map_nolock(pvmw->vma->vm_mm, pvmw->pmd,
- pvmw->address, ptlp);
+ NULL, pvmw->address, ptlp);
if (!pvmw->pte)
return false;

diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 197937495a0a..b8b28715cb4f 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -305,7 +305,7 @@ pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
return NULL;
}

-pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
+pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd, pmd_t *pmdvalp,
unsigned long addr, spinlock_t **ptlp)
{
pmd_t pmdval;
@@ -314,6 +314,8 @@ pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
pte = __pte_offset_map(pmd, addr, &pmdval);
if (likely(pte))
*ptlp = pte_lockptr(mm, &pmdval);
+ if (pmdvalp)
+ *pmdvalp = pmdval;
return pte;
}

@@ -347,14 +349,15 @@ pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
* and disconnected table. Until pte_unmap(pte) unmaps and rcu_read_unlock()s
* afterwards.
*
- * pte_offset_map_nolock(mm, pmd, addr, ptlp), above, is like pte_offset_map();
- * but when successful, it also outputs a pointer to the spinlock in ptlp - as
- * pte_offset_map_lock() does, but in this case without locking it. This helps
- * the caller to avoid a later pte_lockptr(mm, *pmd), which might by that time
- * act on a changed *pmd: pte_offset_map_nolock() provides the correct spinlock
- * pointer for the page table that it returns. In principle, the caller should
- * recheck *pmd once the lock is taken; in practice, no callsite needs that -
- * either the mmap_lock for write, or pte_same() check on contents, is enough.
+ * pte_offset_map_nolock(mm, pmd, pmdvalp, addr, ptlp), above, is like
+ * pte_offset_map(); but when successful, it also outputs a pointer to the
+ * spinlock in ptlp - as pte_offset_map_lock() does, but in this case without
+ * locking it. This helps the caller to avoid a later pte_lockptr(mm, *pmd),
+ * which might by that time act on a changed *pmd: pte_offset_map_nolock()
+ * provides the correct spinlock pointer for the page table that it returns.
+ * In principle, the caller should recheck *pmd once the lock is taken; But in
+ * most cases, either the mmap_lock for write, or pte_same() check on contents,
+ * is enough.
*
* Note that free_pgtables(), used after unmapping detached vmas, or when
* exiting the whole mm, does not take page table lock before freeing a page
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 5e7f2801698a..9c77271d499c 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -1143,7 +1143,7 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
src_addr, src_addr + PAGE_SIZE);
mmu_notifier_invalidate_range_start(&range);
retry:
- dst_pte = pte_offset_map_nolock(mm, dst_pmd, dst_addr, &dst_ptl);
+ dst_pte = pte_offset_map_nolock(mm, dst_pmd, NULL, dst_addr, &dst_ptl);

/* Retry if a huge pmd materialized from under us */
if (unlikely(!dst_pte)) {
@@ -1151,7 +1151,7 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
goto out;
}

- src_pte = pte_offset_map_nolock(mm, src_pmd, src_addr, &src_ptl);
+ src_pte = pte_offset_map_nolock(mm, src_pmd, NULL, src_addr, &src_ptl);

/*
* We held the mmap_lock for reading so MADV_DONTNEED
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c0429fd6c573..56727caa907b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3374,7 +3374,7 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
DEFINE_MAX_SEQ(walk->lruvec);
int old_gen, new_gen = lru_gen_from_seq(max_seq);

- pte = pte_offset_map_nolock(args->mm, pmd, start & PMD_MASK, &ptl);
+ pte = pte_offset_map_nolock(args->mm, pmd, NULL, start & PMD_MASK, &ptl);
if (!pte)
return false;
if (!spin_trylock(ptl)) {
--
2.20.1


2024-06-13 09:00:14

by Qi Zheng

[permalink] [raw]
Subject: [RFC PATCH 1/3] mm: pgtable: move pte_free_defer() out of CONFIG_TRANSPARENT_HUGEPAGE

In order to reuse the pte_free_defer() in the subsequent work of freeing
empty user PTE pages, move it out of the CONFIG_TRANSPARENT_HUGEPAGE
range.

No functional change intended.

Signed-off-by: Qi Zheng <[email protected]>
---
arch/powerpc/mm/pgtable-frag.c | 2 --
arch/s390/mm/pgalloc.c | 2 --
arch/sparc/mm/init_64.c | 2 +-
mm/pgtable-generic.c | 2 +-
4 files changed, 2 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/mm/pgtable-frag.c b/arch/powerpc/mm/pgtable-frag.c
index 8c31802f97e8..46d8f4bec85e 100644
--- a/arch/powerpc/mm/pgtable-frag.c
+++ b/arch/powerpc/mm/pgtable-frag.c
@@ -133,7 +133,6 @@ void pte_fragment_free(unsigned long *table, int kernel)
}
}

-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
{
struct page *page;
@@ -142,4 +141,3 @@ void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
SetPageActive(page);
pte_fragment_free((unsigned long *)pgtable, 0);
}
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
diff --git a/arch/s390/mm/pgalloc.c b/arch/s390/mm/pgalloc.c
index abb629d7e131..6415379bd3fd 100644
--- a/arch/s390/mm/pgalloc.c
+++ b/arch/s390/mm/pgalloc.c
@@ -204,7 +204,6 @@ void __tlb_remove_table(void *table)
pagetable_pte_dtor_free(ptdesc);
}

-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
static void pte_free_now(struct rcu_head *head)
{
struct ptdesc *ptdesc = container_of(head, struct ptdesc, pt_rcu_head);
@@ -223,7 +222,6 @@ void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
*/
WARN_ON_ONCE(mm_has_pgste(mm));
}
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */

/*
* Base infrastructure required to generate basic asces, region, segment,
diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index 53d7cb5bbffe..20aaf123c9fc 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -2939,7 +2939,6 @@ void pgtable_free(void *table, bool is_page)
kmem_cache_free(pgtable_cache, table);
}

-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
static void pte_free_now(struct rcu_head *head)
{
struct page *page;
@@ -2956,6 +2955,7 @@ void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
call_rcu(&page->rcu_head, pte_free_now);
}

+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
void update_mmu_cache_pmd(struct vm_area_struct *vma, unsigned long addr,
pmd_t *pmd)
{
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index a78a4adf711a..197937495a0a 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -233,6 +233,7 @@ pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long address,
return pmd;
}
#endif
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */

/* arch define pte_free_defer in asm/pgalloc.h for its own implementation */
#ifndef pte_free_defer
@@ -252,7 +253,6 @@ void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
call_rcu(&page->rcu_head, pte_free_now);
}
#endif /* pte_free_defer */
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */

#if defined(CONFIG_GUP_GET_PXX_LOW_HIGH) && \
(defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RCU))
--
2.20.1


2024-06-13 09:01:04

by Qi Zheng

[permalink] [raw]
Subject: [RFC PATCH 3/3] mm: free empty user PTE pages

Now in order to pursue high performance, applications mostly use some
high-performance user-mode memory allocators, such as jemalloc or
tcmalloc. These memory allocators use madvise(MADV_DONTNEED or MADV_FREE)
to release physical memory, but neither MADV_DONTNEED nor MADV_FREE will
release page table memory, which may cause huge page table memory usage.

The following are a memory usage snapshot of one process which actually
happened on our server:

VIRT: 55t
RES: 590g
VmPTE: 110g

In this case, most of the page table entries are empty. For such a PTE
page where all entries are empty, we can actually free it back to the
system for others to use.

Similar to numa_balancing, this commit adds a task_work to scan the
address space of the user process when it returns to user space. If a
suitable empty PTE page is found, it will be released.

The following code snippet can show the effect of optimization:

mmap 50G
while (1) {
for (; i < 1024 * 25; i++) {
touch 2M memory
madvise MADV_DONTNEED 2M
}
}

As we can see, the memory usage of VmPTE is reduced:

before after
VIRT 50.0 GB 50.0 GB
RES 3.1 MB 3.1 MB
VmPTE 102640 kB 756 kB (Even less)

Signed-off-by: Qi Zheng <[email protected]>
---
include/linux/mm_types.h | 4 +
include/linux/pgtable.h | 14 +++
include/linux/sched.h | 1 +
kernel/sched/core.c | 1 +
kernel/sched/fair.c | 2 +
mm/Makefile | 2 +-
mm/freept.c | 180 +++++++++++++++++++++++++++++++++++++++
mm/khugepaged.c | 18 +++-
8 files changed, 220 insertions(+), 2 deletions(-)
create mode 100644 mm/freept.c

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index ef09c4eef6d3..bbc697fa4a83 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -839,6 +839,10 @@ struct mm_struct {
#endif
#ifdef CONFIG_MMU
atomic_long_t pgtables_bytes; /* size of all page tables */
+ /* Next mm_pgtable scan (in jiffies) */
+ unsigned long mm_pgtable_next_scan;
+ /* Restart point for scanning and freeing empty user PTE pages */
+ unsigned long mm_pgtable_scan_offset;
#endif
int map_count; /* number of VMAs */

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index fbff20070ca3..4d1cfaa92422 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1589,6 +1589,20 @@ static inline unsigned long my_zero_pfn(unsigned long addr)
}
#endif /* CONFIG_MMU */

+#ifdef CONFIG_MMU
+#define MM_PGTABLE_SCAN_DELAY 100 /* 100ms */
+#define MM_PGTABLE_SCAN_SIZE 256 /* 256MB */
+void init_mm_pgtable_work(struct task_struct *p);
+void task_tick_mm_pgtable(struct task_struct *curr);
+#else
+static inline void init_mm_pgtable_work(struct task_struct *p)
+{
+}
+static inline void task_tick_mm_pgtable(struct task_struct *curr)
+{
+}
+#endif
+
#ifdef CONFIG_MMU

#ifndef CONFIG_TRANSPARENT_HUGEPAGE
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 73c874e051f7..5c0f3d96d608 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1485,6 +1485,7 @@ struct task_struct {
#ifdef CONFIG_MMU
struct task_struct *oom_reaper_list;
struct timer_list oom_reaper_timer;
+ struct callback_head pgtable_work;
#endif
#ifdef CONFIG_VMAP_STACK
struct vm_struct *stack_vm_area;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c663075c86fb..d5f6df6f5c32 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4359,6 +4359,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
p->migration_pending = NULL;
#endif
init_sched_mm_cid(p);
+ init_mm_pgtable_work(p);
}

DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 41b58387023d..bbc7cbf22eaa 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12696,6 +12696,8 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
if (static_branch_unlikely(&sched_numa_balancing))
task_tick_numa(rq, curr);

+ task_tick_mm_pgtable(curr);
+
update_misfit_status(curr, rq);
check_update_overutilized_status(task_rq(curr));

diff --git a/mm/Makefile b/mm/Makefile
index 8fb85acda1b1..af1a324aa65e 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -54,7 +54,7 @@ obj-y := filemap.o mempool.o oom_kill.o fadvise.o \
mm_init.o percpu.o slab_common.o \
compaction.o show_mem.o shmem_quota.o\
interval_tree.o list_lru.o workingset.o \
- debug.o gup.o mmap_lock.o $(mmu-y)
+ debug.o gup.o mmap_lock.o freept.o $(mmu-y)

# Give 'page_alloc' its own module-parameter namespace
page-alloc-y := page_alloc.o
diff --git a/mm/freept.c b/mm/freept.c
new file mode 100644
index 000000000000..ed1ea5535e03
--- /dev/null
+++ b/mm/freept.c
@@ -0,0 +1,180 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/userfaultfd_k.h>
+#include <linux/pagewalk.h>
+#include <linux/task_work.h>
+#include <linux/hugetlb.h>
+#include <asm/tlbflush.h>
+
+void task_tick_mm_pgtable(struct task_struct *curr)
+{
+ struct callback_head *work = &curr->pgtable_work;
+ unsigned long now = jiffies;
+
+ if (!curr->mm || (curr->flags & (PF_EXITING | PF_KTHREAD)) ||
+ work->next != work)
+ return;
+
+ if (time_before(now, READ_ONCE(curr->mm->mm_pgtable_next_scan)))
+ return;
+
+ task_work_add(curr, work, TWA_RESUME);
+}
+
+/*
+ * Locking:
+ * - already held the mmap read lock to traverse the vma tree and pgtable
+ * - use pmd lock for clearing pmd entry
+ * - use pte lock for checking empty PTE page, and release it after clearing
+ * pmd entry, then we can capture the changed pmd in pte_offset_map_lock()
+ * etc after holding this pte lock. Thanks to this, we don't need to hold the
+ * rmap-related locks.
+ * - users of pte_offset_map_lock() etc all expect the PTE page to be stable by
+ * using rcu lock, so use pte_free_defer() to free PTE pages.
+ */
+static int freept_pmd_entry(pmd_t *pmd, unsigned long addr, unsigned long next,
+ struct mm_walk *walk)
+{
+ struct mmu_notifier_range range;
+ struct mm_struct *mm = walk->mm;
+ pte_t *start_pte, *pte;
+ pmd_t pmdval;
+ spinlock_t *pml = NULL, *ptl;
+ unsigned long haddr = addr;
+ int i;
+
+ mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm,
+ haddr, haddr + PMD_SIZE);
+ mmu_notifier_invalidate_range_start(&range);
+
+ start_pte = pte_offset_map_nolock(mm, pmd, &pmdval, haddr, &ptl);
+ if (!start_pte)
+ goto out;
+
+ pml = pmd_lock(mm, pmd);
+ if (ptl != pml)
+ spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
+
+ if (unlikely(!pmd_same(pmdval, pmdp_get_lockless(pmd))))
+ goto out_ptl;
+
+ /* Check if it is empty PTE page */
+ for (i = 0, addr = haddr, pte = start_pte;
+ i < PTRS_PER_PTE; i++, addr += PAGE_SIZE, pte++) {
+ if (!pte_none(ptep_get(pte)))
+ goto out_ptl;
+ }
+ pte_unmap(start_pte);
+
+ pmd_clear(pmd);
+ flush_tlb_range(walk->vma, haddr, haddr + PMD_SIZE);
+ pmdp_get_lockless_sync();
+ if (ptl != pml)
+ spin_unlock(ptl);
+ spin_unlock(pml);
+
+ mmu_notifier_invalidate_range_end(&range);
+
+ mm_dec_nr_ptes(mm);
+ pte_free_defer(mm, pmd_pgtable(pmdval));
+
+ return 0;
+
+out_ptl:
+ pte_unmap_unlock(start_pte, ptl);
+ if (pml != ptl)
+ spin_unlock(pml);
+out:
+ mmu_notifier_invalidate_range_end(&range);
+
+ return 0;
+}
+
+static const struct mm_walk_ops mm_pgtable_walk_ops = {
+ .pmd_entry = freept_pmd_entry,
+ .walk_lock = PGWALK_RDLOCK,
+};
+
+static void task_mm_pgtable_work(struct callback_head *work)
+{
+ unsigned long now = jiffies, old_scan, next_scan;
+ struct task_struct *p = current;
+ struct mm_struct *mm = p->mm;
+ struct vm_area_struct *vma;
+ unsigned long start, end;
+ struct vma_iterator vmi;
+
+ work->next = work; /* Prevent double-add */
+ if (p->flags & PF_EXITING)
+ return;
+
+ if (!mm->mm_pgtable_next_scan) {
+ mm->mm_pgtable_next_scan = now + msecs_to_jiffies(MM_PGTABLE_SCAN_DELAY);
+ return;
+ }
+
+ old_scan = mm->mm_pgtable_next_scan;
+ if (time_before(now, old_scan))
+ return;
+
+ next_scan = now + msecs_to_jiffies(MM_PGTABLE_SCAN_DELAY);
+ if (!try_cmpxchg(&mm->mm_pgtable_next_scan, &old_scan, next_scan))
+ return;
+
+ if (!mmap_read_trylock(mm))
+ return;
+
+ start = mm->mm_pgtable_scan_offset;
+ vma_iter_init(&vmi, mm, start);
+ vma = vma_next(&vmi);
+ if (!vma) {
+ mm->mm_pgtable_scan_offset = 0;
+ start = 0;
+ vma_iter_set(&vmi, start);
+ vma = vma_next(&vmi);
+ }
+
+ do {
+ /* Skip hugetlb case */
+ if (is_vm_hugetlb_page(vma))
+ continue;
+
+ /* Leave this to the THP path to handle */
+ if (vma->vm_flags & VM_HUGEPAGE)
+ continue;
+
+ /* Keep pmd pgtable for uffd-wp; see comment in retract_page_tables() */
+ if (userfaultfd_wp(vma))
+ continue;
+
+ /* Only consider PTE pages that do not cross vmas */
+ start = ALIGN(vma->vm_start, PMD_SIZE);
+ end = ALIGN_DOWN(vma->vm_end, PMD_SIZE);
+ if (end - start < PMD_SIZE)
+ continue;
+
+ walk_page_range_vma(vma, start, end, &mm_pgtable_walk_ops, NULL);
+
+ if (end - mm->mm_pgtable_scan_offset >= (MM_PGTABLE_SCAN_SIZE << 20))
+ goto out;
+
+ cond_resched();
+ } for_each_vma(vmi, vma);
+
+out:
+ mm->mm_pgtable_scan_offset = vma ? end : 0;
+ mmap_read_unlock(mm);
+}
+
+void init_mm_pgtable_work(struct task_struct *p)
+{
+ struct mm_struct *mm = p->mm;
+ int mm_users = 0;
+
+ if (mm) {
+ mm_users = atomic_read(&mm->mm_users);
+ if (mm_users == 1)
+ mm->mm_pgtable_next_scan = jiffies + msecs_to_jiffies(MM_PGTABLE_SCAN_DELAY);
+ }
+ p->pgtable_work.next = &p->pgtable_work; /* Protect against double add */
+ init_task_work(&p->pgtable_work, task_mm_pgtable_work);
+}
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 2a8703ee876c..a2b96f4ba737 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1581,7 +1581,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
if (userfaultfd_armed(vma) && !(vma->vm_flags & VM_SHARED))
pml = pmd_lock(mm, pmd);

- start_pte = pte_offset_map_nolock(mm, pmd, NULL, haddr, &ptl);
+ start_pte = pte_offset_map_nolock(mm, pmd, &pgt_pmd, haddr, &ptl);
if (!start_pte) /* mmap_lock + page lock should prevent this */
goto abort;
if (!pml)
@@ -1589,6 +1589,10 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
else if (ptl != pml)
spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);

+ /* pmd entry may be changed by others */
+ if (unlikely(!pml && !pmd_same(pgt_pmd, pmdp_get_lockless(pmd))))
+ goto abort;
+
/* step 2: clear page table and adjust rmap */
for (i = 0, addr = haddr, pte = start_pte;
i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE, pte++) {
@@ -1636,6 +1640,11 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
pml = pmd_lock(mm, pmd);
if (ptl != pml)
spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
+
+ if (unlikely(!pmd_same(pgt_pmd, pmdp_get_lockless(pmd)))) {
+ spin_unlock(ptl);
+ goto unlock;
+ }
}
pgt_pmd = pmdp_collapse_flush(vma, haddr, pmd);
pmdp_get_lockless_sync();
@@ -1663,6 +1672,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
}
if (start_pte)
pte_unmap_unlock(start_pte, ptl);
+unlock:
if (pml && pml != ptl)
spin_unlock(pml);
if (notified)
@@ -1722,6 +1732,12 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
mmu_notifier_invalidate_range_start(&range);

pml = pmd_lock(mm, pmd);
+ /* check if the pmd is still valid */
+ if (check_pmd_still_valid(mm, addr, pmd) != SCAN_SUCCEED) {
+ spin_unlock(pml);
+ mmu_notifier_invalidate_range_end(&range);
+ continue;
+ }
ptl = pte_lockptr(mm, pmd);
if (ptl != pml)
spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
--
2.20.1


2024-06-13 09:05:13

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] asynchronously scan and free empty user PTE pages

On 13.06.24 10:38, Qi Zheng wrote:
> Hi all,
>
> This series aims to asynchronously scan and free empty user PTE pages.
>
> 1. Background
> =============
>
> We often find huge user PTE memory usage on our servers, such as the following:
>
> VIRT: 55t
> RES: 590g
> VmPTE: 110g
>
> The root cause is that these processes use some high-performance mmeory
> allocators (such as jemalloc, tcmalloc, etc). These memory allocators use
> madvise(MADV_DONTNEED or MADV_FREE) to release physical memory, but neither
> MADV_DONTNEED nor MADV_FREE will release page table memory, which may cause
> huge page table memory usage.
>
> This issue has been discussed on LSFMM 2022 (led by David Hildenbrand):
>
> topic link: https://lore.kernel.org/linux-mm/[email protected]/
> youtube link: https://www.youtube.com/watch?v=naO_BRhcU68
>
> In the past, I have tried to introduce refcount for PTE pages to solve this
> problem, but these methods [1][2][3] introduced too much complexity.
>
> [1]. https://lore.kernel.org/lkml/[email protected]/
> [2]. https://lore.kernel.org/lkml/[email protected]/
> [3]. https://lore.kernel.org/lkml/[email protected]/
>
> 2. Infrastructure
> =================
>
> Later, in order to freeing retracted page table, Hugh Dickins added a lot of
> PTE-related infrastructure[4][5][6]:
>
> - allow pte_offset_map_lock() etc to fail
> - make PTE pages can be removed without mmap or rmap locks
> (see collapse_pte_mapped_thp() and retract_page_tables())
> - make PTE pages can be freed by RCU (via pte_free_defer())
> - etc
>
> These are all beneficial to freeing empty PTE pages.
>
> [4]. https://lore.kernel.org/all/[email protected]/
> [5]. https://lore.kernel.org/all/[email protected]/
> [6]. https://lore.kernel.org/all/[email protected]/
>

I'm a big fan for virtio-mem.


> 3. Implementation
> =================
>
> For empty user PTE pages, we don't actually need to free it immediately, nor do
> we need to free all of it.
>
> Therefore, in this patchset, we register a task_work for the user tasks to
> asyncronously scan and free empty PTE pages when they return to user space.
> (The scanning time interval and address space size can be adjusted.)

The question is, if we really have to scan asynchronously, or if would
be reasonable for most use cases to trigger a madvise(MADV_PT_RECLAIM)
every now and then. For virtio-mem, and likely most memory allocators,
that might be feasible, and valuable independent of system-wide
automatic scanning.

>
> When scanning, we can filter out some unsuitable vmas:
>
> - VM_HUGETLB vma
> - VM_UFFD_WP vma

Why is UFFD_WP unsuitable? It should be suitable as long as you make
sure to really only remove page tables that are all pte_none().

> - etc
>
> And for some PTE pages that spans multiple vmas, we can also skip.
>
> For locking:
>
> - use the mmap read lock to traverse the vma tree and pgtable
> - use pmd lock for clearing pmd entry
> - use pte lock for checking empty PTE page, and release it after clearing
> pmd entry, then we can capture the changed pmd in pte_offset_map_lock()
> etc after holding this pte lock. Thanks to this, we don't need to hold the
> rmap-related locks.
> - users of pte_offset_map_lock() etc all expect the PTE page to be stable by
> using rcu lock, so use pte_free_defer() to free PTE pages.
>

I once had a protoype that would scan similar to GUP-fast, using the
mmap lock in read mode and disabling local IRQs and then walking the
page table locklessly (no PTLs). Only when identifying an empty page and
ripping out the page table, it would have to do more heavy locking (back
when we required the mmap lock in write mode and other things).

I can try digging up that patch if you're interested.

We'll have to double check whether all anon memory cases can *properly*
handle pte_offset_map_lock() failing (not just handling it, but doing
the right thing; most of that anon-only code didn't ever run into that
issue so far, so these code paths were likely never triggered).


> For the path that will also free PTE pages in THP, we need to recheck whether the
> content of pmd entry is valid after holding pmd lock or pte lock.
>
> 4. TODO
> =======
>
> Some applications may be concerned about the overhead of scanning and rebuilding
> page tables, so the following features are considered for implementation in the
> future:
>
> - add per-process switch (via prctl)
> - add a madvise option (like THP)
> - add MM_PGTABLE_SCAN_DELAY/MM_PGTABLE_SCAN_SIZE control (via procfs file)
>
> Perhaps we can add the refcount to PTE pages in the future as well, which would
> help improve the scanning speed.

I didn't like the added complexity last time, and the problem of
handling situations where we squeeze multiple page tables into a single
"struct page".

--
Cheers,

David / dhildenb


2024-06-13 09:32:45

by Qi Zheng

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] asynchronously scan and free empty user PTE pages

Hi David,

Thanks for such a quick reply!

On 2024/6/13 17:04, David Hildenbrand wrote:
> On 13.06.24 10:38, Qi Zheng wrote:
>> Hi all,

[...]

>
>
>> 3. Implementation
>> =================
>>
>> For empty user PTE pages, we don't actually need to free it
>> immediately, nor do
>> we need to free all of it.
>>
>> Therefore, in this patchset, we register a task_work for the user
>> tasks to
>> asyncronously scan and free empty PTE pages when they return to user
>> space.
>> (The scanning time interval and address space size can be adjusted.)
>
> The question is, if we really have to scan asynchronously, or if would
> be reasonable for most use cases to trigger a madvise(MADV_PT_RECLAIM)
> every now and then. For virtio-mem, and likely most memory allocators,
> that might be feasible, and valuable independent of system-wide
> automatic scanning.

Agree, I also think it is possible to add always && madvise modes
simliar to THP.

>
>>
>> When scanning, we can filter out some unsuitable vmas:
>>
>>      - VM_HUGETLB vma
>>      - VM_UFFD_WP vma
>
> Why is UFFD_WP unsuitable? It should be suitable as long as you make
> sure to really only remove page tables that are all pte_none().

Got it, I mistakenly thought pte_none() covered pte marker case until
I saw pte_none_mostly().

>
>>      - etc
>> And for some PTE pages that spans multiple vmas, we can also skip.
>>
>> For locking:
>>
>>      - use the mmap read lock to traverse the vma tree and pgtable
>>      - use pmd lock for clearing pmd entry
>>      - use pte lock for checking empty PTE page, and release it after
>> clearing
>>        pmd entry, then we can capture the changed pmd in
>> pte_offset_map_lock()
>>        etc after holding this pte lock. Thanks to this, we don't need
>> to hold the
>>        rmap-related locks.
>>      - users of pte_offset_map_lock() etc all expect the PTE page to
>> be stable by
>>        using rcu lock, so use pte_free_defer() to free PTE pages.
>
> I once had a protoype that would scan similar to GUP-fast, using the
> mmap lock in read mode and disabling local IRQs and then walking the
> page table locklessly (no PTLs). Only when identifying an empty page and
> ripping out the page table, it would have to do more heavy locking (back
> when we required the mmap lock in write mode and other things).

Maybe mmap write lock is not necessary, we can protect it using pmd lock
&& pte lock as above.

>
> I can try digging up that patch if you're interested.

Yes, that would be better, maybe it can provide more inspiration!

>
> We'll have to double check whether all anon memory cases can *properly*
> handle pte_offset_map_lock() failing (not just handling it, but doing
> the right thing; most of that anon-only code didn't ever run into that
> issue so far, so these code paths were likely never triggered).

Yeah, I'll keep checking this out too.

>
>
>> For the path that will also free PTE pages in THP, we need to recheck
>> whether the
>> content of pmd entry is valid after holding pmd lock or pte lock.
>>
>> 4. TODO
>> =======
>>
>> Some applications may be concerned about the overhead of scanning and
>> rebuilding
>> page tables, so the following features are considered for
>> implementation in the
>> future:
>>
>>      - add per-process switch (via prctl)
>>      - add a madvise option (like THP)
>>      - add MM_PGTABLE_SCAN_DELAY/MM_PGTABLE_SCAN_SIZE control (via
>> procfs file)
>> Perhaps we can add the refcount to PTE pages in the future as well,
>> which would
>> help improve the scanning speed.
>
> I didn't like the added complexity last time, and the problem of
> handling situations where we squeeze multiple page tables into a single
> "struct page".

OK, except for refcount, do you think the other three todos above are
still worth doing?

Thanks,
Qi

>

2024-06-13 10:25:22

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] asynchronously scan and free empty user PTE pages

On 13.06.24 11:32, Qi Zheng wrote:
> Hi David,
>
> Thanks for such a quick reply!

I appreciate you working on this :)

>
> On 2024/6/13 17:04, David Hildenbrand wrote:
>> On 13.06.24 10:38, Qi Zheng wrote:
>>> Hi all,
>
> [...]
>
>>
>>
>>> 3. Implementation
>>> =================
>>>
>>> For empty user PTE pages, we don't actually need to free it
>>> immediately, nor do
>>> we need to free all of it.
>>>
>>> Therefore, in this patchset, we register a task_work for the user
>>> tasks to
>>> asyncronously scan and free empty PTE pages when they return to user
>>> space.
>>> (The scanning time interval and address space size can be adjusted.)
>>
>> The question is, if we really have to scan asynchronously, or if would
>> be reasonable for most use cases to trigger a madvise(MADV_PT_RECLAIM)
>> every now and then. For virtio-mem, and likely most memory allocators,
>> that might be feasible, and valuable independent of system-wide
>> automatic scanning.
>
> Agree, I also think it is possible to add always && madvise modes
> simliar to THP.

My thinking is, we start with a madvise(MADV_PT_RECLAIM) that will
synchronously try to reclaim page tables without any asynchronous work.

Similar to MADV_COLLAPSE that only does synchronous work. Of course,
if we don't need any heavy locking for reclaim, we might also just
try reclaiming during MADV_DONTNEED when spanning a complete page
table. That won't sort out all cases where reclaim is possible, but
with both approaches we could cover quite a lot that were discovered
to really result in a lot of emprt page tables.

On top, we might implement some asynchronous scanning later, This is,
of course, TBD. Maybe we could wire up other page table scanners
(khugepaged ?) to simply reclaim empty page tables it finds as well?

>
>>
>>>
>>> When scanning, we can filter out some unsuitable vmas:
>>>
>>>      - VM_HUGETLB vma
>>>      - VM_UFFD_WP vma
>>
>> Why is UFFD_WP unsuitable? It should be suitable as long as you make
>> sure to really only remove page tables that are all pte_none().
>
> Got it, I mistakenly thought pte_none() covered pte marker case until
> I saw pte_none_mostly().

I *think* there is one nasty detail, and we might need an arch callback
to test if a pte is *really* can be reclaimed: for example, s390x might
require us keeping some !pte_none() page tables.

While a PTE might be none, the s390x PGSTE (think of it as another
8byte per PTE entry stored right next to the actual page table
entries) might hold data we might have to preserve for our KVM guest.

But that should be easy to wire up.

>
>>
>>>      - etc
>>> And for some PTE pages that spans multiple vmas, we can also skip.
>>>
>>> For locking:
>>>
>>>      - use the mmap read lock to traverse the vma tree and pgtable
>>>      - use pmd lock for clearing pmd entry
>>>      - use pte lock for checking empty PTE page, and release it after
>>> clearing
>>>        pmd entry, then we can capture the changed pmd in
>>> pte_offset_map_lock()
>>>        etc after holding this pte lock. Thanks to this, we don't need
>>> to hold the
>>>        rmap-related locks.
>>>      - users of pte_offset_map_lock() etc all expect the PTE page to
>>> be stable by
>>>        using rcu lock, so use pte_free_defer() to free PTE pages.
>>
>> I once had a protoype that would scan similar to GUP-fast, using the
>> mmap lock in read mode and disabling local IRQs and then walking the
>> page table locklessly (no PTLs). Only when identifying an empty page and
>> ripping out the page table, it would have to do more heavy locking (back
>> when we required the mmap lock in write mode and other things).
>
> Maybe mmap write lock is not necessary, we can protect it using pmd lock
> && pte lock as above.

Yes, I'm hoping we can do that, that will solve a lot of possible issues.

>
>>
>> I can try digging up that patch if you're interested.
>
> Yes, that would be better, maybe it can provide more inspiration!

I pushed it to
https://github.com/davidhildenbrand/linux/tree/page_table_reclaim

I suspect it's a non-working version (and I assume the locking is broken, there
are no VMA checks, etc), it's an old prototype. Just to give you an idea about the
lockless scanning and how I started by triggering reclaim only when kicked-off by
user space.

>
>>
>> We'll have to double check whether all anon memory cases can *properly*
>> handle pte_offset_map_lock() failing (not just handling it, but doing
>> the right thing; most of that anon-only code didn't ever run into that
>> issue so far, so these code paths were likely never triggered).
>
> Yeah, I'll keep checking this out too.
>
>>
>>
>>> For the path that will also free PTE pages in THP, we need to recheck
>>> whether the
>>> content of pmd entry is valid after holding pmd lock or pte lock.
>>>
>>> 4. TODO
>>> =======
>>>
>>> Some applications may be concerned about the overhead of scanning and
>>> rebuilding
>>> page tables, so the following features are considered for
>>> implementation in the
>>> future:
>>>
>>>      - add per-process switch (via prctl)
>>>      - add a madvise option (like THP)
>>>      - add MM_PGTABLE_SCAN_DELAY/MM_PGTABLE_SCAN_SIZE control (via
>>> procfs file)
>>> Perhaps we can add the refcount to PTE pages in the future as well,
>>> which would
>>> help improve the scanning speed.
>>
>> I didn't like the added complexity last time, and the problem of
>> handling situations where we squeeze multiple page tables into a single
>> "struct page".
>
> OK, except for refcount, do you think the other three todos above are
> still worth doing?

I think the question is from where we start: for example, only synchronous
reclaim vs. asynchonous reclaim. Synchronous reclaim won't really affect
workloads that do not actively trigger it, so it raises a lot less eyebrows. ...
and some user space might have a good idea where it makes sense to try to
reclaim, and when.

So the other things you note here rather affect asynchronous reclaim, and
might be reasonable in that context. But not sure if we should start with doing
things asynchronously.

--
Cheers,

David / dhildenb


2024-06-13 12:22:23

by Qi Zheng

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] asynchronously scan and free empty user PTE pages

Hi,

On 2024/6/13 18:25, David Hildenbrand wrote:
> On 13.06.24 11:32, Qi Zheng wrote:
>> Hi David,
>>
>> Thanks for such a quick reply!
>
> I appreciate you working on this :)
>
>>
>> On 2024/6/13 17:04, David Hildenbrand wrote:
>>> On 13.06.24 10:38, Qi Zheng wrote:
>>>> Hi all,
>>
>> [...]
>>
>>>
>>>
>>>> 3. Implementation
>>>> =================
>>>>
>>>> For empty user PTE pages, we don't actually need to free it
>>>> immediately, nor do
>>>> we need to free all of it.
>>>>
>>>> Therefore, in this patchset, we register a task_work for the user
>>>> tasks to
>>>> asyncronously scan and free empty PTE pages when they return to user
>>>> space.
>>>> (The scanning time interval and address space size can be adjusted.)
>>>
>>> The question is, if we really have to scan asynchronously, or if would
>>> be reasonable for most use cases to trigger a madvise(MADV_PT_RECLAIM)
>>> every now and then. For virtio-mem, and likely most memory allocators,
>>> that might be feasible, and valuable independent of system-wide
>>> automatic scanning.
>>
>> Agree, I also think it is possible to add always && madvise modes
>> simliar to THP.
>
> My thinking is, we start with a madvise(MADV_PT_RECLAIM) that will
> synchronously try to reclaim page tables without any asynchronous work.
>
> Similar to MADV_COLLAPSE that only does synchronous work. Of course,

This is feasible, but I worry that some user-mode programs may not be
able to determine when to call it.

My previous idea was to do something similar to madvise(MADV_HUGEPAGE),
just mark the vma as being able to reclaim the pgtable, and then hand
it over to the background thread for asynchronous reclaim.

> if we don't need any heavy locking for reclaim, we might also just
> try reclaiming during MADV_DONTNEED when spanning a complete page

I think the lock held by the current solution is not too heavy and
should be acceptable.

But for MADV_FREE case, it still needs to be handled by
madvise(MADV_PT_RECLAIM) or asynchronous work.

> table. That won't sort out all cases where reclaim is possible, but
> with both approaches we could cover quite a lot that were discovered
> to really result in a lot of emprt page tables.

Yes, agree.

>
> On top, we might implement some asynchronous scanning later, This is,
> of course, TBD. Maybe we could wire up other page table scanners
> (khugepaged ?) to simply reclaim empty page tables it finds as well?

This is also an idea. Another option may be some pgtable scanning paths,
such as MGLRU.

>
>>
>>>
>>>>
>>>> When scanning, we can filter out some unsuitable vmas:
>>>>
>>>>       - VM_HUGETLB vma
>>>>       - VM_UFFD_WP vma
>>>
>>> Why is UFFD_WP unsuitable? It should be suitable as long as you make
>>> sure to really only remove page tables that are all pte_none().
>>
>> Got it, I mistakenly thought pte_none() covered pte marker case until
>> I saw pte_none_mostly().
>
> I *think* there is one nasty detail, and we might need an arch callback
> to test if a pte is *really* can be reclaimed: for example, s390x might
> require us keeping some !pte_none() page tables.
>
> While a PTE might be none, the s390x PGSTE (think of it as another
> 8byte per PTE entry stored right next to the actual page table
> entries) might hold data we might have to preserve for our KVM guest.

Oh, thanks for adding this background information!

>
> But that should be easy to wire up.

That's good!

>
>>
>>>
>>>>       - etc
>>>> And for some PTE pages that spans multiple vmas, we can also skip.
>>>>
>>>> For locking:
>>>>
>>>>       - use the mmap read lock to traverse the vma tree and pgtable
>>>>       - use pmd lock for clearing pmd entry
>>>>       - use pte lock for checking empty PTE page, and release it after
>>>> clearing
>>>>         pmd entry, then we can capture the changed pmd in
>>>> pte_offset_map_lock()
>>>>         etc after holding this pte lock. Thanks to this, we don't need
>>>> to hold the
>>>>         rmap-related locks.
>>>>       - users of pte_offset_map_lock() etc all expect the PTE page to
>>>> be stable by
>>>>         using rcu lock, so use pte_free_defer() to free PTE pages.
>>>
>>> I once had a protoype that would scan similar to GUP-fast, using the
>>> mmap lock in read mode and disabling local IRQs and then walking the
>>> page table locklessly (no PTLs). Only when identifying an empty page and
>>> ripping out the page table, it would have to do more heavy locking (back
>>> when we required the mmap lock in write mode and other things).
>>
>> Maybe mmap write lock is not necessary, we can protect it using pmd lock
>> && pte lock as above.
>
> Yes, I'm hoping we can do that, that will solve a lot of possible issues.

Yes, I think the protection provided by the locks above is enough. Of
course, it would be better if more people could double-check it.

>
>>
>>>
>>> I can try digging up that patch if you're interested.
>>
>> Yes, that would be better, maybe it can provide more inspiration!
>
> I pushed it to
>     https://github.com/davidhildenbrand/linux/tree/page_table_reclaim
>
> I suspect it's a non-working version (and I assume the locking is
> broken, there
> are no VMA checks, etc), it's an old prototype. Just to give you an idea
> about the
> lockless scanning and how I started by triggering reclaim only when
> kicked-off by
> user space.

Many thanks! But I'm worried that on some platforms disbaling the IRQ
might be more expensive than holding the lock, such as arm64? Not sure.

>
>>
>>>
>>> We'll have to double check whether all anon memory cases can *properly*
>>> handle pte_offset_map_lock() failing (not just handling it, but doing
>>> the right thing; most of that anon-only code didn't ever run into that
>>> issue so far, so these code paths were likely never triggered).
>>
>> Yeah, I'll keep checking this out too.
>>
>>>
>>>
>>>> For the path that will also free PTE pages in THP, we need to recheck
>>>> whether the
>>>> content of pmd entry is valid after holding pmd lock or pte lock.
>>>>
>>>> 4. TODO
>>>> =======
>>>>
>>>> Some applications may be concerned about the overhead of scanning and
>>>> rebuilding
>>>> page tables, so the following features are considered for
>>>> implementation in the
>>>> future:
>>>>
>>>>       - add per-process switch (via prctl)
>>>>       - add a madvise option (like THP)
>>>>       - add MM_PGTABLE_SCAN_DELAY/MM_PGTABLE_SCAN_SIZE control (via
>>>> procfs file)
>>>> Perhaps we can add the refcount to PTE pages in the future as well,
>>>> which would
>>>> help improve the scanning speed.
>>>
>>> I didn't like the added complexity last time, and the problem of
>>> handling situations where we squeeze multiple page tables into a single
>>> "struct page".
>>
>> OK, except for refcount, do you think the other three todos above are
>> still worth doing?
>
> I think the question is from where we start: for example, only synchronous
> reclaim vs. asynchonous reclaim. Synchronous reclaim won't really affect
> workloads that do not actively trigger it, so it raises a lot less
> eyebrows. ...
> and some user space might have a good idea where it makes sense to try to
> reclaim, and when.
>
> So the other things you note here rather affect asynchronous reclaim, and
> might be reasonable in that context. But not sure if we should start
> with doing
> things asynchronously.

I think synchronous and asynchronous have their own advantages and
disadvantages, and are complementary. Perhaps they can be implemented at
the same time?

Thanks,
Qi

>

2024-06-14 03:32:59

by Qi Zheng

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] asynchronously scan and free empty user PTE pages

Hi David,

How about starting with this:

a. for MADV_DONTNEED case, try synchronous reclaim as you said
b. for MADV_FREE case, add a madvise(MADV_PT_RECLAIM) option to mark
this vma, then add its corresponding mm to a global list, and then
traverse the list and reclaim it when the memory is tight and enters
the system reclaim path.

(If this option is for synchronous reclaim as you said, then the
user-mode program may need to start a thread to make a cyclic call.
I'm not sure if this usage makes sense. If so, I can also implement
such an option.)
c. for s390 case you mentioned, maybe we can set a CONFIG_FREE_PT first,
and then s390 will not select this config until the problem is solved.
d. for lockless scan, we can use pte_offset_map_nolock() instead of
disabling IRQ to scan, because we hold RCU read lock at this time,
which can also ensure that the PTE page is not freed.

Thanks,
Qi

On 2024/6/13 19:59, Qi Zheng wrote:
> Hi,
>
> On 2024/6/13 18:25, David Hildenbrand wrote:
>> On 13.06.24 11:32, Qi Zheng wrote:
>>> Hi David,
>>>
>>> Thanks for such a quick reply!
>>
>> I appreciate you working on this :)
>>
>>>
>>> On 2024/6/13 17:04, David Hildenbrand wrote:
>>>> On 13.06.24 10:38, Qi Zheng wrote:
>>>>> Hi all,
>>>
>>> [...]
>>>
>>>>
>>>>
>>>>> 3. Implementation
>>>>> =================
>>>>>
>>>>> For empty user PTE pages, we don't actually need to free it
>>>>> immediately, nor do
>>>>> we need to free all of it.
>>>>>
>>>>> Therefore, in this patchset, we register a task_work for the user
>>>>> tasks to
>>>>> asyncronously scan and free empty PTE pages when they return to user
>>>>> space.
>>>>> (The scanning time interval and address space size can be adjusted.)
>>>>
>>>> The question is, if we really have to scan asynchronously, or if would
>>>> be reasonable for most use cases to trigger a madvise(MADV_PT_RECLAIM)
>>>> every now and then. For virtio-mem, and likely most memory allocators,
>>>> that might be feasible, and valuable independent of system-wide
>>>> automatic scanning.
>>>
>>> Agree, I also think it is possible to add always && madvise modes
>>> simliar to THP.
>>
>> My thinking is, we start with a madvise(MADV_PT_RECLAIM) that will
>> synchronously try to reclaim page tables without any asynchronous work.
>>
>> Similar to MADV_COLLAPSE that only does synchronous work. Of course,
>
> This is feasible, but I worry that some user-mode programs may not be
> able to determine when to call it.
>
> My previous idea was to do something similar to madvise(MADV_HUGEPAGE),
> just mark the vma as being able to reclaim the pgtable, and then hand
> it over to the background thread for asynchronous reclaim.
>
>> if we don't need any heavy locking for reclaim, we might also just
>> try reclaiming during MADV_DONTNEED when spanning a complete page
>
> I think the lock held by the current solution is not too heavy and
> should be acceptable.
>
> But for MADV_FREE case, it still needs to be handled by
> madvise(MADV_PT_RECLAIM) or asynchronous work.
>
>> table. That won't sort out all cases where reclaim is possible, but
>> with both approaches we could cover quite a lot that were discovered
>> to really result in a lot of emprt page tables.
>
> Yes, agree.
>
>>
>> On top, we might implement some asynchronous scanning later, This is,
>> of course, TBD. Maybe we could wire up other page table scanners
>> (khugepaged ?) to simply reclaim empty page tables it finds as well?
>
> This is also an idea. Another option may be some pgtable scanning paths,
> such as MGLRU.
>
>>
>>>
>>>>
>>>>>
>>>>> When scanning, we can filter out some unsuitable vmas:
>>>>>
>>>>>       - VM_HUGETLB vma
>>>>>       - VM_UFFD_WP vma
>>>>
>>>> Why is UFFD_WP unsuitable? It should be suitable as long as you make
>>>> sure to really only remove page tables that are all pte_none().
>>>
>>> Got it, I mistakenly thought pte_none() covered pte marker case until
>>> I saw pte_none_mostly().
>>
>> I *think* there is one nasty detail, and we might need an arch callback
>> to test if a pte is *really* can be reclaimed: for example, s390x might
>> require us keeping some !pte_none() page tables.
>>
>> While a PTE might be none, the s390x PGSTE (think of it as another
>> 8byte per PTE entry stored right next to the actual page table
>> entries) might hold data we might have to preserve for our KVM guest.
>
> Oh, thanks for adding this background information!
>
>>
>> But that should be easy to wire up.
>
> That's good!
>
>>
>>>
>>>>
>>>>>       - etc
>>>>> And for some PTE pages that spans multiple vmas, we can also skip.
>>>>>
>>>>> For locking:
>>>>>
>>>>>       - use the mmap read lock to traverse the vma tree and pgtable
>>>>>       - use pmd lock for clearing pmd entry
>>>>>       - use pte lock for checking empty PTE page, and release it after
>>>>> clearing
>>>>>         pmd entry, then we can capture the changed pmd in
>>>>> pte_offset_map_lock()
>>>>>         etc after holding this pte lock. Thanks to this, we don't need
>>>>> to hold the
>>>>>         rmap-related locks.
>>>>>       - users of pte_offset_map_lock() etc all expect the PTE page to
>>>>> be stable by
>>>>>         using rcu lock, so use pte_free_defer() to free PTE pages.
>>>>
>>>> I once had a protoype that would scan similar to GUP-fast, using the
>>>> mmap lock in read mode and disabling local IRQs and then walking the
>>>> page table locklessly (no PTLs). Only when identifying an empty page
>>>> and
>>>> ripping out the page table, it would have to do more heavy locking
>>>> (back
>>>> when we required the mmap lock in write mode and other things).
>>>
>>> Maybe mmap write lock is not necessary, we can protect it using pmd lock
>>> && pte lock as above.
>>
>> Yes, I'm hoping we can do that, that will solve a lot of possible issues.
>
> Yes, I think the protection provided by the locks above is enough. Of
> course, it would be better if more people could double-check it.
>
>>
>>>
>>>>
>>>> I can try digging up that patch if you're interested.
>>>
>>> Yes, that would be better, maybe it can provide more inspiration!
>>
>> I pushed it to
>>      https://github.com/davidhildenbrand/linux/tree/page_table_reclaim
>>
>> I suspect it's a non-working version (and I assume the locking is
>> broken, there
>> are no VMA checks, etc), it's an old prototype. Just to give you an
>> idea about the
>> lockless scanning and how I started by triggering reclaim only when
>> kicked-off by
>> user space.
>
> Many thanks! But I'm worried that on some platforms disbaling the IRQ
> might be more expensive than holding the lock, such as arm64? Not sure.
>
>>
>>>
>>>>
>>>> We'll have to double check whether all anon memory cases can *properly*
>>>> handle pte_offset_map_lock() failing (not just handling it, but doing
>>>> the right thing; most of that anon-only code didn't ever run into that
>>>> issue so far, so these code paths were likely never triggered).
>>>
>>> Yeah, I'll keep checking this out too.
>>>
>>>>
>>>>
>>>>> For the path that will also free PTE pages in THP, we need to recheck
>>>>> whether the
>>>>> content of pmd entry is valid after holding pmd lock or pte lock.
>>>>>
>>>>> 4. TODO
>>>>> =======
>>>>>
>>>>> Some applications may be concerned about the overhead of scanning and
>>>>> rebuilding
>>>>> page tables, so the following features are considered for
>>>>> implementation in the
>>>>> future:
>>>>>
>>>>>       - add per-process switch (via prctl)
>>>>>       - add a madvise option (like THP)
>>>>>       - add MM_PGTABLE_SCAN_DELAY/MM_PGTABLE_SCAN_SIZE control (via
>>>>> procfs file)
>>>>> Perhaps we can add the refcount to PTE pages in the future as well,
>>>>> which would
>>>>> help improve the scanning speed.
>>>>
>>>> I didn't like the added complexity last time, and the problem of
>>>> handling situations where we squeeze multiple page tables into a single
>>>> "struct page".
>>>
>>> OK, except for refcount, do you think the other three todos above are
>>> still worth doing?
>>
>> I think the question is from where we start: for example, only
>> synchronous
>> reclaim vs. asynchonous reclaim. Synchronous reclaim won't really affect
>> workloads that do not actively trigger it, so it raises a lot less
>> eyebrows. ...
>> and some user space might have a good idea where it makes sense to try to
>> reclaim, and when.
>>
>> So the other things you note here rather affect asynchronous reclaim, and
>> might be reasonable in that context. But not sure if we should start
>> with doing
>> things asynchronously.
>
> I think synchronous and asynchronous have their own advantages and
> disadvantages, and are complementary. Perhaps they can be implemented at
> the same time?
>
> Thanks,
> Qi
>
>>

2024-06-14 07:54:05

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] asynchronously scan and free empty user PTE pages

>> My thinking is, we start with a madvise(MADV_PT_RECLAIM) that will
>> synchronously try to reclaim page tables without any asynchronous work.
>>
>> Similar to MADV_COLLAPSE that only does synchronous work. Of course,
>
> This is feasible, but I worry that some user-mode programs may not be
> able to determine when to call it.

Some yes, but others clearly :) Meaning, it's one step into the right
direction without having to worry about asynchronous work in the kernel
for now. That doesn't mean that asynchronous option is off the table.

>
> My previous idea was to do something similar to madvise(MADV_HUGEPAGE),
> just mark the vma as being able to reclaim the pgtable, and then hand
> it over to the background thread for asynchronous reclaim.

That's one option, although there might be workloads where you really
don't have to scan asynchronously and possibly repeatedly.

For example, after virtio-mem discarded some memory it hotunplugged from
a VM using MADV_DONTNEED (in a sequence of multiple steps), it could
just setup a timer to free up page tables after a while exactly once. No
need to scan repeatedly / multiple times if virtio-mem didn't remove any
memory from a VM.

For memory allocators it could be similar: trigger it once (from another
thread?) on a range after sufficient freeing happened. If the workload
is mostly idle, there might not be a need to free up memory.

(mostly focused on anonymous memory + shmem for now. With file-backed
memory it might be different, but that has so far not been the biggest
consumer we saw regarding page tables.)

Of course, for real asynchronous/automatic scanning in the kernel, one
could try finding clues when scanning is reasonable: for example, mark
page tables that have been scanned and there was nothing to reclaim, and
mark page tables when modifying them. But such optimizations are rather
future work I guess, because devil is in the detail.

>
>> if we don't need any heavy locking for reclaim, we might also just
>> try reclaiming during MADV_DONTNEED when spanning a complete page
>
> I think the lock held by the current solution is not too heavy and
> should be acceptable.
>
> But for MADV_FREE case, it still needs to be handled by
> madvise(MADV_PT_RECLAIM) or asynchronous work.

Yes. Interestingly, reclaim code might be able to do that scanning +
reclaim if locking is cheap.

>
>> table. That won't sort out all cases where reclaim is possible, but
>> with both approaches we could cover quite a lot that were discovered
>> to really result in a lot of emprt page tables.
>
> Yes, agree.
>
>>
>> On top, we might implement some asynchronous scanning later, This is,
>> of course, TBD. Maybe we could wire up other page table scanners
>> (khugepaged ?) to simply reclaim empty page tables it finds as well?
>
> This is also an idea. Another option may be some pgtable scanning paths,
> such as MGLRU.
>

Exactly.

>>
>>>
>>>>
>>>>>
>>>>> When scanning, we can filter out some unsuitable vmas:
>>>>>
>>>>>       - VM_HUGETLB vma
>>>>>       - VM_UFFD_WP vma
>>>>
>>>> Why is UFFD_WP unsuitable? It should be suitable as long as you make
>>>> sure to really only remove page tables that are all pte_none().
>>>
>>> Got it, I mistakenly thought pte_none() covered pte marker case until
>>> I saw pte_none_mostly().
>>
>> I *think* there is one nasty detail, and we might need an arch callback
>> to test if a pte is *really* can be reclaimed: for example, s390x might
>> require us keeping some !pte_none() page tables.
>>
>> While a PTE might be none, the s390x PGSTE (think of it as another
>> 8byte per PTE entry stored right next to the actual page table
>> entries) might hold data we might have to preserve for our KVM guest.
>
> Oh, thanks for adding this background information!
>
>>
>> But that should be easy to wire up.
>
> That's good!
>
>>
>>>
>>>>
>>>>>       - etc
>>>>> And for some PTE pages that spans multiple vmas, we can also skip.
>>>>>
>>>>> For locking:
>>>>>
>>>>>       - use the mmap read lock to traverse the vma tree and pgtable
>>>>>       - use pmd lock for clearing pmd entry
>>>>>       - use pte lock for checking empty PTE page, and release it after
>>>>> clearing
>>>>>         pmd entry, then we can capture the changed pmd in
>>>>> pte_offset_map_lock()
>>>>>         etc after holding this pte lock. Thanks to this, we don't need
>>>>> to hold the
>>>>>         rmap-related locks.
>>>>>       - users of pte_offset_map_lock() etc all expect the PTE page to
>>>>> be stable by
>>>>>         using rcu lock, so use pte_free_defer() to free PTE pages.
>>>>
>>>> I once had a protoype that would scan similar to GUP-fast, using the
>>>> mmap lock in read mode and disabling local IRQs and then walking the
>>>> page table locklessly (no PTLs). Only when identifying an empty page and
>>>> ripping out the page table, it would have to do more heavy locking (back
>>>> when we required the mmap lock in write mode and other things).
>>>
>>> Maybe mmap write lock is not necessary, we can protect it using pmd lock
>>> && pte lock as above.
>>
>> Yes, I'm hoping we can do that, that will solve a lot of possible issues.
>
> Yes, I think the protection provided by the locks above is enough. Of
> course, it would be better if more people could double-check it.
>
>>
>>>
>>>>
>>>> I can try digging up that patch if you're interested.
>>>
>>> Yes, that would be better, maybe it can provide more inspiration!
>>
>> I pushed it to
>>     https://github.com/davidhildenbrand/linux/tree/page_table_reclaim
>>
>> I suspect it's a non-working version (and I assume the locking is
>> broken, there
>> are no VMA checks, etc), it's an old prototype. Just to give you an idea
>> about the
>> lockless scanning and how I started by triggering reclaim only when
>> kicked-off by
>> user space.
>
> Many thanks! But I'm worried that on some platforms disbaling the IRQ
> might be more expensive than holding the lock, such as arm64? Not sure.

Scanning completely lockless (no mmap lock, not PT locks), means that --
as long as there is not much to reclaim (for most workloads the common
case!) -- you would not affect the workload at all.

Take a look at the khugepaged logic that does mmap_read_trylock(mm) and
makes sure to drop the mmap lock frequently due to
khugepaged_pages_to_scan, to not affect the workload too much while
scanning.

>
>>
>>>
>>>>
>>>> We'll have to double check whether all anon memory cases can *properly*
>>>> handle pte_offset_map_lock() failing (not just handling it, but doing
>>>> the right thing; most of that anon-only code didn't ever run into that
>>>> issue so far, so these code paths were likely never triggered).
>>>
>>> Yeah, I'll keep checking this out too.
>>>
>>>>
>>>>
>>>>> For the path that will also free PTE pages in THP, we need to recheck
>>>>> whether the
>>>>> content of pmd entry is valid after holding pmd lock or pte lock.
>>>>>
>>>>> 4. TODO
>>>>> =======
>>>>>
>>>>> Some applications may be concerned about the overhead of scanning and
>>>>> rebuilding
>>>>> page tables, so the following features are considered for
>>>>> implementation in the
>>>>> future:
>>>>>
>>>>>       - add per-process switch (via prctl)
>>>>>       - add a madvise option (like THP)
>>>>>       - add MM_PGTABLE_SCAN_DELAY/MM_PGTABLE_SCAN_SIZE control (via
>>>>> procfs file)
>>>>> Perhaps we can add the refcount to PTE pages in the future as well,
>>>>> which would
>>>>> help improve the scanning speed.
>>>>
>>>> I didn't like the added complexity last time, and the problem of
>>>> handling situations where we squeeze multiple page tables into a single
>>>> "struct page".
>>>
>>> OK, except for refcount, do you think the other three todos above are
>>> still worth doing?
>>
>> I think the question is from where we start: for example, only synchronous
>> reclaim vs. asynchonous reclaim. Synchronous reclaim won't really affect
>> workloads that do not actively trigger it, so it raises a lot less
>> eyebrows. ...
>> and some user space might have a good idea where it makes sense to try to
>> reclaim, and when.
>>
>> So the other things you note here rather affect asynchronous reclaim, and
>> might be reasonable in that context. But not sure if we should start
>> with doing
>> things asynchronously.
>
> I think synchronous and asynchronous have their own advantages and
> disadvantages, and are complementary. Perhaps they can be implemented at
> the same time?


No strong opinion, something synchronous sounds to me like the
low-hanging fruit, that could add the infrastructure to be used by
something more advanced/synchronously :)

--
Cheers,

David / dhildenb


2024-06-14 10:58:42

by Qi Zheng

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] asynchronously scan and free empty user PTE pages

Hi David,

On 2024/6/14 15:53, David Hildenbrand wrote:
>>> My thinking is, we start with a madvise(MADV_PT_RECLAIM) that will
>>> synchronously try to reclaim page tables without any asynchronous work.
>>>
>>> Similar to MADV_COLLAPSE that only does synchronous work. Of course,
>>
>> This is feasible, but I worry that some user-mode programs may not be
>> able to determine when to call it.
>
> Some yes, but others clearly :) Meaning, it's one step into the right
> direction without having to worry about asynchronous work in the kernel
> for now. That doesn't mean that asynchronous option is off the table.

Got it. I will try to implement a synchronous madvise option in the
next version.

>
>>
>> My previous idea was to do something similar to madvise(MADV_HUGEPAGE),
>> just mark the vma as being able to reclaim the pgtable, and then hand
>> it over to the background thread for asynchronous reclaim.
>
> That's one option, although there might be workloads where you really
> don't have to scan asynchronously and possibly repeatedly.
>
> For example, after virtio-mem discarded some memory it hotunplugged from
> a VM using MADV_DONTNEED (in a sequence of multiple steps), it could
> just setup a timer to free up page tables after a while exactly once. No
> need to scan repeatedly / multiple times if virtio-mem didn't remove any
> memory from a VM.
>
> For memory allocators it could be similar: trigger it once (from another
> thread?) on a range after sufficient freeing happened. If the workload
> is mostly idle, there might not be a need to free up memory.

Thanks for adding this information!

>
> (mostly focused on anonymous memory + shmem for now. With file-backed
> memory it might be different, but that has so far not been the biggest
> consumer we saw regarding page tables.)

OK.

>
> Of course, for real asynchronous/automatic scanning in the kernel, one
> could try finding clues when scanning is reasonable: for example, mark
> page tables that have been scanned and there was nothing to reclaim, and
> mark page tables when modifying them. But such optimizations are rather
> future work I guess, because devil is in the detail.

Yes, we can optimize it step by step.

>
>>
>>> if we don't need any heavy locking for reclaim, we might also just
>>> try reclaiming during MADV_DONTNEED when spanning a complete page
>>
>> I think the lock held by the current solution is not too heavy and
>> should be acceptable.
>>
>> But for MADV_FREE case, it still needs to be handled by
>> madvise(MADV_PT_RECLAIM) or asynchronous work.
>
> Yes. Interestingly, reclaim code might be able to do that scanning +
> reclaim if locking is cheap.

Yes, I am also considering implementing another madvise option in the
next version:

mark the vma, then add its corresponding mm to a global list, and
then traverse the list and reclaim it when the memory is tight and
enters the system reclaim path.


>
>>
>>> table. That won't sort out all cases where reclaim is possible, but
>>> with both approaches we could cover quite a lot that were discovered
>>> to really result in a lot of emprt page tables.
>>

[...]

>>>
>>> I pushed it to
>>>       https://github.com/davidhildenbrand/linux/tree/page_table_reclaim
>>>
>>> I suspect it's a non-working version (and I assume the locking is
>>> broken, there
>>> are no VMA checks, etc), it's an old prototype. Just to give you an idea
>>> about the
>>> lockless scanning and how I started by triggering reclaim only when
>>> kicked-off by
>>> user space.
>>
>> Many thanks! But I'm worried that on some platforms disbaling the IRQ
>> might be more expensive than holding the lock, such as arm64? Not sure.
>
> Scanning completely lockless (no mmap lock, not PT locks), means that --
> as long as there is not much to reclaim (for most workloads the common
> case!) -- you would not affect the workload at all.
>
> Take a look at the khugepaged logic that does mmap_read_trylock(mm) and
> makes sure to drop the mmap lock frequently due to
> khugepaged_pages_to_scan, to not affect the workload too much while
> scanning.
>

OK, I will take a look.

>>
>>>
>>>>
>>>>>
>>>>> We'll have to double check whether all anon memory cases can
>>>>> *properly*
>>>>> handle pte_offset_map_lock() failing (not just handling it, but doing
>>>>> the right thing; most of that anon-only code didn't ever run into that
>>>>> issue so far, so these code paths were likely never triggered).
>>>>
>>>> Yeah, I'll keep checking this out too.
>>>>
>>>>>
>>>>>
>>>>>> For the path that will also free PTE pages in THP, we need to recheck
>>>>>> whether the
>>>>>> content of pmd entry is valid after holding pmd lock or pte lock.
>>>>>>
>>>>>> 4. TODO
>>>>>> =======
>>>>>>
>>>>>> Some applications may be concerned about the overhead of scanning and
>>>>>> rebuilding
>>>>>> page tables, so the following features are considered for
>>>>>> implementation in the
>>>>>> future:
>>>>>>
>>>>>>        - add per-process switch (via prctl)
>>>>>>        - add a madvise option (like THP)
>>>>>>        - add MM_PGTABLE_SCAN_DELAY/MM_PGTABLE_SCAN_SIZE control (via
>>>>>> procfs file)
>>>>>> Perhaps we can add the refcount to PTE pages in the future as well,
>>>>>> which would
>>>>>> help improve the scanning speed.
>>>>>
>>>>> I didn't like the added complexity last time, and the problem of
>>>>> handling situations where we squeeze multiple page tables into a
>>>>> single
>>>>> "struct page".
>>>>
>>>> OK, except for refcount, do you think the other three todos above are
>>>> still worth doing?
>>>
>>> I think the question is from where we start: for example, only
>>> synchronous
>>> reclaim vs. asynchonous reclaim. Synchronous reclaim won't really affect
>>> workloads that do not actively trigger it, so it raises a lot less
>>> eyebrows. ...
>>> and some user space might have a good idea where it makes sense to
>>> try to
>>> reclaim, and when.
>>>
>>> So the other things you note here rather affect asynchronous reclaim,
>>> and
>>> might be reasonable in that context. But not sure if we should start
>>> with doing
>>> things asynchronously.
>>
>> I think synchronous and asynchronous have their own advantages and
>> disadvantages, and are complementary. Perhaps they can be implemented at
>> the same time?
>
>
> No strong opinion, something synchronous sounds to me like the
> low-hanging fruit, that could add the infrastructure to be used by
> something more advanced/synchronously :)

Got it, I will try to do the following in the next version.

a. for MADV_DONTNEED case, try synchronous reclaim as you said

b. for MADV_FREE case:

- add a madvise option for synchronous reclaim

- add another madvise option to mark the vma, then add its
corresponding mm to a global list, and then traverse
the list and reclaim it when the memory is tight and
enters the system reclaim path.
(maybe there is an option to unmark)

c. for s390 case you mentioned, create a CONFIG_FREE_PT first, and
then s390 will not select this config until the problem is solved.

d. for lockless scan, try using disabling IRQ or (mmap read lock +
pte_offset_map_nolock).

Thanks,
Qi

>