2022-08-25 10:23:15

by Qi Zheng

[permalink] [raw]
Subject: [RFC PATCH 0/7] Try to free empty and zero user PTE page table pages

Hi,

Before this, in order to free empty user PTE page table pages, I posted the
following patch sets of two solutions:
- atomic refcount version:
https://lore.kernel.org/lkml/[email protected]/
- percpu refcount version:
https://lore.kernel.org/lkml/[email protected]/

Both patch sets have the following behavior:
a. Protect the page table walker by hooking pte_offset_map{_lock}() and
pte_unmap{_unlock}()
b. Will automatically reclaim PTE page table pages in the non-reclaiming path

For behavior a, there may be the following disadvantages mentioned by
David Hildenbrand:
- It introduces a lot of complexity. It's not something easy to get in and most
probably not easy to get out again
- It is inconvenient to extend to other architectures. For example, for the
continuous ptes of arm64, the pointer to the PTE entry is obtained directly
through pte_offset_kernel() instead of pte_offset_map{_lock}()
- It has been found that pte_unmap() is missing in some places that only
execute on 64-bit systems, which is a disaster for pte_refcount

For behavior b, it may not be necessary to actively reclaim PTE pages, especially
when memory pressure is not high, and deferring to the reclaim path may be a
better choice.

In addition, the above two solutions are only for empty PTE pages (a PTE page
where all entries are empty), and do not deal with the zero PTE page ( a PTE
page where all page table entries are mapped to shared zero page) mentioned by
David Hildenbrand:
"Especially the shared zeropage is nasty, because there are
sane use cases that can trigger it. Assume you have a VM
(e.g., QEMU) that inflated the balloon to return free memory
to the hypervisor.

Simply migrating that VM will populate the shared zeropage to
all inflated pages, because migration code ends up reading all
VM memory. Similarly, the guest can just read that memory as
well, for example, when the guest issues kdump itself."

The purpose of this RFC patch is to continue the discussion and fix the above
issues. The following is the solution to be discussed.

In order to quickly identify the above two types of PTE pages, we still
introduced a pte_refcount for each PTE page. We put the mapped and zero PTE
entry counter into the pte_refcount of the PTE page. The bitmask has the
following meaning:

- bits 0-9 are mapped PTE entry count
- bits 10-19 are zero PTE entry count

In this way, when mapped PTE entry count is 0, we can know that the current PTE
page is an empty PTE page, and when zero PTE entry count is PTRS_PER_PTE, we can
know that the current PTE page is a zero PTE page.

We only update the pte_refcount when setting and clearing of PTE entry, and
since they are both protected by pte lock, pte_refcount can be a non-atomic
variable with little performance overhead.

For page table walker, we mutually exclusive it by holding write lock of
mmap_lock when doing pmd_clear() (in the newly added path to reclaim PTE pages).

The [RFC PATCH 7/7] is an example of reclaiming empty and zero PTE page in a
process. But the best time to reclaim should be in the reclaiming path, such as
before waking up the oom killer. At this point, the system can not reclaim more
memory. Compared with killing a process, it is more acceptable to hold a write
lock of mmap_lock to reclaim memory by releasing empty and zero PTE pages.

My idea is to count the number of bytes (mm->reclaimable_pt_bytes, similar to
mm->pgtables_bytes) of reclaimable PTE pages (including empty and zero PTE page)
in each mm, and maintain a rbtree with mm->reclaimable_pt_bytes as the key, then
we can pick the mm with the largest mm->reclaimable_pt_bytes to reclaim in the
reclaim path.

This series is based on v5.19.

Comments and suggestions are welcome.

Thanks,
Qi

Qi Zheng (7):
mm: use ptep_clear() in non-present cases
mm: introduce CONFIG_FREE_USER_PTE
mm: add pte_to_page() helper
mm: introduce pte_refcount for user PTE page table page
pte_ref: add track_pte_{set, clear}() helper
x86/mm: add x86_64 support for pte_ref
mm: add proc interface to free user PTE page table pages

arch/x86/Kconfig | 1 +
arch/x86/include/asm/pgtable.h | 4 +
include/linux/mm.h | 2 +
include/linux/mm_types.h | 1 +
include/linux/pgtable.h | 11 +-
include/linux/pte_ref.h | 41 ++++++
kernel/sysctl.c | 12 ++
mm/Kconfig | 11 ++
mm/Makefile | 2 +-
mm/memory.c | 2 +-
mm/mprotect.c | 2 +-
mm/pte_ref.c | 234 +++++++++++++++++++++++++++++++++
12 files changed, 319 insertions(+), 4 deletions(-)
create mode 100644 include/linux/pte_ref.h
create mode 100644 mm/pte_ref.c

--
2.20.1


2022-08-25 10:24:23

by Qi Zheng

[permalink] [raw]
Subject: [RFC PATCH 4/7] mm: introduce pte_refcount for user PTE page table page

The following is the largest user PTE page table memory that
can be allocated by a single user process in a 32-bit and a
64-bit system (assuming 4K page size).

+---------------------------+--------+---------+
| | 32-bit | 64-bit |
+===========================+========+=========+
| user PTE page table pages | 3 MiB | 512 GiB |
+---------------------------+--------+---------+
| user PMD page table pages | 3 KiB | 1 GiB |
+---------------------------+--------+---------+
(for 32-bit, take 3G user address space as an example;
for 64-bit, take 48-bit address width as an example.)

Today, 64-bit servers generally have only a few terabytes of
physical memory, and mapping these memory does not require as
many PTE page tables as above, but in some of the following
scenarios, it is still possible to cause huge page table memory
usage.

1. In order to pursue high performance, applications mostly use
some high-performance user-mode memory allocators, such as
jemalloc or tcmalloc. These memory allocators use
madvise(MADV_DONTNEED or MADV_FREE) to release physical memory,
but neither MADV_DONTNEED nor MADV_FREE will release page table
memory, which may cause huge page table memory as follows:

VIRT: 55t
RES: 590g
VmPTE: 110g

In this case, most of the page table entries are empty. For such
a PTE page where all entries are empty, we call it empty PTE page.

2. The shared zero page scenario mentioned by David Hildenbrand:

Especially the shared zeropage is nasty, because there are
sane use cases that can trigger it. Assume you have a VM
(e.g., QEMU) that inflated the balloon to return free memory
to the hypervisor.

Simply migrating that VM will populate the shared zeropage to
all inflated pages, because migration code ends up reading all
VM memory. Similarly, the guest can just read that memory as
well, for example, when the guest issues kdump itself.

In this case, most of the page table entries are mapped to the shared
zero page. For such a PTE page where all page table entries are mapped
to zero pages, we call it zero PTE page.

The page table entries for both types of PTE pages do not record
"meaningful" information, so we can try to free these PTE pages at
some point (such as when memory pressure is high) to reclaim more
memory.

To quickly identify these two types of pages, we have introduced a
pte_refcount for each PTE page. We put the mapped and zero PTE entry
counter into the pte_refcount of the PTE page. The bitmask has the
following meaning:

- bits 0-9 are mapped PTE entry count
- bits 10-19 are zero PTE entry count

Because the mapping and unmapping of PTE entries are under pte_lock,
there is no concurrent thread to modify pte_refcount, so pte_refcount
can be a non-atomic variable with little performance overhead.

Signed-off-by: Qi Zheng <[email protected]>
---
include/linux/mm.h | 2 ++
include/linux/mm_types.h | 1 +
include/linux/pte_ref.h | 23 +++++++++++++
mm/Makefile | 2 +-
mm/pte_ref.c | 72 ++++++++++++++++++++++++++++++++++++++++
5 files changed, 99 insertions(+), 1 deletion(-)
create mode 100644 include/linux/pte_ref.h
create mode 100644 mm/pte_ref.c

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7898e29bcfb5..23e2f1e75b4b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -28,6 +28,7 @@
#include <linux/sched.h>
#include <linux/pgtable.h>
#include <linux/kasan.h>
+#include <linux/pte_ref.h>

struct mempolicy;
struct anon_vma;
@@ -2336,6 +2337,7 @@ static inline bool pgtable_pte_page_ctor(struct page *page)
return false;
__SetPageTable(page);
inc_lruvec_page_state(page, NR_PAGETABLE);
+ pte_ref_init(page);
return true;
}

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index c29ab4c0cd5c..da2738f87737 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -153,6 +153,7 @@ struct page {
union {
struct mm_struct *pt_mm; /* x86 pgds only */
atomic_t pt_frag_refcount; /* powerpc */
+ unsigned long pte_refcount; /* only for PTE page */
};
#if ALLOC_SPLIT_PTLOCKS
spinlock_t *ptl;
diff --git a/include/linux/pte_ref.h b/include/linux/pte_ref.h
new file mode 100644
index 000000000000..db14e03e1dff
--- /dev/null
+++ b/include/linux/pte_ref.h
@@ -0,0 +1,23 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2022, ByteDance. All rights reserved.
+ *
+ * Author: Qi Zheng <[email protected]>
+ */
+
+#ifndef _LINUX_PTE_REF_H
+#define _LINUX_PTE_REF_H
+
+#ifdef CONFIG_FREE_USER_PTE
+
+void pte_ref_init(pgtable_t pte);
+
+#else /* !CONFIG_FREE_USER_PTE */
+
+static inline void pte_ref_init(pgtable_t pte)
+{
+}
+
+#endif /* CONFIG_FREE_USER_PTE */
+
+#endif /* _LINUX_PTE_REF_H */
diff --git a/mm/Makefile b/mm/Makefile
index 6f9ffa968a1a..f8fa5078a13d 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -54,7 +54,7 @@ obj-y := filemap.o mempool.o oom_kill.o fadvise.o \
mm_init.o percpu.o slab_common.o \
compaction.o vmacache.o \
interval_tree.o list_lru.o workingset.o \
- debug.o gup.o mmap_lock.o $(mmu-y)
+ debug.o gup.o mmap_lock.o $(mmu-y) pte_ref.o

# Give 'page_alloc' its own module-parameter namespace
page-alloc-y := page_alloc.o
diff --git a/mm/pte_ref.c b/mm/pte_ref.c
new file mode 100644
index 000000000000..12b27646e88c
--- /dev/null
+++ b/mm/pte_ref.c
@@ -0,0 +1,72 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2022, ByteDance. All rights reserved.
+ *
+ * Author: Qi Zheng <[email protected]>
+ */
+#include <linux/pgtable.h>
+#include <linux/pte_ref.h>
+
+#ifdef CONFIG_FREE_USER_PTE
+
+/*
+ * For a PTE page where all entries are empty, we call it empty PTE page. For a
+ * PTE page where all page table entries are mapped to zero pages, we call it
+ * zero PTE page.
+ *
+ * The page table entries for both types of PTE pages do not record "meaningful"
+ * information, so we can try to free these PTE pages at some point (such as
+ * when memory pressure is high) to reclaim more memory.
+ *
+ * We put the mapped and zero PTE entry counter into the pte_refcount of the
+ * PTE page. The bitmask has the following meaning:
+ *
+ * - bits 0-9 are mapped PTE entry count
+ * - bits 10-19 are zero PTE entry count
+ *
+ * Because the mapping and unmapping of PTE entries are under pte_lock, there is
+ * no concurrent thread to modify pte_refcount, so pte_refcount can be a
+ * non-atomic variable with little performance overhead.
+ */
+#define PTE_MAPPED_BITS 10
+#define PTE_ZERO_BITS 10
+
+#define PTE_MAPPED_SHIFT 0
+#define PTE_ZERO_SHIFT (PTE_MAPPED_SHIFT + PTE_MAPPED_BITS)
+
+#define __PTE_REF_MASK(x) ((1UL << (x))-1)
+
+#define PTE_MAPPED_MASK (__PTE_REF_MASK(PTE_MAPPED_BITS) << PTE_MAPPED_SHIFT)
+#define PTE_ZERO_MASK (__PTE_REF_MASK(PTE_ZERO_BITS) << PTE_ZERO_SHIFT)
+
+#define PTE_MAPPED_OFFSET (1UL << PTE_MAPPED_SHIFT)
+#define PTE_ZERO_OFFSET (1UL << PTE_ZERO_SHIFT)
+
+static inline unsigned long pte_refcount(pgtable_t pte)
+{
+ return pte->pte_refcount;
+}
+
+#define pte_mapped_count(pte) \
+ ((pte_refcount(pte) & PTE_MAPPED_MASK) >> PTE_MAPPED_SHIFT)
+#define pte_zero_count(pte) \
+ ((pte_refcount(pte) & PTE_ZERO_MASK) >> PTE_ZERO_SHIFT)
+
+static __always_inline void pte_refcount_add(struct mm_struct *mm,
+ pgtable_t pte, int val)
+{
+ pte->pte_refcount += val;
+}
+
+static __always_inline void pte_refcount_sub(struct mm_struct *mm,
+ pgtable_t pte, int val)
+{
+ pte->pte_refcount -= val;
+}
+
+void pte_ref_init(pgtable_t pte)
+{
+ pte->pte_refcount = 0;
+}
+
+#endif /* CONFIG_FREE_USER_PTE */
--
2.20.1

2022-08-25 10:25:32

by Qi Zheng

[permalink] [raw]
Subject: [RFC PATCH 7/7] mm: add proc interface to free user PTE page table pages

Add /proc/sys/vm/free_ptes file to procfs, when pid is written
to the file, we will traverse its process address space, find
and free empty PTE pages or zero PTE pages.

Signed-off-by: Qi Zheng <[email protected]>
---
include/linux/pte_ref.h | 5 ++
kernel/sysctl.c | 12 ++++
mm/pte_ref.c | 126 ++++++++++++++++++++++++++++++++++++++++
3 files changed, 143 insertions(+)

diff --git a/include/linux/pte_ref.h b/include/linux/pte_ref.h
index ab49c7fac120..f7e244129291 100644
--- a/include/linux/pte_ref.h
+++ b/include/linux/pte_ref.h
@@ -16,6 +16,11 @@ void track_pte_set(struct mm_struct *mm, unsigned long addr, pte_t *ptep,
pte_t pte);
void track_pte_clear(struct mm_struct *mm, unsigned long addr, pte_t *ptep,
pte_t pte);
+
+int free_ptes_sysctl_handler(struct ctl_table *table, int write,
+ void *buffer, size_t *length, loff_t *ppos);
+extern int sysctl_free_ptes_pid;
+
#else /* !CONFIG_FREE_USER_PTE */

static inline void pte_ref_init(pgtable_t pte)
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 35d034219513..14e1a9841cb8 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -64,6 +64,7 @@
#include <linux/mount.h>
#include <linux/userfaultfd_k.h>
#include <linux/pid.h>
+#include <linux/pte_ref.h>

#include "../lib/kstrtox.h"

@@ -2153,6 +2154,17 @@ static struct ctl_table vm_table[] = {
.extra1 = SYSCTL_ONE,
.extra2 = SYSCTL_FOUR,
},
+#ifdef CONFIG_FREE_USER_PTE
+ {
+ .procname = "free_ptes",
+ .data = &sysctl_free_ptes_pid,
+ .maxlen = sizeof(int),
+ .mode = 0200,
+ .proc_handler = free_ptes_sysctl_handler,
+ .extra1 = SYSCTL_ZERO,
+ .extra2 = SYSCTL_INT_MAX,
+ },
+#endif
#ifdef CONFIG_COMPACTION
{
.procname = "compact_memory",
diff --git a/mm/pte_ref.c b/mm/pte_ref.c
index 818821d068af..e7080a3100a6 100644
--- a/mm/pte_ref.c
+++ b/mm/pte_ref.c
@@ -6,6 +6,14 @@
*/
#include <linux/pgtable.h>
#include <linux/pte_ref.h>
+#include <linux/mm.h>
+#include <linux/pagewalk.h>
+#include <linux/sched/mm.h>
+#include <linux/jump_label.h>
+#include <linux/hugetlb.h>
+#include <asm/tlbflush.h>
+
+#include "internal.h"

#ifdef CONFIG_FREE_USER_PTE

@@ -105,4 +113,122 @@ void track_pte_clear(struct mm_struct *mm, unsigned long addr, pte_t *ptep,
}
EXPORT_SYMBOL(track_pte_clear);

+#ifdef CONFIG_DEBUG_VM
+void pte_free_debug(pmd_t pmd)
+{
+ pte_t *ptep = (pte_t *)pmd_page_vaddr(pmd);
+ int i = 0;
+
+ for (i = 0; i < PTRS_PER_PTE; i++, ptep++) {
+ pte_t pte = *ptep;
+ BUG_ON(!(pte_none(pte) || is_zero_pfn(pte_pfn(pte))));
+ }
+}
+#else
+static inline void pte_free_debug(pmd_t pmd)
+{
+}
+#endif
+
+
+static int kfreeptd_pmd_entry(pmd_t *pmd, unsigned long addr,
+ unsigned long next, struct mm_walk *walk)
+{
+ pmd_t pmdval;
+ pgtable_t page;
+ struct mm_struct *mm = walk->mm;
+ struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);
+ spinlock_t *ptl;
+ bool free = false;
+ unsigned long haddr = addr & PMD_MASK;
+
+ if (pmd_trans_unstable(pmd))
+ goto out;
+
+ mmap_read_unlock(mm);
+ mmap_write_lock(mm);
+
+ if (mm_find_pmd(mm, addr) != pmd)
+ goto unlock_out;
+
+ ptl = pmd_lock(mm, pmd);
+ pmdval = *pmd;
+ if (pmd_none(pmdval) || pmd_leaf(pmdval)) {
+ spin_unlock(ptl);
+ goto unlock_out;
+ }
+ page = pmd_pgtable(pmdval);
+ if (!pte_mapped_count(page) || pte_zero_count(page) == PTRS_PER_PTE) {
+ pmd_clear(pmd);
+ flush_tlb_range(&vma, haddr, haddr + PMD_SIZE);
+ free = true;
+ }
+ spin_unlock(ptl);
+
+unlock_out:
+ mmap_write_unlock(mm);
+ mmap_read_lock(mm);
+
+ if (free) {
+ pte_free_debug(pmdval);
+ mm_dec_nr_ptes(mm);
+ pgtable_pte_page_dtor(page);
+ __free_page(page);
+ }
+
+out:
+ cond_resched();
+ return 0;
+}
+
+static const struct mm_walk_ops kfreeptd_walk_ops = {
+ .pmd_entry = kfreeptd_pmd_entry,
+};
+
+int sysctl_free_ptes_pid;
+int free_ptes_sysctl_handler(struct ctl_table *table, int write,
+ void *buffer, size_t *length, loff_t *ppos)
+{
+ int ret;
+
+ ret = proc_dointvec_minmax(table, write, buffer, length, ppos);
+ if (ret)
+ return ret;
+ if (write) {
+ struct task_struct *task;
+ struct mm_struct *mm;
+
+ rcu_read_lock();
+ task = find_task_by_vpid(sysctl_free_ptes_pid);
+ if (!task) {
+ rcu_read_unlock();
+ return -ESRCH;
+ }
+ mm = get_task_mm(task);
+ rcu_read_unlock();
+
+ if (!mm) {
+ mmput(mm);
+ return -ESRCH;
+ }
+
+ do {
+ ret = -EBUSY;
+
+ if (mmap_read_trylock(mm)) {
+ ret = walk_page_range(mm, FIRST_USER_ADDRESS,
+ ULONG_MAX,
+ &kfreeptd_walk_ops, NULL);
+
+ mmap_read_unlock(mm);
+ }
+
+ cond_resched();
+ } while (ret == -EAGAIN);
+
+ mmput(mm);
+ }
+ return ret;
+}
+
#endif /* CONFIG_FREE_USER_PTE */
--
2.20.1

2022-08-25 11:09:45

by Qi Zheng

[permalink] [raw]
Subject: [RFC PATCH 2/7] mm: introduce CONFIG_FREE_USER_PTE

This configuration variable will be used to build the code needed to
free user PTE page table pages.

The PTE page table setting and clearing functions(such as set_pte_at())
are in the architecture's files, and these functions will be hooked to
implement FREE_USER_PTE, so the architecture support is needed.

Signed-off-by: Qi Zheng <[email protected]>
---
mm/Kconfig | 11 +++++++++++
1 file changed, 11 insertions(+)

diff --git a/mm/Kconfig b/mm/Kconfig
index 169e64192e48..d2a5a24cee2d 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1130,6 +1130,17 @@ config PTE_MARKER_UFFD_WP
purposes. It is required to enable userfaultfd write protection on
file-backed memory types like shmem and hugetlbfs.

+config ARCH_SUPPORTS_FREE_USER_PTE
+ def_bool n
+
+config FREE_USER_PTE
+ bool "Free user PTE page table pages"
+ default y
+ depends on ARCH_SUPPORTS_FREE_USER_PTE && MMU && SMP
+ help
+ Try to free user PTE page table page when its all entries are none or
+ mapped shared zero page.
+
source "mm/damon/Kconfig"

endmenu
--
2.20.1

2022-08-25 11:10:53

by Qi Zheng

[permalink] [raw]
Subject: [RFC PATCH 6/7] x86/mm: add x86_64 support for pte_ref

Add pte_ref hooks into routines that modify user PTE page tables,
and select ARCH_SUPPORTS_FREE_USER_PTE, so that the pte_ref code
can be compiled and worked on this architecture.

Signed-off-by: Qi Zheng <[email protected]>
---
arch/x86/Kconfig | 1 +
arch/x86/include/asm/pgtable.h | 4 ++++
include/linux/pgtable.h | 1 +
3 files changed, 6 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 52a7f91527fe..50215b05723e 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -34,6 +34,7 @@ config X86_64
select SWIOTLB
select ARCH_HAS_ELFCORE_COMPAT
select ZONE_DMA32
+ select ARCH_SUPPORTS_FREE_USER_PTE

config FORCE_DYNAMIC_FTRACE
def_bool y
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 44e2d6f1dbaa..cbfcfa497fb9 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -23,6 +23,7 @@
#include <asm/coco.h>
#include <asm-generic/pgtable_uffd.h>
#include <linux/page_table_check.h>
+#include <linux/pte_ref.h>

extern pgd_t early_top_pgt[PTRS_PER_PGD];
bool __init __early_make_pgtable(unsigned long address, pmdval_t pmd);
@@ -1005,6 +1006,7 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
pte_t *ptep, pte_t pte)
{
page_table_check_pte_set(mm, addr, ptep, pte);
+ track_pte_set(mm, addr, ptep, pte);
set_pte(ptep, pte);
}

@@ -1050,6 +1052,7 @@ static inline pte_t ptep_get_and_clear(struct mm_struct *mm, unsigned long addr,
{
pte_t pte = native_ptep_get_and_clear(ptep);
page_table_check_pte_clear(mm, addr, pte);
+ track_pte_clear(mm, addr, ptep, pte);
return pte;
}

@@ -1066,6 +1069,7 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
*/
pte = native_local_ptep_get_and_clear(ptep);
page_table_check_pte_clear(mm, addr, pte);
+ track_pte_clear(mm, addr, ptep, pte);
} else {
pte = ptep_get_and_clear(mm, addr, ptep);
}
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index c4a6bda6e965..908636f48c95 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -276,6 +276,7 @@ static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
pte_t pte = *ptep;
pte_clear(mm, address, ptep);
page_table_check_pte_clear(mm, address, pte);
+ track_pte_clear(mm, address, ptep, pte);
return pte;
}
#endif
--
2.20.1

2022-08-29 11:03:03

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Try to free empty and zero user PTE page table pages

On 25.08.22 12:10, Qi Zheng wrote:
> Hi,
>
> Before this, in order to free empty user PTE page table pages, I posted the
> following patch sets of two solutions:
> - atomic refcount version:
> https://lore.kernel.org/lkml/[email protected]/
> - percpu refcount version:
> https://lore.kernel.org/lkml/[email protected]/
>
> Both patch sets have the following behavior:
> a. Protect the page table walker by hooking pte_offset_map{_lock}() and
> pte_unmap{_unlock}()
> b. Will automatically reclaim PTE page table pages in the non-reclaiming path
>
> For behavior a, there may be the following disadvantages mentioned by
> David Hildenbrand:
> - It introduces a lot of complexity. It's not something easy to get in and most
> probably not easy to get out again
> - It is inconvenient to extend to other architectures. For example, for the
> continuous ptes of arm64, the pointer to the PTE entry is obtained directly
> through pte_offset_kernel() instead of pte_offset_map{_lock}()
> - It has been found that pte_unmap() is missing in some places that only
> execute on 64-bit systems, which is a disaster for pte_refcount
>
> For behavior b, it may not be necessary to actively reclaim PTE pages, especially
> when memory pressure is not high, and deferring to the reclaim path may be a
> better choice.
>
> In addition, the above two solutions are only for empty PTE pages (a PTE page
> where all entries are empty), and do not deal with the zero PTE page ( a PTE
> page where all page table entries are mapped to shared zero page) mentioned by
> David Hildenbrand:
> "Especially the shared zeropage is nasty, because there are
> sane use cases that can trigger it. Assume you have a VM
> (e.g., QEMU) that inflated the balloon to return free memory
> to the hypervisor.
>
> Simply migrating that VM will populate the shared zeropage to
> all inflated pages, because migration code ends up reading all
> VM memory. Similarly, the guest can just read that memory as
> well, for example, when the guest issues kdump itself."
>
> The purpose of this RFC patch is to continue the discussion and fix the above
> issues. The following is the solution to be discussed.

Thanks for providing an alternative! It's certainly easier to digest :)

>
> In order to quickly identify the above two types of PTE pages, we still
> introduced a pte_refcount for each PTE page. We put the mapped and zero PTE
> entry counter into the pte_refcount of the PTE page. The bitmask has the
> following meaning:
>
> - bits 0-9 are mapped PTE entry count
> - bits 10-19 are zero PTE entry count

I guess we could factor the zero PTE change out, to have an even simpler
first version. The issue is that some features (userfaultfd) don't
expect page faults when something was aleady mapped previously.

PTE markers as introduced by Peter might require a thought -- we don't
have anything mapped but do have additional information that we have to
maintain.

>
> In this way, when mapped PTE entry count is 0, we can know that the current PTE
> page is an empty PTE page, and when zero PTE entry count is PTRS_PER_PTE, we can
> know that the current PTE page is a zero PTE page.
>
> We only update the pte_refcount when setting and clearing of PTE entry, and
> since they are both protected by pte lock, pte_refcount can be a non-atomic
> variable with little performance overhead.
>
> For page table walker, we mutually exclusive it by holding write lock of
> mmap_lock when doing pmd_clear() (in the newly added path to reclaim PTE pages).

I recall when I played with that idea that the mmap_lock is not
sufficient to rip out a page table. IIRC, we also have to hold the rmap
lock(s), to prevent RMAP walkers from still using the page table.

Especially if multiple VMAs intersect a page table, things might get
tricky, because multiple rmap locks could be involved.

We might want/need another mechanism to synchronize against page table
walkers.

--
Thanks,

David / dhildenb

2022-08-29 14:48:31

by Qi Zheng

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Try to free empty and zero user PTE page table pages



On 2022/8/29 18:09, David Hildenbrand wrote:
> On 25.08.22 12:10, Qi Zheng wrote:
>> Hi,
>>
>> Before this, in order to free empty user PTE page table pages, I posted the
>> following patch sets of two solutions:
>> - atomic refcount version:
>> https://lore.kernel.org/lkml/[email protected]/
>> - percpu refcount version:
>> https://lore.kernel.org/lkml/[email protected]/
>>
>> Both patch sets have the following behavior:
>> a. Protect the page table walker by hooking pte_offset_map{_lock}() and
>> pte_unmap{_unlock}()
>> b. Will automatically reclaim PTE page table pages in the non-reclaiming path
>>
>> For behavior a, there may be the following disadvantages mentioned by
>> David Hildenbrand:
>> - It introduces a lot of complexity. It's not something easy to get in and most
>> probably not easy to get out again
>> - It is inconvenient to extend to other architectures. For example, for the
>> continuous ptes of arm64, the pointer to the PTE entry is obtained directly
>> through pte_offset_kernel() instead of pte_offset_map{_lock}()
>> - It has been found that pte_unmap() is missing in some places that only
>> execute on 64-bit systems, which is a disaster for pte_refcount
>>
>> For behavior b, it may not be necessary to actively reclaim PTE pages, especially
>> when memory pressure is not high, and deferring to the reclaim path may be a
>> better choice.
>>
>> In addition, the above two solutions are only for empty PTE pages (a PTE page
>> where all entries are empty), and do not deal with the zero PTE page ( a PTE
>> page where all page table entries are mapped to shared zero page) mentioned by
>> David Hildenbrand:
>> "Especially the shared zeropage is nasty, because there are
>> sane use cases that can trigger it. Assume you have a VM
>> (e.g., QEMU) that inflated the balloon to return free memory
>> to the hypervisor.
>>
>> Simply migrating that VM will populate the shared zeropage to
>> all inflated pages, because migration code ends up reading all
>> VM memory. Similarly, the guest can just read that memory as
>> well, for example, when the guest issues kdump itself."
>>
>> The purpose of this RFC patch is to continue the discussion and fix the above
>> issues. The following is the solution to be discussed.
>
> Thanks for providing an alternative! It's certainly easier to digest :)

Hi David,

Nice to see your reply.

>
>>
>> In order to quickly identify the above two types of PTE pages, we still
>> introduced a pte_refcount for each PTE page. We put the mapped and zero PTE
>> entry counter into the pte_refcount of the PTE page. The bitmask has the
>> following meaning:
>>
>> - bits 0-9 are mapped PTE entry count
>> - bits 10-19 are zero PTE entry count
>
> I guess we could factor the zero PTE change out, to have an even simpler
OK, we can deal with the empty PTE page case first.

> first version. The issue is that some features (userfaultfd) don't
> expect page faults when something was aleady mapped previously.
>
> PTE markers as introduced by Peter might require a thought -- we don't
> have anything mapped but do have additional information that we have to
> maintain.

I see the pte marker entry is non-present entry not empty entry
(pte_none()). So we've dealt with this situation, which is also
what's done in [RFC PATCH 1/7].

>
>>
>> In this way, when mapped PTE entry count is 0, we can know that the current PTE
>> page is an empty PTE page, and when zero PTE entry count is PTRS_PER_PTE, we can
>> know that the current PTE page is a zero PTE page.
>>
>> We only update the pte_refcount when setting and clearing of PTE entry, and
>> since they are both protected by pte lock, pte_refcount can be a non-atomic
>> variable with little performance overhead.
>>
>> For page table walker, we mutually exclusive it by holding write lock of
>> mmap_lock when doing pmd_clear() (in the newly added path to reclaim PTE pages).
>
> I recall when I played with that idea that the mmap_lock is not
> sufficient to rip out a page table. IIRC, we also have to hold the rmap
> lock(s), to prevent RMAP walkers from still using the page table.

Oh, I forgot this. We should also hold rmap lock(s) like
move_normal_pmd().

>
> Especially if multiple VMAs intersect a page table, things might get
> tricky, because multiple rmap locks could be involved.

Maybe we can iterate over the vma list and just process the 2M aligned
part?

>
> We might want/need another mechanism to synchronize against page table
> walkers.

This is a tricky problem, equivalent to narrowing the protection scope
of mmap_lock. Any preliminary ideas?

Thanks,
Qi

>

--
Thanks,
Qi