2021-11-10 08:42:37

by Qi Zheng

[permalink] [raw]
Subject: [PATCH v3 00/15] Free user PTE page table pages

Hi,

This patch series aims to free user PTE page table pages when all PTE entries
are empty.

The beginning of this story is that some malloc libraries(e.g. jemalloc or
tcmalloc) usually allocate the amount of VAs by mmap() and do not unmap those VAs.
They will use madvise(MADV_DONTNEED) to free physical memory if they want.
But the page tables do not be freed by madvise(), so it can produce many
page tables when the process touches an enormous virtual address space.

The following figures are a memory usage snapshot of one process which actually
happened on our server:

VIRT: 55t
RES: 590g
VmPTE: 110g

As we can see, the PTE page tables size is 110g, while the RES is 590g. In
theory, the process only need 1.2g PTE page tables to map those physical
memory. The reason why PTE page tables occupy a lot of memory is that
madvise(MADV_DONTNEED) only empty the PTE and free physical memory but
doesn't free the PTE page table pages. So we can free those empty PTE page
tables to save memory. In the above cases, we can save memory about 108g(best
case). And the larger the difference between the size of VIRT and RES, the
more memory we save.

In this patch series, we add a pte_refcount field to the struct page of page
table to track how many users of PTE page table. Similar to the mechanism of
page refcount, the user of PTE page table should hold a refcount to it before
accessing. The PTE page table page will be freed when the last refcount is
dropped.

Testing:

The following code snippet can show the effect of optimization:

mmap 50G
while (1) {
for (; i < 1024 * 25; i++) {
touch 2M memory
madvise MADV_DONTNEED 2M
}
}

As we can see, the memory usage of VmPTE is reduced:

before after
VIRT 50.0 GB 50.0 GB
RES 3.1 MB 3.6 MB
VmPTE 102640 kB 248 kB

I also have tested the stability by LTP[1] for several weeks. I have not seen
any crash so far.

The performance of page fault can be affected because of the allocation/freeing
of PTE page table pages. The following is the test result by using a micro
benchmark[2]:

root@~# perf stat -e page-faults --repeat 5 ./multi-fault $threads:

threads before (pf/min) after (pf/min)
1 32,085,255 31,880,833 (-0.64%)
8 101,674,967 100,588,311 (-1.17%)
16 113,207,000 112,801,832 (-0.36%)

(The "pfn/min" means how many page faults in one minute.)

The performance of page fault is ~1% slower than before.

And there are no obvious changes in perf hot spots:

before:
19.29% [kernel] [k] clear_page_rep
16.12% [kernel] [k] do_user_addr_fault
9.57% [kernel] [k] _raw_spin_unlock_irqrestore
6.16% [kernel] [k] get_page_from_freelist
5.03% [kernel] [k] __handle_mm_fault
3.53% [kernel] [k] __rcu_read_unlock
3.45% [kernel] [k] handle_mm_fault
3.38% [kernel] [k] down_read_trylock
2.74% [kernel] [k] free_unref_page_list
2.17% [kernel] [k] up_read
1.93% [kernel] [k] charge_memcg
1.73% [kernel] [k] try_charge_memcg
1.71% [kernel] [k] __alloc_pages
1.69% [kernel] [k] ___perf_sw_event
1.44% [kernel] [k] get_mem_cgroup_from_mm

after:
18.19% [kernel] [k] clear_page_rep
16.28% [kernel] [k] do_user_addr_fault
8.39% [kernel] [k] _raw_spin_unlock_irqrestore
5.12% [kernel] [k] get_page_from_freelist
4.81% [kernel] [k] __handle_mm_fault
4.68% [kernel] [k] down_read_trylock
3.80% [kernel] [k] handle_mm_fault
3.59% [kernel] [k] get_mem_cgroup_from_mm
2.49% [kernel] [k] free_unref_page_list
2.41% [kernel] [k] up_read
2.16% [kernel] [k] charge_memcg
1.92% [kernel] [k] __rcu_read_unlock
1.88% [kernel] [k] ___perf_sw_event
1.70% [kernel] [k] pte_get_unless_zero

This series is based on next-20211108.

Comments and suggestions are welcome.

Thanks,
Qi.

[1] https://github.com/linux-test-project/ltp
[2] https://lore.kernel.org/lkml/[email protected]/2-multi-fault-all.c

Changelog in v2 -> v3:
- Refactored this patch series:
- [PATCH v3 6/15]: Introduce the new dummy helpers first
- [PATCH v3 7-12/15]: Convert each subsystem individually
- [PATCH v3 13/15]: Implement the actual logic to the dummy helpers
And thanks for the advice from David and Jason.
- Add a document.

Changelog in v1 -> v2:
- Change pte_install() to pmd_install().
- Fix some typo and code style problems.
- Split [PATCH v1 5/7] into [PATCH v2 4/9], [PATCH v2 5/9],[PATCH v2 6/9]
and [PATCH v2 7/9].

Qi Zheng (15):
mm: do code cleanups to filemap_map_pmd()
mm: introduce is_huge_pmd() helper
mm: move pte_offset_map_lock() to pgtable.h
mm: rework the parameter of lock_page_or_retry()
mm: add pmd_installed_type return for __pte_alloc() and other friends
mm: introduce refcount for user PTE page table page
mm/pte_ref: add support for user PTE page table page allocation
mm/pte_ref: initialize the refcount of the withdrawn PTE page table
page
mm/pte_ref: add support for the map/unmap of user PTE page table page
mm/pte_ref: add support for page fault path
mm/pte_ref: take a refcount before accessing the PTE page table page
mm/pte_ref: update the pmd entry in move_normal_pmd()
mm/pte_ref: free user PTE page table pages
Documentation: add document for pte_ref
mm/pte_ref: use mmu_gather to free PTE page table pages

Documentation/vm/pte_ref.rst | 216 ++++++++++++++++++++++++++++++++++++
arch/x86/Kconfig | 2 +-
fs/proc/task_mmu.c | 24 +++-
fs/userfaultfd.c | 9 +-
include/linux/huge_mm.h | 10 +-
include/linux/mm.h | 170 ++++-------------------------
include/linux/mm_types.h | 6 +-
include/linux/pagemap.h | 8 +-
include/linux/pgtable.h | 152 +++++++++++++++++++++++++-
include/linux/pte_ref.h | 146 +++++++++++++++++++++++++
include/linux/rmap.h | 2 +
kernel/events/uprobes.c | 2 +
mm/Kconfig | 4 +
mm/Makefile | 4 +-
mm/damon/vaddr.c | 12 +-
mm/debug_vm_pgtable.c | 5 +-
mm/filemap.c | 45 +++++---
mm/gup.c | 25 ++++-
mm/hmm.c | 5 +-
mm/huge_memory.c | 3 +-
mm/internal.h | 4 +-
mm/khugepaged.c | 21 +++-
mm/ksm.c | 6 +-
mm/madvise.c | 21 +++-
mm/memcontrol.c | 12 +-
mm/memory-failure.c | 11 +-
mm/memory.c | 254 ++++++++++++++++++++++++++++++++-----------
mm/mempolicy.c | 6 +-
mm/migrate.c | 54 ++++-----
mm/mincore.c | 7 +-
mm/mlock.c | 1 +
mm/mmu_gather.c | 40 +++----
mm/mprotect.c | 11 +-
mm/mremap.c | 14 ++-
mm/page_vma_mapped.c | 4 +
mm/pagewalk.c | 15 ++-
mm/pgtable-generic.c | 1 +
mm/pte_ref.c | 141 ++++++++++++++++++++++++
mm/rmap.c | 10 ++
mm/swapfile.c | 3 +
mm/userfaultfd.c | 40 +++++--
41 files changed, 1186 insertions(+), 340 deletions(-)
create mode 100644 Documentation/vm/pte_ref.rst
create mode 100644 include/linux/pte_ref.h
create mode 100644 mm/pte_ref.c

--
2.11.0


2021-11-10 08:42:45

by Qi Zheng

[permalink] [raw]
Subject: [PATCH v3 01/15] mm: do code cleanups to filemap_map_pmd()

Currently we have two times the same few lines repeated
in filemap_map_pmd(). Deduplicate them and fix some code
style issues.

Signed-off-by: Qi Zheng <[email protected]>
---
mm/filemap.c | 19 +++++++++----------
1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index daa0e23a6ee6..07c654202870 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3203,11 +3203,8 @@ static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page)
struct mm_struct *mm = vmf->vma->vm_mm;

/* Huge page is mapped? No need to proceed. */
- if (pmd_trans_huge(*vmf->pmd)) {
- unlock_page(page);
- put_page(page);
- return true;
- }
+ if (pmd_trans_huge(*vmf->pmd))
+ goto out;

if (pmd_none(*vmf->pmd) && PageTransHuge(page)) {
vm_fault_t ret = do_set_pmd(vmf, page);
@@ -3222,13 +3219,15 @@ static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page)
pmd_install(mm, vmf->pmd, &vmf->prealloc_pte);

/* See comment in handle_pte_fault() */
- if (pmd_devmap_trans_unstable(vmf->pmd)) {
- unlock_page(page);
- put_page(page);
- return true;
- }
+ if (pmd_devmap_trans_unstable(vmf->pmd))
+ goto out;

return false;
+
+out:
+ unlock_page(page);
+ put_page(page);
+ return true;
}

static struct page *next_uptodate_page(struct page *page,
--
2.11.0

2021-11-10 08:43:02

by Qi Zheng

[permalink] [raw]
Subject: [PATCH v3 02/15] mm: introduce is_huge_pmd() helper

Currently we have some times the following judgments repeated in the
code:

is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)

which is to determine whether the *pmd is a huge pmd, so introduce
is_huge_pmd() helper to deduplicate them.

Signed-off-by: Qi Zheng <[email protected]>
---
include/linux/huge_mm.h | 10 +++++++---
mm/huge_memory.c | 3 +--
mm/memory.c | 5 ++---
mm/mprotect.c | 2 +-
mm/mremap.c | 3 +--
5 files changed, 12 insertions(+), 11 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index f280f33ff223..b37a89180846 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -199,8 +199,7 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
#define split_huge_pmd(__vma, __pmd, __address) \
do { \
pmd_t *____pmd = (__pmd); \
- if (is_swap_pmd(*____pmd) || pmd_trans_huge(*____pmd) \
- || pmd_devmap(*____pmd)) \
+ if (is_huge_pmd(*____pmd)) \
__split_huge_pmd(__vma, __pmd, __address, \
false, NULL); \
} while (0)
@@ -232,11 +231,16 @@ static inline int is_swap_pmd(pmd_t pmd)
return !pmd_none(pmd) && !pmd_present(pmd);
}

+static inline int is_huge_pmd(pmd_t pmd)
+{
+ return is_swap_pmd(pmd) || pmd_trans_huge(pmd) || pmd_devmap(pmd);
+}
+
/* mmap_lock must be held on entry */
static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
struct vm_area_struct *vma)
{
- if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd))
+ if (is_huge_pmd(*pmd))
return __pmd_trans_huge_lock(pmd, vma);
else
return NULL;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e5483347291c..e76ee2e1e423 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1832,8 +1832,7 @@ spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma)
{
spinlock_t *ptl;
ptl = pmd_lock(vma->vm_mm, pmd);
- if (likely(is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) ||
- pmd_devmap(*pmd)))
+ if (likely(is_huge_pmd(*pmd)))
return ptl;
spin_unlock(ptl);
return NULL;
diff --git a/mm/memory.c b/mm/memory.c
index 855486fff526..b00cd60fc368 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1146,8 +1146,7 @@ copy_pmd_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
src_pmd = pmd_offset(src_pud, addr);
do {
next = pmd_addr_end(addr, end);
- if (is_swap_pmd(*src_pmd) || pmd_trans_huge(*src_pmd)
- || pmd_devmap(*src_pmd)) {
+ if (is_huge_pmd(*src_pmd)) {
int err;
VM_BUG_ON_VMA(next-addr != HPAGE_PMD_SIZE, src_vma);
err = copy_huge_pmd(dst_mm, src_mm, dst_pmd, src_pmd,
@@ -1441,7 +1440,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
pmd = pmd_offset(pud, addr);
do {
next = pmd_addr_end(addr, end);
- if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
+ if (is_huge_pmd(*pmd)) {
if (next - addr != HPAGE_PMD_SIZE)
__split_huge_pmd(vma, pmd, addr, false, NULL);
else if (zap_huge_pmd(tlb, vma, pmd, addr))
diff --git a/mm/mprotect.c b/mm/mprotect.c
index e552f5e0ccbd..2d5064a4631c 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -257,7 +257,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
mmu_notifier_invalidate_range_start(&range);
}

- if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
+ if (is_huge_pmd(*pmd)) {
if (next - addr != HPAGE_PMD_SIZE) {
__split_huge_pmd(vma, pmd, addr, false, NULL);
} else {
diff --git a/mm/mremap.c b/mm/mremap.c
index 002eec83e91e..c6e9da09dd0a 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -532,8 +532,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
new_pmd = alloc_new_pmd(vma->vm_mm, vma, new_addr);
if (!new_pmd)
break;
- if (is_swap_pmd(*old_pmd) || pmd_trans_huge(*old_pmd) ||
- pmd_devmap(*old_pmd)) {
+ if (is_huge_pmd(*old_pmd)) {
if (extent == HPAGE_PMD_SIZE &&
move_pgt_entry(HPAGE_PMD, vma, old_addr, new_addr,
old_pmd, new_pmd, need_rmap_locks))
--
2.11.0

2021-11-10 08:43:05

by Qi Zheng

[permalink] [raw]
Subject: [PATCH v3 03/15] mm: move pte_offset_map_lock() to pgtable.h

pte_offset_map() is in include/linux/pgtable.h, so move its
friend pte_offset_map_lock() to pgtable.h together.

pte_lockptr() is required for pte_offset_map_lock(), so
also move {pte,pmd,pud}_lockptr() to pgtable.h.

Signed-off-by: Qi Zheng <[email protected]>
---
include/linux/mm.h | 149 ------------------------------------------------
include/linux/pgtable.h | 149 ++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 149 insertions(+), 149 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a7e4a9e7d807..706da081b9f8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2284,70 +2284,6 @@ static inline pmd_t *pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long a
}
#endif /* CONFIG_MMU */

-#if USE_SPLIT_PTE_PTLOCKS
-#if ALLOC_SPLIT_PTLOCKS
-void __init ptlock_cache_init(void);
-extern bool ptlock_alloc(struct page *page);
-extern void ptlock_free(struct page *page);
-
-static inline spinlock_t *ptlock_ptr(struct page *page)
-{
- return page->ptl;
-}
-#else /* ALLOC_SPLIT_PTLOCKS */
-static inline void ptlock_cache_init(void)
-{
-}
-
-static inline bool ptlock_alloc(struct page *page)
-{
- return true;
-}
-
-static inline void ptlock_free(struct page *page)
-{
-}
-
-static inline spinlock_t *ptlock_ptr(struct page *page)
-{
- return &page->ptl;
-}
-#endif /* ALLOC_SPLIT_PTLOCKS */
-
-static inline spinlock_t *pte_lockptr(struct mm_struct *mm, pmd_t *pmd)
-{
- return ptlock_ptr(pmd_page(*pmd));
-}
-
-static inline bool ptlock_init(struct page *page)
-{
- /*
- * prep_new_page() initialize page->private (and therefore page->ptl)
- * with 0. Make sure nobody took it in use in between.
- *
- * It can happen if arch try to use slab for page table allocation:
- * slab code uses page->slab_cache, which share storage with page->ptl.
- */
- VM_BUG_ON_PAGE(*(unsigned long *)&page->ptl, page);
- if (!ptlock_alloc(page))
- return false;
- spin_lock_init(ptlock_ptr(page));
- return true;
-}
-
-#else /* !USE_SPLIT_PTE_PTLOCKS */
-/*
- * We use mm->page_table_lock to guard all pagetable pages of the mm.
- */
-static inline spinlock_t *pte_lockptr(struct mm_struct *mm, pmd_t *pmd)
-{
- return &mm->page_table_lock;
-}
-static inline void ptlock_cache_init(void) {}
-static inline bool ptlock_init(struct page *page) { return true; }
-static inline void ptlock_free(struct page *page) {}
-#endif /* USE_SPLIT_PTE_PTLOCKS */
-
static inline void pgtable_init(void)
{
ptlock_cache_init();
@@ -2370,20 +2306,6 @@ static inline void pgtable_pte_page_dtor(struct page *page)
dec_lruvec_page_state(page, NR_PAGETABLE);
}

-#define pte_offset_map_lock(mm, pmd, address, ptlp) \
-({ \
- spinlock_t *__ptl = pte_lockptr(mm, pmd); \
- pte_t *__pte = pte_offset_map(pmd, address); \
- *(ptlp) = __ptl; \
- spin_lock(__ptl); \
- __pte; \
-})
-
-#define pte_unmap_unlock(pte, ptl) do { \
- spin_unlock(ptl); \
- pte_unmap(pte); \
-} while (0)
-
#define pte_alloc(mm, pmd) (unlikely(pmd_none(*(pmd))) && __pte_alloc(mm, pmd))

#define pte_alloc_map(mm, pmd, address) \
@@ -2397,58 +2319,6 @@ static inline void pgtable_pte_page_dtor(struct page *page)
((unlikely(pmd_none(*(pmd))) && __pte_alloc_kernel(pmd))? \
NULL: pte_offset_kernel(pmd, address))

-#if USE_SPLIT_PMD_PTLOCKS
-
-static struct page *pmd_to_page(pmd_t *pmd)
-{
- unsigned long mask = ~(PTRS_PER_PMD * sizeof(pmd_t) - 1);
- return virt_to_page((void *)((unsigned long) pmd & mask));
-}
-
-static inline spinlock_t *pmd_lockptr(struct mm_struct *mm, pmd_t *pmd)
-{
- return ptlock_ptr(pmd_to_page(pmd));
-}
-
-static inline bool pmd_ptlock_init(struct page *page)
-{
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
- page->pmd_huge_pte = NULL;
-#endif
- return ptlock_init(page);
-}
-
-static inline void pmd_ptlock_free(struct page *page)
-{
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
- VM_BUG_ON_PAGE(page->pmd_huge_pte, page);
-#endif
- ptlock_free(page);
-}
-
-#define pmd_huge_pte(mm, pmd) (pmd_to_page(pmd)->pmd_huge_pte)
-
-#else
-
-static inline spinlock_t *pmd_lockptr(struct mm_struct *mm, pmd_t *pmd)
-{
- return &mm->page_table_lock;
-}
-
-static inline bool pmd_ptlock_init(struct page *page) { return true; }
-static inline void pmd_ptlock_free(struct page *page) {}
-
-#define pmd_huge_pte(mm, pmd) ((mm)->pmd_huge_pte)
-
-#endif
-
-static inline spinlock_t *pmd_lock(struct mm_struct *mm, pmd_t *pmd)
-{
- spinlock_t *ptl = pmd_lockptr(mm, pmd);
- spin_lock(ptl);
- return ptl;
-}
-
static inline bool pgtable_pmd_page_ctor(struct page *page)
{
if (!pmd_ptlock_init(page))
@@ -2465,25 +2335,6 @@ static inline void pgtable_pmd_page_dtor(struct page *page)
dec_lruvec_page_state(page, NR_PAGETABLE);
}

-/*
- * No scalability reason to split PUD locks yet, but follow the same pattern
- * as the PMD locks to make it easier if we decide to. The VM should not be
- * considered ready to switch to split PUD locks yet; there may be places
- * which need to be converted from page_table_lock.
- */
-static inline spinlock_t *pud_lockptr(struct mm_struct *mm, pud_t *pud)
-{
- return &mm->page_table_lock;
-}
-
-static inline spinlock_t *pud_lock(struct mm_struct *mm, pud_t *pud)
-{
- spinlock_t *ptl = pud_lockptr(mm, pud);
-
- spin_lock(ptl);
- return ptl;
-}
-
extern void __init pagecache_init(void);
extern void __init free_area_init_memoryless_node(int nid);
extern void free_initmem(void);
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index e24d2c992b11..c8f045705c1e 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -84,6 +84,141 @@ static inline unsigned long pud_index(unsigned long address)
#define pgd_index(a) (((a) >> PGDIR_SHIFT) & (PTRS_PER_PGD - 1))
#endif

+#if USE_SPLIT_PTE_PTLOCKS
+#if ALLOC_SPLIT_PTLOCKS
+void __init ptlock_cache_init(void);
+extern bool ptlock_alloc(struct page *page);
+extern void ptlock_free(struct page *page);
+
+static inline spinlock_t *ptlock_ptr(struct page *page)
+{
+ return page->ptl;
+}
+#else /* ALLOC_SPLIT_PTLOCKS */
+static inline void ptlock_cache_init(void)
+{
+}
+
+static inline bool ptlock_alloc(struct page *page)
+{
+ return true;
+}
+
+static inline void ptlock_free(struct page *page)
+{
+}
+
+static inline spinlock_t *ptlock_ptr(struct page *page)
+{
+ return &page->ptl;
+}
+#endif /* ALLOC_SPLIT_PTLOCKS */
+
+static inline spinlock_t *pte_lockptr(struct mm_struct *mm, pmd_t *pmd)
+{
+ return ptlock_ptr(pmd_page(*pmd));
+}
+
+static inline bool ptlock_init(struct page *page)
+{
+ /*
+ * prep_new_page() initialize page->private (and therefore page->ptl)
+ * with 0. Make sure nobody took it in use in between.
+ *
+ * It can happen if arch try to use slab for page table allocation:
+ * slab code uses page->slab_cache, which share storage with page->ptl.
+ */
+ VM_BUG_ON_PAGE(*(unsigned long *)&page->ptl, page);
+ if (!ptlock_alloc(page))
+ return false;
+ spin_lock_init(ptlock_ptr(page));
+ return true;
+}
+
+#else /* !USE_SPLIT_PTE_PTLOCKS */
+/*
+ * We use mm->page_table_lock to guard all pagetable pages of the mm.
+ */
+static inline spinlock_t *pte_lockptr(struct mm_struct *mm, pmd_t *pmd)
+{
+ return &mm->page_table_lock;
+}
+static inline void ptlock_cache_init(void) {}
+static inline bool ptlock_init(struct page *page) { return true; }
+static inline void ptlock_free(struct page *page) {}
+#endif /* USE_SPLIT_PTE_PTLOCKS */
+
+#if USE_SPLIT_PMD_PTLOCKS
+
+static struct page *pmd_to_page(pmd_t *pmd)
+{
+ unsigned long mask = ~(PTRS_PER_PMD * sizeof(pmd_t) - 1);
+ return virt_to_page((void *)((unsigned long) pmd & mask));
+}
+
+static inline spinlock_t *pmd_lockptr(struct mm_struct *mm, pmd_t *pmd)
+{
+ return ptlock_ptr(pmd_to_page(pmd));
+}
+
+static inline bool pmd_ptlock_init(struct page *page)
+{
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ page->pmd_huge_pte = NULL;
+#endif
+ return ptlock_init(page);
+}
+
+static inline void pmd_ptlock_free(struct page *page)
+{
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ VM_BUG_ON_PAGE(page->pmd_huge_pte, page);
+#endif
+ ptlock_free(page);
+}
+
+#define pmd_huge_pte(mm, pmd) (pmd_to_page(pmd)->pmd_huge_pte)
+
+#else
+
+static inline spinlock_t *pmd_lockptr(struct mm_struct *mm, pmd_t *pmd)
+{
+ return &mm->page_table_lock;
+}
+
+static inline bool pmd_ptlock_init(struct page *page) { return true; }
+static inline void pmd_ptlock_free(struct page *page) {}
+
+#define pmd_huge_pte(mm, pmd) ((mm)->pmd_huge_pte)
+
+#endif
+
+static inline spinlock_t *pmd_lock(struct mm_struct *mm, pmd_t *pmd)
+{
+ spinlock_t *ptl = pmd_lockptr(mm, pmd);
+ spin_lock(ptl);
+ return ptl;
+}
+
+/*
+ * No scalability reason to split PUD locks yet, but follow the same pattern
+ * as the PMD locks to make it easier if we decide to. The VM should not be
+ * considered ready to switch to split PUD locks yet; there may be places
+ * which need to be converted from page_table_lock.
+ */
+static inline spinlock_t *pud_lockptr(struct mm_struct *mm, pud_t *pud)
+{
+ return &mm->page_table_lock;
+}
+
+static inline spinlock_t *pud_lock(struct mm_struct *mm, pud_t *pud)
+{
+ spinlock_t *ptl = pud_lockptr(mm, pud);
+
+ spin_lock(ptl);
+ return ptl;
+}
+
#ifndef pte_offset_kernel
static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address)
{
@@ -102,6 +237,20 @@ static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address)
#define pte_unmap(pte) ((void)(pte)) /* NOP */
#endif

+#define pte_offset_map_lock(mm, pmd, address, ptlp) \
+({ \
+ spinlock_t *__ptl = pte_lockptr(mm, pmd); \
+ pte_t *__pte = pte_offset_map(pmd, address); \
+ *(ptlp) = __ptl; \
+ spin_lock(__ptl); \
+ __pte; \
+})
+
+#define pte_unmap_unlock(pte, ptl) do { \
+ spin_unlock(ptl); \
+ pte_unmap(pte); \
+} while (0)
+
/* Find an entry in the second-level page table.. */
#ifndef pmd_offset
static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
--
2.11.0

2021-11-10 08:43:12

by Qi Zheng

[permalink] [raw]
Subject: [PATCH v3 04/15] mm: rework the parameter of lock_page_or_retry()

We need the vmf in lock_page_or_retry() in the subsequent patch,
so pass in it directly.

Signed-off-by: Qi Zheng <[email protected]>
---
include/linux/pagemap.h | 8 +++-----
mm/filemap.c | 6 ++++--
mm/memory.c | 4 ++--
3 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 6a30916b76e5..94f9547b4411 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -709,8 +709,7 @@ static inline bool wake_page_match(struct wait_page_queue *wait_page,

void __folio_lock(struct folio *folio);
int __folio_lock_killable(struct folio *folio);
-bool __folio_lock_or_retry(struct folio *folio, struct mm_struct *mm,
- unsigned int flags);
+bool __folio_lock_or_retry(struct folio *folio, struct vm_fault *vmf);
void unlock_page(struct page *page);
void folio_unlock(struct folio *folio);

@@ -772,14 +771,13 @@ static inline int lock_page_killable(struct page *page)
* Return value and mmap_lock implications depend on flags; see
* __folio_lock_or_retry().
*/
-static inline bool lock_page_or_retry(struct page *page, struct mm_struct *mm,
- unsigned int flags)
+static inline bool lock_page_or_retry(struct page *page, struct vm_fault *vmf)
{
struct folio *folio;
might_sleep();

folio = page_folio(page);
- return folio_trylock(folio) || __folio_lock_or_retry(folio, mm, flags);
+ return folio_trylock(folio) || __folio_lock_or_retry(folio, vmf);
}

/*
diff --git a/mm/filemap.c b/mm/filemap.c
index 07c654202870..ff8d19b7ce1d 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1695,9 +1695,11 @@ static int __folio_lock_async(struct folio *folio, struct wait_page_queue *wait)
* If neither ALLOW_RETRY nor KILLABLE are set, will always return true
* with the folio locked and the mmap_lock unperturbed.
*/
-bool __folio_lock_or_retry(struct folio *folio, struct mm_struct *mm,
- unsigned int flags)
+bool __folio_lock_or_retry(struct folio *folio, struct vm_fault *vmf)
{
+ unsigned int flags = vmf->flags;
+ struct mm_struct *mm = vmf->vma->vm_mm;
+
if (fault_flag_allow_retry_first(flags)) {
/*
* CAUTION! In this case, mmap_lock is not released
diff --git a/mm/memory.c b/mm/memory.c
index b00cd60fc368..bec6a5d5ee7c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3443,7 +3443,7 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
struct vm_area_struct *vma = vmf->vma;
struct mmu_notifier_range range;

- if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags))
+ if (!lock_page_or_retry(page, vmf))
return VM_FAULT_RETRY;
mmu_notifier_range_init_owner(&range, MMU_NOTIFY_EXCLUSIVE, 0, vma,
vma->vm_mm, vmf->address & PAGE_MASK,
@@ -3576,7 +3576,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
goto out_release;
}

- locked = lock_page_or_retry(page, vma->vm_mm, vmf->flags);
+ locked = lock_page_or_retry(page, vmf);

delayacct_clear_flag(current, DELAYACCT_PF_SWAPIN);
if (!locked) {
--
2.11.0

2021-11-10 08:43:56

by Qi Zheng

[permalink] [raw]
Subject: [PATCH v3 09/15] mm/pte_ref: add support for the map/unmap of user PTE page table page

The !pte_none() entry will take a reference on the user PTE page
table page, such as regular page table entry that map physical
pages, or swap entry, or migrate entry, etc.

So a pte_none() entry is mapped, it needs to increase the refcount
of the PTE page table page. When a !pte_none() entry becomes none,
the refcount of the PTE page table page needs to be decreased.

For swap or migrate cases, which only change the content of
the PTE entry, we keep the refcount unchanged.

Signed-off-by: Qi Zheng <[email protected]>
---
kernel/events/uprobes.c | 2 ++
mm/filemap.c | 3 +++
mm/madvise.c | 5 +++++
mm/memory.c | 42 +++++++++++++++++++++++++++++++++++-------
mm/migrate.c | 1 +
mm/mremap.c | 7 +++++++
mm/rmap.c | 10 ++++++++++
mm/userfaultfd.c | 2 ++
8 files changed, 65 insertions(+), 7 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 6357c3580d07..96dd2959e1ac 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -200,6 +200,8 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
if (new_page)
set_pte_at_notify(mm, addr, pvmw.pte,
mk_pte(new_page, vma->vm_page_prot));
+ else
+ pte_put(mm, pte_to_pmd(pvmw.pte), addr);

page_remove_rmap(old_page, false);
if (!page_mapped(old_page))
diff --git a/mm/filemap.c b/mm/filemap.c
index 1e7e9e4fd759..aa47ee11a3d8 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3309,6 +3309,7 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
struct page *head, *page;
unsigned int mmap_miss = READ_ONCE(file->f_ra.mmap_miss);
vm_fault_t ret = 0;
+ unsigned int nr_get = 0;

rcu_read_lock();
head = first_map_page(mapping, &xas, end_pgoff);
@@ -3342,6 +3343,7 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
ret = VM_FAULT_NOPAGE;

do_set_pte(vmf, page, addr);
+ nr_get++;
/* no need to invalidate: a not-present page won't be cached */
update_mmu_cache(vma, addr, vmf->pte);
unlock_page(head);
@@ -3351,6 +3353,7 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
put_page(head);
} while ((head = next_map_page(mapping, &xas, end_pgoff)) != NULL);
pte_unmap_unlock(vmf->pte, vmf->ptl);
+ pte_get_many(vmf->pmd, nr_get);
pte_put(vma->vm_mm, vmf->pmd, start);
out:
rcu_read_unlock();
diff --git a/mm/madvise.c b/mm/madvise.c
index 0734db8d53a7..82fc40b6dcbf 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -580,6 +580,8 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
struct page *page;
int nr_swap = 0;
unsigned long next;
+ unsigned int nr_put = 0;
+ unsigned long start = addr;

next = pmd_addr_end(addr, end);
if (pmd_trans_huge(*pmd))
@@ -612,6 +614,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
nr_swap--;
free_swap_and_cache(entry);
pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
+ nr_put++;
continue;
}

@@ -696,6 +699,8 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
}
arch_leave_lazy_mmu_mode();
pte_unmap_unlock(orig_pte, ptl);
+ if (nr_put)
+ pte_put_many(mm, pmd, start, nr_put);
cond_resched();
next:
return 0;
diff --git a/mm/memory.c b/mm/memory.c
index 0b9af38cfa11..ea4d651ac8c7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -878,6 +878,7 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
if (!userfaultfd_wp(dst_vma))
pte = pte_swp_clear_uffd_wp(pte);
set_pte_at(dst_mm, addr, dst_pte, pte);
+ pte_get(pte_to_pmd(dst_pte));
return 0;
}

@@ -946,6 +947,7 @@ copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
/* Uffd-wp needs to be delivered to dest pte as well */
pte = pte_wrprotect(pte_mkuffd_wp(pte));
set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
+ pte_get(pte_to_pmd(dst_pte));
return 0;
}

@@ -998,6 +1000,7 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
pte = pte_clear_uffd_wp(pte);

set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
+ pte_get(pte_to_pmd(dst_pte));
return 0;
}

@@ -1335,6 +1338,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
pte_t *start_pte;
pte_t *pte;
swp_entry_t entry;
+ unsigned int nr_put = 0;
+ unsigned long start = addr;

tlb_change_page_size(tlb, PAGE_SIZE);
again:
@@ -1359,6 +1364,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
continue;
ptent = ptep_get_and_clear_full(mm, addr, pte,
tlb->fullmm);
+ nr_put++;
tlb_remove_tlb_entry(tlb, pte, addr);
if (unlikely(!page))
continue;
@@ -1392,6 +1398,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
if (unlikely(zap_skip_check_mapping(details, page)))
continue;
pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
+ nr_put++;
rss[mm_counter(page)]--;

if (is_device_private_entry(entry))
@@ -1416,6 +1423,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
if (unlikely(!free_swap_and_cache(entry)))
print_bad_pte(vma, addr, ptent, NULL);
pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
+ nr_put++;
} while (pte++, addr += PAGE_SIZE, addr != end);

add_mm_rss_vec(mm, rss);
@@ -1442,6 +1450,9 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
goto again;
}

+ if (nr_put)
+ pte_put_many(mm, pmd, start, nr_put);
+
return addr;
}

@@ -1759,6 +1770,7 @@ static int insert_page_into_pte_locked(struct mm_struct *mm, pte_t *pte,
inc_mm_counter_fast(mm, mm_counter_file(page));
page_add_file_rmap(page, false);
set_pte_at(mm, addr, pte, mk_pte(page, prot));
+ pte_get(pte_to_pmd(pte));
return 0;
}

@@ -2085,6 +2097,7 @@ static vm_fault_t insert_pfn(struct vm_area_struct *vma, unsigned long addr,
}

set_pte_at(mm, addr, pte, entry);
+ pte_get(pte_to_pmd(pte));
update_mmu_cache(vma, addr, pte); /* XXX: why not for insert_page? */

out_unlock:
@@ -2291,6 +2304,7 @@ static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd,
unsigned long pfn, pgprot_t prot)
{
unsigned long start = addr;
+ unsigned int nr_get = 0;
pte_t *pte, *mapped_pte;
spinlock_t *ptl;
int err = 0;
@@ -2306,10 +2320,12 @@ static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd,
break;
}
set_pte_at(mm, addr, pte, pte_mkspecial(pfn_pte(pfn, prot)));
+ nr_get++;
pfn++;
} while (pte++, addr += PAGE_SIZE, addr != end);
arch_leave_lazy_mmu_mode();
pte_unmap_unlock(mapped_pte, ptl);
+ pte_get_many(pmd, nr_get);
pte_put(mm, pmd, start);
return err;
}
@@ -2524,6 +2540,7 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
pte_t *pte, *mapped_pte;
int err = 0;
spinlock_t *ptl;
+ unsigned int nr_put = 0, nr_get = 0;

if (create) {
mapped_pte = pte = (mm == &init_mm) ?
@@ -2531,6 +2548,7 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
pte_alloc_map_lock(mm, pmd, addr, &ptl);
if (!pte)
return -ENOMEM;
+ nr_put++;
} else {
mapped_pte = pte = (mm == &init_mm) ?
pte_offset_kernel(pmd, addr) :
@@ -2543,11 +2561,17 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,

if (fn) {
do {
- if (create || !pte_none(*pte)) {
+ if (create) {
err = fn(pte++, addr, data);
- if (err)
- break;
+ if (mm != &init_mm && !pte_none(*(pte-1)))
+ nr_get++;
+ } else if (!pte_none(*pte)) {
+ err = fn(pte++, addr, data);
+ if (mm != &init_mm && pte_none(*(pte-1)))
+ nr_put++;
}
+ if (err)
+ break;
} while (addr += PAGE_SIZE, addr != end);
}
*mask |= PGTBL_PTE_MODIFIED;
@@ -2556,8 +2580,9 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,

if (mm != &init_mm) {
pte_unmap_unlock(mapped_pte, ptl);
- if (create)
- pte_put(mm, pmd, start);
+ pte_get_many(pmd, nr_get);
+ if (nr_put)
+ pte_put_many(mm, pmd, start, nr_put);
}
return err;
}
@@ -3835,6 +3860,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
lru_cache_add_inactive_or_unevictable(page, vma);
setpte:
set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
+ pte_get(vmf->pmd);

/* No need to invalidate - it was non-present before */
update_mmu_cache(vma, vmf->address, vmf->pte);
@@ -4086,10 +4112,12 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
vmf->address, &vmf->ptl);
ret = 0;
/* Re-check under ptl */
- if (likely(pte_none(*vmf->pte)))
+ if (likely(pte_none(*vmf->pte))) {
do_set_pte(vmf, page, vmf->address);
- else
+ pte_get(vmf->pmd);
+ } else {
ret = VM_FAULT_NOPAGE;
+ }

update_mmu_tlb(vma, vmf->address, vmf->pte);
pte_unmap_unlock(vmf->pte, vmf->ptl);
diff --git a/mm/migrate.c b/mm/migrate.c
index 26f16a4836d8..c03ac25f42a9 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2807,6 +2807,7 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
} else {
/* No need to invalidate - it was non-present before */
set_pte_at(mm, addr, ptep, entry);
+ pte_get(pmdp);
update_mmu_cache(vma, addr, ptep);
}

diff --git a/mm/mremap.c b/mm/mremap.c
index f80c628db25d..088a7a75cb4b 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -141,6 +141,8 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
spinlock_t *old_ptl, *new_ptl;
bool force_flush = false;
unsigned long len = old_end - old_addr;
+ unsigned int nr_put = 0, nr_get = 0;
+ unsigned long old_start = old_addr;

/*
* When need_rmap_locks is true, we take the i_mmap_rwsem and anon_vma
@@ -181,6 +183,7 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
continue;

pte = ptep_get_and_clear(mm, old_addr, old_pte);
+ nr_put++;
/*
* If we are remapping a valid PTE, make sure
* to flush TLB before we drop the PTL for the
@@ -197,6 +200,7 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
pte = move_pte(pte, new_vma->vm_page_prot, old_addr, new_addr);
pte = move_soft_dirty_pte(pte);
set_pte_at(mm, new_addr, new_pte, pte);
+ nr_get++;
}

arch_leave_lazy_mmu_mode();
@@ -206,6 +210,9 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
spin_unlock(new_ptl);
pte_unmap(new_pte - 1);
pte_unmap_unlock(old_pte - 1, old_ptl);
+ pte_get_many(new_pmd, nr_get);
+ if (nr_put)
+ pte_put_many(mm, old_pmd, old_start, nr_put);
if (need_rmap_locks)
drop_rmap_locks(vma);
}
diff --git a/mm/rmap.c b/mm/rmap.c
index 2908d637bcad..630ce8a036b5 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1404,6 +1404,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
bool ret = true;
struct mmu_notifier_range range;
enum ttu_flags flags = (enum ttu_flags)(long)arg;
+ unsigned int nr_put = 0;

/*
* When racing against e.g. zap_pte_range() on another cpu,
@@ -1551,6 +1552,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
/* We have to invalidate as we cleared the pte */
mmu_notifier_invalidate_range(mm, address,
address + PAGE_SIZE);
+ nr_put++;
} else if (PageAnon(page)) {
swp_entry_t entry = { .val = page_private(subpage) };
pte_t swp_pte;
@@ -1564,6 +1566,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
/* We have to invalidate as we cleared the pte */
mmu_notifier_invalidate_range(mm, address,
address + PAGE_SIZE);
+ pte_put(mm, pvmw.pmd, address);
page_vma_mapped_walk_done(&pvmw);
break;
}
@@ -1575,6 +1578,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
mmu_notifier_invalidate_range(mm,
address, address + PAGE_SIZE);
dec_mm_counter(mm, MM_ANONPAGES);
+ nr_put++;
goto discard;
}

@@ -1630,6 +1634,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
* See Documentation/vm/mmu_notifier.rst
*/
dec_mm_counter(mm, mm_counter_file(page));
+ nr_put++;
}
discard:
/*
@@ -1641,6 +1646,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
*/
page_remove_rmap(subpage, PageHuge(page));
put_page(page);
+ if (nr_put) {
+ pte_put_many(mm, pvmw.pmd, address, nr_put);
+ nr_put = 0;
+ }
}

mmu_notifier_invalidate_range_end(&range);
@@ -1871,6 +1880,7 @@ static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma,
/* We have to invalidate as we cleared the pte */
mmu_notifier_invalidate_range(mm, address,
address + PAGE_SIZE);
+ pte_put(mm, pvmw.pmd, address);
} else {
swp_entry_t entry;
pte_t swp_pte;
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 37df899a1b9d..b87c61b94065 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -110,6 +110,7 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
lru_cache_add_inactive_or_unevictable(page, dst_vma);

set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
+ pte_get(dst_pmd);

/* No need to invalidate - it was non-present before */
update_mmu_cache(dst_vma, dst_addr, dst_pte);
@@ -204,6 +205,7 @@ static int mfill_zeropage_pte(struct mm_struct *dst_mm,
if (!pte_none(*dst_pte))
goto out_unlock;
set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
+ pte_get(dst_pmd);
/* No need to invalidate - it was non-present before */
update_mmu_cache(dst_vma, dst_addr, dst_pte);
ret = 0;
--
2.11.0

2021-11-10 08:44:21

by Qi Zheng

[permalink] [raw]
Subject: [PATCH v3 05/15] mm: add pmd_installed_type return for __pte_alloc() and other friends

When we call __pte_alloc() or other friends, a huge pmd might
be created from a different thread. This is why
pmd_trans_unstable() will now be called after __pte_alloc()
or other friends return.

This patch add pmd_installed_type return for __pte_alloc() and other
friends, then we can check the huge pmd through the return value
instead of calling pmd_trans_unstable() again.

This patch has no functional change, just some preparations
for the future patches.

Signed-off-by: Qi Zheng <[email protected]>
---
include/linux/mm.h | 20 +++++++++++++++++---
mm/debug_vm_pgtable.c | 2 +-
mm/filemap.c | 11 +++++++----
mm/gup.c | 2 +-
mm/internal.h | 3 ++-
mm/memory.c | 39 ++++++++++++++++++++++++++-------------
mm/migrate.c | 17 ++---------------
mm/mremap.c | 2 +-
mm/userfaultfd.c | 24 +++++++++++++++---------
9 files changed, 72 insertions(+), 48 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 706da081b9f8..52f36fde2f11 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2306,13 +2306,27 @@ static inline void pgtable_pte_page_dtor(struct page *page)
dec_lruvec_page_state(page, NR_PAGETABLE);
}

-#define pte_alloc(mm, pmd) (unlikely(pmd_none(*(pmd))) && __pte_alloc(mm, pmd))
+enum pmd_installed_type {
+ INSTALLED_PTE,
+ INSTALLED_HUGE_PMD,
+};
+
+static inline int pte_alloc(struct mm_struct *mm, pmd_t *pmd)
+{
+ if (unlikely(pmd_none(*(pmd))))
+ return __pte_alloc(mm, pmd);
+ if (unlikely(is_huge_pmd(*pmd)))
+ return INSTALLED_HUGE_PMD;
+
+ return INSTALLED_PTE;
+}
+#define pte_alloc pte_alloc

#define pte_alloc_map(mm, pmd, address) \
- (pte_alloc(mm, pmd) ? NULL : pte_offset_map(pmd, address))
+ (pte_alloc(mm, pmd) < 0 ? NULL : pte_offset_map(pmd, address))

#define pte_alloc_map_lock(mm, pmd, address, ptlp) \
- (pte_alloc(mm, pmd) ? \
+ (pte_alloc(mm, pmd) < 0 ? \
NULL : pte_offset_map_lock(mm, pmd, address, ptlp))

#define pte_alloc_kernel(pmd, address) \
diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index 228e3954b90c..b8322c55e65d 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -1170,7 +1170,7 @@ static int __init init_args(struct pgtable_debug_args *args)
args->start_pmdp = pmd_offset(args->pudp, 0UL);
WARN_ON(!args->start_pmdp);

- if (pte_alloc(args->mm, args->pmdp)) {
+ if (pte_alloc(args->mm, args->pmdp) < 0) {
pr_err("Failed to allocate pte entries\n");
ret = -ENOMEM;
goto error;
diff --git a/mm/filemap.c b/mm/filemap.c
index ff8d19b7ce1d..23363f8ddbbe 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3217,12 +3217,15 @@ static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page)
}
}

- if (pmd_none(*vmf->pmd))
- pmd_install(mm, vmf->pmd, &vmf->prealloc_pte);
+ if (pmd_none(*vmf->pmd)) {
+ int ret = pmd_install(mm, vmf->pmd, &vmf->prealloc_pte);

- /* See comment in handle_pte_fault() */
- if (pmd_devmap_trans_unstable(vmf->pmd))
+ if (unlikely(ret == INSTALLED_HUGE_PMD))
+ goto out;
+ } else if (pmd_devmap_trans_unstable(vmf->pmd)) {
+ /* See comment in handle_pte_fault() */
goto out;
+ }

return false;

diff --git a/mm/gup.c b/mm/gup.c
index 2c51e9748a6a..2def775232a3 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -699,7 +699,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
} else {
spin_unlock(ptl);
split_huge_pmd(vma, pmd, address);
- ret = pte_alloc(mm, pmd) ? -ENOMEM : 0;
+ ret = pte_alloc(mm, pmd) < 0 ? -ENOMEM : 0;
}

return ret ? ERR_PTR(ret) :
diff --git a/mm/internal.h b/mm/internal.h
index 3b79a5c9427a..474d6e3443f8 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -67,7 +67,8 @@ bool __folio_end_writeback(struct folio *folio);

void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
unsigned long floor, unsigned long ceiling);
-void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte);
+enum pmd_installed_type pmd_install(struct mm_struct *mm, pmd_t *pmd,
+ pgtable_t *pte);

static inline bool can_madv_lru_vma(struct vm_area_struct *vma)
{
diff --git a/mm/memory.c b/mm/memory.c
index bec6a5d5ee7c..8a39c0e58324 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -437,8 +437,10 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
}
}

-void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte)
+enum pmd_installed_type pmd_install(struct mm_struct *mm, pmd_t *pmd,
+ pgtable_t *pte)
{
+ int ret = INSTALLED_PTE;
spinlock_t *ptl = pmd_lock(mm, pmd);

if (likely(pmd_none(*pmd))) { /* Has another populated it ? */
@@ -459,20 +461,26 @@ void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte)
smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */
pmd_populate(mm, pmd, *pte);
*pte = NULL;
+ } else if (is_huge_pmd(*pmd)) {
+ /* See comment in handle_pte_fault() */
+ ret = INSTALLED_HUGE_PMD;
}
spin_unlock(ptl);
+
+ return ret;
}

int __pte_alloc(struct mm_struct *mm, pmd_t *pmd)
{
+ enum pmd_installed_type ret;
pgtable_t new = pte_alloc_one(mm);
if (!new)
return -ENOMEM;

- pmd_install(mm, pmd, &new);
+ ret = pmd_install(mm, pmd, &new);
if (new)
pte_free(mm, new);
- return 0;
+ return ret;
}

int __pte_alloc_kernel(pmd_t *pmd)
@@ -1813,7 +1821,7 @@ static int insert_pages(struct vm_area_struct *vma, unsigned long addr,

/* Allocate the PTE if necessary; takes PMD lock once only. */
ret = -ENOMEM;
- if (pte_alloc(mm, pmd))
+ if (pte_alloc(mm, pmd) < 0)
goto out;

while (pages_to_write_in_pmd) {
@@ -3713,6 +3721,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
struct page *page;
vm_fault_t ret = 0;
pte_t entry;
+ int alloc_ret;

/* File mapping without ->vm_ops ? */
if (vma->vm_flags & VM_SHARED)
@@ -3728,11 +3737,11 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
*
* Here we only have mmap_read_lock(mm).
*/
- if (pte_alloc(vma->vm_mm, vmf->pmd))
+ alloc_ret = pte_alloc(vma->vm_mm, vmf->pmd);
+ if (alloc_ret < 0)
return VM_FAULT_OOM;
-
/* See comment in handle_pte_fault() */
- if (unlikely(pmd_trans_unstable(vmf->pmd)))
+ if (unlikely(alloc_ret == INSTALLED_HUGE_PMD))
return 0;

/* Use the zero-page for reads */
@@ -4023,6 +4032,8 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
}

if (pmd_none(*vmf->pmd)) {
+ int alloc_ret;
+
if (PageTransCompound(page)) {
ret = do_set_pmd(vmf, page);
if (ret != VM_FAULT_FALLBACK)
@@ -4030,14 +4041,16 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
}

if (vmf->prealloc_pte)
- pmd_install(vma->vm_mm, vmf->pmd, &vmf->prealloc_pte);
- else if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd)))
- return VM_FAULT_OOM;
- }
+ alloc_ret = pmd_install(vma->vm_mm, vmf->pmd, &vmf->prealloc_pte);
+ else
+ alloc_ret = pte_alloc(vma->vm_mm, vmf->pmd);

- /* See comment in handle_pte_fault() */
- if (pmd_devmap_trans_unstable(vmf->pmd))
+ if (unlikely(alloc_ret != INSTALLED_PTE))
+ return alloc_ret < 0 ? VM_FAULT_OOM : 0;
+ } else if (pmd_devmap_trans_unstable(vmf->pmd)) {
+ /* See comment in handle_pte_fault() */
return 0;
+ }

vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
vmf->address, &vmf->ptl);
diff --git a/mm/migrate.c b/mm/migrate.c
index cf25b00f03c8..bdfdfd3b50be 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2731,21 +2731,8 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
if (pmd_trans_huge(*pmdp) || pmd_devmap(*pmdp))
goto abort;

- /*
- * Use pte_alloc() instead of pte_alloc_map(). We can't run
- * pte_offset_map() on pmds where a huge pmd might be created
- * from a different thread.
- *
- * pte_alloc_map() is safe to use under mmap_write_lock(mm) or when
- * parallel threads are excluded by other means.
- *
- * Here we only have mmap_read_lock(mm).
- */
- if (pte_alloc(mm, pmdp))
- goto abort;
-
- /* See the comment in pte_alloc_one_map() */
- if (unlikely(pmd_trans_unstable(pmdp)))
+ /* See the comment in do_anonymous_page() */
+ if (unlikely(pte_alloc(mm, pmdp) != INSTALLED_PTE))
goto abort;

if (unlikely(anon_vma_prepare(vma)))
diff --git a/mm/mremap.c b/mm/mremap.c
index c6e9da09dd0a..fc5c56858883 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -551,7 +551,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
continue;
}

- if (pte_alloc(new_vma->vm_mm, new_pmd))
+ if (pte_alloc(new_vma->vm_mm, new_pmd) < 0)
break;
move_ptes(vma, old_pmd, old_addr, old_addr + extent, new_vma,
new_pmd, new_addr, need_rmap_locks);
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 0780c2a57ff1..2cea08e7f076 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -592,15 +592,21 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
err = -EEXIST;
break;
}
- if (unlikely(pmd_none(dst_pmdval)) &&
- unlikely(__pte_alloc(dst_mm, dst_pmd))) {
- err = -ENOMEM;
- break;
- }
- /* If an huge pmd materialized from under us fail */
- if (unlikely(pmd_trans_huge(*dst_pmd))) {
- err = -EFAULT;
- break;
+
+ if (unlikely(pmd_none(dst_pmdval))) {
+ int ret = __pte_alloc(dst_mm, dst_pmd);
+
+ /*
+ * If there is not enough memory or an huge pmd
+ * materialized from under us
+ */
+ if (unlikely(ret < 0)) {
+ err = -ENOMEM;
+ break;
+ } else if (unlikely(ret == INSTALLED_HUGE_PMD)) {
+ err = -EFAULT;
+ break;
+ }
}

BUG_ON(pmd_none(*dst_pmd));
--
2.11.0

2021-11-10 08:44:28

by Qi Zheng

[permalink] [raw]
Subject: [PATCH v3 07/15] mm/pte_ref: add support for user PTE page table page allocation

When the PTE page table page is allocated and installed into the
pmd entry, it needs to take an initial reference count to prevent
the release of PTE page table page by other threads, and the caller
of pte_alloc()(or other friends) needs to reduce this reference count.

Signed-off-by: Qi Zheng <[email protected]>
---
include/linux/mm.h | 7 +++++--
mm/debug_vm_pgtable.c | 1 +
mm/filemap.c | 8 ++++++--
mm/gup.c | 10 +++++++---
mm/memory.c | 51 +++++++++++++++++++++++++++++++++++++++++----------
mm/migrate.c | 9 ++++++---
mm/mlock.c | 1 +
mm/mremap.c | 1 +
mm/userfaultfd.c | 16 +++++++++++++++-
9 files changed, 83 insertions(+), 21 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 52f36fde2f11..753a9435e0d0 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -26,6 +26,7 @@
#include <linux/err.h>
#include <linux/page-flags.h>
#include <linux/page_ref.h>
+#include <linux/pte_ref.h>
#include <linux/memremap.h>
#include <linux/overflow.h>
#include <linux/sizes.h>
@@ -2313,9 +2314,11 @@ enum pmd_installed_type {

static inline int pte_alloc(struct mm_struct *mm, pmd_t *pmd)
{
- if (unlikely(pmd_none(*(pmd))))
+ enum pte_tryget_type ret = pte_try_get(pmd);
+
+ if (ret == TRYGET_FAILED_NONE || ret == TRYGET_FAILED_ZERO)
return __pte_alloc(mm, pmd);
- if (unlikely(is_huge_pmd(*pmd)))
+ else if (ret == TRYGET_FAILED_HUGE_PMD)
return INSTALLED_HUGE_PMD;

return INSTALLED_PTE;
diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index b8322c55e65d..52f006654664 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -1048,6 +1048,7 @@ static void __init destroy_args(struct pgtable_debug_args *args)

/* Free page table entries */
if (args->start_ptep) {
+ pte_put(args->mm, args->start_pmdp, args->vaddr);
pte_free(args->mm, args->start_ptep);
mm_dec_nr_ptes(args->mm);
}
diff --git a/mm/filemap.c b/mm/filemap.c
index 23363f8ddbbe..1e7e9e4fd759 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3217,6 +3217,7 @@ static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page)
}
}

+retry:
if (pmd_none(*vmf->pmd)) {
int ret = pmd_install(mm, vmf->pmd, &vmf->prealloc_pte);

@@ -3225,6 +3226,8 @@ static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page)
} else if (pmd_devmap_trans_unstable(vmf->pmd)) {
/* See comment in handle_pte_fault() */
goto out;
+ } else if (pte_try_get(vmf->pmd) == TRYGET_FAILED_ZERO) {
+ goto retry;
}

return false;
@@ -3301,7 +3304,7 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
struct file *file = vma->vm_file;
struct address_space *mapping = file->f_mapping;
pgoff_t last_pgoff = start_pgoff;
- unsigned long addr;
+ unsigned long addr, start;
XA_STATE(xas, &mapping->i_pages, start_pgoff);
struct page *head, *page;
unsigned int mmap_miss = READ_ONCE(file->f_ra.mmap_miss);
@@ -3317,7 +3320,7 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
goto out;
}

- addr = vma->vm_start + ((start_pgoff - vma->vm_pgoff) << PAGE_SHIFT);
+ start = addr = vma->vm_start + ((start_pgoff - vma->vm_pgoff) << PAGE_SHIFT);
vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
do {
page = find_subpage(head, xas.xa_index);
@@ -3348,6 +3351,7 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
put_page(head);
} while ((head = next_map_page(mapping, &xas, end_pgoff)) != NULL);
pte_unmap_unlock(vmf->pte, vmf->ptl);
+ pte_put(vma->vm_mm, vmf->pmd, start);
out:
rcu_read_unlock();
WRITE_ONCE(file->f_ra.mmap_miss, mmap_miss);
diff --git a/mm/gup.c b/mm/gup.c
index 2def775232a3..e084111103f0 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -694,7 +694,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
spin_unlock(ptl);
ret = 0;
split_huge_pmd(vma, pmd, address);
- if (pmd_trans_unstable(pmd))
+ if (pte_try_get(pmd) == TRYGET_FAILED_HUGE_PMD)
ret = -EBUSY;
} else {
spin_unlock(ptl);
@@ -702,8 +702,12 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
ret = pte_alloc(mm, pmd) < 0 ? -ENOMEM : 0;
}

- return ret ? ERR_PTR(ret) :
- follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
+ if (ret)
+ return ERR_PTR(ret);
+
+ page = follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
+ pte_put(mm, pmd, address);
+ return page;
}
page = follow_trans_huge_pmd(vma, address, pmd, flags);
spin_unlock(ptl);
diff --git a/mm/memory.c b/mm/memory.c
index 8a39c0e58324..0b9af38cfa11 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -441,10 +441,13 @@ enum pmd_installed_type pmd_install(struct mm_struct *mm, pmd_t *pmd,
pgtable_t *pte)
{
int ret = INSTALLED_PTE;
- spinlock_t *ptl = pmd_lock(mm, pmd);
+ spinlock_t *ptl;

+retry:
+ ptl = pmd_lock(mm, pmd);
if (likely(pmd_none(*pmd))) { /* Has another populated it ? */
mm_inc_nr_ptes(mm);
+ pte_ref_init(*pte, pmd, 1);
/*
* Ensure all pte setup (eg. pte page lock and page clearing) are
* visible before the pte is made visible to other CPUs by being
@@ -464,6 +467,9 @@ enum pmd_installed_type pmd_install(struct mm_struct *mm, pmd_t *pmd,
} else if (is_huge_pmd(*pmd)) {
/* See comment in handle_pte_fault() */
ret = INSTALLED_HUGE_PMD;
+ } else if (!pte_get_unless_zero(pmd)) {
+ spin_unlock(ptl);
+ goto retry;
}
spin_unlock(ptl);

@@ -1028,6 +1034,7 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
int rss[NR_MM_COUNTERS];
swp_entry_t entry = (swp_entry_t){0};
struct page *prealloc = NULL;
+ unsigned long start = addr;

again:
progress = 0;
@@ -1108,6 +1115,7 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
pte_unmap(orig_src_pte);
add_mm_rss_vec(dst_mm, rss);
pte_unmap_unlock(orig_dst_pte, dst_ptl);
+ pte_put(dst_mm, dst_pmd, start);
cond_resched();

if (ret == -EIO) {
@@ -1778,6 +1786,7 @@ static int insert_page(struct vm_area_struct *vma, unsigned long addr,
goto out;
retval = insert_page_into_pte_locked(mm, pte, addr, page, prot);
pte_unmap_unlock(pte, ptl);
+ pte_put(mm, pte_to_pmd(pte), addr);
out:
return retval;
}
@@ -1810,6 +1819,7 @@ static int insert_pages(struct vm_area_struct *vma, unsigned long addr,
unsigned long remaining_pages_total = *num;
unsigned long pages_to_write_in_pmd;
int ret;
+ unsigned long start = addr;
more:
ret = -EFAULT;
pmd = walk_to_pmd(mm, addr);
@@ -1836,7 +1846,7 @@ static int insert_pages(struct vm_area_struct *vma, unsigned long addr,
pte_unmap_unlock(start_pte, pte_lock);
ret = err;
remaining_pages_total -= pte_idx;
- goto out;
+ goto put;
}
addr += PAGE_SIZE;
++curr_page_idx;
@@ -1845,9 +1855,13 @@ static int insert_pages(struct vm_area_struct *vma, unsigned long addr,
pages_to_write_in_pmd -= batch_size;
remaining_pages_total -= batch_size;
}
- if (remaining_pages_total)
+ if (remaining_pages_total) {
+ pte_put(mm, pmd, start);
goto more;
+ }
ret = 0;
+put:
+ pte_put(mm, pmd, start);
out:
*num = remaining_pages_total;
return ret;
@@ -2075,6 +2089,7 @@ static vm_fault_t insert_pfn(struct vm_area_struct *vma, unsigned long addr,

out_unlock:
pte_unmap_unlock(pte, ptl);
+ pte_put(mm, pte_to_pmd(pte), addr);
return VM_FAULT_NOPAGE;
}

@@ -2275,6 +2290,7 @@ static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd,
unsigned long addr, unsigned long end,
unsigned long pfn, pgprot_t prot)
{
+ unsigned long start = addr;
pte_t *pte, *mapped_pte;
spinlock_t *ptl;
int err = 0;
@@ -2294,6 +2310,7 @@ static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd,
} while (pte++, addr += PAGE_SIZE, addr != end);
arch_leave_lazy_mmu_mode();
pte_unmap_unlock(mapped_pte, ptl);
+ pte_put(mm, pmd, start);
return err;
}

@@ -2503,6 +2520,7 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
pte_fn_t fn, void *data, bool create,
pgtbl_mod_mask *mask)
{
+ unsigned long start = addr;
pte_t *pte, *mapped_pte;
int err = 0;
spinlock_t *ptl;
@@ -2536,8 +2554,11 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,

arch_leave_lazy_mmu_mode();

- if (mm != &init_mm)
+ if (mm != &init_mm) {
pte_unmap_unlock(mapped_pte, ptl);
+ if (create)
+ pte_put(mm, pmd, start);
+ }
return err;
}

@@ -3761,7 +3782,8 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
/* Deliver the page fault to userland, check inside PT lock */
if (userfaultfd_missing(vma)) {
pte_unmap_unlock(vmf->pte, vmf->ptl);
- return handle_userfault(vmf, VM_UFFD_MISSING);
+ ret = handle_userfault(vmf, VM_UFFD_MISSING);
+ goto put;
}
goto setpte;
}
@@ -3804,7 +3826,8 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
if (userfaultfd_missing(vma)) {
pte_unmap_unlock(vmf->pte, vmf->ptl);
put_page(page);
- return handle_userfault(vmf, VM_UFFD_MISSING);
+ ret = handle_userfault(vmf, VM_UFFD_MISSING);
+ goto put;
}

inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
@@ -3817,14 +3840,17 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
update_mmu_cache(vma, vmf->address, vmf->pte);
unlock:
pte_unmap_unlock(vmf->pte, vmf->ptl);
- return ret;
+ goto put;
release:
put_page(page);
goto unlock;
oom_free_page:
put_page(page);
oom:
- return VM_FAULT_OOM;
+ ret = VM_FAULT_OOM;
+put:
+ pte_put(vma->vm_mm, vmf->pmd, vmf->address);
+ return ret;
}

/*
@@ -4031,7 +4057,9 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
return ret;
}

- if (pmd_none(*vmf->pmd)) {
+retry:
+ ret = pte_try_get(vmf->pmd);
+ if (ret == TRYGET_FAILED_NONE) {
int alloc_ret;

if (PageTransCompound(page)) {
@@ -4047,9 +4075,11 @@ vm_fault_t finish_fault(struct vm_fault *vmf)

if (unlikely(alloc_ret != INSTALLED_PTE))
return alloc_ret < 0 ? VM_FAULT_OOM : 0;
- } else if (pmd_devmap_trans_unstable(vmf->pmd)) {
+ } else if (ret == TRYGET_FAILED_HUGE_PMD) {
/* See comment in handle_pte_fault() */
return 0;
+ } else if (ret == TRYGET_FAILED_ZERO) {
+ goto retry;
}

vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
@@ -4063,6 +4093,7 @@ vm_fault_t finish_fault(struct vm_fault *vmf)

update_mmu_tlb(vma, vmf->address, vmf->pte);
pte_unmap_unlock(vmf->pte, vmf->ptl);
+ pte_put(vma->vm_mm, vmf->pmd, vmf->address);
return ret;
}

diff --git a/mm/migrate.c b/mm/migrate.c
index bdfdfd3b50be..26f16a4836d8 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2736,9 +2736,9 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
goto abort;

if (unlikely(anon_vma_prepare(vma)))
- goto abort;
+ goto put;
if (mem_cgroup_charge(page_folio(page), vma->vm_mm, GFP_KERNEL))
- goto abort;
+ goto put;

/*
* The memory barrier inside __SetPageUptodate makes sure that
@@ -2764,7 +2764,7 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
* device memory.
*/
pr_warn_once("Unsupported ZONE_DEVICE page type.\n");
- goto abort;
+ goto put;
}
} else {
entry = mk_pte(page, vma->vm_page_prot);
@@ -2811,11 +2811,14 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
}

pte_unmap_unlock(ptep, ptl);
+ pte_put(mm, pmdp, addr);
*src = MIGRATE_PFN_MIGRATE;
return;

unlock_abort:
pte_unmap_unlock(ptep, ptl);
+put:
+ pte_put(mm, pmdp, addr);
abort:
*src &= ~MIGRATE_PFN_MIGRATE;
}
diff --git a/mm/mlock.c b/mm/mlock.c
index e263d62ae2d0..a4ef20ba9627 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -398,6 +398,7 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec,
break;
}
pte_unmap_unlock(pte, ptl);
+ pte_put(vma->vm_mm, pte_to_pmd(pte), start);
return start;
}

diff --git a/mm/mremap.c b/mm/mremap.c
index fc5c56858883..f80c628db25d 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -555,6 +555,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
break;
move_ptes(vma, old_pmd, old_addr, old_addr + extent, new_vma,
new_pmd, new_addr, need_rmap_locks);
+ pte_put(new_vma->vm_mm, new_pmd, new_addr);
}

mmu_notifier_invalidate_range_end(&range);
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 2cea08e7f076..37df899a1b9d 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -574,6 +574,7 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,

while (src_addr < src_start + len) {
pmd_t dst_pmdval;
+ enum pte_tryget_type tryget_type;

BUG_ON(dst_addr >= dst_start + len);

@@ -583,6 +584,14 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
break;
}

+again:
+ /*
+ * After the management of the PTE page changes to the refcount
+ * mode, the PTE page may be released by another thread(rcu mode),
+ * so the rcu lock is held here to prevent the PTE page from
+ * being released.
+ */
+ rcu_read_lock();
dst_pmdval = pmd_read_atomic(dst_pmd);
/*
* If the dst_pmd is mapped as THP don't
@@ -593,7 +602,9 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
break;
}

- if (unlikely(pmd_none(dst_pmdval))) {
+ tryget_type = pte_try_get(&dst_pmdval);
+ rcu_read_unlock();
+ if (unlikely(tryget_type == TRYGET_FAILED_NONE)) {
int ret = __pte_alloc(dst_mm, dst_pmd);

/*
@@ -607,6 +618,8 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
err = -EFAULT;
break;
}
+ } else if (unlikely(tryget_type == TRYGET_FAILED_ZERO)) {
+ goto again;
}

BUG_ON(pmd_none(*dst_pmd));
@@ -614,6 +627,7 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,

err = mfill_atomic_pte(dst_mm, dst_pmd, dst_vma, dst_addr,
src_addr, &page, mcopy_mode, wp_copy);
+ pte_put(dst_mm, dst_pmd, dst_addr);
cond_resched();

if (unlikely(err == -ENOENT)) {
--
2.11.0

2021-11-10 08:44:34

by Qi Zheng

[permalink] [raw]
Subject: [PATCH v3 06/15] mm: introduce refcount for user PTE page table page

1. Preface
==========

Now in order to pursue high performance, applications mostly use some
high-performance user-mode memory allocators, such as jemalloc or tcmalloc.
These memory allocators use madvise(MADV_DONTNEED or MADV_FREE) to release
physical memory for the following reasons::

First of all, we should hold as few write locks of mmap_lock as possible,
since the mmap_lock semaphore has long been a contention point in the
memory management subsystem. The mmap()/munmap() hold the write lock, and
the madvise(MADV_DONTNEED or MADV_FREE) hold the read lock, so using
madvise() instead of munmap() to released physical memory can reduce the
competition of the mmap_lock.

Secondly, after using madvise() to release physical memory, there is no
need to build vma and allocate page tables again when accessing the same
virtual address again, which can also save some time.

The following is the largest user PTE page table memory that can be
allocated by a single user process in a 32-bit and a 64-bit system.

+---------------------------+--------+---------+
| | 32-bit | 64-bit |
+===========================+========+=========+
| user PTE page table pages | 3 MiB | 512 GiB |
+---------------------------+--------+---------+
| user PMD page table pages | 3 KiB | 1 GiB |
+---------------------------+--------+---------+

(for 32-bit, take 3G user address space, 4K page size as an example;
for 64-bit, take 48-bit address width, 4K page size as an example.)

After using madvise(), everything looks good, but as can be seen from the
above table, a single process can create a large number of PTE page tables
on a 64-bit system, since both of the MADV_DONTNEED and MADV_FREE will not
release page table memory. And before the process exits or calls munmap(),
the kernel cannot reclaim these pages even if these PTE page tables do not
map anything.

Therefore, we decided to introduce reference count to manage the PTE page
table life cycle, so that some free PTE page table memory in the system
can be dynamically released.

2. The reference count of user PTE page table pages
===================================================

We introduce two members for the struct page of the user PTE page table
page::

union {
pgtable_t pmd_huge_pte; /* protected by page->ptl */
pmd_t *pmd; /* PTE page only */
};
union {
struct mm_struct *pt_mm; /* x86 pgds only */
atomic_t pt_frag_refcount; /* powerpc */
atomic_t pte_refcount; /* PTE page only */
};

The pmd member record the pmd entry that maps the user PTE page table page,
the pte_refcount member keep track of how many references to the user PTE
page table page.

The following people will hold a reference on the user PTE page table
page::

The !pte_none() entry, such as regular page table entry that map physical
pages, or swap entry, or migrate entry, etc.

Visitor to the PTE page table entries, such as page table walker.

Any ``!pte_none()`` entry and visitor can be regarded as the user of its
PTE page table page. When the ``pte_refcount`` is reduced to 0, it means
that no one is using the PTE page table page, then this free PTE page
table page can be released back to the system at this time.

3. Helpers
==========

+---------------------+-------------------------------------------------+
| pte_ref_init | Initialize the pte_refcount and pmd |
+---------------------+-------------------------------------------------+
| pte_to_pmd | Get the corresponding pmd |
+---------------------+-------------------------------------------------+
| pte_update_pmd | Update the corresponding pmd |
+---------------------+-------------------------------------------------+
| pte_get | Increment a pte_refcount |
+---------------------+-------------------------------------------------+
| pte_get_many | Add a value to a pte_refcount |
+---------------------+-------------------------------------------------+
| pte_get_unless_zero | Increment a pte_refcount unless it is 0 |
+---------------------+-------------------------------------------------+
| pte_try_get | Try to increment a pte_refcount |
+---------------------+-------------------------------------------------+
| pte_tryget_map | Try to increment a pte_refcount before |
| | pte_offset_map() |
+---------------------+-------------------------------------------------+
| pte_tryget_map_lock | Try to increment a pte_refcount before |
| | pte_offset_map_lock() |
+---------------------+-------------------------------------------------+
| pte_put | Decrement a pte_refcount |
+---------------------+-------------------------------------------------+
| pte_put_many | Sub a value to a pte_refcount |
+---------------------+-------------------------------------------------+
| pte_put_vmf | Decrement a pte_refcount in the page fault path |
+---------------------+-------------------------------------------------+

4. About this commit
====================
This commit just introduces some dummy helpers, the actual logic will
be implemented in future commits.

Signed-off-by: Qi Zheng <[email protected]>
---
include/linux/mm_types.h | 6 +++-
include/linux/pte_ref.h | 87 ++++++++++++++++++++++++++++++++++++++++++++++++
mm/Makefile | 4 +--
mm/pte_ref.c | 55 ++++++++++++++++++++++++++++++
4 files changed, 149 insertions(+), 3 deletions(-)
create mode 100644 include/linux/pte_ref.h
create mode 100644 mm/pte_ref.c

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index bb8c6f5f19bc..c599008d54fe 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -149,11 +149,15 @@ struct page {
};
struct { /* Page table pages */
unsigned long _pt_pad_1; /* compound_head */
- pgtable_t pmd_huge_pte; /* protected by page->ptl */
+ union {
+ pgtable_t pmd_huge_pte; /* protected by page->ptl */
+ pmd_t *pmd; /* PTE page only */
+ };
unsigned long _pt_pad_2; /* mapping */
union {
struct mm_struct *pt_mm; /* x86 pgds only */
atomic_t pt_frag_refcount; /* powerpc */
+ atomic_t pte_refcount; /* PTE page only */
};
#if ALLOC_SPLIT_PTLOCKS
spinlock_t *ptl;
diff --git a/include/linux/pte_ref.h b/include/linux/pte_ref.h
new file mode 100644
index 000000000000..b6d8335bdc59
--- /dev/null
+++ b/include/linux/pte_ref.h
@@ -0,0 +1,87 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2021, ByteDance. All rights reserved.
+ *
+ * Author: Qi Zheng <[email protected]>
+ */
+#ifndef _LINUX_PTE_REF_H
+#define _LINUX_PTE_REF_H
+
+#include <linux/pgtable.h>
+
+enum pte_tryget_type {
+ TRYGET_SUCCESSED,
+ TRYGET_FAILED_ZERO,
+ TRYGET_FAILED_NONE,
+ TRYGET_FAILED_HUGE_PMD,
+};
+
+bool pte_get_unless_zero(pmd_t *pmd);
+enum pte_tryget_type pte_try_get(pmd_t *pmd);
+void pte_put_vmf(struct vm_fault *vmf);
+
+static inline void pte_ref_init(pgtable_t pte, pmd_t *pmd, int count)
+{
+}
+
+static inline pmd_t *pte_to_pmd(pte_t *pte)
+{
+ return NULL;
+}
+
+static inline void pte_update_pmd(pmd_t old_pmd, pmd_t *new_pmd)
+{
+}
+
+static inline void pte_get_many(pmd_t *pmd, unsigned int nr)
+{
+}
+
+/*
+ * pte_get - Increment refcount for the PTE page table.
+ * @pmd: a pointer to the pmd entry corresponding to the PTE page table.
+ *
+ * Similar to the mechanism of page refcount, the user of PTE page table
+ * should hold a refcount to it before accessing.
+ */
+static inline void pte_get(pmd_t *pmd)
+{
+ pte_get_many(pmd, 1);
+}
+
+static inline pte_t *pte_tryget_map(pmd_t *pmd, unsigned long address)
+{
+ if (pte_try_get(pmd))
+ return NULL;
+
+ return pte_offset_map(pmd, address);
+}
+
+static inline pte_t *pte_tryget_map_lock(struct mm_struct *mm, pmd_t *pmd,
+ unsigned long address, spinlock_t **ptlp)
+{
+ if (pte_try_get(pmd))
+ return NULL;
+
+ return pte_offset_map_lock(mm, pmd, address, ptlp);
+}
+
+static inline void pte_put_many(struct mm_struct *mm, pmd_t *pmd,
+ unsigned long addr, unsigned int nr)
+{
+}
+
+/*
+ * pte_put - Decrement refcount for the PTE page table.
+ * @mm: the mm_struct of the target address space.
+ * @pmd: a pointer to the pmd entry corresponding to the PTE page table.
+ * @addr: the start address of the tlb range to be flushed.
+ *
+ * The PTE page table page will be freed when the last refcount is dropped.
+ */
+static inline void pte_put(struct mm_struct *mm, pmd_t *pmd, unsigned long addr)
+{
+ pte_put_many(mm, pmd, addr, 1);
+}
+
+#endif
diff --git a/mm/Makefile b/mm/Makefile
index d6c0042e3aa0..ea679bf75a5f 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -38,8 +38,8 @@ mmu-y := nommu.o
mmu-$(CONFIG_MMU) := highmem.o memory.o mincore.o \
mlock.o mmap.o mmu_gather.o mprotect.o mremap.o \
msync.o page_vma_mapped.o pagewalk.o \
- pgtable-generic.o rmap.o vmalloc.o
-
+ pgtable-generic.o rmap.o vmalloc.o \
+ pte_ref.o

ifdef CONFIG_CROSS_MEMORY_ATTACH
mmu-$(CONFIG_MMU) += process_vm_access.o
diff --git a/mm/pte_ref.c b/mm/pte_ref.c
new file mode 100644
index 000000000000..de109905bc8f
--- /dev/null
+++ b/mm/pte_ref.c
@@ -0,0 +1,55 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2021, ByteDance. All rights reserved.
+ *
+ * Author: Qi Zheng <[email protected]>
+ */
+
+#include <linux/pte_ref.h>
+#include <linux/mm.h>
+
+/*
+ * pte_get_unless_zero - Increment refcount for the PTE page table
+ * unless it is zero.
+ * @pmd: a pointer to the pmd entry corresponding to the PTE page table.
+ */
+bool pte_get_unless_zero(pmd_t *pmd)
+{
+ return true;
+}
+
+/*
+ * pte_try_get - Try to increment refcount for the PTE page table.
+ * @pmd: a pointer to the pmd entry corresponding to the PTE page table.
+ *
+ * Return true if the increment succeeded. Otherwise return false.
+ *
+ * Before Operating the PTE page table, we need to hold a refcount
+ * to protect against the concurrent release of the PTE page table.
+ * But we will fail in the following case:
+ * - The content mapped in @pmd is not a PTE page
+ * - The refcount of the PTE page table is zero, it will be freed
+ */
+enum pte_tryget_type pte_try_get(pmd_t *pmd)
+{
+ if (unlikely(pmd_none(*pmd)))
+ return TRYGET_FAILED_NONE;
+ if (unlikely(is_huge_pmd(*pmd)))
+ return TRYGET_FAILED_HUGE_PMD;
+
+ return TRYGET_SUCCESSED;
+}
+
+/*
+ * pte_put_vmf - Decrement refcount for the PTE page table.
+ * @vmf: fault information
+ *
+ * The mmap_lock may be unlocked in advance in some cases
+ * in handle_pte_fault(), then the pmd entry will no longer
+ * be stable. For example, the corresponds of the PTE page may
+ * be replaced(e.g. mremap), so we should ensure the pte_put()
+ * is performed in the critical section of the mmap_lock.
+ */
+void pte_put_vmf(struct vm_fault *vmf)
+{
+}
--
2.11.0

2021-11-10 08:45:41

by Qi Zheng

[permalink] [raw]
Subject: [PATCH v3 08/15] mm/pte_ref: initialize the refcount of the withdrawn PTE page table page

When we split the PMD-mapped THP to the PTE-mapped THP, we should
initialize the refcount of the withdrawn PTE page table page to
HPAGE_PMD_NR, which ensures that we can release the PTE page table
page when it is free(the refcount is 0).

Signed-off-by: Qi Zheng <[email protected]>
---
mm/pgtable-generic.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 4e640baf9794..523053e09dfa 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -186,6 +186,7 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
struct page, lru);
if (pmd_huge_pte(mm, pmdp))
list_del(&pgtable->lru);
+ pte_ref_init(pgtable, pmdp, HPAGE_PMD_NR);
return pgtable;
}
#endif
--
2.11.0

2021-11-10 08:54:48

by Qi Zheng

[permalink] [raw]
Subject: Re: [PATCH v3 00/15] Free user PTE page table pages

Hi all,

I’m sorry, something went wrong when sending this patch set, I will
resend the whole patch later.

Thanks,
Qi

On 11/10/21 4:40 PM, Qi Zheng wrote:
> Hi,
>
> This patch series aims to free user PTE page table pages when all PTE entries
> are empty.
>
> The beginning of this story is that some malloc libraries(e.g. jemalloc or
> tcmalloc) usually allocate the amount of VAs by mmap() and do not unmap those VAs.
> They will use madvise(MADV_DONTNEED) to free physical memory if they want.
> But the page tables do not be freed by madvise(), so it can produce many
> page tables when the process touches an enormous virtual address space.
>
> The following figures are a memory usage snapshot of one process which actually
> happened on our server:
>
> VIRT: 55t
> RES: 590g
> VmPTE: 110g
>
> As we can see, the PTE page tables size is 110g, while the RES is 590g. In
> theory, the process only need 1.2g PTE page tables to map those physical
> memory. The reason why PTE page tables occupy a lot of memory is that
> madvise(MADV_DONTNEED) only empty the PTE and free physical memory but
> doesn't free the PTE page table pages. So we can free those empty PTE page
> tables to save memory. In the above cases, we can save memory about 108g(best
> case). And the larger the difference between the size of VIRT and RES, the
> more memory we save.
>
> In this patch series, we add a pte_refcount field to the struct page of page
> table to track how many users of PTE page table. Similar to the mechanism of
> page refcount, the user of PTE page table should hold a refcount to it before
> accessing. The PTE page table page will be freed when the last refcount is
> dropped.
>
> Testing:
>
> The following code snippet can show the effect of optimization:
>
> mmap 50G
> while (1) {
> for (; i < 1024 * 25; i++) {
> touch 2M memory
> madvise MADV_DONTNEED 2M
> }
> }
>
> As we can see, the memory usage of VmPTE is reduced:
>
> before after
> VIRT 50.0 GB 50.0 GB
> RES 3.1 MB 3.6 MB
> VmPTE 102640 kB 248 kB
>
> I also have tested the stability by LTP[1] for several weeks. I have not seen
> any crash so far.
>
> The performance of page fault can be affected because of the allocation/freeing
> of PTE page table pages. The following is the test result by using a micro
> benchmark[2]:
>
> root@~# perf stat -e page-faults --repeat 5 ./multi-fault $threads:
>
> threads before (pf/min) after (pf/min)
> 1 32,085,255 31,880,833 (-0.64%)
> 8 101,674,967 100,588,311 (-1.17%)
> 16 113,207,000 112,801,832 (-0.36%)
>
> (The "pfn/min" means how many page faults in one minute.)
>
> The performance of page fault is ~1% slower than before.
>
> And there are no obvious changes in perf hot spots:
>
> before:
> 19.29% [kernel] [k] clear_page_rep
> 16.12% [kernel] [k] do_user_addr_fault
> 9.57% [kernel] [k] _raw_spin_unlock_irqrestore
> 6.16% [kernel] [k] get_page_from_freelist
> 5.03% [kernel] [k] __handle_mm_fault
> 3.53% [kernel] [k] __rcu_read_unlock
> 3.45% [kernel] [k] handle_mm_fault
> 3.38% [kernel] [k] down_read_trylock
> 2.74% [kernel] [k] free_unref_page_list
> 2.17% [kernel] [k] up_read
> 1.93% [kernel] [k] charge_memcg
> 1.73% [kernel] [k] try_charge_memcg
> 1.71% [kernel] [k] __alloc_pages
> 1.69% [kernel] [k] ___perf_sw_event
> 1.44% [kernel] [k] get_mem_cgroup_from_mm
>
> after:
> 18.19% [kernel] [k] clear_page_rep
> 16.28% [kernel] [k] do_user_addr_fault
> 8.39% [kernel] [k] _raw_spin_unlock_irqrestore
> 5.12% [kernel] [k] get_page_from_freelist
> 4.81% [kernel] [k] __handle_mm_fault
> 4.68% [kernel] [k] down_read_trylock
> 3.80% [kernel] [k] handle_mm_fault
> 3.59% [kernel] [k] get_mem_cgroup_from_mm
> 2.49% [kernel] [k] free_unref_page_list
> 2.41% [kernel] [k] up_read
> 2.16% [kernel] [k] charge_memcg
> 1.92% [kernel] [k] __rcu_read_unlock
> 1.88% [kernel] [k] ___perf_sw_event
> 1.70% [kernel] [k] pte_get_unless_zero
>
> This series is based on next-20211108.
>
> Comments and suggestions are welcome.
>
> Thanks,
> Qi.
>
> [1] https://github.com/linux-test-project/ltp
> [2] https://lore.kernel.org/lkml/[email protected]/2-multi-fault-all.c
>
> Changelog in v2 -> v3:
> - Refactored this patch series:
> - [PATCH v3 6/15]: Introduce the new dummy helpers first
> - [PATCH v3 7-12/15]: Convert each subsystem individually
> - [PATCH v3 13/15]: Implement the actual logic to the dummy helpers
> And thanks for the advice from David and Jason.
> - Add a document.
>
> Changelog in v1 -> v2:
> - Change pte_install() to pmd_install().
> - Fix some typo and code style problems.
> - Split [PATCH v1 5/7] into [PATCH v2 4/9], [PATCH v2 5/9],[PATCH v2 6/9]
> and [PATCH v2 7/9].
>
> Qi Zheng (15):
> mm: do code cleanups to filemap_map_pmd()
> mm: introduce is_huge_pmd() helper
> mm: move pte_offset_map_lock() to pgtable.h
> mm: rework the parameter of lock_page_or_retry()
> mm: add pmd_installed_type return for __pte_alloc() and other friends
> mm: introduce refcount for user PTE page table page
> mm/pte_ref: add support for user PTE page table page allocation
> mm/pte_ref: initialize the refcount of the withdrawn PTE page table
> page
> mm/pte_ref: add support for the map/unmap of user PTE page table page
> mm/pte_ref: add support for page fault path
> mm/pte_ref: take a refcount before accessing the PTE page table page
> mm/pte_ref: update the pmd entry in move_normal_pmd()
> mm/pte_ref: free user PTE page table pages
> Documentation: add document for pte_ref
> mm/pte_ref: use mmu_gather to free PTE page table pages
>
> Documentation/vm/pte_ref.rst | 216 ++++++++++++++++++++++++++++++++++++
> arch/x86/Kconfig | 2 +-
> fs/proc/task_mmu.c | 24 +++-
> fs/userfaultfd.c | 9 +-
> include/linux/huge_mm.h | 10 +-
> include/linux/mm.h | 170 ++++-------------------------
> include/linux/mm_types.h | 6 +-
> include/linux/pagemap.h | 8 +-
> include/linux/pgtable.h | 152 +++++++++++++++++++++++++-
> include/linux/pte_ref.h | 146 +++++++++++++++++++++++++
> include/linux/rmap.h | 2 +
> kernel/events/uprobes.c | 2 +
> mm/Kconfig | 4 +
> mm/Makefile | 4 +-
> mm/damon/vaddr.c | 12 +-
> mm/debug_vm_pgtable.c | 5 +-
> mm/filemap.c | 45 +++++---
> mm/gup.c | 25 ++++-
> mm/hmm.c | 5 +-
> mm/huge_memory.c | 3 +-
> mm/internal.h | 4 +-
> mm/khugepaged.c | 21 +++-
> mm/ksm.c | 6 +-
> mm/madvise.c | 21 +++-
> mm/memcontrol.c | 12 +-
> mm/memory-failure.c | 11 +-
> mm/memory.c | 254 ++++++++++++++++++++++++++++++++-----------
> mm/mempolicy.c | 6 +-
> mm/migrate.c | 54 ++++-----
> mm/mincore.c | 7 +-
> mm/mlock.c | 1 +
> mm/mmu_gather.c | 40 +++----
> mm/mprotect.c | 11 +-
> mm/mremap.c | 14 ++-
> mm/page_vma_mapped.c | 4 +
> mm/pagewalk.c | 15 ++-
> mm/pgtable-generic.c | 1 +
> mm/pte_ref.c | 141 ++++++++++++++++++++++++
> mm/rmap.c | 10 ++
> mm/swapfile.c | 3 +
> mm/userfaultfd.c | 40 +++++--
> 41 files changed, 1186 insertions(+), 340 deletions(-)
> create mode 100644 Documentation/vm/pte_ref.rst
> create mode 100644 include/linux/pte_ref.h
> create mode 100644 mm/pte_ref.c
>

2021-11-10 12:32:25

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v3 02/15] mm: introduce is_huge_pmd() helper

On Wed, Nov 10, 2021 at 04:40:44PM +0800, Qi Zheng wrote:
> Currently we have some times the following judgments repeated in the
> code:
>
> is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)
>
> which is to determine whether the *pmd is a huge pmd, so introduce
> is_huge_pmd() helper to deduplicate them.

Isn't this pmd_leaf() ?

Jason

2021-11-10 13:01:21

by Qi Zheng

[permalink] [raw]
Subject: Re: [PATCH v3 02/15] mm: introduce is_huge_pmd() helper


On 11/10/21 8:29 PM, Jason Gunthorpe wrote:
> On Wed, Nov 10, 2021 at 04:40:44PM +0800, Qi Zheng wrote:
>> Currently we have some times the following judgments repeated in the
>> code:
>>
>> is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)
>>
>> which is to determine whether the *pmd is a huge pmd, so introduce
>> is_huge_pmd() helper to deduplicate them.
>
> Isn't this pmd_leaf() ?

Currently, the implementation of pmd_leaf() does not include
pmd_devmap() checks. But considering the semantics of pmd_leaf(),
the "devmap" pmd should also belong to "leaf" pmd. Maybe we should
modify pmd_leaf() to make it more semantically consistent?

By the way, something went wrong when sending this patchset, and
I have re-sent the complete patchset, please comment over there.

Thanks,
Qi

>
> Jason
>

2021-11-10 13:02:30

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v3 02/15] mm: introduce is_huge_pmd() helper

On Wed, Nov 10, 2021 at 08:58:35PM +0800, Qi Zheng wrote:
>
> On 11/10/21 8:29 PM, Jason Gunthorpe wrote:
> > On Wed, Nov 10, 2021 at 04:40:44PM +0800, Qi Zheng wrote:
> > > Currently we have some times the following judgments repeated in the
> > > code:
> > >
> > > is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)
> > >
> > > which is to determine whether the *pmd is a huge pmd, so introduce
> > > is_huge_pmd() helper to deduplicate them.
> >
> > Isn't this pmd_leaf() ?
>
> Currently, the implementation of pmd_leaf() does not include
> pmd_devmap() checks.

Are you sure? I thought x86 did via the tricky bit checks?

> But considering the semantics of pmd_leaf(), the "devmap" pmd should
> also belong to "leaf" pmd. Maybe we should modify pmd_leaf() to make
> it more semantically consistent?

I would prefer that..

Jason