2021-11-10 10:59:03

by Qi Zheng

[permalink] [raw]
Subject: [PATCH v3 00/15] Free user PTE page table pages

Hi,

This patch series aims to free user PTE page table pages when all PTE entries
are empty.

The beginning of this story is that some malloc libraries(e.g. jemalloc or
tcmalloc) usually allocate the amount of VAs by mmap() and do not unmap those VAs.
They will use madvise(MADV_DONTNEED) to free physical memory if they want.
But the page tables do not be freed by madvise(), so it can produce many
page tables when the process touches an enormous virtual address space.

The following figures are a memory usage snapshot of one process which actually
happened on our server:

VIRT: 55t
RES: 590g
VmPTE: 110g

As we can see, the PTE page tables size is 110g, while the RES is 590g. In
theory, the process only need 1.2g PTE page tables to map those physical
memory. The reason why PTE page tables occupy a lot of memory is that
madvise(MADV_DONTNEED) only empty the PTE and free physical memory but
doesn't free the PTE page table pages. So we can free those empty PTE page
tables to save memory. In the above cases, we can save memory about 108g(best
case). And the larger the difference between the size of VIRT and RES, the
more memory we save.

In this patch series, we add a pte_refcount field to the struct page of page
table to track how many users of PTE page table. Similar to the mechanism of
page refcount, the user of PTE page table should hold a refcount to it before
accessing. The PTE page table page will be freed when the last refcount is
dropped.

Testing:

The following code snippet can show the effect of optimization:

mmap 50G
while (1) {
for (; i < 1024 * 25; i++) {
touch 2M memory
madvise MADV_DONTNEED 2M
}
}

As we can see, the memory usage of VmPTE is reduced:

before after
VIRT 50.0 GB 50.0 GB
RES 3.1 MB 3.6 MB
VmPTE 102640 kB 248 kB

I also have tested the stability by LTP[1] for several weeks. I have not seen
any crash so far.

The performance of page fault can be affected because of the allocation/freeing
of PTE page table pages. The following is the test result by using a micro
benchmark[2]:

root@~# perf stat -e page-faults --repeat 5 ./multi-fault $threads:

threads before (pf/min) after (pf/min)
1 32,085,255 31,880,833 (-0.64%)
8 101,674,967 100,588,311 (-1.17%)
16 113,207,000 112,801,832 (-0.36%)

(The "pfn/min" means how many page faults in one minute.)

The performance of page fault is ~1% slower than before.

And there are no obvious changes in perf hot spots:

before:
19.29% [kernel] [k] clear_page_rep
16.12% [kernel] [k] do_user_addr_fault
9.57% [kernel] [k] _raw_spin_unlock_irqrestore
6.16% [kernel] [k] get_page_from_freelist
5.03% [kernel] [k] __handle_mm_fault
3.53% [kernel] [k] __rcu_read_unlock
3.45% [kernel] [k] handle_mm_fault
3.38% [kernel] [k] down_read_trylock
2.74% [kernel] [k] free_unref_page_list
2.17% [kernel] [k] up_read
1.93% [kernel] [k] charge_memcg
1.73% [kernel] [k] try_charge_memcg
1.71% [kernel] [k] __alloc_pages
1.69% [kernel] [k] ___perf_sw_event
1.44% [kernel] [k] get_mem_cgroup_from_mm

after:
18.19% [kernel] [k] clear_page_rep
16.28% [kernel] [k] do_user_addr_fault
8.39% [kernel] [k] _raw_spin_unlock_irqrestore
5.12% [kernel] [k] get_page_from_freelist
4.81% [kernel] [k] __handle_mm_fault
4.68% [kernel] [k] down_read_trylock
3.80% [kernel] [k] handle_mm_fault
3.59% [kernel] [k] get_mem_cgroup_from_mm
2.49% [kernel] [k] free_unref_page_list
2.41% [kernel] [k] up_read
2.16% [kernel] [k] charge_memcg
1.92% [kernel] [k] __rcu_read_unlock
1.88% [kernel] [k] ___perf_sw_event
1.70% [kernel] [k] pte_get_unless_zero

This series is based on next-20211108.

Comments and suggestions are welcome.

Thanks,
Qi.

[1] https://github.com/linux-test-project/ltp
[2] https://lore.kernel.org/lkml/[email protected]/2-multi-fault-all.c

Changelog in v2 -> v3:
- Refactored this patch series:
- [PATCH v3 6/15]: Introduce the new dummy helpers first
- [PATCH v3 7-12/15]: Convert each subsystem individually
- [PATCH v3 13/15]: Implement the actual logic to the dummy helpers
And thanks for the advice from David and Jason.
- Add a document.

Changelog in v1 -> v2:
- Change pte_install() to pmd_install().
- Fix some typo and code style problems.
- Split [PATCH v1 5/7] into [PATCH v2 4/9], [PATCH v2 5/9],[PATCH v2 6/9]
and [PATCH v2 7/9].

Qi Zheng (15):
mm: do code cleanups to filemap_map_pmd()
mm: introduce is_huge_pmd() helper
mm: move pte_offset_map_lock() to pgtable.h
mm: rework the parameter of lock_page_or_retry()
mm: add pmd_installed_type return for __pte_alloc() and other friends
mm: introduce refcount for user PTE page table page
mm/pte_ref: add support for user PTE page table page allocation
mm/pte_ref: initialize the refcount of the withdrawn PTE page table
page
mm/pte_ref: add support for the map/unmap of user PTE page table page
mm/pte_ref: add support for page fault path
mm/pte_ref: take a refcount before accessing the PTE page table page
mm/pte_ref: update the pmd entry in move_normal_pmd()
mm/pte_ref: free user PTE page table pages
Documentation: add document for pte_ref
mm/pte_ref: use mmu_gather to free PTE page table pages

Documentation/vm/pte_ref.rst | 216 ++++++++++++++++++++++++++++++++++++
arch/x86/Kconfig | 2 +-
fs/proc/task_mmu.c | 24 +++-
fs/userfaultfd.c | 9 +-
include/linux/huge_mm.h | 10 +-
include/linux/mm.h | 170 ++++-------------------------
include/linux/mm_types.h | 6 +-
include/linux/pagemap.h | 8 +-
include/linux/pgtable.h | 152 +++++++++++++++++++++++++-
include/linux/pte_ref.h | 146 +++++++++++++++++++++++++
include/linux/rmap.h | 2 +
kernel/events/uprobes.c | 2 +
mm/Kconfig | 4 +
mm/Makefile | 4 +-
mm/damon/vaddr.c | 12 +-
mm/debug_vm_pgtable.c | 5 +-
mm/filemap.c | 45 +++++---
mm/gup.c | 25 ++++-
mm/hmm.c | 5 +-
mm/huge_memory.c | 3 +-
mm/internal.h | 4 +-
mm/khugepaged.c | 21 +++-
mm/ksm.c | 6 +-
mm/madvise.c | 21 +++-
mm/memcontrol.c | 12 +-
mm/memory-failure.c | 11 +-
mm/memory.c | 254 ++++++++++++++++++++++++++++++++-----------
mm/mempolicy.c | 6 +-
mm/migrate.c | 54 ++++-----
mm/mincore.c | 7 +-
mm/mlock.c | 1 +
mm/mmu_gather.c | 40 +++----
mm/mprotect.c | 11 +-
mm/mremap.c | 14 ++-
mm/page_vma_mapped.c | 4 +
mm/pagewalk.c | 15 ++-
mm/pgtable-generic.c | 1 +
mm/pte_ref.c | 141 ++++++++++++++++++++++++
mm/rmap.c | 10 ++
mm/swapfile.c | 3 +
mm/userfaultfd.c | 40 +++++--
41 files changed, 1186 insertions(+), 340 deletions(-)
create mode 100644 Documentation/vm/pte_ref.rst
create mode 100644 include/linux/pte_ref.h
create mode 100644 mm/pte_ref.c

--
2.11.0


2021-11-10 10:59:04

by Qi Zheng

[permalink] [raw]
Subject: [PATCH v3 13/15] mm/pte_ref: free user PTE page table pages

This commit introduces CONFIG_FREE_USER_PTE option and
implements the actual logic to the dummy helpers about
pte_ref.

Signed-off-by: Qi Zheng <[email protected]>
---
include/linux/mm.h | 2 ++
include/linux/pgtable.h | 3 +-
include/linux/pte_ref.h | 53 ++++++++++++++++++++++++----
mm/Kconfig | 4 +++
mm/debug_vm_pgtable.c | 2 ++
mm/memory.c | 15 ++++++++
mm/pte_ref.c | 91 +++++++++++++++++++++++++++++++++++++++++++++----
7 files changed, 156 insertions(+), 14 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 753a9435e0d0..18fbf9e0996a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -437,6 +437,7 @@ extern pgprot_t protection_map[16];
* @FAULT_FLAG_REMOTE: The fault is not for current task/mm.
* @FAULT_FLAG_INSTRUCTION: The fault was during an instruction fetch.
* @FAULT_FLAG_INTERRUPTIBLE: The fault can be interrupted by non-fatal signals.
+ * @FAULT_FLAG_PTE_GET: This means the refcount of the pte has been got.
*
* About @FAULT_FLAG_ALLOW_RETRY and @FAULT_FLAG_TRIED: we can specify
* whether we would allow page faults to retry by specifying these two
@@ -468,6 +469,7 @@ enum fault_flag {
FAULT_FLAG_REMOTE = 1 << 7,
FAULT_FLAG_INSTRUCTION = 1 << 8,
FAULT_FLAG_INTERRUPTIBLE = 1 << 9,
+ FAULT_FLAG_PTE_GET = 1 << 10,
};

/*
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index c8f045705c1e..6ac51d58f11a 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -480,7 +480,6 @@ static inline pte_t ptep_get_lockless(pte_t *ptep)
}
#endif /* CONFIG_GUP_GET_PTE_LOW_HIGH */

-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
#ifndef __HAVE_ARCH_PMDP_HUGE_GET_AND_CLEAR
static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm,
unsigned long address,
@@ -491,6 +490,8 @@ static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm,
return pmd;
}
#endif /* __HAVE_ARCH_PMDP_HUGE_GET_AND_CLEAR */
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
#ifndef __HAVE_ARCH_PUDP_HUGE_GET_AND_CLEAR
static inline pud_t pudp_huge_get_and_clear(struct mm_struct *mm,
unsigned long address,
diff --git a/include/linux/pte_ref.h b/include/linux/pte_ref.h
index b6d8335bdc59..8a26eaba83ef 100644
--- a/include/linux/pte_ref.h
+++ b/include/linux/pte_ref.h
@@ -8,6 +8,7 @@
#define _LINUX_PTE_REF_H

#include <linux/pgtable.h>
+#include <linux/page-flags.h>

enum pte_tryget_type {
TRYGET_SUCCESSED,
@@ -16,12 +17,49 @@ enum pte_tryget_type {
TRYGET_FAILED_HUGE_PMD,
};

-bool pte_get_unless_zero(pmd_t *pmd);
-enum pte_tryget_type pte_try_get(pmd_t *pmd);
void pte_put_vmf(struct vm_fault *vmf);
+enum pte_tryget_type pte_try_get(pmd_t *pmd);
+bool pte_get_unless_zero(pmd_t *pmd);
+
+#ifdef CONFIG_FREE_USER_PTE
+void free_user_pte_table(struct mm_struct *mm, pmd_t *pmdp, unsigned long addr);

static inline void pte_ref_init(pgtable_t pte, pmd_t *pmd, int count)
{
+ pte->pmd = pmd;
+ atomic_set(&pte->pte_refcount, count);
+}
+
+static inline pmd_t *pte_to_pmd(pte_t *pte)
+{
+ return virt_to_page(pte)->pmd;
+}
+
+static inline void pte_update_pmd(pmd_t old_pmd, pmd_t *new_pmd)
+{
+ pmd_pgtable(old_pmd)->pmd = new_pmd;
+}
+
+static inline void pte_get_many(pmd_t *pmd, unsigned int nr)
+{
+ pgtable_t pte = pmd_pgtable(*pmd);
+
+ VM_BUG_ON(!PageTable(pte));
+ atomic_add(nr, &pte->pte_refcount);
+}
+
+static inline void pte_put_many(struct mm_struct *mm, pmd_t *pmd,
+ unsigned long addr, unsigned int nr)
+{
+ pgtable_t pte = pmd_pgtable(*pmd);
+
+ VM_BUG_ON(!PageTable(pte));
+ if (atomic_sub_and_test(nr, &pte->pte_refcount))
+ free_user_pte_table(mm, pmd, addr & PMD_MASK);
+}
+#else
+static inline void pte_ref_init(pgtable_t pte, pmd_t *pmd, int count)
+{
}

static inline pmd_t *pte_to_pmd(pte_t *pte)
@@ -37,6 +75,12 @@ static inline void pte_get_many(pmd_t *pmd, unsigned int nr)
{
}

+static inline void pte_put_many(struct mm_struct *mm, pmd_t *pmd,
+ unsigned long addr, unsigned int nr)
+{
+}
+#endif /* CONFIG_FREE_USER_PTE */
+
/*
* pte_get - Increment refcount for the PTE page table.
* @pmd: a pointer to the pmd entry corresponding to the PTE page table.
@@ -66,11 +110,6 @@ static inline pte_t *pte_tryget_map_lock(struct mm_struct *mm, pmd_t *pmd,
return pte_offset_map_lock(mm, pmd, address, ptlp);
}

-static inline void pte_put_many(struct mm_struct *mm, pmd_t *pmd,
- unsigned long addr, unsigned int nr)
-{
-}
-
/*
* pte_put - Decrement refcount for the PTE page table.
* @mm: the mm_struct of the target address space.
diff --git a/mm/Kconfig b/mm/Kconfig
index 5c5508fafcec..44549d287869 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -898,6 +898,10 @@ config IO_MAPPING
config SECRETMEM
def_bool ARCH_HAS_SET_DIRECT_MAP && !EMBEDDED

+config FREE_USER_PTE
+ def_bool y
+ depends on X86_64
+
source "mm/damon/Kconfig"

endmenu
diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index 52f006654664..757dd84ee254 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -1049,8 +1049,10 @@ static void __init destroy_args(struct pgtable_debug_args *args)
/* Free page table entries */
if (args->start_ptep) {
pte_put(args->mm, args->start_pmdp, args->vaddr);
+#ifndef CONFIG_FREE_USER_PTE
pte_free(args->mm, args->start_ptep);
mm_dec_nr_ptes(args->mm);
+#endif
}

if (args->start_pmdp) {
diff --git a/mm/memory.c b/mm/memory.c
index e360ecd37a71..4d1ede78d1b0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -219,6 +219,17 @@ static void check_sync_rss_stat(struct task_struct *task)

#endif /* SPLIT_RSS_COUNTING */

+#ifdef CONFIG_FREE_USER_PTE
+static inline void free_pte_range(struct mmu_gather *tlb, pmd_t *pmd,
+ unsigned long addr)
+{
+ /*
+ * We should never reach here since the PTE page tables are
+ * dynamically freed.
+ */
+ BUG();
+}
+#else
/*
* Note: this doesn't free the actual pages themselves. That
* has been handled earlier when unmapping all the memory regions.
@@ -231,6 +242,7 @@ static void free_pte_range(struct mmu_gather *tlb, pmd_t *pmd,
pte_free_tlb(tlb, token, addr);
mm_dec_nr_ptes(tlb->mm);
}
+#endif /* CONFIG_FREE_USER_PTE */

static inline void free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
unsigned long addr, unsigned long end,
@@ -4631,6 +4643,9 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
goto retry;
vmf->orig_pte = *vmf->pte;

+ if (IS_ENABLED(CONFIG_FREE_USER_PTE))
+ vmf->flags |= FAULT_FLAG_PTE_GET;
+
/*
* some architectures can have larger ptes than wordsize,
* e.g.ppc44x-defconfig has CONFIG_PTE_64BIT=y and
diff --git a/mm/pte_ref.c b/mm/pte_ref.c
index de109905bc8f..728e61cea25e 100644
--- a/mm/pte_ref.c
+++ b/mm/pte_ref.c
@@ -7,7 +7,10 @@

#include <linux/pte_ref.h>
#include <linux/mm.h>
+#include <linux/hugetlb.h>
+#include <asm/tlbflush.h>

+#ifdef CONFIG_FREE_USER_PTE
/*
* pte_get_unless_zero - Increment refcount for the PTE page table
* unless it is zero.
@@ -15,7 +18,10 @@
*/
bool pte_get_unless_zero(pmd_t *pmd)
{
- return true;
+ pgtable_t pte = pmd_pgtable(*pmd);
+
+ VM_BUG_ON(!PageTable(pte));
+ return atomic_inc_not_zero(&pte->pte_refcount);
}

/*
@@ -32,12 +38,20 @@ bool pte_get_unless_zero(pmd_t *pmd)
*/
enum pte_tryget_type pte_try_get(pmd_t *pmd)
{
- if (unlikely(pmd_none(*pmd)))
- return TRYGET_FAILED_NONE;
- if (unlikely(is_huge_pmd(*pmd)))
- return TRYGET_FAILED_HUGE_PMD;
+ int retval = TRYGET_SUCCESSED;
+ pmd_t pmdval;

- return TRYGET_SUCCESSED;
+ rcu_read_lock();
+ pmdval = READ_ONCE(*pmd);
+ if (unlikely(pmd_none(pmdval)))
+ retval = TRYGET_FAILED_NONE;
+ else if (unlikely(is_huge_pmd(pmdval)))
+ retval = TRYGET_FAILED_HUGE_PMD;
+ else if (!pte_get_unless_zero(&pmdval))
+ retval = TRYGET_FAILED_ZERO;
+ rcu_read_unlock();
+
+ return retval;
}

/*
@@ -52,4 +66,69 @@ enum pte_tryget_type pte_try_get(pmd_t *pmd)
*/
void pte_put_vmf(struct vm_fault *vmf)
{
+ if (!(vmf->flags & FAULT_FLAG_PTE_GET))
+ return;
+ vmf->flags &= ~FAULT_FLAG_PTE_GET;
+
+ pte_put(vmf->vma->vm_mm, vmf->pmd, vmf->address);
+}
+#else
+bool pte_get_unless_zero(pmd_t *pmd)
+{
+ return true;
+}
+
+enum pte_tryget_type pte_try_get(pmd_t *pmd)
+{
+ if (unlikely(pmd_none(*pmd)))
+ return TRYGET_FAILED_NONE;
+
+ if (unlikely(is_huge_pmd(*pmd)))
+ return TRYGET_FAILED_HUGE_PMD;
+
+ return TRYGET_SUCCESSED;
+}
+
+void pte_put_vmf(struct vm_fault *vmf)
+{
+}
+#endif /* CONFIG_FREE_USER_PTE */
+
+#ifdef CONFIG_DEBUG_VM
+static void pte_free_debug(pmd_t pmd)
+{
+ pte_t *ptep = (pte_t *)pmd_page_vaddr(pmd);
+ int i = 0;
+
+ for (i = 0; i < PTRS_PER_PTE; i++)
+ BUG_ON(!pte_none(*ptep++));
+}
+#else
+static inline void pte_free_debug(pmd_t pmd)
+{
+}
+#endif
+
+static void pte_free_rcu(struct rcu_head *rcu)
+{
+ struct page *page = container_of(rcu, struct page, rcu_head);
+
+ pgtable_pte_page_dtor(page);
+ __free_page(page);
+}
+
+void free_user_pte_table(struct mm_struct *mm, pmd_t *pmd, unsigned long addr)
+{
+ struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);
+ spinlock_t *ptl;
+ pmd_t pmdval;
+
+ ptl = pmd_lock(mm, pmd);
+ pmdval = pmdp_huge_get_and_clear(mm, addr, pmd);
+ flush_tlb_range(&vma, addr, addr + PMD_SIZE);
+ spin_unlock(ptl);
+
+ pte_free_debug(pmdval);
+ mm_dec_nr_ptes(mm);
+ call_rcu(&pmd_pgtable(pmdval)->rcu_head, pte_free_rcu);
}
--
2.11.0

2021-11-10 10:59:11

by Qi Zheng

[permalink] [raw]
Subject: [PATCH v3 14/15] Documentation: add document for pte_ref

This commit adds document for pte_ref under `Documentation/vm/`.

Signed-off-by: Qi Zheng <[email protected]>
---
Documentation/vm/pte_ref.rst | 212 +++++++++++++++++++++++++++++++++++++++++++
1 file changed, 212 insertions(+)
create mode 100644 Documentation/vm/pte_ref.rst

diff --git a/Documentation/vm/pte_ref.rst b/Documentation/vm/pte_ref.rst
new file mode 100644
index 000000000000..c5323a263464
--- /dev/null
+++ b/Documentation/vm/pte_ref.rst
@@ -0,0 +1,212 @@
+.. _pte_ref:
+
+============================================================================
+pte_ref: Tracking about how many references to each user PTE page table page
+============================================================================
+
+.. contents:: :local:
+
+1. Preface
+==========
+
+Now in order to pursue high performance, applications mostly use some
+high-performance user-mode memory allocators, such as jemalloc or tcmalloc.
+These memory allocators use ``madvise(MADV_DONTNEED or MADV_FREE)`` to release
+physical memory for the following reasons::
+
+ First of all, we should hold as few write locks of mmap_lock as possible,since
+ the mmap_lock semaphore has long been a contention point in the memory
+ management subsystem. The mmap()/munmap() hold the write lock, and the
+ madvise(MADV_DONTNEED or MADV_FREE) hold the read lock, so using madvise()
+ instead of munmap() to released physical memory can reduce the competition of
+ the mmap_lock.
+
+ Secondly, after using madvise() to release physical memory, there is no need to
+ build vma and allocate page tables again when accessing the same virtual
+ address again, which can also save some time.
+
+The following is the largest user PTE page table memory that can be allocated by
+a single user process in a 32-bit and a 64-bit system.
+
++---------------------------+--------+---------+
+| | 32-bit | 64-bit |
++===========================+========+=========+
+| user PTE page table pages | 3 MiB | 512 GiB |
++---------------------------+--------+---------+
+| user PMD page table pages | 3 KiB | 1 GiB |
++---------------------------+--------+---------+
+
+(for 32-bit, take 3G user address space, 4K page size as an example; for 64-bit,
+take 48-bit address width, 4K page size as an example.)
+
+After using ``madvise()``, everything looks good, but as can be seen from the
+above table, a single process can create a large number of PTE page tables on a
+64-bit system, since both of the ``MADV_DONTNEED`` and ``MADV_FREE`` will not
+release page table memory. And before the process exits or calls ``munmap()``,
+the kernel cannot reclaim these pages even if these PTE page tables do not map
+anything.
+
+Therefore, we decided to introduce reference count to manage the PTE page table
+life cycle, so that some free PTE page table memory in the system can be
+dynamically released.
+
+2. The reference count of user PTE page table pages
+===================================================
+
+We introduce two members for the ``struct page`` of the user PTE page table
+page::
+
+ union {
+ pgtable_t pmd_huge_pte; /* protected by page->ptl */
+ pmd_t *pmd; /* PTE page only */
+ };
+ union {
+ struct mm_struct *pt_mm; /* x86 pgds only */
+ atomic_t pt_frag_refcount; /* powerpc */
+ atomic_t pte_refcount; /* PTE page only */
+ };
+
+The ``pmd`` member record the pmd entry that maps the user PTE page table page,
+the ``pte_refcount`` member keep track of how many references to the user PTE
+page table page.
+
+The following people will hold a reference on the user PTE page table page::
+
+ The !pte_none() entry, such as regular page table entry that map physical
+ pages, or swap entry, or migrate entry, etc.
+
+ Visitor to the PTE page table entries, such as page table walker.
+
+Any ``!pte_none()`` entry and visitor can be regarded as the user of its PTE
+page table page. When the ``pte_refcount`` is reduced to 0, it means that no one
+is using the PTE page table page, then this free PTE page table page can be
+released back to the system at this time.
+
+3. Competitive relationship
+===========================
+
+Now, the user page table will only be released by calling ``free_pgtables()``
+when the process exits or ``unmap_region()`` is called (e.g. ``munmap()`` path).
+So other threads only need to ensure mutual exclusion with these paths to ensure
+that the page table is not released. For example::
+
+ thread A thread B
+ page table walker munmap
+ ================= ======
+
+ mmap_read_lock()
+ if (!pte_none() && pte_present() && !pmd_trans_unstable()) {
+ pte_offset_map_lock()
+ *walk page table*
+ pte_unmap_unlock()
+ }
+ mmap_read_unlock()
+
+ mmap_write_lock_killable()
+ detach_vmas_to_be_unmapped()
+ unmap_region()
+ --> free_pgtables()
+
+But after we introduce the reference count for the user PTE page table page,
+these existing balances will be broken. The page can be released at any time
+when its ``pte_refcount`` is reduced to 0. Therefore, the following case may
+happen::
+
+ thread A thread B thread C
+ page table walker madvise(MADV_DONTNEED) page fault
+ ================= ====================== ==========
+
+ mmap_read_lock()
+ if (!pte_none() && pte_present() && !pmd_trans_unstable()) {
+
+ mmap_read_lock()
+ unmap_page_range()
+ --> zap_pte_range()
+ *the pte_refcount is reduced to 0*
+ --> *free PTE page table page*
+
+ /* broken!! */ mmap_read_lock()
+ pte_offset_map_lock()
+
+As we can see, all of the thread A, B and C hold the read lock of mmap_lock, so
+they can execute concurrently. When thread B releases the PTE page table page,
+the value in the corresponding pmd entry will become unstable, which may be
+none or huge pmd, or map a new PTE page table page again. This will cause system
+chaos and even panic.
+
+So as described in the section "The reference count of user PTE page table
+pages", we need to try to take a reference to the PTE page table page before
+walking page table, then the system will become orderly again::
+
+ thread A thread B
+ page table walker madvise(MADV_DONTNEED)
+ ================= ======================
+
+ mmap_read_lock()
+ if (!pte_none() && pte_present() && !pmd_trans_unstable()) {
+ pte_try_get()
+ --> pte_get_unless_zero
+ *if successfully, then:*
+
+ mmap_read_lock()
+ unmap_page_range()
+ --> zap_pte_range()
+ *the pte_refcount is reduced to 1*
+
+ pte_offset_map_lock()
+ *walk page table*
+ pte_unmap_unlock()
+ pte_put()
+ --> *the pte_refcount is reduced to 0*
+ --> *free PTE page table page*
+
+There is also a lock-less scenario(such as fast GUP). Fortunately, we don't need
+to do any additional operations to ensure that the system is in order. Take fast
+GUP as an example::
+
+ thread A thread B
+ fast GUP madvise(MADV_DONTNEED)
+ ======== ======================
+
+ get_user_pages_fast_only()
+ --> local_irq_save();
+ *free PTE page table page*
+ --> unhook page
+ /* The CPU where thread A is located closed
+ * the local interrupt and cannot respond to
+ * IPI, so it will block here */
+ TLB invalidate page
+ gup_pgd_range();
+ local_irq_restore();
+ *free page*
+
+4. Helpers
+==========
+
++---------------------+-------------------------------------------------+
+| pte_ref_init | Initialize the pte_refcount and pmd |
++---------------------+-------------------------------------------------+
+| pte_to_pmd | Get the corresponding pmd |
++---------------------+-------------------------------------------------+
+| pte_update_pmd | Update the corresponding pmd |
++---------------------+-------------------------------------------------+
+| pte_get | Increment a pte_refcount |
++---------------------+-------------------------------------------------+
+| pte_get_many | Add a value to a pte_refcount |
++---------------------+-------------------------------------------------+
+| pte_get_unless_zero | Increment a pte_refcount unless it is 0 |
++---------------------+-------------------------------------------------+
+| pte_try_get | Try to increment a pte_refcount |
++---------------------+-------------------------------------------------+
+| pte_tryget_map | Try to increment a pte_refcount before |
+| | pte_offset_map() |
++---------------------+-------------------------------------------------+
+| pte_tryget_map_lock | Try to increment a pte_refcount before |
+| | pte_offset_map_lock() |
++---------------------+-------------------------------------------------+
+| pte_put | Decrement a pte_refcount |
++---------------------+-------------------------------------------------+
+| pte_put_many | Sub a value to a pte_refcount |
++---------------------+-------------------------------------------------+
+| pte_put_vmf | Decrement a pte_refcount in the page fault path |
++---------------------+-------------------------------------------------+
--
2.11.0

2021-11-10 10:59:15

by Qi Zheng

[permalink] [raw]
Subject: [PATCH v3 02/15] mm: introduce is_huge_pmd() helper

Currently we have some times the following judgments repeated in the
code:

is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)

which is to determine whether the *pmd is a huge pmd, so introduce
is_huge_pmd() helper to deduplicate them.

Signed-off-by: Qi Zheng <[email protected]>
---
include/linux/huge_mm.h | 10 +++++++---
mm/huge_memory.c | 3 +--
mm/memory.c | 5 ++---
mm/mprotect.c | 2 +-
mm/mremap.c | 3 +--
5 files changed, 12 insertions(+), 11 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index f280f33ff223..b37a89180846 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -199,8 +199,7 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
#define split_huge_pmd(__vma, __pmd, __address) \
do { \
pmd_t *____pmd = (__pmd); \
- if (is_swap_pmd(*____pmd) || pmd_trans_huge(*____pmd) \
- || pmd_devmap(*____pmd)) \
+ if (is_huge_pmd(*____pmd)) \
__split_huge_pmd(__vma, __pmd, __address, \
false, NULL); \
} while (0)
@@ -232,11 +231,16 @@ static inline int is_swap_pmd(pmd_t pmd)
return !pmd_none(pmd) && !pmd_present(pmd);
}

+static inline int is_huge_pmd(pmd_t pmd)
+{
+ return is_swap_pmd(pmd) || pmd_trans_huge(pmd) || pmd_devmap(pmd);
+}
+
/* mmap_lock must be held on entry */
static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
struct vm_area_struct *vma)
{
- if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd))
+ if (is_huge_pmd(*pmd))
return __pmd_trans_huge_lock(pmd, vma);
else
return NULL;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e5483347291c..e76ee2e1e423 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1832,8 +1832,7 @@ spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma)
{
spinlock_t *ptl;
ptl = pmd_lock(vma->vm_mm, pmd);
- if (likely(is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) ||
- pmd_devmap(*pmd)))
+ if (likely(is_huge_pmd(*pmd)))
return ptl;
spin_unlock(ptl);
return NULL;
diff --git a/mm/memory.c b/mm/memory.c
index 855486fff526..b00cd60fc368 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1146,8 +1146,7 @@ copy_pmd_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
src_pmd = pmd_offset(src_pud, addr);
do {
next = pmd_addr_end(addr, end);
- if (is_swap_pmd(*src_pmd) || pmd_trans_huge(*src_pmd)
- || pmd_devmap(*src_pmd)) {
+ if (is_huge_pmd(*src_pmd)) {
int err;
VM_BUG_ON_VMA(next-addr != HPAGE_PMD_SIZE, src_vma);
err = copy_huge_pmd(dst_mm, src_mm, dst_pmd, src_pmd,
@@ -1441,7 +1440,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
pmd = pmd_offset(pud, addr);
do {
next = pmd_addr_end(addr, end);
- if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
+ if (is_huge_pmd(*pmd)) {
if (next - addr != HPAGE_PMD_SIZE)
__split_huge_pmd(vma, pmd, addr, false, NULL);
else if (zap_huge_pmd(tlb, vma, pmd, addr))
diff --git a/mm/mprotect.c b/mm/mprotect.c
index e552f5e0ccbd..2d5064a4631c 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -257,7 +257,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
mmu_notifier_invalidate_range_start(&range);
}

- if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
+ if (is_huge_pmd(*pmd)) {
if (next - addr != HPAGE_PMD_SIZE) {
__split_huge_pmd(vma, pmd, addr, false, NULL);
} else {
diff --git a/mm/mremap.c b/mm/mremap.c
index 002eec83e91e..c6e9da09dd0a 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -532,8 +532,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
new_pmd = alloc_new_pmd(vma->vm_mm, vma, new_addr);
if (!new_pmd)
break;
- if (is_swap_pmd(*old_pmd) || pmd_trans_huge(*old_pmd) ||
- pmd_devmap(*old_pmd)) {
+ if (is_huge_pmd(*old_pmd)) {
if (extent == HPAGE_PMD_SIZE &&
move_pgt_entry(HPAGE_PMD, vma, old_addr, new_addr,
old_pmd, new_pmd, need_rmap_locks))
--
2.11.0

2021-11-10 10:59:32

by Qi Zheng

[permalink] [raw]
Subject: [PATCH v3 03/15] mm: move pte_offset_map_lock() to pgtable.h

pte_offset_map() is in include/linux/pgtable.h, so move its
friend pte_offset_map_lock() to pgtable.h together.

pte_lockptr() is required for pte_offset_map_lock(), so
also move {pte,pmd,pud}_lockptr() to pgtable.h.

Signed-off-by: Qi Zheng <[email protected]>
---
include/linux/mm.h | 149 ------------------------------------------------
include/linux/pgtable.h | 149 ++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 149 insertions(+), 149 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a7e4a9e7d807..706da081b9f8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2284,70 +2284,6 @@ static inline pmd_t *pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long a
}
#endif /* CONFIG_MMU */

-#if USE_SPLIT_PTE_PTLOCKS
-#if ALLOC_SPLIT_PTLOCKS
-void __init ptlock_cache_init(void);
-extern bool ptlock_alloc(struct page *page);
-extern void ptlock_free(struct page *page);
-
-static inline spinlock_t *ptlock_ptr(struct page *page)
-{
- return page->ptl;
-}
-#else /* ALLOC_SPLIT_PTLOCKS */
-static inline void ptlock_cache_init(void)
-{
-}
-
-static inline bool ptlock_alloc(struct page *page)
-{
- return true;
-}
-
-static inline void ptlock_free(struct page *page)
-{
-}
-
-static inline spinlock_t *ptlock_ptr(struct page *page)
-{
- return &page->ptl;
-}
-#endif /* ALLOC_SPLIT_PTLOCKS */
-
-static inline spinlock_t *pte_lockptr(struct mm_struct *mm, pmd_t *pmd)
-{
- return ptlock_ptr(pmd_page(*pmd));
-}
-
-static inline bool ptlock_init(struct page *page)
-{
- /*
- * prep_new_page() initialize page->private (and therefore page->ptl)
- * with 0. Make sure nobody took it in use in between.
- *
- * It can happen if arch try to use slab for page table allocation:
- * slab code uses page->slab_cache, which share storage with page->ptl.
- */
- VM_BUG_ON_PAGE(*(unsigned long *)&page->ptl, page);
- if (!ptlock_alloc(page))
- return false;
- spin_lock_init(ptlock_ptr(page));
- return true;
-}
-
-#else /* !USE_SPLIT_PTE_PTLOCKS */
-/*
- * We use mm->page_table_lock to guard all pagetable pages of the mm.
- */
-static inline spinlock_t *pte_lockptr(struct mm_struct *mm, pmd_t *pmd)
-{
- return &mm->page_table_lock;
-}
-static inline void ptlock_cache_init(void) {}
-static inline bool ptlock_init(struct page *page) { return true; }
-static inline void ptlock_free(struct page *page) {}
-#endif /* USE_SPLIT_PTE_PTLOCKS */
-
static inline void pgtable_init(void)
{
ptlock_cache_init();
@@ -2370,20 +2306,6 @@ static inline void pgtable_pte_page_dtor(struct page *page)
dec_lruvec_page_state(page, NR_PAGETABLE);
}

-#define pte_offset_map_lock(mm, pmd, address, ptlp) \
-({ \
- spinlock_t *__ptl = pte_lockptr(mm, pmd); \
- pte_t *__pte = pte_offset_map(pmd, address); \
- *(ptlp) = __ptl; \
- spin_lock(__ptl); \
- __pte; \
-})
-
-#define pte_unmap_unlock(pte, ptl) do { \
- spin_unlock(ptl); \
- pte_unmap(pte); \
-} while (0)
-
#define pte_alloc(mm, pmd) (unlikely(pmd_none(*(pmd))) && __pte_alloc(mm, pmd))

#define pte_alloc_map(mm, pmd, address) \
@@ -2397,58 +2319,6 @@ static inline void pgtable_pte_page_dtor(struct page *page)
((unlikely(pmd_none(*(pmd))) && __pte_alloc_kernel(pmd))? \
NULL: pte_offset_kernel(pmd, address))

-#if USE_SPLIT_PMD_PTLOCKS
-
-static struct page *pmd_to_page(pmd_t *pmd)
-{
- unsigned long mask = ~(PTRS_PER_PMD * sizeof(pmd_t) - 1);
- return virt_to_page((void *)((unsigned long) pmd & mask));
-}
-
-static inline spinlock_t *pmd_lockptr(struct mm_struct *mm, pmd_t *pmd)
-{
- return ptlock_ptr(pmd_to_page(pmd));
-}
-
-static inline bool pmd_ptlock_init(struct page *page)
-{
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
- page->pmd_huge_pte = NULL;
-#endif
- return ptlock_init(page);
-}
-
-static inline void pmd_ptlock_free(struct page *page)
-{
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
- VM_BUG_ON_PAGE(page->pmd_huge_pte, page);
-#endif
- ptlock_free(page);
-}
-
-#define pmd_huge_pte(mm, pmd) (pmd_to_page(pmd)->pmd_huge_pte)
-
-#else
-
-static inline spinlock_t *pmd_lockptr(struct mm_struct *mm, pmd_t *pmd)
-{
- return &mm->page_table_lock;
-}
-
-static inline bool pmd_ptlock_init(struct page *page) { return true; }
-static inline void pmd_ptlock_free(struct page *page) {}
-
-#define pmd_huge_pte(mm, pmd) ((mm)->pmd_huge_pte)
-
-#endif
-
-static inline spinlock_t *pmd_lock(struct mm_struct *mm, pmd_t *pmd)
-{
- spinlock_t *ptl = pmd_lockptr(mm, pmd);
- spin_lock(ptl);
- return ptl;
-}
-
static inline bool pgtable_pmd_page_ctor(struct page *page)
{
if (!pmd_ptlock_init(page))
@@ -2465,25 +2335,6 @@ static inline void pgtable_pmd_page_dtor(struct page *page)
dec_lruvec_page_state(page, NR_PAGETABLE);
}

-/*
- * No scalability reason to split PUD locks yet, but follow the same pattern
- * as the PMD locks to make it easier if we decide to. The VM should not be
- * considered ready to switch to split PUD locks yet; there may be places
- * which need to be converted from page_table_lock.
- */
-static inline spinlock_t *pud_lockptr(struct mm_struct *mm, pud_t *pud)
-{
- return &mm->page_table_lock;
-}
-
-static inline spinlock_t *pud_lock(struct mm_struct *mm, pud_t *pud)
-{
- spinlock_t *ptl = pud_lockptr(mm, pud);
-
- spin_lock(ptl);
- return ptl;
-}
-
extern void __init pagecache_init(void);
extern void __init free_area_init_memoryless_node(int nid);
extern void free_initmem(void);
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index e24d2c992b11..c8f045705c1e 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -84,6 +84,141 @@ static inline unsigned long pud_index(unsigned long address)
#define pgd_index(a) (((a) >> PGDIR_SHIFT) & (PTRS_PER_PGD - 1))
#endif

+#if USE_SPLIT_PTE_PTLOCKS
+#if ALLOC_SPLIT_PTLOCKS
+void __init ptlock_cache_init(void);
+extern bool ptlock_alloc(struct page *page);
+extern void ptlock_free(struct page *page);
+
+static inline spinlock_t *ptlock_ptr(struct page *page)
+{
+ return page->ptl;
+}
+#else /* ALLOC_SPLIT_PTLOCKS */
+static inline void ptlock_cache_init(void)
+{
+}
+
+static inline bool ptlock_alloc(struct page *page)
+{
+ return true;
+}
+
+static inline void ptlock_free(struct page *page)
+{
+}
+
+static inline spinlock_t *ptlock_ptr(struct page *page)
+{
+ return &page->ptl;
+}
+#endif /* ALLOC_SPLIT_PTLOCKS */
+
+static inline spinlock_t *pte_lockptr(struct mm_struct *mm, pmd_t *pmd)
+{
+ return ptlock_ptr(pmd_page(*pmd));
+}
+
+static inline bool ptlock_init(struct page *page)
+{
+ /*
+ * prep_new_page() initialize page->private (and therefore page->ptl)
+ * with 0. Make sure nobody took it in use in between.
+ *
+ * It can happen if arch try to use slab for page table allocation:
+ * slab code uses page->slab_cache, which share storage with page->ptl.
+ */
+ VM_BUG_ON_PAGE(*(unsigned long *)&page->ptl, page);
+ if (!ptlock_alloc(page))
+ return false;
+ spin_lock_init(ptlock_ptr(page));
+ return true;
+}
+
+#else /* !USE_SPLIT_PTE_PTLOCKS */
+/*
+ * We use mm->page_table_lock to guard all pagetable pages of the mm.
+ */
+static inline spinlock_t *pte_lockptr(struct mm_struct *mm, pmd_t *pmd)
+{
+ return &mm->page_table_lock;
+}
+static inline void ptlock_cache_init(void) {}
+static inline bool ptlock_init(struct page *page) { return true; }
+static inline void ptlock_free(struct page *page) {}
+#endif /* USE_SPLIT_PTE_PTLOCKS */
+
+#if USE_SPLIT_PMD_PTLOCKS
+
+static struct page *pmd_to_page(pmd_t *pmd)
+{
+ unsigned long mask = ~(PTRS_PER_PMD * sizeof(pmd_t) - 1);
+ return virt_to_page((void *)((unsigned long) pmd & mask));
+}
+
+static inline spinlock_t *pmd_lockptr(struct mm_struct *mm, pmd_t *pmd)
+{
+ return ptlock_ptr(pmd_to_page(pmd));
+}
+
+static inline bool pmd_ptlock_init(struct page *page)
+{
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ page->pmd_huge_pte = NULL;
+#endif
+ return ptlock_init(page);
+}
+
+static inline void pmd_ptlock_free(struct page *page)
+{
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ VM_BUG_ON_PAGE(page->pmd_huge_pte, page);
+#endif
+ ptlock_free(page);
+}
+
+#define pmd_huge_pte(mm, pmd) (pmd_to_page(pmd)->pmd_huge_pte)
+
+#else
+
+static inline spinlock_t *pmd_lockptr(struct mm_struct *mm, pmd_t *pmd)
+{
+ return &mm->page_table_lock;
+}
+
+static inline bool pmd_ptlock_init(struct page *page) { return true; }
+static inline void pmd_ptlock_free(struct page *page) {}
+
+#define pmd_huge_pte(mm, pmd) ((mm)->pmd_huge_pte)
+
+#endif
+
+static inline spinlock_t *pmd_lock(struct mm_struct *mm, pmd_t *pmd)
+{
+ spinlock_t *ptl = pmd_lockptr(mm, pmd);
+ spin_lock(ptl);
+ return ptl;
+}
+
+/*
+ * No scalability reason to split PUD locks yet, but follow the same pattern
+ * as the PMD locks to make it easier if we decide to. The VM should not be
+ * considered ready to switch to split PUD locks yet; there may be places
+ * which need to be converted from page_table_lock.
+ */
+static inline spinlock_t *pud_lockptr(struct mm_struct *mm, pud_t *pud)
+{
+ return &mm->page_table_lock;
+}
+
+static inline spinlock_t *pud_lock(struct mm_struct *mm, pud_t *pud)
+{
+ spinlock_t *ptl = pud_lockptr(mm, pud);
+
+ spin_lock(ptl);
+ return ptl;
+}
+
#ifndef pte_offset_kernel
static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address)
{
@@ -102,6 +237,20 @@ static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address)
#define pte_unmap(pte) ((void)(pte)) /* NOP */
#endif

+#define pte_offset_map_lock(mm, pmd, address, ptlp) \
+({ \
+ spinlock_t *__ptl = pte_lockptr(mm, pmd); \
+ pte_t *__pte = pte_offset_map(pmd, address); \
+ *(ptlp) = __ptl; \
+ spin_lock(__ptl); \
+ __pte; \
+})
+
+#define pte_unmap_unlock(pte, ptl) do { \
+ spin_unlock(ptl); \
+ pte_unmap(pte); \
+} while (0)
+
/* Find an entry in the second-level page table.. */
#ifndef pmd_offset
static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
--
2.11.0

2021-11-10 10:59:33

by Qi Zheng

[permalink] [raw]
Subject: [PATCH v3 05/15] mm: add pmd_installed_type return for __pte_alloc() and other friends

When we call __pte_alloc() or other friends, a huge pmd might
be created from a different thread. This is why
pmd_trans_unstable() will now be called after __pte_alloc()
or other friends return.

This patch add pmd_installed_type return for __pte_alloc() and other
friends, then we can check the huge pmd through the return value
instead of calling pmd_trans_unstable() again.

This patch has no functional change, just some preparations
for the future patches.

Signed-off-by: Qi Zheng <[email protected]>
---
include/linux/mm.h | 20 +++++++++++++++++---
mm/debug_vm_pgtable.c | 2 +-
mm/filemap.c | 11 +++++++----
mm/gup.c | 2 +-
mm/internal.h | 3 ++-
mm/memory.c | 39 ++++++++++++++++++++++++++-------------
mm/migrate.c | 17 ++---------------
mm/mremap.c | 2 +-
mm/userfaultfd.c | 24 +++++++++++++++---------
9 files changed, 72 insertions(+), 48 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 706da081b9f8..52f36fde2f11 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2306,13 +2306,27 @@ static inline void pgtable_pte_page_dtor(struct page *page)
dec_lruvec_page_state(page, NR_PAGETABLE);
}

-#define pte_alloc(mm, pmd) (unlikely(pmd_none(*(pmd))) && __pte_alloc(mm, pmd))
+enum pmd_installed_type {
+ INSTALLED_PTE,
+ INSTALLED_HUGE_PMD,
+};
+
+static inline int pte_alloc(struct mm_struct *mm, pmd_t *pmd)
+{
+ if (unlikely(pmd_none(*(pmd))))
+ return __pte_alloc(mm, pmd);
+ if (unlikely(is_huge_pmd(*pmd)))
+ return INSTALLED_HUGE_PMD;
+
+ return INSTALLED_PTE;
+}
+#define pte_alloc pte_alloc

#define pte_alloc_map(mm, pmd, address) \
- (pte_alloc(mm, pmd) ? NULL : pte_offset_map(pmd, address))
+ (pte_alloc(mm, pmd) < 0 ? NULL : pte_offset_map(pmd, address))

#define pte_alloc_map_lock(mm, pmd, address, ptlp) \
- (pte_alloc(mm, pmd) ? \
+ (pte_alloc(mm, pmd) < 0 ? \
NULL : pte_offset_map_lock(mm, pmd, address, ptlp))

#define pte_alloc_kernel(pmd, address) \
diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index 228e3954b90c..b8322c55e65d 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -1170,7 +1170,7 @@ static int __init init_args(struct pgtable_debug_args *args)
args->start_pmdp = pmd_offset(args->pudp, 0UL);
WARN_ON(!args->start_pmdp);

- if (pte_alloc(args->mm, args->pmdp)) {
+ if (pte_alloc(args->mm, args->pmdp) < 0) {
pr_err("Failed to allocate pte entries\n");
ret = -ENOMEM;
goto error;
diff --git a/mm/filemap.c b/mm/filemap.c
index ff8d19b7ce1d..23363f8ddbbe 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3217,12 +3217,15 @@ static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page)
}
}

- if (pmd_none(*vmf->pmd))
- pmd_install(mm, vmf->pmd, &vmf->prealloc_pte);
+ if (pmd_none(*vmf->pmd)) {
+ int ret = pmd_install(mm, vmf->pmd, &vmf->prealloc_pte);

- /* See comment in handle_pte_fault() */
- if (pmd_devmap_trans_unstable(vmf->pmd))
+ if (unlikely(ret == INSTALLED_HUGE_PMD))
+ goto out;
+ } else if (pmd_devmap_trans_unstable(vmf->pmd)) {
+ /* See comment in handle_pte_fault() */
goto out;
+ }

return false;

diff --git a/mm/gup.c b/mm/gup.c
index 2c51e9748a6a..2def775232a3 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -699,7 +699,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
} else {
spin_unlock(ptl);
split_huge_pmd(vma, pmd, address);
- ret = pte_alloc(mm, pmd) ? -ENOMEM : 0;
+ ret = pte_alloc(mm, pmd) < 0 ? -ENOMEM : 0;
}

return ret ? ERR_PTR(ret) :
diff --git a/mm/internal.h b/mm/internal.h
index 3b79a5c9427a..474d6e3443f8 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -67,7 +67,8 @@ bool __folio_end_writeback(struct folio *folio);

void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
unsigned long floor, unsigned long ceiling);
-void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte);
+enum pmd_installed_type pmd_install(struct mm_struct *mm, pmd_t *pmd,
+ pgtable_t *pte);

static inline bool can_madv_lru_vma(struct vm_area_struct *vma)
{
diff --git a/mm/memory.c b/mm/memory.c
index bec6a5d5ee7c..8a39c0e58324 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -437,8 +437,10 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
}
}

-void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte)
+enum pmd_installed_type pmd_install(struct mm_struct *mm, pmd_t *pmd,
+ pgtable_t *pte)
{
+ int ret = INSTALLED_PTE;
spinlock_t *ptl = pmd_lock(mm, pmd);

if (likely(pmd_none(*pmd))) { /* Has another populated it ? */
@@ -459,20 +461,26 @@ void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte)
smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */
pmd_populate(mm, pmd, *pte);
*pte = NULL;
+ } else if (is_huge_pmd(*pmd)) {
+ /* See comment in handle_pte_fault() */
+ ret = INSTALLED_HUGE_PMD;
}
spin_unlock(ptl);
+
+ return ret;
}

int __pte_alloc(struct mm_struct *mm, pmd_t *pmd)
{
+ enum pmd_installed_type ret;
pgtable_t new = pte_alloc_one(mm);
if (!new)
return -ENOMEM;

- pmd_install(mm, pmd, &new);
+ ret = pmd_install(mm, pmd, &new);
if (new)
pte_free(mm, new);
- return 0;
+ return ret;
}

int __pte_alloc_kernel(pmd_t *pmd)
@@ -1813,7 +1821,7 @@ static int insert_pages(struct vm_area_struct *vma, unsigned long addr,

/* Allocate the PTE if necessary; takes PMD lock once only. */
ret = -ENOMEM;
- if (pte_alloc(mm, pmd))
+ if (pte_alloc(mm, pmd) < 0)
goto out;

while (pages_to_write_in_pmd) {
@@ -3713,6 +3721,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
struct page *page;
vm_fault_t ret = 0;
pte_t entry;
+ int alloc_ret;

/* File mapping without ->vm_ops ? */
if (vma->vm_flags & VM_SHARED)
@@ -3728,11 +3737,11 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
*
* Here we only have mmap_read_lock(mm).
*/
- if (pte_alloc(vma->vm_mm, vmf->pmd))
+ alloc_ret = pte_alloc(vma->vm_mm, vmf->pmd);
+ if (alloc_ret < 0)
return VM_FAULT_OOM;
-
/* See comment in handle_pte_fault() */
- if (unlikely(pmd_trans_unstable(vmf->pmd)))
+ if (unlikely(alloc_ret == INSTALLED_HUGE_PMD))
return 0;

/* Use the zero-page for reads */
@@ -4023,6 +4032,8 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
}

if (pmd_none(*vmf->pmd)) {
+ int alloc_ret;
+
if (PageTransCompound(page)) {
ret = do_set_pmd(vmf, page);
if (ret != VM_FAULT_FALLBACK)
@@ -4030,14 +4041,16 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
}

if (vmf->prealloc_pte)
- pmd_install(vma->vm_mm, vmf->pmd, &vmf->prealloc_pte);
- else if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd)))
- return VM_FAULT_OOM;
- }
+ alloc_ret = pmd_install(vma->vm_mm, vmf->pmd, &vmf->prealloc_pte);
+ else
+ alloc_ret = pte_alloc(vma->vm_mm, vmf->pmd);

- /* See comment in handle_pte_fault() */
- if (pmd_devmap_trans_unstable(vmf->pmd))
+ if (unlikely(alloc_ret != INSTALLED_PTE))
+ return alloc_ret < 0 ? VM_FAULT_OOM : 0;
+ } else if (pmd_devmap_trans_unstable(vmf->pmd)) {
+ /* See comment in handle_pte_fault() */
return 0;
+ }

vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
vmf->address, &vmf->ptl);
diff --git a/mm/migrate.c b/mm/migrate.c
index cf25b00f03c8..bdfdfd3b50be 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2731,21 +2731,8 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
if (pmd_trans_huge(*pmdp) || pmd_devmap(*pmdp))
goto abort;

- /*
- * Use pte_alloc() instead of pte_alloc_map(). We can't run
- * pte_offset_map() on pmds where a huge pmd might be created
- * from a different thread.
- *
- * pte_alloc_map() is safe to use under mmap_write_lock(mm) or when
- * parallel threads are excluded by other means.
- *
- * Here we only have mmap_read_lock(mm).
- */
- if (pte_alloc(mm, pmdp))
- goto abort;
-
- /* See the comment in pte_alloc_one_map() */
- if (unlikely(pmd_trans_unstable(pmdp)))
+ /* See the comment in do_anonymous_page() */
+ if (unlikely(pte_alloc(mm, pmdp) != INSTALLED_PTE))
goto abort;

if (unlikely(anon_vma_prepare(vma)))
diff --git a/mm/mremap.c b/mm/mremap.c
index c6e9da09dd0a..fc5c56858883 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -551,7 +551,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
continue;
}

- if (pte_alloc(new_vma->vm_mm, new_pmd))
+ if (pte_alloc(new_vma->vm_mm, new_pmd) < 0)
break;
move_ptes(vma, old_pmd, old_addr, old_addr + extent, new_vma,
new_pmd, new_addr, need_rmap_locks);
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 0780c2a57ff1..2cea08e7f076 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -592,15 +592,21 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
err = -EEXIST;
break;
}
- if (unlikely(pmd_none(dst_pmdval)) &&
- unlikely(__pte_alloc(dst_mm, dst_pmd))) {
- err = -ENOMEM;
- break;
- }
- /* If an huge pmd materialized from under us fail */
- if (unlikely(pmd_trans_huge(*dst_pmd))) {
- err = -EFAULT;
- break;
+
+ if (unlikely(pmd_none(dst_pmdval))) {
+ int ret = __pte_alloc(dst_mm, dst_pmd);
+
+ /*
+ * If there is not enough memory or an huge pmd
+ * materialized from under us
+ */
+ if (unlikely(ret < 0)) {
+ err = -ENOMEM;
+ break;
+ } else if (unlikely(ret == INSTALLED_HUGE_PMD)) {
+ err = -EFAULT;
+ break;
+ }
}

BUG_ON(pmd_none(*dst_pmd));
--
2.11.0

2021-11-10 11:00:00

by Qi Zheng

[permalink] [raw]
Subject: [PATCH v3 08/15] mm/pte_ref: initialize the refcount of the withdrawn PTE page table page

When we split the PMD-mapped THP to the PTE-mapped THP, we should
initialize the refcount of the withdrawn PTE page table page to
HPAGE_PMD_NR, which ensures that we can release the PTE page table
page when it is free(the refcount is 0).

Signed-off-by: Qi Zheng <[email protected]>
---
mm/pgtable-generic.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 4e640baf9794..523053e09dfa 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -186,6 +186,7 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
struct page, lru);
if (pmd_huge_pte(mm, pmdp))
list_del(&pgtable->lru);
+ pte_ref_init(pgtable, pmdp, HPAGE_PMD_NR);
return pgtable;
}
#endif
--
2.11.0

2021-11-10 11:00:49

by Qi Zheng

[permalink] [raw]
Subject: [PATCH v3 15/15] mm/pte_ref: use mmu_gather to free PTE page table pages

In unmap_region() and other paths, we can reuse @tlb to
free PTE page table, which can reduce the number of tlb
flush.

Signed-off-by: Qi Zheng <[email protected]>
---
Documentation/vm/pte_ref.rst | 58 +++++++++++++++++++++++---------------------
arch/x86/Kconfig | 2 +-
include/linux/pte_ref.h | 34 ++++++++++++++++++++------
mm/madvise.c | 4 +--
mm/memory.c | 4 +--
mm/mmu_gather.c | 40 +++++++++++++-----------------
mm/pte_ref.c | 13 +++++++---
7 files changed, 90 insertions(+), 65 deletions(-)

diff --git a/Documentation/vm/pte_ref.rst b/Documentation/vm/pte_ref.rst
index c5323a263464..d304c0bfaae1 100644
--- a/Documentation/vm/pte_ref.rst
+++ b/Documentation/vm/pte_ref.rst
@@ -183,30 +183,34 @@ GUP as an example::
4. Helpers
==========

-+---------------------+-------------------------------------------------+
-| pte_ref_init | Initialize the pte_refcount and pmd |
-+---------------------+-------------------------------------------------+
-| pte_to_pmd | Get the corresponding pmd |
-+---------------------+-------------------------------------------------+
-| pte_update_pmd | Update the corresponding pmd |
-+---------------------+-------------------------------------------------+
-| pte_get | Increment a pte_refcount |
-+---------------------+-------------------------------------------------+
-| pte_get_many | Add a value to a pte_refcount |
-+---------------------+-------------------------------------------------+
-| pte_get_unless_zero | Increment a pte_refcount unless it is 0 |
-+---------------------+-------------------------------------------------+
-| pte_try_get | Try to increment a pte_refcount |
-+---------------------+-------------------------------------------------+
-| pte_tryget_map | Try to increment a pte_refcount before |
-| | pte_offset_map() |
-+---------------------+-------------------------------------------------+
-| pte_tryget_map_lock | Try to increment a pte_refcount before |
-| | pte_offset_map_lock() |
-+---------------------+-------------------------------------------------+
-| pte_put | Decrement a pte_refcount |
-+---------------------+-------------------------------------------------+
-| pte_put_many | Sub a value to a pte_refcount |
-+---------------------+-------------------------------------------------+
-| pte_put_vmf | Decrement a pte_refcount in the page fault path |
-+---------------------+-------------------------------------------------+
++---------------------+------------------------------------------------------+
+| pte_ref_init | Initialize the pte_refcount and pmd |
++---------------------+------------------------------------------------------+
+| pte_to_pmd | Get the corresponding pmd |
++---------------------+------------------------------------------------------+
+| pte_update_pmd | Update the corresponding pmd |
++---------------------+------------------------------------------------------+
+| pte_get | Increment a pte_refcount |
++---------------------+------------------------------------------------------+
+| pte_get_many | Add a value to a pte_refcount |
++---------------------+------------------------------------------------------+
+| pte_get_unless_zero | Increment a pte_refcount unless it is 0 |
++---------------------+------------------------------------------------------+
+| pte_try_get | Try to increment a pte_refcount |
++---------------------+------------------------------------------------------+
+| pte_tryget_map | Try to increment a pte_refcount before |
+| | pte_offset_map() |
++---------------------+------------------------------------------------------+
+| pte_tryget_map_lock | Try to increment a pte_refcount before |
+| | pte_offset_map_lock() |
++---------------------+------------------------------------------------------+
+| __pte_put | Decrement a pte_refcount |
++---------------------+------------------------------------------------------+
+| __pte_put_many | Sub a value to a pte_refcount |
++---------------------+------------------------------------------------------+
+| pte_put | Decrement a pte_refcount(without tlb parameter) |
++---------------------+------------------------------------------------------+
+| pte_put_many | Sub a value to a pte_refcount(without tlb parameter) |
++---------------------+------------------------------------------------------+
+| pte_put_vmf | Decrement a pte_refcount in the page fault path |
++---------------------+------------------------------------------------------+
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index ca5bfe83ec61..69ea13437947 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -233,7 +233,7 @@ config X86
select HAVE_PCI
select HAVE_PERF_REGS
select HAVE_PERF_USER_STACK_DUMP
- select MMU_GATHER_RCU_TABLE_FREE if PARAVIRT
+ select MMU_GATHER_RCU_TABLE_FREE if PARAVIRT || FREE_USER_PTE
select HAVE_POSIX_CPU_TIMERS_TASK_WORK
select HAVE_REGS_AND_STACK_ACCESS_API
select HAVE_RELIABLE_STACKTRACE if X86_64 && (UNWINDER_FRAME_POINTER || UNWINDER_ORC) && STACK_VALIDATION
diff --git a/include/linux/pte_ref.h b/include/linux/pte_ref.h
index 8a26eaba83ef..dc3923bb38f6 100644
--- a/include/linux/pte_ref.h
+++ b/include/linux/pte_ref.h
@@ -22,7 +22,8 @@ enum pte_tryget_type pte_try_get(pmd_t *pmd);
bool pte_get_unless_zero(pmd_t *pmd);

#ifdef CONFIG_FREE_USER_PTE
-void free_user_pte_table(struct mm_struct *mm, pmd_t *pmdp, unsigned long addr);
+void free_user_pte_table(struct mmu_gather *tlb, struct mm_struct *mm,
+ pmd_t *pmd, unsigned long addr);

static inline void pte_ref_init(pgtable_t pte, pmd_t *pmd, int count)
{
@@ -48,14 +49,21 @@ static inline void pte_get_many(pmd_t *pmd, unsigned int nr)
atomic_add(nr, &pte->pte_refcount);
}

-static inline void pte_put_many(struct mm_struct *mm, pmd_t *pmd,
- unsigned long addr, unsigned int nr)
+static inline void __pte_put_many(struct mmu_gather *tlb, struct mm_struct *mm,
+ pmd_t *pmd, unsigned long addr,
+ unsigned int nr)
{
pgtable_t pte = pmd_pgtable(*pmd);

VM_BUG_ON(!PageTable(pte));
if (atomic_sub_and_test(nr, &pte->pte_refcount))
- free_user_pte_table(mm, pmd, addr & PMD_MASK);
+ free_user_pte_table(tlb, mm, pmd, addr & PMD_MASK);
+}
+
+static inline void __pte_put(struct mmu_gather *tlb, struct mm_struct *mm,
+ pmd_t *pmd, unsigned long addr)
+{
+ __pte_put_many(tlb, mm, pmd, addr, 1);
}
#else
static inline void pte_ref_init(pgtable_t pte, pmd_t *pmd, int count)
@@ -75,8 +83,14 @@ static inline void pte_get_many(pmd_t *pmd, unsigned int nr)
{
}

-static inline void pte_put_many(struct mm_struct *mm, pmd_t *pmd,
- unsigned long addr, unsigned int nr)
+static inline void __pte_put_many(struct mmu_gather *tlb, struct mm_struct *mm,
+ pmd_t *pmd, unsigned long addr,
+ unsigned int nr)
+{
+}
+
+static inline void __pte_put(struct mmu_gather *tlb, struct mm_struct *mm,
+ pmd_t *pmd, unsigned long addr)
{
}
#endif /* CONFIG_FREE_USER_PTE */
@@ -110,6 +124,12 @@ static inline pte_t *pte_tryget_map_lock(struct mm_struct *mm, pmd_t *pmd,
return pte_offset_map_lock(mm, pmd, address, ptlp);
}

+static inline void pte_put_many(struct mm_struct *mm, pmd_t *pmd,
+ unsigned long addr, unsigned int nr)
+{
+ __pte_put_many(NULL, mm, pmd, addr, nr);
+}
+
/*
* pte_put - Decrement refcount for the PTE page table.
* @mm: the mm_struct of the target address space.
@@ -120,7 +140,7 @@ static inline pte_t *pte_tryget_map_lock(struct mm_struct *mm, pmd_t *pmd,
*/
static inline void pte_put(struct mm_struct *mm, pmd_t *pmd, unsigned long addr)
{
- pte_put_many(mm, pmd, addr, 1);
+ __pte_put(NULL, mm, pmd, addr);
}

#endif
diff --git a/mm/madvise.c b/mm/madvise.c
index 5cf2832abb98..b51254305bb2 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -477,7 +477,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,

arch_leave_lazy_mmu_mode();
pte_unmap_unlock(orig_pte, ptl);
- pte_put(vma->vm_mm, pmd, start);
+ __pte_put(tlb, vma->vm_mm, pmd, start);
if (pageout)
reclaim_pages(&page_list);
cond_resched();
@@ -710,7 +710,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
arch_leave_lazy_mmu_mode();
pte_unmap_unlock(orig_pte, ptl);
if (nr_put)
- pte_put_many(mm, pmd, start, nr_put);
+ __pte_put_many(tlb, mm, pmd, start, nr_put);
cond_resched();
next:
return 0;
diff --git a/mm/memory.c b/mm/memory.c
index 4d1ede78d1b0..1bdae3b0f877 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1469,7 +1469,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
}

if (nr_put)
- pte_put_many(mm, pmd, start, nr_put);
+ __pte_put_many(tlb, mm, pmd, start, nr_put);

return addr;
}
@@ -1515,7 +1515,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
if (pte_try_get(pmd))
goto next;
next = zap_pte_range(tlb, vma, pmd, addr, next, details);
- pte_put(tlb->mm, pmd, addr);
+ __pte_put(tlb, tlb->mm, pmd, addr);
next:
cond_resched();
} while (pmd++, addr = next, addr != end);
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index 1b9837419bf9..1bd9fa889421 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -134,42 +134,42 @@ static void __tlb_remove_table_free(struct mmu_table_batch *batch)
*
*/

-static void tlb_remove_table_smp_sync(void *arg)
+static void tlb_remove_table_rcu(struct rcu_head *head)
{
- /* Simply deliver the interrupt */
+ __tlb_remove_table_free(container_of(head, struct mmu_table_batch, rcu));
}

-static void tlb_remove_table_sync_one(void)
+static void tlb_remove_table_free(struct mmu_table_batch *batch)
{
- /*
- * This isn't an RCU grace period and hence the page-tables cannot be
- * assumed to be actually RCU-freed.
- *
- * It is however sufficient for software page-table walkers that rely on
- * IRQ disabling.
- */
- smp_call_function(tlb_remove_table_smp_sync, NULL, 1);
+ call_rcu(&batch->rcu, tlb_remove_table_rcu);
}

-static void tlb_remove_table_rcu(struct rcu_head *head)
+static void tlb_remove_table_one_rcu(struct rcu_head *head)
{
- __tlb_remove_table_free(container_of(head, struct mmu_table_batch, rcu));
+ struct page *page = container_of(head, struct page, rcu_head);
+
+ __tlb_remove_table(page);
}

-static void tlb_remove_table_free(struct mmu_table_batch *batch)
+static void tlb_remove_table_one(void *table)
{
- call_rcu(&batch->rcu, tlb_remove_table_rcu);
+ pgtable_t page = (pgtable_t)table;
+
+ call_rcu(&page->rcu_head, tlb_remove_table_one_rcu);
}

#else /* !CONFIG_MMU_GATHER_RCU_TABLE_FREE */

-static void tlb_remove_table_sync_one(void) { }
-
static void tlb_remove_table_free(struct mmu_table_batch *batch)
{
__tlb_remove_table_free(batch);
}

+static void tlb_remove_table_one(void *table)
+{
+ __tlb_remove_table(table);
+}
+
#endif /* CONFIG_MMU_GATHER_RCU_TABLE_FREE */

/*
@@ -187,12 +187,6 @@ static inline void tlb_table_invalidate(struct mmu_gather *tlb)
}
}

-static void tlb_remove_table_one(void *table)
-{
- tlb_remove_table_sync_one();
- __tlb_remove_table(table);
-}
-
static void tlb_table_flush(struct mmu_gather *tlb)
{
struct mmu_table_batch **batch = &tlb->batch;
diff --git a/mm/pte_ref.c b/mm/pte_ref.c
index 728e61cea25e..f9650ad23c7c 100644
--- a/mm/pte_ref.c
+++ b/mm/pte_ref.c
@@ -8,6 +8,8 @@
#include <linux/pte_ref.h>
#include <linux/mm.h>
#include <linux/hugetlb.h>
+#include <asm/pgalloc.h>
+#include <asm/tlb.h>
#include <asm/tlbflush.h>

#ifdef CONFIG_FREE_USER_PTE
@@ -117,7 +119,8 @@ static void pte_free_rcu(struct rcu_head *rcu)
__free_page(page);
}

-void free_user_pte_table(struct mm_struct *mm, pmd_t *pmd, unsigned long addr)
+void free_user_pte_table(struct mmu_gather *tlb, struct mm_struct *mm,
+ pmd_t *pmd, unsigned long addr)
{
struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);
spinlock_t *ptl;
@@ -125,10 +128,14 @@ void free_user_pte_table(struct mm_struct *mm, pmd_t *pmd, unsigned long addr)

ptl = pmd_lock(mm, pmd);
pmdval = pmdp_huge_get_and_clear(mm, addr, pmd);
- flush_tlb_range(&vma, addr, addr + PMD_SIZE);
+ if (!tlb)
+ flush_tlb_range(&vma, addr, addr + PMD_SIZE);
+ else
+ pte_free_tlb(tlb, pmd_pgtable(pmdval), addr);
spin_unlock(ptl);

pte_free_debug(pmdval);
mm_dec_nr_ptes(mm);
- call_rcu(&pmd_pgtable(pmdval)->rcu_head, pte_free_rcu);
+ if (!tlb)
+ call_rcu(&pmd_pgtable(pmdval)->rcu_head, pte_free_rcu);
}
--
2.11.0

2021-11-10 12:58:41

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v3 00/15] Free user PTE page table pages

On Wed, Nov 10, 2021 at 06:54:13PM +0800, Qi Zheng wrote:

> In this patch series, we add a pte_refcount field to the struct page of page
> table to track how many users of PTE page table. Similar to the mechanism of
> page refcount, the user of PTE page table should hold a refcount to it before
> accessing. The PTE page table page will be freed when the last refcount is
> dropped.

So, this approach basically adds two atomics on every PTE map

If I have it right the reason that zap cannot clean the PTEs today is
because zap cannot obtain the mmap lock due to a lock ordering issue
with the inode lock vs mmap lock.

If it could obtain the mmap lock then it could do the zap using the
write side as unmapping a vma does.

Rather than adding a new "lock" to ever PTE I wonder if it would be
more efficient to break up the mmap lock and introduce a specific
rwsem for the page table itself, in addition to the PTL. Currently the
mmap lock is protecting both the vma list and the page table.

I think that would allow the lock ordering issue to be resolved and
zap could obtain a page table rwsem.

Compared to two atomics per PTE this would just be two atomic per
page table walk operation, it is conceptually a lot simpler, and would
allow freeing all the page table levels, not just PTEs.

?

Jason

2021-11-10 13:27:30

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v3 00/15] Free user PTE page table pages

On 10.11.21 13:56, Jason Gunthorpe wrote:
> On Wed, Nov 10, 2021 at 06:54:13PM +0800, Qi Zheng wrote:
>
>> In this patch series, we add a pte_refcount field to the struct page of page
>> table to track how many users of PTE page table. Similar to the mechanism of
>> page refcount, the user of PTE page table should hold a refcount to it before
>> accessing. The PTE page table page will be freed when the last refcount is
>> dropped.
>
> So, this approach basically adds two atomics on every PTE map
>
> If I have it right the reason that zap cannot clean the PTEs today is
> because zap cannot obtain the mmap lock due to a lock ordering issue
> with the inode lock vs mmap lock.

There are different ways to zap: madvise(DONTNEED) vs
fallocate(PUNCH_HOLE). It depends on "from where" we're actually
comming: a process page table walker or the rmap.

The way locking currently works doesn't allow to remove a page table
just by holding the mmap lock, not even in write mode. You'll also need
to hold the respective rmap locks -- which implies that reclaiming apge
tables crossing VMAs is "problematic". Take a look at khugepaged which
has to play quite some tricks to remove a page table.

And there are other ways we can create empty page tables via the rmap,
like reclaim/writeback, although they are rather a secondary concern mostly.

>
> If it could obtain the mmap lock then it could do the zap using the
> write side as unmapping a vma does.
>
> Rather than adding a new "lock" to ever PTE I wonder if it would be
> more efficient to break up the mmap lock and introduce a specific
> rwsem for the page table itself, in addition to the PTL. Currently the
> mmap lock is protecting both the vma list and the page table.

There is the rmap side of things as well. At least the rmap won't
reclaim alloc/free page tables, but it will walk page tables while
holding the respective rmap lock.

>
> I think that would allow the lock ordering issue to be resolved and
> zap could obtain a page table rwsem.
>
> Compared to two atomics per PTE this would just be two atomic per
> page table walk operation, it is conceptually a lot simpler, and would
> allow freeing all the page table levels, not just PTEs.

Another alternative is to not do it in the kernel automatically, but
instead have a madvise(MADV_CLEANUP_PGTABLE) mechanism that will get
called by user space explicitly once it's reasonable. While this will
work for the obvious madvise(DONTNEED) users -- like memory allocators
-- that zap memory, it's a bit more complicated once shared memory is
involved and we're fallocate(PUNCH_HOLE) memory. But it would at least
work for many use cases that want to optimize memory consumption for
sparse memory mappings.

Note that PTEs are the biggest memory consumer. On x86-64, a 1 TiB area
will consume 2 GiB of PTE tables and only 4 MiB of PMD tables. So PTEs
are most certainly the most important part piece.

--
Thanks,

David / dhildenb

2021-11-10 13:56:46

by Qi Zheng

[permalink] [raw]
Subject: Re: [PATCH v3 00/15] Free user PTE page table pages



On 11/10/21 8:56 PM, Jason Gunthorpe wrote:
> On Wed, Nov 10, 2021 at 06:54:13PM +0800, Qi Zheng wrote:
>
>> In this patch series, we add a pte_refcount field to the struct page of page
>> table to track how many users of PTE page table. Similar to the mechanism of
>> page refcount, the user of PTE page table should hold a refcount to it before
>> accessing. The PTE page table page will be freed when the last refcount is
>> dropped.
>
> So, this approach basically adds two atomics on every PTE map
>
> If I have it right the reason that zap cannot clean the PTEs today is
> because zap cannot obtain the mmap lock due to a lock ordering issue
> with the inode lock vs mmap lock.

Currently, both MADV_DONTNEED and MADV_FREE obtain the read side of
mmap_lock instead of write side, which is the reason that
jemalloc/tcmalloc prefer to use madvise() to release physical memory.

>
> If it could obtain the mmap lock then it could do the zap using the
> write side as unmapping a vma does.

Even if it obtains the write side of mmap_lock, how to make sure that
all the page table entries are empty? Traverse 512 entries every time?

>
> Rather than adding a new "lock" to ever PTE I wonder if it would be
> more efficient to break up the mmap lock and introduce a specific
> rwsem for the page table itself, in addition to the PTL. Currently the
> mmap lock is protecting both the vma list and the page table.

Now each level of page table has its own spin lock. Can you explain the
working mechanism of this special rwsem more clearly?

If we can reduce the protection range of mmap_lock, it is indeed a great
thing, but I think it is very difficult, and it will not solve the
problem of how to check that all entries in the page table page are
empty.

>
> I think that would allow the lock ordering issue to be resolved and
> zap could obtain a page table rwsem.
>
> Compared to two atomics per PTE this would just be two atomic per
> page table walk operation, it is conceptually a lot simpler, and would
> allow freeing all the page table levels, not just PTEs.

The reason why only the PTE page is released now is that it is the
largest. This reference count can actually be used for other levels of
page tables.

>
> ?
>
> Jason
>

2021-11-10 14:03:38

by Qi Zheng

[permalink] [raw]
Subject: Re: [PATCH v3 00/15] Free user PTE page table pages



On 11/10/21 9:25 PM, David Hildenbrand wrote:
> On 10.11.21 13:56, Jason Gunthorpe wrote:
>> On Wed, Nov 10, 2021 at 06:54:13PM +0800, Qi Zheng wrote:
>>
>>> In this patch series, we add a pte_refcount field to the struct page of page
>>> table to track how many users of PTE page table. Similar to the mechanism of
>>> page refcount, the user of PTE page table should hold a refcount to it before
>>> accessing. The PTE page table page will be freed when the last refcount is
>>> dropped.
>>
>> So, this approach basically adds two atomics on every PTE map
>>
>> If I have it right the reason that zap cannot clean the PTEs today is
>> because zap cannot obtain the mmap lock due to a lock ordering issue
>> with the inode lock vs mmap lock.
>
> There are different ways to zap: madvise(DONTNEED) vs
> fallocate(PUNCH_HOLE). It depends on "from where" we're actually
> comming: a process page table walker or the rmap.
>
> The way locking currently works doesn't allow to remove a page table
> just by holding the mmap lock, not even in write mode. You'll also need
> to hold the respective rmap locks -- which implies that reclaiming apge
> tables crossing VMAs is "problematic". Take a look at khugepaged which
> has to play quite some tricks to remove a page table.
>
> And there are other ways we can create empty page tables via the rmap,
> like reclaim/writeback, although they are rather a secondary concern mostly.
>
>>
>> If it could obtain the mmap lock then it could do the zap using the
>> write side as unmapping a vma does.
>>
>> Rather than adding a new "lock" to ever PTE I wonder if it would be
>> more efficient to break up the mmap lock and introduce a specific
>> rwsem for the page table itself, in addition to the PTL. Currently the
>> mmap lock is protecting both the vma list and the page table.
>
> There is the rmap side of things as well. At least the rmap won't
> reclaim alloc/free page tables, but it will walk page tables while
> holding the respective rmap lock.
>
>>
>> I think that would allow the lock ordering issue to be resolved and
>> zap could obtain a page table rwsem.
>>
>> Compared to two atomics per PTE this would just be two atomic per
>> page table walk operation, it is conceptually a lot simpler, and would
>> allow freeing all the page table levels, not just PTEs.
>
> Another alternative is to not do it in the kernel automatically, but
> instead have a madvise(MADV_CLEANUP_PGTABLE) mechanism that will get
> called by user space explicitly once it's reasonable. While this will
> work for the obvious madvise(DONTNEED) users -- like memory allocators
> -- that zap memory, it's a bit more complicated once shared memory is
> involved and we're fallocate(PUNCH_HOLE) memory. But it would at least
> work for many use cases that want to optimize memory consumption for
> sparse memory mappings.
>
> Note that PTEs are the biggest memory consumer. On x86-64, a 1 TiB area
> will consume 2 GiB of PTE tables and only 4 MiB of PMD tables. So PTEs
> are most certainly the most important part piece.
>

total agree!

Thanks,
Qi

2021-11-10 14:41:01

by Jonathan Corbet

[permalink] [raw]
Subject: Re: [PATCH v3 14/15] Documentation: add document for pte_ref

Qi Zheng <[email protected]> writes:

> This commit adds document for pte_ref under `Documentation/vm/`.

Thanks for documenting this work!

> Signed-off-by: Qi Zheng <[email protected]>
> ---
> Documentation/vm/pte_ref.rst | 212 +++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 212 insertions(+)
> create mode 100644 Documentation/vm/pte_ref.rst

When you add a new RST file, you also need to add it to the associated
index.rst file or it won't be included in the docs build. Instead,
you'll get the "not included in any toctree" warning that you surely saw
when you tested the docs build with this file :)

> diff --git a/Documentation/vm/pte_ref.rst b/Documentation/vm/pte_ref.rst
> new file mode 100644
> index 000000000000..c5323a263464
> --- /dev/null
> +++ b/Documentation/vm/pte_ref.rst
> @@ -0,0 +1,212 @@
> +.. _pte_ref:

Do you need this label anywhere? If not, I'd leave it out.

Thanks,

jon

2021-11-10 14:42:42

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v3 00/15] Free user PTE page table pages

On Wed, Nov 10, 2021 at 02:25:50PM +0100, David Hildenbrand wrote:
> On 10.11.21 13:56, Jason Gunthorpe wrote:
> > On Wed, Nov 10, 2021 at 06:54:13PM +0800, Qi Zheng wrote:
> >
> >> In this patch series, we add a pte_refcount field to the struct page of page
> >> table to track how many users of PTE page table. Similar to the mechanism of
> >> page refcount, the user of PTE page table should hold a refcount to it before
> >> accessing. The PTE page table page will be freed when the last refcount is
> >> dropped.
> >
> > So, this approach basically adds two atomics on every PTE map
> >
> > If I have it right the reason that zap cannot clean the PTEs today is
> > because zap cannot obtain the mmap lock due to a lock ordering issue
> > with the inode lock vs mmap lock.
>
> There are different ways to zap: madvise(DONTNEED) vs
> fallocate(PUNCH_HOLE). It depends on "from where" we're actually
> comming: a process page table walker or the rmap.

AFAIK rmap is the same issue, it can't lock the mmap_sem

> The way locking currently works doesn't allow to remove a page table
> just by holding the mmap lock, not even in write mode.

I'm not sure I understand this. If the goal is to free the PTE tables
then the main concern is use-after free on page table walkers (which
this series is addressing). Ignoring bugs, we have only three ways to
read the page table:

- Fully locked. Under the PTLs (gup slow is an example)
- Semi-locked. Under the read mmap lock and no PTLs (hmm is an example)
- hw-locked. Barriered with TLB flush (gup fast is an example)

#1 should be completely safe as the PTLs will protect eveything
#2 is safe so long as the write side is held during any layout changes
#3 interacts with the TLB flush, and is also safe with zap

rmap itself is a #1 page table walker, ie it gets the PTLs under
page_vma_mapped_walk().

The sin we have comitted here is that both the mmap lock and the PTLs
are being used to protect the page table itself with a very
complicated dual semantic.

Splitting the sleeping mmap lock into 'covers vma' and 'covers page
tables' lets us solve the lock ordering and semi-locked can become
more fully locked by the new lock, instead of by abusing mmap sem.

I'd suggest to make this new lock a special rwsem which allows either
concurrent read access OR concurrent PTL access, but not both. This
way we don't degrade performance of the split PTLs, *and* when
something needs to change the page table structure it has a way to
properly exclude all the #2 lockless readers.

So evey touch to the page table starts by obtaining this new lock,
depending on the access mode to be used. (PTL vs lockless read)

We can keep the existing THP logic where a leaf PMD can be transformed
to a non-leaf PMD in the semi-locked case, but the case where a
non-leaf PMD is transformed to a leaf PMD has to take the lock.

Jason

2021-11-10 15:40:16

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v3 00/15] Free user PTE page table pages

On 10.11.21 15:38, Jason Gunthorpe wrote:
> On Wed, Nov 10, 2021 at 02:25:50PM +0100, David Hildenbrand wrote:
>> On 10.11.21 13:56, Jason Gunthorpe wrote:
>>> On Wed, Nov 10, 2021 at 06:54:13PM +0800, Qi Zheng wrote:
>>>
>>>> In this patch series, we add a pte_refcount field to the struct page of page
>>>> table to track how many users of PTE page table. Similar to the mechanism of
>>>> page refcount, the user of PTE page table should hold a refcount to it before
>>>> accessing. The PTE page table page will be freed when the last refcount is
>>>> dropped.
>>>
>>> So, this approach basically adds two atomics on every PTE map
>>>
>>> If I have it right the reason that zap cannot clean the PTEs today is
>>> because zap cannot obtain the mmap lock due to a lock ordering issue
>>> with the inode lock vs mmap lock.
>>
>> There are different ways to zap: madvise(DONTNEED) vs
>> fallocate(PUNCH_HOLE). It depends on "from where" we're actually
>> comming: a process page table walker or the rmap.
>
> AFAIK rmap is the same issue, it can't lock the mmap_sem
>
>> The way locking currently works doesn't allow to remove a page table
>> just by holding the mmap lock, not even in write mode.
>
> I'm not sure I understand this. If the goal is to free the PTE tables
> then the main concern is use-after free on page table walkers (which
> this series is addressing). Ignoring bugs, we have only three ways to
> read the page table:

Yes, use-after-free and reuse-while-freeing are the two challenges AFAIKs.

>
> - Fully locked. Under the PTLs (gup slow is an example)
> - Semi-locked. Under the read mmap lock and no PTLs (hmm is an example)
> - hw-locked. Barriered with TLB flush (gup fast is an example)

Three additions as far as I can tell:

1. Fully locked currently needs the read mmap lock OR the rmap lock in
read. PTLs on their own are not sufficient AFAIKT.
2. #1 and #2 can currently only walk within VMA ranges.
3. We can theoretically walk page tables outside VMA ranges with the
write mmap lock, because page tables get removed with the mmap lock in
read mode and heavy-weight operations (VMA layout, khugepaged) are
performed under the write mmap lock.

The rmap locks protect from modifications where we want to exclude rmap
walkers similarly to how we grab the mmap lock in write, where the PTLs
are not sufficient.

See mm/mremap.c:move_ptes() as an example which performs VMA layout +
page table modifications. See khugepagd which doesn't perform VMA layout
modifications but page table modifications.

>
> #1 should be completely safe as the PTLs will protect eveything
> #2 is safe so long as the write side is held during any layout changes
> #3 interacts with the TLB flush, and is also safe with zap
>
> rmap itself is a #1 page table walker, ie it gets the PTLs under
> page_vma_mapped_walk().

When you talk about PTLs, do you mean only PTE-PTLs or also PMD-PTLs?

Because the PMD-PTLs re usually not taken in case we know there is a
page table (nothing would currently change it without heavy locking).
And if they are taken, they are only held while allocating/checking a
PMDE, not while actually *using* the page table that's mapped in that entry.

For example, walk_page_range() requires the mmap lock in read and grabs
the PTE-PTLs.

>
> The sin we have comitted here is that both the mmap lock and the PTLs
> are being used to protect the page table itself with a very
> complicated dual semantic.
>
> Splitting the sleeping mmap lock into 'covers vma' and 'covers page
> tables' lets us solve the lock ordering and semi-locked can become
> more fully locked by the new lock, instead of by abusing mmap sem.

It would still be a fairly coarse-grained locking, I am not sure if that
is a step into the right direction. If you want to modify *some* page
table in your process you have exclude each and every page table walker.
Or did I mis-interpret what you were saying?

>
> I'd suggest to make this new lock a special rwsem which allows either
> concurrent read access OR concurrent PTL access, but not both. This

I looked into such a lock recently in similar context and something like
that does not exist yet (and fairness will be challenging). You either
have a single reader or multiple writer. I'd be interested if someone
knows of something like that.


--
Thanks,

David / dhildenb

2021-11-10 16:39:34

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v3 00/15] Free user PTE page table pages

On Wed, Nov 10, 2021 at 04:37:14PM +0100, David Hildenbrand wrote:

> > - Fully locked. Under the PTLs (gup slow is an example)
> > - Semi-locked. Under the read mmap lock and no PTLs (hmm is an example)
> > - hw-locked. Barriered with TLB flush (gup fast is an example)
>
> Three additions as far as I can tell:
>
> 1. Fully locked currently needs the read mmap lock OR the rmap lock in
> read. PTLs on their own are not sufficient AFAIKT.

I think the reality is we don't have any fully locked walkers.. Even
gup slow is semi-lockless until it reaches lower levels, then it takes
the PTLs. (I forgot about that detail!)

The mmap lock is being used to protect the higher levels of the page
table. It is part of its complicated dual purpose.

> 2. #1 and #2 can currently only walk within VMA ranges.

AFICT this is an artifact of re-using the mmap lock to protect the
page table and not being able to obtain the mmap lock in all the
places the page table structure is manipulated.

> 3. We can theoretically walk page tables outside VMA ranges with the
> write mmap lock, because page tables get removed with the mmap lock in
> read mode and heavy-weight operations (VMA layout, khugepaged) are
> performed under the write mmap lock.

Yes, again, an artifact of the current locking.

> The rmap locks protect from modifications where we want to exclude rmap
> walkers similarly to how we grab the mmap lock in write, where the PTLs
> are not sufficient.

It is a similar dual purpose locking as the mmap sem :(

> > #1 should be completely safe as the PTLs will protect eveything
> > #2 is safe so long as the write side is held during any layout changes
> > #3 interacts with the TLB flush, and is also safe with zap
> >
> > rmap itself is a #1 page table walker, ie it gets the PTLs under
> > page_vma_mapped_walk().
>
> When you talk about PTLs, do you mean only PTE-PTLs or also PMD-PTLs?

Ah, here I was thinking about a lock that can protect all the
levels. Today we are abusing the mmap lock to act as the pud_lock, for
instance.

> Because the PMD-PTLs re usually not taken in case we know there is a
> page table (nothing would currently change it without heavy
> locking).

This only works with the lockless walkers, and relies on the read mmap
sem/etc to also mean 'a PTE table cannot become a leaf PMD'

> For example, walk_page_range() requires the mmap lock in read and grabs
> the PTE-PTLs.

Yes, that is a semi-locked reader.

> It would still be a fairly coarse-grained locking, I am not sure if that
> is a step into the right direction. If you want to modify *some* page
> table in your process you have exclude each and every page table walker.
> Or did I mis-interpret what you were saying?

That is one possible design, it favours fast walking and penalizes
mutation. We could also stick a lock in the PMD (instead of a
refcount) and still logically be using a lock instead of a refcount
scheme. Remember modify here is "want to change a table pointer into a
leaf pointer" so it isn't an every day activity..

There is some advantage with this thinking because it harmonizes well
with the other stuff that wants to convert tables into leafs, but has
to deal with complicated locking.

On the other hand, refcounts are a degenerate kind of rwsem and only
help with freeing pages. It also puts more atomics in normal fast
paths since we are refcounting each PTE, not read locking the PMD.

Perhaps the ideal thing would be to stick a rwsem in the PMD. read
means a table cannot be come a leaf. I don't know if there is space
for another atomic in the PMD level, and we'd have to use a hitching
post/hashed waitq scheme too since there surely isn't room for a waitq
too..

I wouldn't be so quick to say one is better than the other, but at
least let's have thought about a locking solution before merging
refcounts :)

Jason

2021-11-10 16:50:07

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v3 00/15] Free user PTE page table pages

On Wed, Nov 10, 2021 at 04:37:14PM +0100, David Hildenbrand wrote:
> > I'd suggest to make this new lock a special rwsem which allows either
> > concurrent read access OR concurrent PTL access, but not both. This
>
> I looked into such a lock recently in similar context and something like
> that does not exist yet (and fairness will be challenging). You either
> have a single reader or multiple writer. I'd be interested if someone
> knows of something like that.

We've talked about having such a lock before for filesystems which want
to permit either many direct-IO accesses or many buffered-IO accesses, but
want to exclude a mixture of direct-IO and buffered-IO. The name we came
up with for such a lock was the red-blue lock. Either Team Red has the
lock, or Team Blue has the lock (or it's free). Indicate free with velue
zero, Team Red with positive numbers and Team Blue with negative numbers.
If we need to indicate an opposing team is waiting for the semaphore,
we can use a high bit (1 << 30) to indicate no new team members can
acquire the lock. Not sure whether anybody ever coded it up.

2021-11-10 16:54:05

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v3 00/15] Free user PTE page table pages

On 10.11.21 17:49, Matthew Wilcox wrote:
> On Wed, Nov 10, 2021 at 04:37:14PM +0100, David Hildenbrand wrote:
>>> I'd suggest to make this new lock a special rwsem which allows either
>>> concurrent read access OR concurrent PTL access, but not both. This
>>
>> I looked into such a lock recently in similar context and something like
>> that does not exist yet (and fairness will be challenging). You either
>> have a single reader or multiple writer. I'd be interested if someone
>> knows of something like that.
>
> We've talked about having such a lock before for filesystems which want
> to permit either many direct-IO accesses or many buffered-IO accesses, but
> want to exclude a mixture of direct-IO and buffered-IO. The name we came
> up with for such a lock was the red-blue lock. Either Team Red has the
> lock, or Team Blue has the lock (or it's free). Indicate free with velue
> zero, Team Red with positive numbers and Team Blue with negative numbers.
> If we need to indicate an opposing team is waiting for the semaphore,
> we can use a high bit (1 << 30) to indicate no new team members can
> acquire the lock. Not sure whether anybody ever coded it up.

Interesting, thanks for sharing!

My excessive google search didn't reveal anything back then (~3 months
ago) -- only questions on popular coding websites asking essentially for
the same thing without any helpful replies. But maybe I used the wrong
keywords (e.g., "multiple reader, multiple writer", I certainly didn't
search for "red-blue lock", but I do like the name :) ).

Fairness might still be the biggest issue, but I am certainly no locking
expert.

--
Thanks,

David / dhildenb


2021-11-10 16:56:10

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v3 00/15] Free user PTE page table pages

On Wed, Nov 10, 2021 at 05:53:57PM +0100, David Hildenbrand wrote:
> On 10.11.21 17:49, Matthew Wilcox wrote:
> > On Wed, Nov 10, 2021 at 04:37:14PM +0100, David Hildenbrand wrote:
> >>> I'd suggest to make this new lock a special rwsem which allows either
> >>> concurrent read access OR concurrent PTL access, but not both. This
> >>
> >> I looked into such a lock recently in similar context and something like
> >> that does not exist yet (and fairness will be challenging). You either
> >> have a single reader or multiple writer. I'd be interested if someone
> >> knows of something like that.
> >
> > We've talked about having such a lock before for filesystems which want
> > to permit either many direct-IO accesses or many buffered-IO accesses, but
> > want to exclude a mixture of direct-IO and buffered-IO. The name we came
> > up with for such a lock was the red-blue lock. Either Team Red has the
> > lock, or Team Blue has the lock (or it's free). Indicate free with velue
> > zero, Team Red with positive numbers and Team Blue with negative numbers.
> > If we need to indicate an opposing team is waiting for the semaphore,
> > we can use a high bit (1 << 30) to indicate no new team members can
> > acquire the lock. Not sure whether anybody ever coded it up.
>
> Interesting, thanks for sharing!
>
> My excessive google search didn't reveal anything back then (~3 months
> ago) -- only questions on popular coding websites asking essentially for
> the same thing without any helpful replies. But maybe I used the wrong
> keywords (e.g., "multiple reader, multiple writer", I certainly didn't
> search for "red-blue lock", but I do like the name :) ).
>
> Fairness might still be the biggest issue, but I am certainly no locking
> expert.

Fairness could use the same basic logic as the write prefered to read
in the rwsem.

The atomic implementation would be with atomic_dec_unless_positive()
and atomic_inc_unless_negative(), without fairness it looks
straightfoward.

Jason

2021-11-10 17:37:56

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v3 00/15] Free user PTE page table pages

>> It would still be a fairly coarse-grained locking, I am not sure if that
>> is a step into the right direction. If you want to modify *some* page
>> table in your process you have exclude each and every page table walker.
>> Or did I mis-interpret what you were saying?
>
> That is one possible design, it favours fast walking and penalizes
> mutation. We could also stick a lock in the PMD (instead of a
> refcount) and still logically be using a lock instead of a refcount
> scheme. Remember modify here is "want to change a table pointer into a
> leaf pointer" so it isn't an every day activity..

It will be if we somewhat frequent when reclaim an empty PTE page table
as soon as it turns empty. This not only happens when zapping, but also
during writeback/swapping. So while writing back / swapping you might be
left with empty page tables to reclaim.

Of course, this is the current approach. Another approach that doesn't
require additional refcounts is scanning page tables for empty ones and
reclaiming them. This scanning can either be triggered manually from
user space or automatically from the kernel.

>
> There is some advantage with this thinking because it harmonizes well
> with the other stuff that wants to convert tables into leafs, but has
> to deal with complicated locking.
>
> On the other hand, refcounts are a degenerate kind of rwsem and only
> help with freeing pages. It also puts more atomics in normal fast
> paths since we are refcounting each PTE, not read locking the PMD.
>
> Perhaps the ideal thing would be to stick a rwsem in the PMD. read
> means a table cannot be come a leaf. I don't know if there is space
> for another atomic in the PMD level, and we'd have to use a hitching
> post/hashed waitq scheme too since there surely isn't room for a waitq
> too..
>
> I wouldn't be so quick to say one is better than the other, but at
> least let's have thought about a locking solution before merging
> refcounts :)

Yes, absolutely. I can see the beauty in the current approach, because
it just reclaims "automatically" once possible -- page table empty and
nobody is walking it. The downside is that it doesn't always make sense
to reclaim an empty page table immediately once it turns empty.

Also, it adds complexity for something that is only a problem in some
corner cases -- sparse memory mappings, especially relevant for some
memory allocators after freeing a lot of memory or running VMs with
memory ballooning after inflating the balloon. Some of these use cases
might be good with just triggering page table reclaim manually from user
space.

--
Thanks,

David / dhildenb


2021-11-10 17:49:33

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v3 00/15] Free user PTE page table pages

On Wed, Nov 10, 2021 at 06:37:46PM +0100, David Hildenbrand wrote:

> It will be if we somewhat frequent when reclaim an empty PTE page table
> as soon as it turns empty.

What do you think is the frequency of unlocked page table walks that
this would have to block on?

> Also, it adds complexity for something that is only a problem in some
> corner cases -- sparse memory mappings, especially relevant for some
> memory allocators after freeing a lot of memory or running VMs with
> memory ballooning after inflating the balloon. Some of these use cases
> might be good with just triggering page table reclaim manually from user
> space.

Right, this is why it would be nice if the complexity could address
more than one problem, like the existing complex locking around the
thp stuff..

Jason

2021-11-11 03:58:25

by Qi Zheng

[permalink] [raw]
Subject: Re: [PATCH v3 00/15] Free user PTE page table pages



On 11/11/21 1:37 AM, David Hildenbrand wrote:
>>> It would still be a fairly coarse-grained locking, I am not sure if that
>>> is a step into the right direction. If you want to modify *some* page
>>> table in your process you have exclude each and every page table walker.
>>> Or did I mis-interpret what you were saying?
>>
>> That is one possible design, it favours fast walking and penalizes
>> mutation. We could also stick a lock in the PMD (instead of a
>> refcount) and still logically be using a lock instead of a refcount
>> scheme. Remember modify here is "want to change a table pointer into a
>> leaf pointer" so it isn't an every day activity..
>
> It will be if we somewhat frequent when reclaim an empty PTE page table
> as soon as it turns empty. This not only happens when zapping, but also
> during writeback/swapping. So while writing back / swapping you might be
> left with empty page tables to reclaim.
>
> Of course, this is the current approach. Another approach that doesn't
> require additional refcounts is scanning page tables for empty ones and
> reclaiming them. This scanning can either be triggered manually from
> user space or automatically from the kernel.

Whether it is introducing a special rwsem or scanning an empty page
table, there are two problems as follows:

#1. When to trigger the scanning or releasing?
#2. Every time to release a 4K page table page, 512 page table
entries need to be scanned.

For #1, if the scanning is triggered manually from user space, the
kernel is relatively passive, and the user does not fully know the best
timing to scan. If the scanning is triggered automatically from the
kernel, that is great. But the timing is not easy to confirm, is it
scanned and reclaimed every time zap or try_to_unmap?

For #2, refcount has advantages.

>
>>
>> There is some advantage with this thinking because it harmonizes well
>> with the other stuff that wants to convert tables into leafs, but has
>> to deal with complicated locking.
>>
>> On the other hand, refcounts are a degenerate kind of rwsem and only
>> help with freeing pages. It also puts more atomics in normal fast
>> paths since we are refcounting each PTE, not read locking the PMD.
>>
>> Perhaps the ideal thing would be to stick a rwsem in the PMD. read
>> means a table cannot be come a leaf. I don't know if there is space
>> for another atomic in the PMD level, and we'd have to use a hitching
>> post/hashed waitq scheme too since there surely isn't room for a waitq
>> too..
>>
>> I wouldn't be so quick to say one is better than the other, but at
>> least let's have thought about a locking solution before merging
>> refcounts :)
>
> Yes, absolutely. I can see the beauty in the current approach, because
> it just reclaims "automatically" once possible -- page table empty and
> nobody is walking it. The downside is that it doesn't always make sense
> to reclaim an empty page table immediately once it turns empty.
>
> Also, it adds complexity for something that is only a problem in some
> corner cases -- sparse memory mappings, especially relevant for some
> memory allocators after freeing a lot of memory or running VMs with
> memory ballooning after inflating the balloon. Some of these use cases
> might be good with just triggering page table reclaim manually from user
> space.
>

Yes, this is indeed a problem. Perhaps some flags can be introduced so
that the release of page table pages can be delayed in some cases.
Similar to the lazyfree mechanism in MADV_FREE?

Thanks,
Qi

2021-11-11 05:40:25

by Qi Zheng

[permalink] [raw]
Subject: Re: [PATCH v3 14/15] Documentation: add document for pte_ref



On 11/10/21 10:39 PM, Jonathan Corbet wrote:
> Qi Zheng <[email protected]> writes:
>
>> This commit adds document for pte_ref under `Documentation/vm/`.
>
> Thanks for documenting this work!
>
>> Signed-off-by: Qi Zheng <[email protected]>
>> ---
>> Documentation/vm/pte_ref.rst | 212 +++++++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 212 insertions(+)
>> create mode 100644 Documentation/vm/pte_ref.rst
>
> When you add a new RST file, you also need to add it to the associated
> index.rst file or it won't be included in the docs build. Instead,
> you'll get the "not included in any toctree" warning that you surely saw
> when you tested the docs build with this file :)

OK, I will add it to the associated index.rst in the next version.

>
>> diff --git a/Documentation/vm/pte_ref.rst b/Documentation/vm/pte_ref.rst
>> new file mode 100644
>> index 000000000000..c5323a263464
>> --- /dev/null
>> +++ b/Documentation/vm/pte_ref.rst
>> @@ -0,0 +1,212 @@
>> +.. _pte_ref:
>
> Do you need this label anywhere? If not, I'd leave it out.

I will remove this label in the next version.

Thanks,
Qi

>
> Thanks,
>
> jon
>

2021-11-11 09:22:27

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v3 00/15] Free user PTE page table pages

On 11.11.21 04:58, Qi Zheng wrote:
>
>
> On 11/11/21 1:37 AM, David Hildenbrand wrote:
>>>> It would still be a fairly coarse-grained locking, I am not sure if that
>>>> is a step into the right direction. If you want to modify *some* page
>>>> table in your process you have exclude each and every page table walker.
>>>> Or did I mis-interpret what you were saying?
>>>
>>> That is one possible design, it favours fast walking and penalizes
>>> mutation. We could also stick a lock in the PMD (instead of a
>>> refcount) and still logically be using a lock instead of a refcount
>>> scheme. Remember modify here is "want to change a table pointer into a
>>> leaf pointer" so it isn't an every day activity..
>>
>> It will be if we somewhat frequent when reclaim an empty PTE page table
>> as soon as it turns empty. This not only happens when zapping, but also
>> during writeback/swapping. So while writing back / swapping you might be
>> left with empty page tables to reclaim.
>>
>> Of course, this is the current approach. Another approach that doesn't
>> require additional refcounts is scanning page tables for empty ones and
>> reclaiming them. This scanning can either be triggered manually from
>> user space or automatically from the kernel.
>
> Whether it is introducing a special rwsem or scanning an empty page
> table, there are two problems as follows:
>
> #1. When to trigger the scanning or releasing?

For example when reclaiming memory, when scanning page tables in
khugepaged, or triggered by user space (note that this is the approach I
originally looked into). But it certainly requires more locking thought
to avoid stopping essentially any page table walker.

> #2. Every time to release a 4K page table page, 512 page table
> entries need to be scanned.

It would happen only when actually trigger reclaim of page tables
(again, someone has to trigger it), so it's barely an issue.

For example, khugepaged already scans the page tables either way.

>
> For #1, if the scanning is triggered manually from user space, the
> kernel is relatively passive, and the user does not fully know the best
> timing to scan. If the scanning is triggered automatically from the
> kernel, that is great. But the timing is not easy to confirm, is it
> scanned and reclaimed every time zap or try_to_unmap?
>
> For #2, refcount has advantages.
>
>>
>>>
>>> There is some advantage with this thinking because it harmonizes well
>>> with the other stuff that wants to convert tables into leafs, but has
>>> to deal with complicated locking.
>>>
>>> On the other hand, refcounts are a degenerate kind of rwsem and only
>>> help with freeing pages. It also puts more atomics in normal fast
>>> paths since we are refcounting each PTE, not read locking the PMD.
>>>
>>> Perhaps the ideal thing would be to stick a rwsem in the PMD. read
>>> means a table cannot be come a leaf. I don't know if there is space
>>> for another atomic in the PMD level, and we'd have to use a hitching
>>> post/hashed waitq scheme too since there surely isn't room for a waitq
>>> too..
>>>
>>> I wouldn't be so quick to say one is better than the other, but at
>>> least let's have thought about a locking solution before merging
>>> refcounts :)
>>
>> Yes, absolutely. I can see the beauty in the current approach, because
>> it just reclaims "automatically" once possible -- page table empty and
>> nobody is walking it. The downside is that it doesn't always make sense
>> to reclaim an empty page table immediately once it turns empty.
>>
>> Also, it adds complexity for something that is only a problem in some
>> corner cases -- sparse memory mappings, especially relevant for some
>> memory allocators after freeing a lot of memory or running VMs with
>> memory ballooning after inflating the balloon. Some of these use cases
>> might be good with just triggering page table reclaim manually from user
>> space.
>>
>
> Yes, this is indeed a problem. Perhaps some flags can be introduced so
> that the release of page table pages can be delayed in some cases.
> Similar to the lazyfree mechanism in MADV_FREE?

The issue AFAIU is that once your refcount hits 0 (no more references,
no more entries), the longer you wait with reclaim, the longer others
have to wait for populating a fresh page table because the "page table
to be reclaimed" is still stuck around. You'd have to keep the refcount
increased for a while, and only drop it after a while. But when? And
how? IMHO it's not trivial, but maybe there is an easy way to achieve it.


--
Thanks,

David / dhildenb


2021-11-11 11:08:48

by Qi Zheng

[permalink] [raw]
Subject: Re: [PATCH v3 00/15] Free user PTE page table pages



On 11/11/21 5:22 PM, David Hildenbrand wrote:
> On 11.11.21 04:58, Qi Zheng wrote:
>>
>>
>> On 11/11/21 1:37 AM, David Hildenbrand wrote:
>>>>> It would still be a fairly coarse-grained locking, I am not sure if that
>>>>> is a step into the right direction. If you want to modify *some* page
>>>>> table in your process you have exclude each and every page table walker.
>>>>> Or did I mis-interpret what you were saying?
>>>>
>>>> That is one possible design, it favours fast walking and penalizes
>>>> mutation. We could also stick a lock in the PMD (instead of a
>>>> refcount) and still logically be using a lock instead of a refcount
>>>> scheme. Remember modify here is "want to change a table pointer into a
>>>> leaf pointer" so it isn't an every day activity..
>>>
>>> It will be if we somewhat frequent when reclaim an empty PTE page table
>>> as soon as it turns empty. This not only happens when zapping, but also
>>> during writeback/swapping. So while writing back / swapping you might be
>>> left with empty page tables to reclaim.
>>>
>>> Of course, this is the current approach. Another approach that doesn't
>>> require additional refcounts is scanning page tables for empty ones and
>>> reclaiming them. This scanning can either be triggered manually from
>>> user space or automatically from the kernel.
>>
>> Whether it is introducing a special rwsem or scanning an empty page
>> table, there are two problems as follows:
>>
>> #1. When to trigger the scanning or releasing?
>
> For example when reclaiming memory, when scanning page tables in
> khugepaged, or triggered by user space (note that this is the approach I
> originally looked into). But it certainly requires more locking thought
> to avoid stopping essentially any page table walker.
>
>> #2. Every time to release a 4K page table page, 512 page table
>> entries need to be scanned.
>
> It would happen only when actually trigger reclaim of page tables
> (again, someone has to trigger it), so it's barely an issue.
>
> For example, khugepaged already scans the page tables either way.
>
>>
>> For #1, if the scanning is triggered manually from user space, the
>> kernel is relatively passive, and the user does not fully know the best
>> timing to scan. If the scanning is triggered automatically from the
>> kernel, that is great. But the timing is not easy to confirm, is it
>> scanned and reclaimed every time zap or try_to_unmap?
>>
>> For #2, refcount has advantages.
>>
>>>
>>>>
>>>> There is some advantage with this thinking because it harmonizes well
>>>> with the other stuff that wants to convert tables into leafs, but has
>>>> to deal with complicated locking.
>>>>
>>>> On the other hand, refcounts are a degenerate kind of rwsem and only
>>>> help with freeing pages. It also puts more atomics in normal fast
>>>> paths since we are refcounting each PTE, not read locking the PMD.
>>>>
>>>> Perhaps the ideal thing would be to stick a rwsem in the PMD. read
>>>> means a table cannot be come a leaf. I don't know if there is space
>>>> for another atomic in the PMD level, and we'd have to use a hitching
>>>> post/hashed waitq scheme too since there surely isn't room for a waitq
>>>> too..
>>>>
>>>> I wouldn't be so quick to say one is better than the other, but at
>>>> least let's have thought about a locking solution before merging
>>>> refcounts :)
>>>
>>> Yes, absolutely. I can see the beauty in the current approach, because
>>> it just reclaims "automatically" once possible -- page table empty and
>>> nobody is walking it. The downside is that it doesn't always make sense
>>> to reclaim an empty page table immediately once it turns empty.
>>>
>>> Also, it adds complexity for something that is only a problem in some
>>> corner cases -- sparse memory mappings, especially relevant for some
>>> memory allocators after freeing a lot of memory or running VMs with
>>> memory ballooning after inflating the balloon. Some of these use cases
>>> might be good with just triggering page table reclaim manually from user
>>> space.
>>>
>>
>> Yes, this is indeed a problem. Perhaps some flags can be introduced so
>> that the release of page table pages can be delayed in some cases.
>> Similar to the lazyfree mechanism in MADV_FREE?
>
> The issue AFAIU is that once your refcount hits 0 (no more references,
> no more entries), the longer you wait with reclaim, the longer others
> have to wait for populating a fresh page table because the "page table
> to be reclaimed" is still stuck around. You'd have to keep the refcount
> increased for a while, and only drop it after a while. But when? And
> how? IMHO it's not trivial, but maybe there is an easy way to achieve it.
>

For running VMs with memory ballooning after inflating the balloon, is
this a hot behavior? Even if it is, it is already facing the release and
reallocation of physical pages. The overhead after introducing
pte_refcount is that we need to release and re-allocate page table page.
But 2MB physical pages only corresponds to 4KiB of PTE page table page.
So maybe the overhead is not big.

In fact, the performance test shown on the cover letter is this case:

test program:
https://lore.kernel.org/lkml/[email protected]/2-multi-fault-all.c

Thanks,
Qi

>

2021-11-11 11:19:28

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v3 00/15] Free user PTE page table pages

On 11.11.21 12:08, Qi Zheng wrote:
>
>
> On 11/11/21 5:22 PM, David Hildenbrand wrote:
>> On 11.11.21 04:58, Qi Zheng wrote:
>>>
>>>
>>> On 11/11/21 1:37 AM, David Hildenbrand wrote:
>>>>>> It would still be a fairly coarse-grained locking, I am not sure if that
>>>>>> is a step into the right direction. If you want to modify *some* page
>>>>>> table in your process you have exclude each and every page table walker.
>>>>>> Or did I mis-interpret what you were saying?
>>>>>
>>>>> That is one possible design, it favours fast walking and penalizes
>>>>> mutation. We could also stick a lock in the PMD (instead of a
>>>>> refcount) and still logically be using a lock instead of a refcount
>>>>> scheme. Remember modify here is "want to change a table pointer into a
>>>>> leaf pointer" so it isn't an every day activity..
>>>>
>>>> It will be if we somewhat frequent when reclaim an empty PTE page table
>>>> as soon as it turns empty. This not only happens when zapping, but also
>>>> during writeback/swapping. So while writing back / swapping you might be
>>>> left with empty page tables to reclaim.
>>>>
>>>> Of course, this is the current approach. Another approach that doesn't
>>>> require additional refcounts is scanning page tables for empty ones and
>>>> reclaiming them. This scanning can either be triggered manually from
>>>> user space or automatically from the kernel.
>>>
>>> Whether it is introducing a special rwsem or scanning an empty page
>>> table, there are two problems as follows:
>>>
>>> #1. When to trigger the scanning or releasing?
>>
>> For example when reclaiming memory, when scanning page tables in
>> khugepaged, or triggered by user space (note that this is the approach I
>> originally looked into). But it certainly requires more locking thought
>> to avoid stopping essentially any page table walker.
>>
>>> #2. Every time to release a 4K page table page, 512 page table
>>> entries need to be scanned.
>>
>> It would happen only when actually trigger reclaim of page tables
>> (again, someone has to trigger it), so it's barely an issue.
>>
>> For example, khugepaged already scans the page tables either way.
>>
>>>
>>> For #1, if the scanning is triggered manually from user space, the
>>> kernel is relatively passive, and the user does not fully know the best
>>> timing to scan. If the scanning is triggered automatically from the
>>> kernel, that is great. But the timing is not easy to confirm, is it
>>> scanned and reclaimed every time zap or try_to_unmap?
>>>
>>> For #2, refcount has advantages.
>>>
>>>>
>>>>>
>>>>> There is some advantage with this thinking because it harmonizes well
>>>>> with the other stuff that wants to convert tables into leafs, but has
>>>>> to deal with complicated locking.
>>>>>
>>>>> On the other hand, refcounts are a degenerate kind of rwsem and only
>>>>> help with freeing pages. It also puts more atomics in normal fast
>>>>> paths since we are refcounting each PTE, not read locking the PMD.
>>>>>
>>>>> Perhaps the ideal thing would be to stick a rwsem in the PMD. read
>>>>> means a table cannot be come a leaf. I don't know if there is space
>>>>> for another atomic in the PMD level, and we'd have to use a hitching
>>>>> post/hashed waitq scheme too since there surely isn't room for a waitq
>>>>> too..
>>>>>
>>>>> I wouldn't be so quick to say one is better than the other, but at
>>>>> least let's have thought about a locking solution before merging
>>>>> refcounts :)
>>>>
>>>> Yes, absolutely. I can see the beauty in the current approach, because
>>>> it just reclaims "automatically" once possible -- page table empty and
>>>> nobody is walking it. The downside is that it doesn't always make sense
>>>> to reclaim an empty page table immediately once it turns empty.
>>>>
>>>> Also, it adds complexity for something that is only a problem in some
>>>> corner cases -- sparse memory mappings, especially relevant for some
>>>> memory allocators after freeing a lot of memory or running VMs with
>>>> memory ballooning after inflating the balloon. Some of these use cases
>>>> might be good with just triggering page table reclaim manually from user
>>>> space.
>>>>
>>>
>>> Yes, this is indeed a problem. Perhaps some flags can be introduced so
>>> that the release of page table pages can be delayed in some cases.
>>> Similar to the lazyfree mechanism in MADV_FREE?
>>
>> The issue AFAIU is that once your refcount hits 0 (no more references,
>> no more entries), the longer you wait with reclaim, the longer others
>> have to wait for populating a fresh page table because the "page table
>> to be reclaimed" is still stuck around. You'd have to keep the refcount
>> increased for a while, and only drop it after a while. But when? And
>> how? IMHO it's not trivial, but maybe there is an easy way to achieve it.
>>
>
> For running VMs with memory ballooning after inflating the balloon, is
> this a hot behavior? Even if it is, it is already facing the release and
> reallocation of physical pages. The overhead after introducing
> pte_refcount is that we need to release and re-allocate page table page.
> But 2MB physical pages only corresponds to 4KiB of PTE page table page.
> So maybe the overhead is not big.

The cases that come to my mind are

a) Swapping on shared memory with concurrent access
b) Reclaim on file-backed memory with concurrent access
c) Free page reporting as implemented by virtio-balloon

In all of these cases, you can have someone immediately re-access the
page table and re-populate it.

For something mostly static (balloon inflation, memory allocator), it's
not that big of a deal I guess.

--
Thanks,

David / dhildenb


2021-11-11 12:00:51

by Qi Zheng

[permalink] [raw]
Subject: Re: [PATCH v3 00/15] Free user PTE page table pages



On 11/11/21 7:19 PM, David Hildenbrand wrote:
> On 11.11.21 12:08, Qi Zheng wrote:
>>
>>
>> On 11/11/21 5:22 PM, David Hildenbrand wrote:
>>> On 11.11.21 04:58, Qi Zheng wrote:
>>>>
>>>>
>>>> On 11/11/21 1:37 AM, David Hildenbrand wrote:
>>>>>>> It would still be a fairly coarse-grained locking, I am not sure if that
>>>>>>> is a step into the right direction. If you want to modify *some* page
>>>>>>> table in your process you have exclude each and every page table walker.
>>>>>>> Or did I mis-interpret what you were saying?
>>>>>>
>>>>>> That is one possible design, it favours fast walking and penalizes
>>>>>> mutation. We could also stick a lock in the PMD (instead of a
>>>>>> refcount) and still logically be using a lock instead of a refcount
>>>>>> scheme. Remember modify here is "want to change a table pointer into a
>>>>>> leaf pointer" so it isn't an every day activity..
>>>>>
>>>>> It will be if we somewhat frequent when reclaim an empty PTE page table
>>>>> as soon as it turns empty. This not only happens when zapping, but also
>>>>> during writeback/swapping. So while writing back / swapping you might be
>>>>> left with empty page tables to reclaim.
>>>>>
>>>>> Of course, this is the current approach. Another approach that doesn't
>>>>> require additional refcounts is scanning page tables for empty ones and
>>>>> reclaiming them. This scanning can either be triggered manually from
>>>>> user space or automatically from the kernel.
>>>>
>>>> Whether it is introducing a special rwsem or scanning an empty page
>>>> table, there are two problems as follows:
>>>>
>>>> #1. When to trigger the scanning or releasing?
>>>
>>> For example when reclaiming memory, when scanning page tables in
>>> khugepaged, or triggered by user space (note that this is the approach I
>>> originally looked into). But it certainly requires more locking thought
>>> to avoid stopping essentially any page table walker.
>>>
>>>> #2. Every time to release a 4K page table page, 512 page table
>>>> entries need to be scanned.
>>>
>>> It would happen only when actually trigger reclaim of page tables
>>> (again, someone has to trigger it), so it's barely an issue.
>>>
>>> For example, khugepaged already scans the page tables either way.
>>>
>>>>
>>>> For #1, if the scanning is triggered manually from user space, the
>>>> kernel is relatively passive, and the user does not fully know the best
>>>> timing to scan. If the scanning is triggered automatically from the
>>>> kernel, that is great. But the timing is not easy to confirm, is it
>>>> scanned and reclaimed every time zap or try_to_unmap?
>>>>
>>>> For #2, refcount has advantages.
>>>>
>>>>>
>>>>>>
>>>>>> There is some advantage with this thinking because it harmonizes well
>>>>>> with the other stuff that wants to convert tables into leafs, but has
>>>>>> to deal with complicated locking.
>>>>>>
>>>>>> On the other hand, refcounts are a degenerate kind of rwsem and only
>>>>>> help with freeing pages. It also puts more atomics in normal fast
>>>>>> paths since we are refcounting each PTE, not read locking the PMD.
>>>>>>
>>>>>> Perhaps the ideal thing would be to stick a rwsem in the PMD. read
>>>>>> means a table cannot be come a leaf. I don't know if there is space
>>>>>> for another atomic in the PMD level, and we'd have to use a hitching
>>>>>> post/hashed waitq scheme too since there surely isn't room for a waitq
>>>>>> too..
>>>>>>
>>>>>> I wouldn't be so quick to say one is better than the other, but at
>>>>>> least let's have thought about a locking solution before merging
>>>>>> refcounts :)
>>>>>
>>>>> Yes, absolutely. I can see the beauty in the current approach, because
>>>>> it just reclaims "automatically" once possible -- page table empty and
>>>>> nobody is walking it. The downside is that it doesn't always make sense
>>>>> to reclaim an empty page table immediately once it turns empty.
>>>>>
>>>>> Also, it adds complexity for something that is only a problem in some
>>>>> corner cases -- sparse memory mappings, especially relevant for some
>>>>> memory allocators after freeing a lot of memory or running VMs with
>>>>> memory ballooning after inflating the balloon. Some of these use cases
>>>>> might be good with just triggering page table reclaim manually from user
>>>>> space.
>>>>>
>>>>
>>>> Yes, this is indeed a problem. Perhaps some flags can be introduced so
>>>> that the release of page table pages can be delayed in some cases.
>>>> Similar to the lazyfree mechanism in MADV_FREE?
>>>
>>> The issue AFAIU is that once your refcount hits 0 (no more references,
>>> no more entries), the longer you wait with reclaim, the longer others
>>> have to wait for populating a fresh page table because the "page table
>>> to be reclaimed" is still stuck around. You'd have to keep the refcount
>>> increased for a while, and only drop it after a while. But when? And
>>> how? IMHO it's not trivial, but maybe there is an easy way to achieve it.
>>>
>>
>> For running VMs with memory ballooning after inflating the balloon, is
>> this a hot behavior? Even if it is, it is already facing the release and
>> reallocation of physical pages. The overhead after introducing
>> pte_refcount is that we need to release and re-allocate page table page.
>> But 2MB physical pages only corresponds to 4KiB of PTE page table page.
>> So maybe the overhead is not big.
>
> The cases that come to my mind are
>
> a) Swapping on shared memory with concurrent access
> b) Reclaim on file-backed memory with concurrent access
> c) Free page reporting as implemented by virtio-balloon
>
> In all of these cases, you can have someone immediately re-access the
> page table and re-populate it.

In the performance test shown on the cover, we repeatedly performed
touch and madvise(MADV_DONTNEED) actions, which simulated the case
you said above.

We did find a small amount of performance regression, but I think it is
acceptable, and no new perf hotspots have been added.

>
> For something mostly static (balloon inflation, memory allocator), it's
> not that big of a deal I guess.
>

2021-11-11 12:20:40

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v3 00/15] Free user PTE page table pages

On 11.11.21 13:00, Qi Zheng wrote:
>
>
> On 11/11/21 7:19 PM, David Hildenbrand wrote:
>> On 11.11.21 12:08, Qi Zheng wrote:
>>>
>>>
>>> On 11/11/21 5:22 PM, David Hildenbrand wrote:
>>>> On 11.11.21 04:58, Qi Zheng wrote:
>>>>>
>>>>>
>>>>> On 11/11/21 1:37 AM, David Hildenbrand wrote:
>>>>>>>> It would still be a fairly coarse-grained locking, I am not sure if that
>>>>>>>> is a step into the right direction. If you want to modify *some* page
>>>>>>>> table in your process you have exclude each and every page table walker.
>>>>>>>> Or did I mis-interpret what you were saying?
>>>>>>>
>>>>>>> That is one possible design, it favours fast walking and penalizes
>>>>>>> mutation. We could also stick a lock in the PMD (instead of a
>>>>>>> refcount) and still logically be using a lock instead of a refcount
>>>>>>> scheme. Remember modify here is "want to change a table pointer into a
>>>>>>> leaf pointer" so it isn't an every day activity..
>>>>>>
>>>>>> It will be if we somewhat frequent when reclaim an empty PTE page table
>>>>>> as soon as it turns empty. This not only happens when zapping, but also
>>>>>> during writeback/swapping. So while writing back / swapping you might be
>>>>>> left with empty page tables to reclaim.
>>>>>>
>>>>>> Of course, this is the current approach. Another approach that doesn't
>>>>>> require additional refcounts is scanning page tables for empty ones and
>>>>>> reclaiming them. This scanning can either be triggered manually from
>>>>>> user space or automatically from the kernel.
>>>>>
>>>>> Whether it is introducing a special rwsem or scanning an empty page
>>>>> table, there are two problems as follows:
>>>>>
>>>>> #1. When to trigger the scanning or releasing?
>>>>
>>>> For example when reclaiming memory, when scanning page tables in
>>>> khugepaged, or triggered by user space (note that this is the approach I
>>>> originally looked into). But it certainly requires more locking thought
>>>> to avoid stopping essentially any page table walker.
>>>>
>>>>> #2. Every time to release a 4K page table page, 512 page table
>>>>> entries need to be scanned.
>>>>
>>>> It would happen only when actually trigger reclaim of page tables
>>>> (again, someone has to trigger it), so it's barely an issue.
>>>>
>>>> For example, khugepaged already scans the page tables either way.
>>>>
>>>>>
>>>>> For #1, if the scanning is triggered manually from user space, the
>>>>> kernel is relatively passive, and the user does not fully know the best
>>>>> timing to scan. If the scanning is triggered automatically from the
>>>>> kernel, that is great. But the timing is not easy to confirm, is it
>>>>> scanned and reclaimed every time zap or try_to_unmap?
>>>>>
>>>>> For #2, refcount has advantages.
>>>>>
>>>>>>
>>>>>>>
>>>>>>> There is some advantage with this thinking because it harmonizes well
>>>>>>> with the other stuff that wants to convert tables into leafs, but has
>>>>>>> to deal with complicated locking.
>>>>>>>
>>>>>>> On the other hand, refcounts are a degenerate kind of rwsem and only
>>>>>>> help with freeing pages. It also puts more atomics in normal fast
>>>>>>> paths since we are refcounting each PTE, not read locking the PMD.
>>>>>>>
>>>>>>> Perhaps the ideal thing would be to stick a rwsem in the PMD. read
>>>>>>> means a table cannot be come a leaf. I don't know if there is space
>>>>>>> for another atomic in the PMD level, and we'd have to use a hitching
>>>>>>> post/hashed waitq scheme too since there surely isn't room for a waitq
>>>>>>> too..
>>>>>>>
>>>>>>> I wouldn't be so quick to say one is better than the other, but at
>>>>>>> least let's have thought about a locking solution before merging
>>>>>>> refcounts :)
>>>>>>
>>>>>> Yes, absolutely. I can see the beauty in the current approach, because
>>>>>> it just reclaims "automatically" once possible -- page table empty and
>>>>>> nobody is walking it. The downside is that it doesn't always make sense
>>>>>> to reclaim an empty page table immediately once it turns empty.
>>>>>>
>>>>>> Also, it adds complexity for something that is only a problem in some
>>>>>> corner cases -- sparse memory mappings, especially relevant for some
>>>>>> memory allocators after freeing a lot of memory or running VMs with
>>>>>> memory ballooning after inflating the balloon. Some of these use cases
>>>>>> might be good with just triggering page table reclaim manually from user
>>>>>> space.
>>>>>>
>>>>>
>>>>> Yes, this is indeed a problem. Perhaps some flags can be introduced so
>>>>> that the release of page table pages can be delayed in some cases.
>>>>> Similar to the lazyfree mechanism in MADV_FREE?
>>>>
>>>> The issue AFAIU is that once your refcount hits 0 (no more references,
>>>> no more entries), the longer you wait with reclaim, the longer others
>>>> have to wait for populating a fresh page table because the "page table
>>>> to be reclaimed" is still stuck around. You'd have to keep the refcount
>>>> increased for a while, and only drop it after a while. But when? And
>>>> how? IMHO it's not trivial, but maybe there is an easy way to achieve it.
>>>>
>>>
>>> For running VMs with memory ballooning after inflating the balloon, is
>>> this a hot behavior? Even if it is, it is already facing the release and
>>> reallocation of physical pages. The overhead after introducing
>>> pte_refcount is that we need to release and re-allocate page table page.
>>> But 2MB physical pages only corresponds to 4KiB of PTE page table page.
>>> So maybe the overhead is not big.
>>
>> The cases that come to my mind are
>>
>> a) Swapping on shared memory with concurrent access
>> b) Reclaim on file-backed memory with concurrent access
>> c) Free page reporting as implemented by virtio-balloon
>>
>> In all of these cases, you can have someone immediately re-access the
>> page table and re-populate it.
>
> In the performance test shown on the cover, we repeatedly performed
> touch and madvise(MADV_DONTNEED) actions, which simulated the case
> you said above.
>
> We did find a small amount of performance regression, but I think it is
> acceptable, and no new perf hotspots have been added.

That test always accesses 2MiB and does it from a single thread. Things
might (IMHO will) look different when only accessing individual pages
and doing the access from one/multiple separate threads (that's what
a),b) and c) essentially do, they don't do it in the pattern you
measured. what you measured matches rather a typical memory allocator).


--
Thanks,

David / dhildenb


2021-11-11 12:32:51

by Qi Zheng

[permalink] [raw]
Subject: Re: [PATCH v3 00/15] Free user PTE page table pages



On 11/11/21 8:20 PM, David Hildenbrand wrote:
> On 11.11.21 13:00, Qi Zheng wrote:
>>
>>
>> On 11/11/21 7:19 PM, David Hildenbrand wrote:
>>> On 11.11.21 12:08, Qi Zheng wrote:
>>>>
>>>>
>>>> On 11/11/21 5:22 PM, David Hildenbrand wrote:
>>>>> On 11.11.21 04:58, Qi Zheng wrote:
>>>>>>
>>>>>>
>>>>>> On 11/11/21 1:37 AM, David Hildenbrand wrote:
>>>>>>>>> It would still be a fairly coarse-grained locking, I am not sure if that
>>>>>>>>> is a step into the right direction. If you want to modify *some* page
>>>>>>>>> table in your process you have exclude each and every page table walker.
>>>>>>>>> Or did I mis-interpret what you were saying?
>>>>>>>>
>>>>>>>> That is one possible design, it favours fast walking and penalizes
>>>>>>>> mutation. We could also stick a lock in the PMD (instead of a
>>>>>>>> refcount) and still logically be using a lock instead of a refcount
>>>>>>>> scheme. Remember modify here is "want to change a table pointer into a
>>>>>>>> leaf pointer" so it isn't an every day activity..
>>>>>>>
>>>>>>> It will be if we somewhat frequent when reclaim an empty PTE page table
>>>>>>> as soon as it turns empty. This not only happens when zapping, but also
>>>>>>> during writeback/swapping. So while writing back / swapping you might be
>>>>>>> left with empty page tables to reclaim.
>>>>>>>
>>>>>>> Of course, this is the current approach. Another approach that doesn't
>>>>>>> require additional refcounts is scanning page tables for empty ones and
>>>>>>> reclaiming them. This scanning can either be triggered manually from
>>>>>>> user space or automatically from the kernel.
>>>>>>
>>>>>> Whether it is introducing a special rwsem or scanning an empty page
>>>>>> table, there are two problems as follows:
>>>>>>
>>>>>> #1. When to trigger the scanning or releasing?
>>>>>
>>>>> For example when reclaiming memory, when scanning page tables in
>>>>> khugepaged, or triggered by user space (note that this is the approach I
>>>>> originally looked into). But it certainly requires more locking thought
>>>>> to avoid stopping essentially any page table walker.
>>>>>
>>>>>> #2. Every time to release a 4K page table page, 512 page table
>>>>>> entries need to be scanned.
>>>>>
>>>>> It would happen only when actually trigger reclaim of page tables
>>>>> (again, someone has to trigger it), so it's barely an issue.
>>>>>
>>>>> For example, khugepaged already scans the page tables either way.
>>>>>
>>>>>>
>>>>>> For #1, if the scanning is triggered manually from user space, the
>>>>>> kernel is relatively passive, and the user does not fully know the best
>>>>>> timing to scan. If the scanning is triggered automatically from the
>>>>>> kernel, that is great. But the timing is not easy to confirm, is it
>>>>>> scanned and reclaimed every time zap or try_to_unmap?
>>>>>>
>>>>>> For #2, refcount has advantages.
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> There is some advantage with this thinking because it harmonizes well
>>>>>>>> with the other stuff that wants to convert tables into leafs, but has
>>>>>>>> to deal with complicated locking.
>>>>>>>>
>>>>>>>> On the other hand, refcounts are a degenerate kind of rwsem and only
>>>>>>>> help with freeing pages. It also puts more atomics in normal fast
>>>>>>>> paths since we are refcounting each PTE, not read locking the PMD.
>>>>>>>>
>>>>>>>> Perhaps the ideal thing would be to stick a rwsem in the PMD. read
>>>>>>>> means a table cannot be come a leaf. I don't know if there is space
>>>>>>>> for another atomic in the PMD level, and we'd have to use a hitching
>>>>>>>> post/hashed waitq scheme too since there surely isn't room for a waitq
>>>>>>>> too..
>>>>>>>>
>>>>>>>> I wouldn't be so quick to say one is better than the other, but at
>>>>>>>> least let's have thought about a locking solution before merging
>>>>>>>> refcounts :)
>>>>>>>
>>>>>>> Yes, absolutely. I can see the beauty in the current approach, because
>>>>>>> it just reclaims "automatically" once possible -- page table empty and
>>>>>>> nobody is walking it. The downside is that it doesn't always make sense
>>>>>>> to reclaim an empty page table immediately once it turns empty.
>>>>>>>
>>>>>>> Also, it adds complexity for something that is only a problem in some
>>>>>>> corner cases -- sparse memory mappings, especially relevant for some
>>>>>>> memory allocators after freeing a lot of memory or running VMs with
>>>>>>> memory ballooning after inflating the balloon. Some of these use cases
>>>>>>> might be good with just triggering page table reclaim manually from user
>>>>>>> space.
>>>>>>>
>>>>>>
>>>>>> Yes, this is indeed a problem. Perhaps some flags can be introduced so
>>>>>> that the release of page table pages can be delayed in some cases.
>>>>>> Similar to the lazyfree mechanism in MADV_FREE?
>>>>>
>>>>> The issue AFAIU is that once your refcount hits 0 (no more references,
>>>>> no more entries), the longer you wait with reclaim, the longer others
>>>>> have to wait for populating a fresh page table because the "page table
>>>>> to be reclaimed" is still stuck around. You'd have to keep the refcount
>>>>> increased for a while, and only drop it after a while. But when? And
>>>>> how? IMHO it's not trivial, but maybe there is an easy way to achieve it.
>>>>>
>>>>
>>>> For running VMs with memory ballooning after inflating the balloon, is
>>>> this a hot behavior? Even if it is, it is already facing the release and
>>>> reallocation of physical pages. The overhead after introducing
>>>> pte_refcount is that we need to release and re-allocate page table page.
>>>> But 2MB physical pages only corresponds to 4KiB of PTE page table page.
>>>> So maybe the overhead is not big.
>>>
>>> The cases that come to my mind are
>>>
>>> a) Swapping on shared memory with concurrent access
>>> b) Reclaim on file-backed memory with concurrent access
>>> c) Free page reporting as implemented by virtio-balloon
>>>
>>> In all of these cases, you can have someone immediately re-access the
>>> page table and re-populate it.
>>
>> In the performance test shown on the cover, we repeatedly performed
>> touch and madvise(MADV_DONTNEED) actions, which simulated the case
>> you said above.
>>
>> We did find a small amount of performance regression, but I think it is
>> acceptable, and no new perf hotspots have been added.
>
> That test always accesses 2MiB and does it from a single thread. Things
> might (IMHO will) look different when only accessing individual pages
> and doing the access from one/multiple separate threads (that's what

No, it includes multi-threading:

while (1) {
char *c;
char *start = mmap_area[cpu];
char *end = mmap_area[cpu] + FAULT_LENGTH;
pthread_barrier_wait(&barrier);
//printf("fault into %p-%p\n",start, end);

for (c = start; c < end; c += PAGE_SIZE)
*c = 0;

pthread_barrier_wait(&barrier);
for (i = 0; cpu==0 && i < num; i++)
madvise(mmap_area[i], FAULT_LENGTH, MADV_DONTNEED);
pthread_barrier_wait(&barrier);
}

Thread on cpu0 will use madvise(MADV_DONTNEED) to release the physical
memory of threads on other cpu.

> a),b) and c) essentially do, they don't do it in the pattern you
> measured. what you measured matches rather a typical memory allocator).
>
>

2021-11-11 12:51:43

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v3 00/15] Free user PTE page table pages

On 11.11.21 13:32, Qi Zheng wrote:
>
>
> On 11/11/21 8:20 PM, David Hildenbrand wrote:
>> On 11.11.21 13:00, Qi Zheng wrote:
>>>
>>>
>>> On 11/11/21 7:19 PM, David Hildenbrand wrote:
>>>> On 11.11.21 12:08, Qi Zheng wrote:
>>>>>
>>>>>
>>>>> On 11/11/21 5:22 PM, David Hildenbrand wrote:
>>>>>> On 11.11.21 04:58, Qi Zheng wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 11/11/21 1:37 AM, David Hildenbrand wrote:
>>>>>>>>>> It would still be a fairly coarse-grained locking, I am not sure if that
>>>>>>>>>> is a step into the right direction. If you want to modify *some* page
>>>>>>>>>> table in your process you have exclude each and every page table walker.
>>>>>>>>>> Or did I mis-interpret what you were saying?
>>>>>>>>>
>>>>>>>>> That is one possible design, it favours fast walking and penalizes
>>>>>>>>> mutation. We could also stick a lock in the PMD (instead of a
>>>>>>>>> refcount) and still logically be using a lock instead of a refcount
>>>>>>>>> scheme. Remember modify here is "want to change a table pointer into a
>>>>>>>>> leaf pointer" so it isn't an every day activity..
>>>>>>>>
>>>>>>>> It will be if we somewhat frequent when reclaim an empty PTE page table
>>>>>>>> as soon as it turns empty. This not only happens when zapping, but also
>>>>>>>> during writeback/swapping. So while writing back / swapping you might be
>>>>>>>> left with empty page tables to reclaim.
>>>>>>>>
>>>>>>>> Of course, this is the current approach. Another approach that doesn't
>>>>>>>> require additional refcounts is scanning page tables for empty ones and
>>>>>>>> reclaiming them. This scanning can either be triggered manually from
>>>>>>>> user space or automatically from the kernel.
>>>>>>>
>>>>>>> Whether it is introducing a special rwsem or scanning an empty page
>>>>>>> table, there are two problems as follows:
>>>>>>>
>>>>>>> #1. When to trigger the scanning or releasing?
>>>>>>
>>>>>> For example when reclaiming memory, when scanning page tables in
>>>>>> khugepaged, or triggered by user space (note that this is the approach I
>>>>>> originally looked into). But it certainly requires more locking thought
>>>>>> to avoid stopping essentially any page table walker.
>>>>>>
>>>>>>> #2. Every time to release a 4K page table page, 512 page table
>>>>>>> entries need to be scanned.
>>>>>>
>>>>>> It would happen only when actually trigger reclaim of page tables
>>>>>> (again, someone has to trigger it), so it's barely an issue.
>>>>>>
>>>>>> For example, khugepaged already scans the page tables either way.
>>>>>>
>>>>>>>
>>>>>>> For #1, if the scanning is triggered manually from user space, the
>>>>>>> kernel is relatively passive, and the user does not fully know the best
>>>>>>> timing to scan. If the scanning is triggered automatically from the
>>>>>>> kernel, that is great. But the timing is not easy to confirm, is it
>>>>>>> scanned and reclaimed every time zap or try_to_unmap?
>>>>>>>
>>>>>>> For #2, refcount has advantages.
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> There is some advantage with this thinking because it harmonizes well
>>>>>>>>> with the other stuff that wants to convert tables into leafs, but has
>>>>>>>>> to deal with complicated locking.
>>>>>>>>>
>>>>>>>>> On the other hand, refcounts are a degenerate kind of rwsem and only
>>>>>>>>> help with freeing pages. It also puts more atomics in normal fast
>>>>>>>>> paths since we are refcounting each PTE, not read locking the PMD.
>>>>>>>>>
>>>>>>>>> Perhaps the ideal thing would be to stick a rwsem in the PMD. read
>>>>>>>>> means a table cannot be come a leaf. I don't know if there is space
>>>>>>>>> for another atomic in the PMD level, and we'd have to use a hitching
>>>>>>>>> post/hashed waitq scheme too since there surely isn't room for a waitq
>>>>>>>>> too..
>>>>>>>>>
>>>>>>>>> I wouldn't be so quick to say one is better than the other, but at
>>>>>>>>> least let's have thought about a locking solution before merging
>>>>>>>>> refcounts :)
>>>>>>>>
>>>>>>>> Yes, absolutely. I can see the beauty in the current approach, because
>>>>>>>> it just reclaims "automatically" once possible -- page table empty and
>>>>>>>> nobody is walking it. The downside is that it doesn't always make sense
>>>>>>>> to reclaim an empty page table immediately once it turns empty.
>>>>>>>>
>>>>>>>> Also, it adds complexity for something that is only a problem in some
>>>>>>>> corner cases -- sparse memory mappings, especially relevant for some
>>>>>>>> memory allocators after freeing a lot of memory or running VMs with
>>>>>>>> memory ballooning after inflating the balloon. Some of these use cases
>>>>>>>> might be good with just triggering page table reclaim manually from user
>>>>>>>> space.
>>>>>>>>
>>>>>>>
>>>>>>> Yes, this is indeed a problem. Perhaps some flags can be introduced so
>>>>>>> that the release of page table pages can be delayed in some cases.
>>>>>>> Similar to the lazyfree mechanism in MADV_FREE?
>>>>>>
>>>>>> The issue AFAIU is that once your refcount hits 0 (no more references,
>>>>>> no more entries), the longer you wait with reclaim, the longer others
>>>>>> have to wait for populating a fresh page table because the "page table
>>>>>> to be reclaimed" is still stuck around. You'd have to keep the refcount
>>>>>> increased for a while, and only drop it after a while. But when? And
>>>>>> how? IMHO it's not trivial, but maybe there is an easy way to achieve it.
>>>>>>
>>>>>
>>>>> For running VMs with memory ballooning after inflating the balloon, is
>>>>> this a hot behavior? Even if it is, it is already facing the release and
>>>>> reallocation of physical pages. The overhead after introducing
>>>>> pte_refcount is that we need to release and re-allocate page table page.
>>>>> But 2MB physical pages only corresponds to 4KiB of PTE page table page.
>>>>> So maybe the overhead is not big.
>>>>
>>>> The cases that come to my mind are
>>>>
>>>> a) Swapping on shared memory with concurrent access
>>>> b) Reclaim on file-backed memory with concurrent access
>>>> c) Free page reporting as implemented by virtio-balloon
>>>>
>>>> In all of these cases, you can have someone immediately re-access the
>>>> page table and re-populate it.
>>>
>>> In the performance test shown on the cover, we repeatedly performed
>>> touch and madvise(MADV_DONTNEED) actions, which simulated the case
>>> you said above.
>>>
>>> We did find a small amount of performance regression, but I think it is
>>> acceptable, and no new perf hotspots have been added.
>>
>> That test always accesses 2MiB and does it from a single thread. Things
>> might (IMHO will) look different when only accessing individual pages
>> and doing the access from one/multiple separate threads (that's what
>
> No, it includes multi-threading:
>

Oh sorry, I totally skipped [2].

> while (1) {
> char *c;
> char *start = mmap_area[cpu];
> char *end = mmap_area[cpu] + FAULT_LENGTH;
> pthread_barrier_wait(&barrier);
> //printf("fault into %p-%p\n",start, end);
>
> for (c = start; c < end; c += PAGE_SIZE)
> *c = 0;
>
> pthread_barrier_wait(&barrier);
> for (i = 0; cpu==0 && i < num; i++)
> madvise(mmap_area[i], FAULT_LENGTH, MADV_DONTNEED);
> pthread_barrier_wait(&barrier);
> }
>
> Thread on cpu0 will use madvise(MADV_DONTNEED) to release the physical
> memory of threads on other cpu.
>

I'll have a more detailed look at the benchmark. On a quick glimpse,
looks like the threads are also accessing a full 2MiB range, one page at
a time, and one thread is zapping the whole 2MiB range. A single CPU
only accesses memory within one 2MiB range IIRC.

Having multiple threads just access individual pages within a single 2
MiB region, and having one thread zap that memory (e.g., simulate
swapout) could be another benchmark.

We have to make sure to run with THP disabled (e.g., using
madvise(MADV_NOHUGEPAGE) on the complete mapping in the benchmark
eventually), because otherwise you might just be populating+zapping THPs
if they would otherwise be allowed in the environment.

--
Thanks,

David / dhildenb


2021-11-11 13:01:52

by Qi Zheng

[permalink] [raw]
Subject: Re: [PATCH v3 00/15] Free user PTE page table pages



On 11/11/21 8:51 PM, David Hildenbrand wrote:

>>>>
>>>> In the performance test shown on the cover, we repeatedly performed
>>>> touch and madvise(MADV_DONTNEED) actions, which simulated the case
>>>> you said above.
>>>>
>>>> We did find a small amount of performance regression, but I think it is
>>>> acceptable, and no new perf hotspots have been added.
>>>
>>> That test always accesses 2MiB and does it from a single thread. Things
>>> might (IMHO will) look different when only accessing individual pages
>>> and doing the access from one/multiple separate threads (that's what
>>
>> No, it includes multi-threading:
>>
>
> Oh sorry, I totally skipped [2].
>
>> while (1) {
>> char *c;
>> char *start = mmap_area[cpu];
>> char *end = mmap_area[cpu] + FAULT_LENGTH;
>> pthread_barrier_wait(&barrier);
>> //printf("fault into %p-%p\n",start, end);
>>
>> for (c = start; c < end; c += PAGE_SIZE)
>> *c = 0;
>>
>> pthread_barrier_wait(&barrier);
>> for (i = 0; cpu==0 && i < num; i++)
>> madvise(mmap_area[i], FAULT_LENGTH, MADV_DONTNEED);
>> pthread_barrier_wait(&barrier);
>> }
>>
>> Thread on cpu0 will use madvise(MADV_DONTNEED) to release the physical
>> memory of threads on other cpu.
>>
>
> I'll have a more detailed look at the benchmark. On a quick glimpse,

Thank you for your time :)

> looks like the threads are also accessing a full 2MiB range, one page at
> a time, and one thread is zapping the whole 2MiB range. A single CPU
> only accesses memory within one 2MiB range IIRC.
>
> Having multiple threads just access individual pages within a single 2
> MiB region, and having one thread zap that memory (e.g., simulate
> swapout) could be another benchmark.

LGTM, I will simulate more scenarios for testing.

>
> We have to make sure to run with THP disabled (e.g., using
> madvise(MADV_NOHUGEPAGE) on the complete mapping in the benchmark
> eventually), because otherwise you might just be populating+zapping THPs
> if they would otherwise be allowed in the environment.

Yes, I turned off THP during testing:

root@~$ cat /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]

>

--
Thanks,
Qi

2021-11-11 13:47:44

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH v3 02/15] mm: introduce is_huge_pmd() helper

Hi Qi,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on hnaz-mm/master]
[also build test ERROR on tip/perf/core tip/x86/core linus/master v5.15 next-20211111]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url: https://github.com/0day-ci/linux/commits/Qi-Zheng/Free-user-PTE-page-table-pages/20211110-185837
base: https://github.com/hnaz/linux-mm master
config: ia64-defconfig (attached as .config)
compiler: ia64-linux-gcc (GCC) 11.2.0
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/0day-ci/linux/commit/ce86336fbabb116520ad01162faf5c8d4a1ce124
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Qi-Zheng/Free-user-PTE-page-table-pages/20211110-185837
git checkout ce86336fbabb116520ad01162faf5c8d4a1ce124
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross ARCH=ia64

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>

All errors (new ones prefixed by >>):

mm/memory.c: In function 'copy_pmd_range':
>> mm/memory.c:1149:21: error: implicit declaration of function 'is_huge_pmd'; did you mean 'is_hugepd'? [-Werror=implicit-function-declaration]
1149 | if (is_huge_pmd(*src_pmd)) {
| ^~~~~~~~~~~
| is_hugepd
In file included from <command-line>:
In function 'zap_pmd_range',
inlined from 'zap_pud_range' at mm/memory.c:1499:10,
inlined from 'zap_p4d_range' at mm/memory.c:1520:10,
inlined from 'unmap_page_range' at mm/memory.c:1541:10:
include/linux/compiler_types.h:335:45: error: call to '__compiletime_assert_304' declared with attribute error: BUILD_BUG failed
335 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
| ^
include/linux/compiler_types.h:316:25: note: in definition of macro '__compiletime_assert'
316 | prefix ## suffix(); \
| ^~~~~~
include/linux/compiler_types.h:335:9: note: in expansion of macro '_compiletime_assert'
335 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
| ^~~~~~~~~~~~~~~~~~~
include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
| ^~~~~~~~~~~~~~~~~~
include/linux/build_bug.h:59:21: note: in expansion of macro 'BUILD_BUG_ON_MSG'
59 | #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed")
| ^~~~~~~~~~~~~~~~
include/linux/huge_mm.h:328:27: note: in expansion of macro 'BUILD_BUG'
328 | #define HPAGE_PMD_SIZE ({ BUILD_BUG(); 0; })
| ^~~~~~~~~
mm/memory.c:1444:44: note: in expansion of macro 'HPAGE_PMD_SIZE'
1444 | if (next - addr != HPAGE_PMD_SIZE)
| ^~~~~~~~~~~~~~
cc1: some warnings being treated as errors
--
mm/mprotect.c: In function 'change_pmd_range':
>> mm/mprotect.c:260:21: error: implicit declaration of function 'is_huge_pmd'; did you mean 'is_hugepd'? [-Werror=implicit-function-declaration]
260 | if (is_huge_pmd(*pmd)) {
| ^~~~~~~~~~~
| is_hugepd
In file included from <command-line>:
In function 'change_pmd_range',
inlined from 'change_pud_range' at mm/mprotect.c:307:12,
inlined from 'change_p4d_range' at mm/mprotect.c:327:12,
inlined from 'change_protection_range' at mm/mprotect.c:352:12:
include/linux/compiler_types.h:335:45: error: call to '__compiletime_assert_298' declared with attribute error: BUILD_BUG failed
335 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
| ^
include/linux/compiler_types.h:316:25: note: in definition of macro '__compiletime_assert'
316 | prefix ## suffix(); \
| ^~~~~~
include/linux/compiler_types.h:335:9: note: in expansion of macro '_compiletime_assert'
335 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
| ^~~~~~~~~~~~~~~~~~~
include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
| ^~~~~~~~~~~~~~~~~~
include/linux/build_bug.h:59:21: note: in expansion of macro 'BUILD_BUG_ON_MSG'
59 | #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed")
| ^~~~~~~~~~~~~~~~
include/linux/huge_mm.h:328:27: note: in expansion of macro 'BUILD_BUG'
328 | #define HPAGE_PMD_SIZE ({ BUILD_BUG(); 0; })
| ^~~~~~~~~
mm/mprotect.c:261:44: note: in expansion of macro 'HPAGE_PMD_SIZE'
261 | if (next - addr != HPAGE_PMD_SIZE) {
| ^~~~~~~~~~~~~~
cc1: some warnings being treated as errors
--
mm/mremap.c: In function 'move_page_tables':
>> mm/mremap.c:535:21: error: implicit declaration of function 'is_huge_pmd'; did you mean 'is_hugepd'? [-Werror=implicit-function-declaration]
535 | if (is_huge_pmd(*old_pmd)) {
| ^~~~~~~~~~~
| is_hugepd
In file included from <command-line>:
include/linux/compiler_types.h:335:45: error: call to '__compiletime_assert_304' declared with attribute error: BUILD_BUG failed
335 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
| ^
include/linux/compiler_types.h:316:25: note: in definition of macro '__compiletime_assert'
316 | prefix ## suffix(); \
| ^~~~~~
include/linux/compiler_types.h:335:9: note: in expansion of macro '_compiletime_assert'
335 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
| ^~~~~~~~~~~~~~~~~~~
include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
| ^~~~~~~~~~~~~~~~~~
include/linux/build_bug.h:59:21: note: in expansion of macro 'BUILD_BUG_ON_MSG'
59 | #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed")
| ^~~~~~~~~~~~~~~~
include/linux/huge_mm.h:328:27: note: in expansion of macro 'BUILD_BUG'
328 | #define HPAGE_PMD_SIZE ({ BUILD_BUG(); 0; })
| ^~~~~~~~~
mm/mremap.c:536:39: note: in expansion of macro 'HPAGE_PMD_SIZE'
536 | if (extent == HPAGE_PMD_SIZE &&
| ^~~~~~~~~~~~~~
cc1: some warnings being treated as errors


vim +1149 mm/memory.c

1132
1133 static inline int
1134 copy_pmd_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
1135 pud_t *dst_pud, pud_t *src_pud, unsigned long addr,
1136 unsigned long end)
1137 {
1138 struct mm_struct *dst_mm = dst_vma->vm_mm;
1139 struct mm_struct *src_mm = src_vma->vm_mm;
1140 pmd_t *src_pmd, *dst_pmd;
1141 unsigned long next;
1142
1143 dst_pmd = pmd_alloc(dst_mm, dst_pud, addr);
1144 if (!dst_pmd)
1145 return -ENOMEM;
1146 src_pmd = pmd_offset(src_pud, addr);
1147 do {
1148 next = pmd_addr_end(addr, end);
> 1149 if (is_huge_pmd(*src_pmd)) {
1150 int err;
1151 VM_BUG_ON_VMA(next-addr != HPAGE_PMD_SIZE, src_vma);
1152 err = copy_huge_pmd(dst_mm, src_mm, dst_pmd, src_pmd,
1153 addr, dst_vma, src_vma);
1154 if (err == -ENOMEM)
1155 return -ENOMEM;
1156 if (!err)
1157 continue;
1158 /* fall through */
1159 }
1160 if (pmd_none_or_clear_bad(src_pmd))
1161 continue;
1162 if (copy_pte_range(dst_vma, src_vma, dst_pmd, src_pmd,
1163 addr, next))
1164 return -ENOMEM;
1165 } while (dst_pmd++, src_pmd++, addr = next, addr != end);
1166 return 0;
1167 }
1168

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]


Attachments:
(No filename) (8.89 kB)
.config.gz (19.59 kB)
Download all attachments

2021-11-14 14:43:59

by kernel test robot

[permalink] [raw]
Subject: [mm/pte_ref] afcc9fb874: kernel_BUG_at_include/linux/pte_ref.h



Greeting,

FYI, we noticed the following commit (built with gcc-9):

commit: afcc9fb8741f26773a381ac1e159e0172344b7d5 ("[PATCH v3 13/15] mm/pte_ref: free user PTE page table pages")
url: https://github.com/0day-ci/linux/commits/Qi-Zheng/Free-user-PTE-page-table-pages/20211110-185837
base: https://github.com/hnaz/linux-mm master
patch link: https://lore.kernel.org/linux-doc/[email protected]

in testcase: boot

on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G

caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):


+------------------------------------------+------------+------------+
| | e249f0fa9a | afcc9fb874 |
+------------------------------------------+------------+------------+
| boot_successes | 16 | 0 |
| boot_failures | 0 | 14 |
| kernel_BUG_at_include/linux/pte_ref.h | 0 | 14 |
| invalid_opcode:#[##] | 0 | 14 |
| RIP:destroy_args | 0 | 14 |
| Kernel_panic-not_syncing:Fatal_exception | 0 | 14 |
+------------------------------------------+------------+------------+


If you fix the issue, kindly add following tag
Reported-by: kernel test robot <[email protected]>


[ 7.245922][ T1] kernel BUG at include/linux/pte_ref.h:56!
[ 7.269161][ T1] invalid opcode: 0000 [#1] PREEMPT SMP PTI
[ 7.271019][ T1] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.15.0-rc7-mm1-00448-gafcc9fb8741f #1
[ 7.273761][ T1] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
[ 7.276418][ T1] RIP: 0010:destroy_args (include/linux/pte_ref.h:56 include/linux/pte_ref.h:123 mm/debug_vm_pgtable.c:1051)
[ 7.277992][ T1] Code: 6b 58 4c 8b 2b 49 8b 3c 24 e8 c6 38 b4 fe 48 c1 e0 06 48 03 05 aa eb 4c ff 8b 50 30 81 e2 00 02 00 f0 81 fa 00 00 00 f0 74 02 <0f> 0b f0 83 68 20 01 75 15 48 89 ea 4c 89 e6 4c 89 ef 48 81 e2 00
All code
========
0: 6b 58 4c 8b imul $0xffffff8b,0x4c(%rax),%ebx
4: 2b 49 8b sub -0x75(%rcx),%ecx
7: 3c 24 cmp $0x24,%al
9: e8 c6 38 b4 fe callq 0xfffffffffeb438d4
e: 48 c1 e0 06 shl $0x6,%rax
12: 48 03 05 aa eb 4c ff add -0xb31456(%rip),%rax # 0xffffffffff4cebc3
19: 8b 50 30 mov 0x30(%rax),%edx
1c: 81 e2 00 02 00 f0 and $0xf0000200,%edx
22: 81 fa 00 00 00 f0 cmp $0xf0000000,%edx
28: 74 02 je 0x2c
2a:* 0f 0b ud2 <-- trapping instruction
2c: f0 83 68 20 01 lock subl $0x1,0x20(%rax)
31: 75 15 jne 0x48
33: 48 89 ea mov %rbp,%rdx
36: 4c 89 e6 mov %r12,%rsi
39: 4c 89 ef mov %r13,%rdi
3c: 48 rex.W
3d: 81 .byte 0x81
3e: e2 00 loop 0x40

Code starting with the faulting instruction
===========================================
0: 0f 0b ud2
2: f0 83 68 20 01 lock subl $0x1,0x20(%rax)
7: 75 15 jne 0x1e
9: 48 89 ea mov %rbp,%rdx
c: 4c 89 e6 mov %r12,%rsi
f: 4c 89 ef mov %r13,%rdi
12: 48 rex.W
13: 81 .byte 0x81
14: e2 00 loop 0x16
[ 7.283473][ T1] RSP: 0000:ffffc90000013da0 EFLAGS: 00010206
[ 7.285295][ T1] RAX: ffffea0000000000 RBX: ffffc90000013dc8 RCX: 0000000000000000
[ 7.287675][ T1] RDX: 00000000f0000200 RSI: ffffffff823848b5 RDI: 0000000000000000
[ 7.290056][ T1] RBP: 000024b4af3bd000 R08: 0000000000000001 R09: 0000000000000040
[ 7.292449][ T1] R10: ffff88842fc2fb60 R11: ffffc90000013d00 R12: ffff88812da63000
[ 7.294926][ T1] R13: ffff88810ca08c00 R14: 0000000140000067 R15: 0000000000000027
[ 7.297349][ T1] FS: 0000000000000000(0000) GS:ffff88842fc00000(0000) knlGS:0000000000000000
[ 7.300020][ T1] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 7.301949][ T1] CR2: 0000000000000000 CR3: 0000000002612000 CR4: 00000000000006f0
[ 7.304153][ T1] Call Trace:
[ 7.306975][ T1] <TASK>
[ 7.307966][ T1] debug_vm_pgtable (mm/debug_vm_pgtable.c:1334)
[ 7.309435][ T1] ? init_args (mm/debug_vm_pgtable.c:1241)
[ 7.310773][ T1] do_one_initcall (init/main.c:1303)
[ 7.312212][ T1] kernel_init_freeable (init/main.c:1377 init/main.c:1394 init/main.c:1413 init/main.c:1618)
[ 7.313728][ T1] ? rest_init (init/main.c:1499)
[ 7.315002][ T1] kernel_init (init/main.c:1509)
[ 7.316368][ T1] ret_from_fork (arch/x86/entry/entry_64.S:301)
[ 7.317692][ T1] </TASK>
[ 7.318697][ T1] Modules linked in:
[ 7.320060][ T1] ---[ end trace 1f2bbe378e842286 ]---
[ 7.321766][ T1] RIP: 0010:destroy_args (include/linux/pte_ref.h:56 include/linux/pte_ref.h:123 mm/debug_vm_pgtable.c:1051)
[ 7.323325][ T1] Code: 6b 58 4c 8b 2b 49 8b 3c 24 e8 c6 38 b4 fe 48 c1 e0 06 48 03 05 aa eb 4c ff 8b 50 30 81 e2 00 02 00 f0 81 fa 00 00 00 f0 74 02 <0f> 0b f0 83 68 20 01 75 15 48 89 ea 4c 89 e6 4c 89 ef 48 81 e2 00
All code
========
0: 6b 58 4c 8b imul $0xffffff8b,0x4c(%rax),%ebx
4: 2b 49 8b sub -0x75(%rcx),%ecx
7: 3c 24 cmp $0x24,%al
9: e8 c6 38 b4 fe callq 0xfffffffffeb438d4
e: 48 c1 e0 06 shl $0x6,%rax
12: 48 03 05 aa eb 4c ff add -0xb31456(%rip),%rax # 0xffffffffff4cebc3
19: 8b 50 30 mov 0x30(%rax),%edx
1c: 81 e2 00 02 00 f0 and $0xf0000200,%edx
22: 81 fa 00 00 00 f0 cmp $0xf0000000,%edx
28: 74 02 je 0x2c
2a:* 0f 0b ud2 <-- trapping instruction
2c: f0 83 68 20 01 lock subl $0x1,0x20(%rax)
31: 75 15 jne 0x48
33: 48 89 ea mov %rbp,%rdx
36: 4c 89 e6 mov %r12,%rsi
39: 4c 89 ef mov %r13,%rdi
3c: 48 rex.W
3d: 81 .byte 0x81
3e: e2 00 loop 0x40

Code starting with the faulting instruction
===========================================
0: 0f 0b ud2
2: f0 83 68 20 01 lock subl $0x1,0x20(%rax)
7: 75 15 jne 0x1e
9: 48 89 ea mov %rbp,%rdx
c: 4c 89 e6 mov %r12,%rsi
f: 4c 89 ef mov %r13,%rdi
12: 48 rex.W
13: 81 .byte 0x81
14: e2 00 loop 0x16


To reproduce:

# build kernel
cd linux
cp config-5.15.0-rc7-mm1-00448-gafcc9fb8741f .config
make HOSTCC=gcc-9 CC=gcc-9 ARCH=x86_64 olddefconfig prepare modules_prepare bzImage

git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp qemu -k <bzImage> job-script # job-script is attached in this email

# if come across any failure that blocks the test,
# please remove ~/.lkp and /lkp dir to run from a clean state.



---
0DAY/LKP+ Test Infrastructure Open Source Technology Center
https://lists.01.org/hyperkitty/list/[email protected] Intel Corporation

Thanks,
Oliver Sang


Attachments:
(No filename) (7.22 kB)
config-5.15.0-rc7-mm1-00448-gafcc9fb8741f (119.14 kB)
job-script (4.57 kB)
dmesg.xz (11.36 kB)
Download all attachments