2023-04-26 17:03:13

by Khalid Aziz

[permalink] [raw]
Subject: [PATCH RFC v2 0/4] Add support for sharing page tables across processes (Previously mshare)

Memory pages shared between processes require a page table entry
(PTE) for each process. Each of these PTE consumes some of the
memory and as long as number of mappings being maintained is small
enough, this space consumed by page tables is not objectionable.
When very few memory pages are shared between processes, the number
of page table entries (PTEs) to maintain is mostly constrained by
the number of pages of memory on the system. As the number of
shared pages and the number of times pages are shared goes up,
amount of memory consumed by page tables starts to become
significant. This issue does not apply to threads. Any number of
threads can share the same pages inside a process while sharing the
same PTEs. Extending this same model to sharing pages across
processes can eliminate this issue for sharing across processes as
well.

Some of the field deployments commonly see memory pages shared
across 1000s of processes. On x86_64, each page requires a PTE that
is only 8 bytes long which is very small compared to the 4K page
size. When 2000 processes map the same page in their address space,
each one of them requires 8 bytes for its PTE and together that adds
up to 8K of memory just to hold the PTEs for one 4K page. On a
database server with 300GB SGA, a system crash was seen with
out-of-memory condition when 1500+ clients tried to share this SGA
even though the system had 512GB of memory. On this server, in the
worst case scenario of all 1500 processes mapping every page from
SGA would have required 878GB+ for just the PTEs. If these PTEs
could be shared, amount of memory saved is very significant.

This patch series adds a new flag to mmap() call - MAP_SHARED_PT.
This flag can be specified along with MAP_SHARED by a process to
hint to kernel that it wishes to share page table entries for this
file mapping mmap region with other processes. Any other process
that mmaps the same file with MAP_SHARED_PT flag can then share the
same page table entries. Besides specifying MAP_SHARED_PT flag, the
processes must map the files at a PMD aligned address with a size
that is a multiple of PMD size and at the same virtual addresses.
This last requirement of same virtual addresses can possibly be
relaxed if that is the consensus.

When mmap() is called with MAP_SHARED_PT flag, a new host mm struct
is created to hold the shared page tables. Host mm struct is not
attached to a process. Start and size of host mm are set to the
start and size of the mmap region and a VMA covering this range is
also added to host mm struct. Existing page table entries from the
process that creates the mapping are copied over to the host mm
struct. All processes mapping this shared region are considered
guest processes. When a guest process mmap's the shared region, a vm
flag VM_SHARED_PT is added to the VMAs in guest process. Upon a page
fault, VMA is checked for the presence of VM_SHARED_PT flag. If the
flag is found, its corresponding PMD is updated with the PMD from
host mm struct so the PMD will point to the page tables in host mm
struct. vm_mm pointer of the VMA is also updated to point to host mm
struct for the duration of fault handling to ensure fault handling
happens in the context of host mm struct. When a new PTE is
created, it is created in the host mm struct page tables and the PMD
in guest mm points to the same PTEs.

This is a basic working implementation. It will need to go through
more testing and refinements. Some notes and questions:

- PMD size alignment and size requirement is currently hard coded
in. Is there a need or desire to make this more flexible and work
with other alignments/sizes? PMD size allows for adapting this
infrastructure to form the basis for hugetlbfs page table sharing
as well. More work will be needed to make that happen.

- Is there a reason to allow a userspace app to query this size and
alignment requirement for MAP_SHARED_PT in some way?

- Shared PTEs means mprotect() call made by one process affects all
processes sharing the same mapping and that behavior will need to
be documented clearly. Effect of mprotect call being different for
processes using shared page tables is the primary reason to
require an explicit opt-in from userspace processes to share page
tables. With a transparent sharing derived from MAP_SHARED alone,
changed effect of mprotect can break significant number of
userspace apps. One could work around that by unsharing whenever
mprotect changes modes on shared mapping but that introduces
complexity and the capability to execute a single mprotect to
change modes across 1000's of processes sharing a mapped database
is a feature explicitly asked for by database folks. This
capability has significant performance advantage when compared to
mechanism of sending messages to every process using shared
mapping to call mprotect and change modes in each process, or
using traps on permissions mismatch in each process.

- This implementation does not allow unmapping page table shared
mappings partially. Should that be supported in future?

Some concerns in this RFC:

- When page tables for a process are freed upon process exit,
pmd_free_tlb() gets called at one point to free all PMDs allocated
by the process. For a shared page table, shared PMDs can not be
released when a guest process exits. These shared PMDs are
released when host mm struct is released upon end of last
reference to page table shared region hosted by this mm. For now
to stop PMDs being released, this RFC introduces following change
in mm/memory.c which works but does not feel like the right
approach. Any suggestions for a better long term approach will be
very appreciated:

@@ -210,13 +221,19 @@ static inline void free_pmd_range(struct mmu_gather *tlb,
pud_t *pud,

pmd = pmd_offset(pud, start);
pud_clear(pud);
- pmd_free_tlb(tlb, pmd, start);
- mm_dec_nr_pmds(tlb->mm);
+ if (shared_pte) {
+ tlb_flush_pud_range(tlb, start, PAGE_SIZE);
+ tlb->freed_tables = 1;
+ } else {
+ pmd_free_tlb(tlb, pmd, start);
+ mm_dec_nr_pmds(tlb->mm);
+ }
}

static inline void free_pud_range(struct mmu_gather *tlb, p4d_t *p4d,

- This implementation requires an additional VM flag. Since all lower
32 bits are currently in use, the new VM flag must come from upper
32 bits which restricts this feature to 64-bit processors.

- This feature is implemented for file mappings only. Is there a
need to support it for anonymous memory as well?

- Accounting for MAP_SHARED_PT mapped filepages in a process and
pagetable bytes is not quite accurate yet in this RFC and will be
fixed in the non-RFC version of patches.

I appreciate any feedback on these patches and ideas for
improvements before moving these patches out of RFC stage.


Changes from RFC v1:
- Broken the patches up into smaller patches
- Fixed a few bugs related to freeing PTEs and PMDs incorrectly
- Cleaned up the code a bit


Khalid Aziz (4):
mm/ptshare: Add vm flag for shared PTE
mm/ptshare: Add flag MAP_SHARED_PT to mmap()
mm/ptshare: Create new mm struct for page table sharing
mm/ptshare: Add page fault handling for page table shared regions

include/linux/fs.h | 2 +
include/linux/mm.h | 8 +
include/trace/events/mmflags.h | 3 +-
include/uapi/asm-generic/mman-common.h | 1 +
mm/Makefile | 2 +-
mm/internal.h | 21 ++
mm/memory.c | 105 ++++++++--
mm/mmap.c | 88 +++++++++
mm/ptshare.c | 263 +++++++++++++++++++++++++
9 files changed, 476 insertions(+), 17 deletions(-)
create mode 100644 mm/ptshare.c

--
2.37.2


2023-04-26 17:04:53

by Khalid Aziz

[permalink] [raw]
Subject: [PATCH RFC v2 4/4] mm/ptshare: Add page fault handling for page table shared regions

Add support for creating a new set of shared page tables in a new
mm_struct upon mmap of an region that can potentially share page
tables. Add page fault handling for this now shared region. Modify
free_pgtables path to make sure page tables in the shared regions
are kept intact when a process using page table region exits and
there are other mappers for the shared region. Clean up mm_struct
holding shared page tables when the last process sharing the region
exits.

Signed-off-by: Khalid Aziz <[email protected]>
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
mm/internal.h | 2 +
mm/memory.c | 105 ++++++++++++++++++++++++++++++------
mm/ptshare.c | 143 ++++++++++++++++++++++++++++++++++++++++++++++++--
3 files changed, 232 insertions(+), 18 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 3efb8738e26f..924065f721fe 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1061,4 +1061,6 @@ struct ptshare_data {
int ptshare_new_mm(struct file *file, struct vm_area_struct *vma);
void ptshare_del_mm(struct vm_area_struct *vm);
int ptshare_insert_vma(struct mm_struct *mm, struct vm_area_struct *vma);
+extern vm_fault_t find_shared_vma(struct vm_area_struct **vmap,
+ unsigned long *addrp, unsigned int flags);
#endif /* __MM_INTERNAL_H */
diff --git a/mm/memory.c b/mm/memory.c
index 01a23ad48a04..c67318ffd001 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -172,17 +172,28 @@ void mm_trace_rss_stat(struct mm_struct *mm, int member)
* has been handled earlier when unmapping all the memory regions.
*/
static void free_pte_range(struct mmu_gather *tlb, pmd_t *pmd,
- unsigned long addr)
+ unsigned long addr, bool shared_pte)
{
pgtable_t token = pmd_pgtable(*pmd);
pmd_clear(pmd);
+ /*
+ * if this address range shares page tables with other processes,
+ * do not release pte pages. Those pages will be released when
+ * host mm that hosts these pte pages is released
+ */
+ if (shared_pte) {
+ tlb_flush_pmd_range(tlb, addr, PAGE_SIZE);
+ tlb->freed_tables = 1;
+ return;
+ }
pte_free_tlb(tlb, token, addr);
mm_dec_nr_ptes(tlb->mm);
}

static inline void free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
unsigned long addr, unsigned long end,
- unsigned long floor, unsigned long ceiling)
+ unsigned long floor, unsigned long ceiling,
+ bool shared_pte)
{
pmd_t *pmd;
unsigned long next;
@@ -194,7 +205,7 @@ static inline void free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
next = pmd_addr_end(addr, end);
if (pmd_none_or_clear_bad(pmd))
continue;
- free_pte_range(tlb, pmd, addr);
+ free_pte_range(tlb, pmd, addr, shared_pte);
} while (pmd++, addr = next, addr != end);

start &= PUD_MASK;
@@ -210,13 +221,19 @@ static inline void free_pmd_range(struct mmu_gather *tlb, pud_t *pud,

pmd = pmd_offset(pud, start);
pud_clear(pud);
- pmd_free_tlb(tlb, pmd, start);
- mm_dec_nr_pmds(tlb->mm);
+ if (shared_pte) {
+ tlb_flush_pud_range(tlb, start, PAGE_SIZE);
+ tlb->freed_tables = 1;
+ } else {
+ pmd_free_tlb(tlb, pmd, start);
+ mm_dec_nr_pmds(tlb->mm);
+ }
}

static inline void free_pud_range(struct mmu_gather *tlb, p4d_t *p4d,
unsigned long addr, unsigned long end,
- unsigned long floor, unsigned long ceiling)
+ unsigned long floor, unsigned long ceiling,
+ bool shared_pte)
{
pud_t *pud;
unsigned long next;
@@ -228,7 +245,8 @@ static inline void free_pud_range(struct mmu_gather *tlb, p4d_t *p4d,
next = pud_addr_end(addr, end);
if (pud_none_or_clear_bad(pud))
continue;
- free_pmd_range(tlb, pud, addr, next, floor, ceiling);
+ free_pmd_range(tlb, pud, addr, next, floor, ceiling,
+ shared_pte);
} while (pud++, addr = next, addr != end);

start &= P4D_MASK;
@@ -250,7 +268,8 @@ static inline void free_pud_range(struct mmu_gather *tlb, p4d_t *p4d,

static inline void free_p4d_range(struct mmu_gather *tlb, pgd_t *pgd,
unsigned long addr, unsigned long end,
- unsigned long floor, unsigned long ceiling)
+ unsigned long floor, unsigned long ceiling,
+ bool shared_pte)
{
p4d_t *p4d;
unsigned long next;
@@ -262,7 +281,8 @@ static inline void free_p4d_range(struct mmu_gather *tlb, pgd_t *pgd,
next = p4d_addr_end(addr, end);
if (p4d_none_or_clear_bad(p4d))
continue;
- free_pud_range(tlb, p4d, addr, next, floor, ceiling);
+ free_pud_range(tlb, p4d, addr, next, floor, ceiling,
+ shared_pte);
} while (p4d++, addr = next, addr != end);

start &= PGDIR_MASK;
@@ -284,9 +304,10 @@ static inline void free_p4d_range(struct mmu_gather *tlb, pgd_t *pgd,
/*
* This function frees user-level page tables of a process.
*/
-void free_pgd_range(struct mmu_gather *tlb,
+static void _free_pgd_range(struct mmu_gather *tlb,
unsigned long addr, unsigned long end,
- unsigned long floor, unsigned long ceiling)
+ unsigned long floor, unsigned long ceiling,
+ bool shared_pte)
{
pgd_t *pgd;
unsigned long next;
@@ -342,10 +363,18 @@ void free_pgd_range(struct mmu_gather *tlb,
next = pgd_addr_end(addr, end);
if (pgd_none_or_clear_bad(pgd))
continue;
- free_p4d_range(tlb, pgd, addr, next, floor, ceiling);
+ free_p4d_range(tlb, pgd, addr, next, floor, ceiling,
+ shared_pte);
} while (pgd++, addr = next, addr != end);
}

+void free_pgd_range(struct mmu_gather *tlb,
+ unsigned long addr, unsigned long end,
+ unsigned long floor, unsigned long ceiling)
+{
+ _free_pgd_range(tlb, addr, end, floor, ceiling, false);
+}
+
void free_pgtables(struct mmu_gather *tlb, struct maple_tree *mt,
struct vm_area_struct *vma, unsigned long floor,
unsigned long ceiling)
@@ -375,16 +404,20 @@ void free_pgtables(struct mmu_gather *tlb, struct maple_tree *mt,
} else {
/*
* Optimization: gather nearby vmas into one call down
+ * but make sure vmas not sharing page tables do
+ * not get combined with vmas sharing page tables
*/
while (next && next->vm_start <= vma->vm_end + PMD_SIZE
- && !is_vm_hugetlb_page(next)) {
+ && !is_vm_hugetlb_page(next)
+ && (vma_is_shared(next) == vma_is_shared(vma))) {
vma = next;
next = mas_find(&mas, ceiling - 1);
unlink_anon_vmas(vma);
unlink_file_vma(vma);
}
- free_pgd_range(tlb, addr, vma->vm_end,
- floor, next ? next->vm_start : ceiling);
+ _free_pgd_range(tlb, addr, vma->vm_end,
+ floor, next ? next->vm_start : ceiling,
+ vma_is_shared(vma));
}
vma = next;
} while (vma);
@@ -5181,6 +5214,8 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
unsigned int flags, struct pt_regs *regs)
{
vm_fault_t ret;
+ bool shared = false;
+ struct mm_struct *orig_mm;

__set_current_state(TASK_RUNNING);

@@ -5191,6 +5226,16 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
if (ret)
return ret;

+ orig_mm = vma->vm_mm;
+ if (unlikely(vma_is_shared(vma))) {
+ ret = find_shared_vma(&vma, &address, flags);
+ if (ret)
+ return ret;
+ if (!vma)
+ return VM_FAULT_SIGSEGV;
+ shared = true;
+ }
+
if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
flags & FAULT_FLAG_INSTRUCTION,
flags & FAULT_FLAG_REMOTE))
@@ -5212,6 +5257,36 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,

lru_gen_exit_fault();

+ /*
+ * Release the read lock on shared VMA's parent mm unless
+ * __handle_mm_fault released the lock already.
+ * __handle_mm_fault sets VM_FAULT_RETRY in return value if
+ * it released mmap lock. If lock was released, that implies
+ * the lock would have been released on task's original mm if
+ * this were not a shared PTE vma. To keep lock state consistent,
+ * make sure to release the lock on task's original mm
+ */
+ if (shared) {
+ int release_mmlock = 1;
+
+ if (!(ret & VM_FAULT_RETRY)) {
+ mmap_read_unlock(vma->vm_mm);
+ release_mmlock = 0;
+ } else if ((flags & FAULT_FLAG_ALLOW_RETRY) &&
+ (flags & FAULT_FLAG_RETRY_NOWAIT)) {
+ mmap_read_unlock(vma->vm_mm);
+ release_mmlock = 0;
+ }
+
+ /*
+ * Reset guest vma pointers that were set up in
+ * find_shared_vma() to process this fault.
+ */
+ vma->vm_mm = orig_mm;
+ if (release_mmlock)
+ mmap_read_unlock(orig_mm);
+ }
+
if (flags & FAULT_FLAG_USER) {
mem_cgroup_exit_user_fault();
/*
diff --git a/mm/ptshare.c b/mm/ptshare.c
index f6784268958c..e0991a877355 100644
--- a/mm/ptshare.c
+++ b/mm/ptshare.c
@@ -13,6 +13,136 @@
#include <asm/pgalloc.h>
#include "internal.h"

+/*
+ */
+static pmd_t
+*get_pmd(struct mm_struct *mm, unsigned long addr)
+{
+ pgd_t *pgd;
+ p4d_t *p4d;
+ pud_t *pud;
+ pmd_t *pmd;
+
+ pgd = pgd_offset(mm, addr);
+ if (pgd_none(*pgd))
+ return NULL;
+
+ p4d = p4d_offset(pgd, addr);
+ if (p4d_none(*p4d)) {
+ p4d = p4d_alloc(mm, pgd, addr);
+ if (!p4d)
+ return NULL;
+ }
+
+ pud = pud_offset(p4d, addr);
+ if (pud_none(*pud)) {
+ pud = pud_alloc(mm, p4d, addr);
+ if (!pud)
+ return NULL;
+ }
+
+ pmd = pmd_offset(pud, addr);
+ if (pmd_none(*pmd)) {
+ pmd = pmd_alloc(mm, pud, addr);
+ if (!pmd)
+ return NULL;
+ }
+
+ return pmd;
+}
+
+/*
+ * Find the shared pmd entries in host mm struct and install them into
+ * guest page tables.
+ */
+static int
+ptshare_copy_pmd(struct mm_struct *host_mm, struct mm_struct *guest_mm,
+ struct vm_area_struct *vma, unsigned long addr)
+{
+ pgd_t *guest_pgd;
+ p4d_t *guest_p4d;
+ pud_t *guest_pud;
+ pmd_t *host_pmd;
+ spinlock_t *host_ptl, *guest_ptl;
+
+ guest_pgd = pgd_offset(guest_mm, addr);
+ guest_p4d = p4d_offset(guest_pgd, addr);
+ if (p4d_none(*guest_p4d)) {
+ guest_p4d = p4d_alloc(guest_mm, guest_pgd, addr);
+ if (!guest_p4d)
+ return 1;
+ }
+
+ guest_pud = pud_offset(guest_p4d, addr);
+ if (pud_none(*guest_pud)) {
+ host_pmd = get_pmd(host_mm, addr);
+ if (!host_pmd)
+ return 1;
+
+ get_page(virt_to_page(host_pmd));
+ host_ptl = pmd_lockptr(host_mm, host_pmd);
+ guest_ptl = pud_lockptr(guest_mm, guest_pud);
+ spin_lock(host_ptl);
+ spin_lock(guest_ptl);
+ pud_populate(guest_mm, guest_pud,
+ (pmd_t *)((unsigned long)host_pmd & PAGE_MASK));
+ put_page(virt_to_page(host_pmd));
+ spin_unlock(guest_ptl);
+ spin_unlock(host_ptl);
+ }
+
+ return 0;
+}
+
+/*
+ * Find the shared page tables in hosting mm struct and install those in
+ * the guest mm struct
+ */
+vm_fault_t
+find_shared_vma(struct vm_area_struct **vmap, unsigned long *addrp,
+ unsigned int flags)
+{
+ struct ptshare_data *info;
+ struct mm_struct *host_mm;
+ struct vm_area_struct *host_vma, *guest_vma = *vmap;
+ unsigned long host_addr;
+ pmd_t *guest_pmd, *host_pmd;
+
+ if ((!guest_vma->vm_file) || (!guest_vma->vm_file->f_mapping))
+ return 0;
+ info = guest_vma->vm_file->f_mapping->ptshare_data;
+ if (!info) {
+ pr_warn("VM_SHARED_PT vma with NULL ptshare_data");
+ dump_stack_print_info(KERN_WARNING);
+ return 0;
+ }
+ host_mm = info->mm;
+
+ mmap_read_lock(host_mm);
+ host_addr = *addrp - guest_vma->vm_start + host_mm->mmap_base;
+ host_pmd = get_pmd(host_mm, host_addr);
+ guest_pmd = get_pmd(guest_vma->vm_mm, *addrp);
+ if (!pmd_same(*guest_pmd, *host_pmd)) {
+ set_pmd(guest_pmd, *host_pmd);
+ mmap_read_unlock(host_mm);
+ return VM_FAULT_NOPAGE;
+ }
+
+ *addrp = host_addr;
+ host_vma = find_vma(host_mm, host_addr);
+ if (!host_vma)
+ return VM_FAULT_SIGSEGV;
+
+ /*
+ * Point vm_mm for the faulting vma to the mm struct holding shared
+ * page tables so the fault handling will happen in the right
+ * shared context
+ */
+ guest_vma->vm_mm = host_mm;
+
+ return 0;
+}
+
/*
* Create a new mm struct that will hold the shared PTEs. Pointer to
* this new mm is stored in the data structure ptshare_data which also
@@ -38,6 +168,7 @@ ptshare_new_mm(struct file *file, struct vm_area_struct *vma)
new_mm->task_size = len;
if (!new_mm->task_size)
new_mm->task_size--;
+ new_mm->owner = NULL;

info = kzalloc(sizeof(*info), GFP_KERNEL);
if (!info) {
@@ -63,7 +194,7 @@ ptshare_new_mm(struct file *file, struct vm_area_struct *vma)
* insert vma into mm holding shared page tables
*/
int
-ptshare_insert_vma(struct mm_struct *mm, struct vm_area_struct *vma)
+ptshare_insert_vma(struct mm_struct *host_mm, struct vm_area_struct *vma)
{
struct vm_area_struct *new_vma;
int err = 0;
@@ -80,12 +211,18 @@ ptshare_insert_vma(struct mm_struct *mm, struct vm_area_struct *vma)
*/
vm_flags_clear(new_vma, (VM_SHARED | VM_SHARED_PT));
vm_flags_set(new_vma, VM_NOHUGEPAGE);
- new_vma->vm_mm = mm;
+ new_vma->vm_mm = host_mm;

- err = insert_vm_struct(mm, new_vma);
+ err = insert_vm_struct(host_mm, new_vma);
if (err)
return -ENOMEM;

+ /*
+ * Copy the PMD entries from host mm to guest so they use the
+ * same PTEs
+ */
+ err = ptshare_copy_pmd(host_mm, vma->vm_mm, vma, vma->vm_start);
+
return err;
}

--
2.37.2

2023-04-26 17:05:09

by Khalid Aziz

[permalink] [raw]
Subject: [PATCH RFC v2 2/4] mm/ptshare: Add flag MAP_SHARED_PT to mmap()

When a process mmaps a file with MAP_SHARED flag, it is possible
that any other processes mmaping the same file with MAP_SHARED
flag with same permissions could share the page table entries as
well instead of creating duplicate entries. This patch introduces a
new flag MAP_SHARED_PT for mmap() which a process can use to hint
that it can share page tables with other processes using the same
mapping.

Signed-off-by: Khalid Aziz <[email protected]>
---
include/uapi/asm-generic/mman-common.h | 1 +
mm/mmap.c | 16 ++++++++++++++++
2 files changed, 17 insertions(+)

diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 6ce1f1ceb432..4d23456b5915 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -29,6 +29,7 @@
#define MAP_HUGETLB 0x040000 /* create a huge page mapping */
#define MAP_SYNC 0x080000 /* perform synchronous page faults for the mapping */
#define MAP_FIXED_NOREPLACE 0x100000 /* MAP_FIXED which doesn't unmap underlying mapping */
+#define MAP_SHARED_PT 0x200000 /* Shared page table mappings */

#define MAP_UNINITIALIZED 0x4000000 /* For anonymous mmap, memory could be
* uninitialized */
diff --git a/mm/mmap.c b/mm/mmap.c
index d5475fbf5729..8b46d465f8d4 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1197,6 +1197,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
struct mm_struct *mm = current->mm;
vm_flags_t vm_flags;
int pkey = 0;
+ int ptshare = 0;

validate_mm(mm);
*populate = 0;
@@ -1234,6 +1235,21 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
if (mm->map_count > sysctl_max_map_count)
return -ENOMEM;

+ /*
+ * If MAP_SHARED_PT is set, MAP_SHARED or MAP_SHARED_VALIDATE must
+ * be set as well
+ */
+ if (flags & MAP_SHARED_PT) {
+#if VM_SHARED_PT
+ if (flags & (MAP_SHARED | MAP_SHARED_VALIDATE))
+ ptshare = 1;
+ else
+ return -EINVAL;
+#else
+ return -EINVAL;
+#endif
+ }
+
/* Obtain the address to map to. we verify (or select) it and ensure
* that it represents a valid section of the address space.
*/
--
2.37.2

2023-04-26 21:31:26

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH RFC v2 0/4] Add support for sharing page tables across processes (Previously mshare)

On 04/26/23 10:49, Khalid Aziz wrote:
> Memory pages shared between processes require a page table entry
> (PTE) for each process. Each of these PTE consumes some of the
> memory and as long as number of mappings being maintained is small
> enough, this space consumed by page tables is not objectionable.
> When very few memory pages are shared between processes, the number
> of page table entries (PTEs) to maintain is mostly constrained by
> the number of pages of memory on the system. As the number of
> shared pages and the number of times pages are shared goes up,
> amount of memory consumed by page tables starts to become
> significant. This issue does not apply to threads. Any number of
> threads can share the same pages inside a process while sharing the
> same PTEs. Extending this same model to sharing pages across
> processes can eliminate this issue for sharing across processes as
> well.
>
> Some of the field deployments commonly see memory pages shared
> across 1000s of processes. On x86_64, each page requires a PTE that
> is only 8 bytes long which is very small compared to the 4K page
> size. When 2000 processes map the same page in their address space,
> each one of them requires 8 bytes for its PTE and together that adds
> up to 8K of memory just to hold the PTEs for one 4K page. On a
> database server with 300GB SGA, a system crash was seen with
> out-of-memory condition when 1500+ clients tried to share this SGA
> even though the system had 512GB of memory. On this server, in the
> worst case scenario of all 1500 processes mapping every page from
> SGA would have required 878GB+ for just the PTEs. If these PTEs
> could be shared, amount of memory saved is very significant.

When hugetlb pmd sharing was introduced the motivating factor was performance
as much as memory savings. See commit,
39dde65c9940 [PATCH] shared page table for hugetlb page

I have not verified the claims in that commit message, but it does sound
reasonable. My assumption is that the same would apply here.

>
> This patch series adds a new flag to mmap() call - MAP_SHARED_PT.
> This flag can be specified along with MAP_SHARED by a process to
> hint to kernel that it wishes to share page table entries for this
> file mapping mmap region with other processes. Any other process
> that mmaps the same file with MAP_SHARED_PT flag can then share the
> same page table entries. Besides specifying MAP_SHARED_PT flag, the
> processes must map the files at a PMD aligned address with a size
> that is a multiple of PMD size and at the same virtual addresses.
> This last requirement of same virtual addresses can possibly be
> relaxed if that is the consensus.

I would think the same virtual address requirement is problematic in the
case of ASLR. hugetlb pmd sharing does not have the 'same virtual
addresses' requirement. My guess is that requiring the same virtual
address does not make code simpler.

> When mmap() is called with MAP_SHARED_PT flag, a new host mm struct
> is created to hold the shared page tables. Host mm struct is not
> attached to a process. Start and size of host mm are set to the
> start and size of the mmap region and a VMA covering this range is
> also added to host mm struct. Existing page table entries from the
> process that creates the mapping are copied over to the host mm
> struct.

It seems there is one host mm struct per shared mapping. Is that
correct? And, the host mm is the size of that single mapping?
Suppose the following:
- process A maps shared file offset 0 to 2xPMD_SIZE
- process B maps shared file offset 0 to 2xPMD_SIZE
It then makes sense A and B would use the same host mm. Now,
- process C maps shared file offset 0 to 4xPMD_SIZE
Does C share anything with A and B? Are there multiple host mm structs?

Just wondering at a high level how this would work without looking at
the code.

> All processes mapping this shared region are considered
> guest processes. When a guest process mmap's the shared region, a vm
> flag VM_SHARED_PT is added to the VMAs in guest process. Upon a page
> fault, VMA is checked for the presence of VM_SHARED_PT flag. If the
> flag is found, its corresponding PMD is updated with the PMD from
> host mm struct so the PMD will point to the page tables in host mm
> struct. vm_mm pointer of the VMA is also updated to point to host mm
> struct for the duration of fault handling to ensure fault handling
> happens in the context of host mm struct. When a new PTE is
> created, it is created in the host mm struct page tables and the PMD
> in guest mm points to the same PTEs.
>
> This is a basic working implementation. It will need to go through
> more testing and refinements. Some notes and questions:
>
> - PMD size alignment and size requirement is currently hard coded
> in. Is there a need or desire to make this more flexible and work
> with other alignments/sizes? PMD size allows for adapting this
> infrastructure to form the basis for hugetlbfs page table sharing
> as well. More work will be needed to make that happen.
>
> - Is there a reason to allow a userspace app to query this size and
> alignment requirement for MAP_SHARED_PT in some way?
>
> - Shared PTEs means mprotect() call made by one process affects all
> processes sharing the same mapping and that behavior will need to
> be documented clearly. Effect of mprotect call being different for
> processes using shared page tables is the primary reason to
> require an explicit opt-in from userspace processes to share page
> tables. With a transparent sharing derived from MAP_SHARED alone,
> changed effect of mprotect can break significant number of
> userspace apps. One could work around that by unsharing whenever
> mprotect changes modes on shared mapping but that introduces
> complexity and the capability to execute a single mprotect to
> change modes across 1000's of processes sharing a mapped database
> is a feature explicitly asked for by database folks. This
> capability has significant performance advantage when compared to
> mechanism of sending messages to every process using shared
> mapping to call mprotect and change modes in each process, or
> using traps on permissions mismatch in each process.

I would guess this is more than just mprotect, and anything that impacts
page tables. Correct? For example MADV_DONTNEED, MADV_HUGEPAGE,
MADV_NOHUGEPAGE.

> - This implementation does not allow unmapping page table shared
> mappings partially. Should that be supported in future?
>
> Some concerns in this RFC:
>
> - When page tables for a process are freed upon process exit,
> pmd_free_tlb() gets called at one point to free all PMDs allocated
> by the process. For a shared page table, shared PMDs can not be
> released when a guest process exits. These shared PMDs are
> released when host mm struct is released upon end of last
> reference to page table shared region hosted by this mm. For now
> to stop PMDs being released, this RFC introduces following change
> in mm/memory.c which works but does not feel like the right
> approach. Any suggestions for a better long term approach will be
> very appreciated:
>
> @@ -210,13 +221,19 @@ static inline void free_pmd_range(struct mmu_gather *tlb,
> pud_t *pud,
>
> pmd = pmd_offset(pud, start);
> pud_clear(pud);
> - pmd_free_tlb(tlb, pmd, start);
> - mm_dec_nr_pmds(tlb->mm);
> + if (shared_pte) {
> + tlb_flush_pud_range(tlb, start, PAGE_SIZE);
> + tlb->freed_tables = 1;
> + } else {
> + pmd_free_tlb(tlb, pmd, start);
> + mm_dec_nr_pmds(tlb->mm);
> + }
> }
>
> static inline void free_pud_range(struct mmu_gather *tlb, p4d_t *p4d,
>
> - This implementation requires an additional VM flag. Since all lower
> 32 bits are currently in use, the new VM flag must come from upper
> 32 bits which restricts this feature to 64-bit processors.
>
> - This feature is implemented for file mappings only. Is there a
> need to support it for anonymous memory as well?

I have not looked at the implementation, are 'file mappings' only
mappings using the page cache? Or, do you just need a file descriptor?
Would a file descriptor created via memfd_create work for anonymous
memory?

--
Mike Kravetz

>
> - Accounting for MAP_SHARED_PT mapped filepages in a process and
> pagetable bytes is not quite accurate yet in this RFC and will be
> fixed in the non-RFC version of patches.
>
> I appreciate any feedback on these patches and ideas for
> improvements before moving these patches out of RFC stage.
>
>
> Changes from RFC v1:
> - Broken the patches up into smaller patches
> - Fixed a few bugs related to freeing PTEs and PMDs incorrectly
> - Cleaned up the code a bit
>
>
> Khalid Aziz (4):
> mm/ptshare: Add vm flag for shared PTE
> mm/ptshare: Add flag MAP_SHARED_PT to mmap()
> mm/ptshare: Create new mm struct for page table sharing
> mm/ptshare: Add page fault handling for page table shared regions
>
> include/linux/fs.h | 2 +
> include/linux/mm.h | 8 +
> include/trace/events/mmflags.h | 3 +-
> include/uapi/asm-generic/mman-common.h | 1 +
> mm/Makefile | 2 +-
> mm/internal.h | 21 ++
> mm/memory.c | 105 ++++++++--
> mm/mmap.c | 88 +++++++++
> mm/ptshare.c | 263 +++++++++++++++++++++++++
> 9 files changed, 476 insertions(+), 17 deletions(-)
> create mode 100644 mm/ptshare.c
>
> --
> 2.37.2
>

2023-04-27 16:49:06

by Khalid Aziz

[permalink] [raw]
Subject: Re: [PATCH RFC v2 0/4] Add support for sharing page tables across processes (Previously mshare)

On 4/26/23 15:27, Mike Kravetz wrote:
> On 04/26/23 10:49, Khalid Aziz wrote:
>> Memory pages shared between processes require a page table entry
>> (PTE) for each process. Each of these PTE consumes some of the
>> memory and as long as number of mappings being maintained is small
>> enough, this space consumed by page tables is not objectionable.
>> When very few memory pages are shared between processes, the number
>> of page table entries (PTEs) to maintain is mostly constrained by
>> the number of pages of memory on the system. As the number of
>> shared pages and the number of times pages are shared goes up,
>> amount of memory consumed by page tables starts to become
>> significant. This issue does not apply to threads. Any number of
>> threads can share the same pages inside a process while sharing the
>> same PTEs. Extending this same model to sharing pages across
>> processes can eliminate this issue for sharing across processes as
>> well.
>>
>> Some of the field deployments commonly see memory pages shared
>> across 1000s of processes. On x86_64, each page requires a PTE that
>> is only 8 bytes long which is very small compared to the 4K page
>> size. When 2000 processes map the same page in their address space,
>> each one of them requires 8 bytes for its PTE and together that adds
>> up to 8K of memory just to hold the PTEs for one 4K page. On a
>> database server with 300GB SGA, a system crash was seen with
>> out-of-memory condition when 1500+ clients tried to share this SGA
>> even though the system had 512GB of memory. On this server, in the
>> worst case scenario of all 1500 processes mapping every page from
>> SGA would have required 878GB+ for just the PTEs. If these PTEs
>> could be shared, amount of memory saved is very significant.
>
> When hugetlb pmd sharing was introduced the motivating factor was performance
> as much as memory savings. See commit,
> 39dde65c9940 [PATCH] shared page table for hugetlb page
>
> I have not verified the claims in that commit message, but it does sound
> reasonable. My assumption is that the same would apply here.

Hi Mike,

Thanks for the feedback. I looked up the commit log for 39dde65c9940. You are right, same does apply here.

>
>>
>> This patch series adds a new flag to mmap() call - MAP_SHARED_PT.
>> This flag can be specified along with MAP_SHARED by a process to
>> hint to kernel that it wishes to share page table entries for this
>> file mapping mmap region with other processes. Any other process
>> that mmaps the same file with MAP_SHARED_PT flag can then share the
>> same page table entries. Besides specifying MAP_SHARED_PT flag, the
>> processes must map the files at a PMD aligned address with a size
>> that is a multiple of PMD size and at the same virtual addresses.
>> This last requirement of same virtual addresses can possibly be
>> relaxed if that is the consensus.
>
> I would think the same virtual address requirement is problematic in the
> case of ASLR. hugetlb pmd sharing does not have the 'same virtual
> addresses' requirement. My guess is that requiring the same virtual
> address does not make code simpler.

Same virtual address requirement make pevious mshare implementation quite a bit simpler, but this reimplementation in
this RFC does not rely upon virtual addresses being same as much. This does open up the possibility of removing this
requirement. I just haven't looked through any potential complications enough.

>
>> When mmap() is called with MAP_SHARED_PT flag, a new host mm struct
>> is created to hold the shared page tables. Host mm struct is not
>> attached to a process. Start and size of host mm are set to the
>> start and size of the mmap region and a VMA covering this range is
>> also added to host mm struct. Existing page table entries from the
>> process that creates the mapping are copied over to the host mm
>> struct.
>
> It seems there is one host mm struct per shared mapping. Is that
> correct? And, the host mm is the size of that single mapping?

There is one host mm per shared file. This was done to allow for multiple VMAs to exist in this mm for a file that can
potentially represent multiple shared regions of the file. In the RFC, I am creating only one shared region per file for
now.

> Suppose the following:
> - process A maps shared file offset 0 to 2xPMD_SIZE
> - process B maps shared file offset 0 to 2xPMD_SIZE
> It then makes sense A and B would use the same host mm. Now,

Yes, that is correct.

> - process C maps shared file offset 0 to 4xPMD_SIZE
> Does C share anything with A and B? Are there multiple host mm structs?

In the RFC implementation, C does not share anything with A. The code is written to allow for expansion to support this
sharing as well as sharing multiple non-contiguous regions of the file. My goal is to debug and finalize sharing
infrastructure for the simple case and then start expanding it. In this specific case, implementation would expand the
existing shared VMAs to cover offset 0 to 4xPMD.

>
> Just wondering at a high level how this would work without looking at
> the code.
>
>> All processes mapping this shared region are considered
>> guest processes. When a guest process mmap's the shared region, a vm
>> flag VM_SHARED_PT is added to the VMAs in guest process. Upon a page
>> fault, VMA is checked for the presence of VM_SHARED_PT flag. If the
>> flag is found, its corresponding PMD is updated with the PMD from
>> host mm struct so the PMD will point to the page tables in host mm
>> struct. vm_mm pointer of the VMA is also updated to point to host mm
>> struct for the duration of fault handling to ensure fault handling
>> happens in the context of host mm struct. When a new PTE is
>> created, it is created in the host mm struct page tables and the PMD
>> in guest mm points to the same PTEs.
>>
>> This is a basic working implementation. It will need to go through
>> more testing and refinements. Some notes and questions:
>>
>> - PMD size alignment and size requirement is currently hard coded
>> in. Is there a need or desire to make this more flexible and work
>> with other alignments/sizes? PMD size allows for adapting this
>> infrastructure to form the basis for hugetlbfs page table sharing
>> as well. More work will be needed to make that happen.
>>
>> - Is there a reason to allow a userspace app to query this size and
>> alignment requirement for MAP_SHARED_PT in some way?
>>
>> - Shared PTEs means mprotect() call made by one process affects all
>> processes sharing the same mapping and that behavior will need to
>> be documented clearly. Effect of mprotect call being different for
>> processes using shared page tables is the primary reason to
>> require an explicit opt-in from userspace processes to share page
>> tables. With a transparent sharing derived from MAP_SHARED alone,
>> changed effect of mprotect can break significant number of
>> userspace apps. One could work around that by unsharing whenever
>> mprotect changes modes on shared mapping but that introduces
>> complexity and the capability to execute a single mprotect to
>> change modes across 1000's of processes sharing a mapped database
>> is a feature explicitly asked for by database folks. This
>> capability has significant performance advantage when compared to
>> mechanism of sending messages to every process using shared
>> mapping to call mprotect and change modes in each process, or
>> using traps on permissions mismatch in each process.
>
> I would guess this is more than just mprotect, and anything that impacts
> page tables. Correct? For example MADV_DONTNEED, MADV_HUGEPAGE,
> MADV_NOHUGEPAGE.

Yes. For instance, the RFC implementation allows one process to set MADV_DONTNEED and remove PTEs which removes them for
all other processes sharing the mapping. That behavior does not quite sound right. A more robust implementation would
possibly remove PTEs only if all sharing processes have set MADV_DONTNEED. Other flags might need different treatment,
for instance MADV_WILLNEED should trigger read-ahead even if set by just one process.


>
>> - This implementation does not allow unmapping page table shared
>> mappings partially. Should that be supported in future?
>>
>> Some concerns in this RFC:
>>
>> - When page tables for a process are freed upon process exit,
>> pmd_free_tlb() gets called at one point to free all PMDs allocated
>> by the process. For a shared page table, shared PMDs can not be
>> released when a guest process exits. These shared PMDs are
>> released when host mm struct is released upon end of last
>> reference to page table shared region hosted by this mm. For now
>> to stop PMDs being released, this RFC introduces following change
>> in mm/memory.c which works but does not feel like the right
>> approach. Any suggestions for a better long term approach will be
>> very appreciated:
>>
>> @@ -210,13 +221,19 @@ static inline void free_pmd_range(struct mmu_gather *tlb,
>> pud_t *pud,
>>
>> pmd = pmd_offset(pud, start);
>> pud_clear(pud);
>> - pmd_free_tlb(tlb, pmd, start);
>> - mm_dec_nr_pmds(tlb->mm);
>> + if (shared_pte) {
>> + tlb_flush_pud_range(tlb, start, PAGE_SIZE);
>> + tlb->freed_tables = 1;
>> + } else {
>> + pmd_free_tlb(tlb, pmd, start);
>> + mm_dec_nr_pmds(tlb->mm);
>> + }
>> }
>>
>> static inline void free_pud_range(struct mmu_gather *tlb, p4d_t *p4d,
>>
>> - This implementation requires an additional VM flag. Since all lower
>> 32 bits are currently in use, the new VM flag must come from upper
>> 32 bits which restricts this feature to 64-bit processors.
>>
>> - This feature is implemented for file mappings only. Is there a
>> need to support it for anonymous memory as well?
>
> I have not looked at the implementation, are 'file mappings' only
> mappings using the page cache? Or, do you just need a file descriptor?
> Would a file descriptor created via memfd_create work for anonymous
> memory?
>

I look for just a file pointer, so an fd created by memfd would work for anonymous memory.

Thanks,
Khalid

2023-04-29 15:02:44

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH RFC v2 4/4] mm/ptshare: Add page fault handling for page table shared regions



Hello,

kernel test robot noticed "WARNING:bad_unlock_balance_detected" on:

commit: a2eef9e49f572b7b2dfa23fc32567e83da9573d5 ("[PATCH RFC v2 4/4] mm/ptshare: Add page fault handling for page table shared regions")
url: https://github.com/intel-lab-lkp/linux/commits/Khalid-Aziz/mm-ptshare-Add-vm-flag-for-shared-PTE/20230427-005143
base: https://git.kernel.org/cgit/linux/kernel/git/arnd/asm-generic.git master
patch link: https://lore.kernel.org/all/9edffd2a12a049a42d9a2c216e3f999aae7e65a4.1682453344.git.khalid.aziz@oracle.com/
patch subject: [PATCH RFC v2 4/4] mm/ptshare: Add page fault handling for page table shared regions

in testcase: kernel-selftests
version: kernel-selftests-x86_64-60acb023-1_20230329
with following parameters:

sc_nr_hugepages: 2
group: mm

test-description: The kernel contains a set of "self tests" under the tools/testing/selftests/ directory. These are intended to be small unit tests to exercise individual code paths in the kernel.
test-url: https://www.kernel.org/doc/Documentation/kselftest.txt


compiler: gcc-11
test machine: 8 threads 1 sockets Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz (Kaby Lake) with 32G memory

(please refer to attached dmesg/kmsg for entire log/backtrace)



If you fix the issue, kindly add following tag
| Reported-by: kernel test robot <[email protected]>
| Link: https://lore.kernel.org/oe-lkp/[email protected]


[ 183.878671][ T3695] WARNING: bad unlock balance detected!
[ 183.884040][ T3695] 6.3.0-rc1-00015-ga2eef9e49f57 #1 Not tainted
[ 183.890014][ T3695] -------------------------------------
[ 183.895382][ T3695] userfaultfd/3695 is trying to release lock (&mm->mmap_lock) at:
[ 183.903000][ T3695] handle_mm_fault (mm/memory.c:5276)
[ 183.909324][ T3695] but there are no more locks to release!
[ 183.914866][ T3695]
[ 183.914866][ T3695] other info that might help us debug this:
[ 183.922738][ T3695] no locks held by userfaultfd/3695.
[ 183.927847][ T3695]
[ 183.927847][ T3695] stack backtrace:
[ 183.933560][ T3695] CPU: 7 PID: 3695 Comm: userfaultfd Not tainted 6.3.0-rc1-00015-ga2eef9e49f57 #1
[ 183.942558][ T3695] Hardware name: Dell Inc. OptiPlex 7050/062KRH, BIOS 1.2.0 12/22/2016
[ 183.950604][ T3695] Call Trace:
[ 183.953730][ T3695] <TASK>
[ 183.956510][ T3695] dump_stack_lvl (lib/dump_stack.c:108)
[ 183.960845][ T3695] __lock_release (kernel/locking/lockdep.c:5346)
[ 183.965354][ T3695] ? lock_downgrade (kernel/locking/lockdep.c:5321)
[ 183.970032][ T3695] ? dump_stack_print_info (lib/dump_stack.c:70)
[ 183.975231][ T3695] ? handle_mm_fault (mm/memory.c:5276)
[ 183.979998][ T3695] lock_release (kernel/locking/lockdep.c:467 kernel/locking/lockdep.c:5691)
[ 183.984248][ T3695] up_read (kernel/locking/rwsem.c:1616)
[ 183.987978][ T3695] handle_mm_fault (mm/memory.c:5276)
[ 183.992571][ T3695] ? lock_is_held_type (kernel/locking/lockdep.c:5410 kernel/locking/lockdep.c:5712)
[ 183.997424][ T3695] ? __handle_mm_fault (mm/memory.c:5201)
[ 184.002364][ T3695] ? find_vma (mm/mmap.c:1854)
[ 184.006459][ T3695] ? can_vma_merge_before+0x330/0x330
[ 184.012283][ T3695] do_user_addr_fault (arch/x86/mm/fault.c:1407)
[ 184.017146][ T3695] exc_page_fault (arch/x86/include/asm/irqflags.h:26 arch/x86/include/asm/irqflags.h:67 arch/x86/include/asm/irqflags.h:127 arch/x86/mm/fault.c:1506 arch/x86/mm/fault.c:1554)
[ 184.021496][ T3695] asm_exc_page_fault (arch/x86/include/asm/idtentry.h:570)
[ 184.026186][ T3695] RIP: 0033:0x403b2e
[ 184.029923][ T3695] Code: 89 75 e0 48 89 55 d8 48 c7 45 f8 00 00 00 00 eb 2c 48 8b 55 e8 48 8b 45 f8 48 01 d0 0f b6 10 48 8b 4d e0 48 8b 45 f8 48 01 c8 <0f> b6 00 38 c2 74 07 b8 01 00 00 00 eb 14 48 83 45 f8 01 48 8b 45
All code
========
0: 89 75 e0 mov %esi,-0x20(%rbp)
3: 48 89 55 d8 mov %rdx,-0x28(%rbp)
7: 48 c7 45 f8 00 00 00 movq $0x0,-0x8(%rbp)
e: 00
f: eb 2c jmp 0x3d
11: 48 8b 55 e8 mov -0x18(%rbp),%rdx
15: 48 8b 45 f8 mov -0x8(%rbp),%rax
19: 48 01 d0 add %rdx,%rax
1c: 0f b6 10 movzbl (%rax),%edx
1f: 48 8b 4d e0 mov -0x20(%rbp),%rcx
23: 48 8b 45 f8 mov -0x8(%rbp),%rax
27: 48 01 c8 add %rcx,%rax
2a:* 0f b6 00 movzbl (%rax),%eax <-- trapping instruction
2d: 38 c2 cmp %al,%dl
2f: 74 07 je 0x38
31: b8 01 00 00 00 mov $0x1,%eax
36: eb 14 jmp 0x4c
38: 48 83 45 f8 01 addq $0x1,-0x8(%rbp)
3d: 48 rex.W
3e: 8b .byte 0x8b
3f: 45 rex.RB

Code starting with the faulting instruction
===========================================
0: 0f b6 00 movzbl (%rax),%eax
3: 38 c2 cmp %al,%dl
5: 74 07 je 0xe
7: b8 01 00 00 00 mov $0x1,%eax
c: eb 14 jmp 0x22
e: 48 83 45 f8 01 addq $0x1,-0x8(%rbp)
13: 48 rex.W
14: 8b .byte 0x8b
15: 45 rex.RB
[ 184.049280][ T3695] RSP: 002b:00007ffe05336170 EFLAGS: 00010206
[ 184.055169][ T3695] RAX: 00007f57bde00000 RBX: 00007ffe05336340 RCX: 00007f57bde00000
[ 184.062955][ T3695] RDX: 00000000000000ff RSI: 00007f57bde00000 RDI: 00007f57ce000000
[ 184.070742][ T3695] RBP: 00007ffe05336170 R08: 00000000ffffffff R09: 0000000000000000
[ 184.078528][ T3695] R10: 0000000000000022 R11: 0000000000000246 R12: 0000000000000000
[ 184.086315][ T3695] R13: 00007ffe05336560 R14: 000000000040be00 R15: 00007f57e6668020
[ 184.094107][ T3695] </TASK>
[ 184.096984][ T3695] ------------[ cut here ]------------
[ 184.102287][ T3695] DEBUG_RWSEMS_WARN_ON(tmp < 0): count = 0xffffffffffffff00, magic = 0xffff888160af9d28, owner = 0x1, curr 0xffff8881b2dc8040, list empty
[ 184.116154][ T3695] WARNING: CPU: 7 PID: 3695 at kernel/locking/rwsem.c:1348 __up_read (kernel/locking/rwsem.c:1348 (discriminator 15))
[ 184.125073][ T3695] Modules linked in: openvswitch nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 intel_rapl_msr intel_rapl_common btrfs x86_pkg_temp_thermal intel_powerclamp blake2b_generic coretemp xor raid6_pq zstd_compress kvm_intel libcrc32c kvm i915 sd_mod t10_pi irqbypass crct10dif_pclmul crc64_rocksoft_generic crc32_pclmul drm_buddy crc64_rocksoft crc32c_intel intel_gtt crc64 ghash_clmulni_intel drm_display_helper sha512_ssse3 sg rapl drm_kms_helper ipmi_devintf ipmi_msghandler mei_wdt wmi_bmof intel_cstate syscopyarea ahci libahci i2c_i801 i2c_designware_platform sysfillrect mei_me intel_uncore i2c_smbus idma64 i2c_designware_core sysimgblt libata ttm mei video wmi intel_pmc_core acpi_pad binfmt_misc fuse drm ip_tables
[ 184.190695][ T3695] CPU: 7 PID: 3695 Comm: userfaultfd Not tainted 6.3.0-rc1-00015-ga2eef9e49f57 #1
[ 184.199697][ T3695] Hardware name: Dell Inc. OptiPlex 7050/062KRH, BIOS 1.2.0 12/22/2016
[ 184.207763][ T3695] RIP: 0010:__up_read (kernel/locking/rwsem.c:1348 (discriminator 15))
[ 184.212637][ T3695] Code: 3c 02 00 0f 85 61 03 00 00 53 48 8b 55 00 4d 89 e9 4d 89 f8 4c 89 f1 48 c7 c6 40 70 c9 83 48 c7 c7 a0 6e c9 83 e8 11 15 e9 ff <0f> 0b 5a e9 0e ff ff ff 48 89 44 24 38 e9 db fd ff ff be 08 00 00
All code
========
0: 3c 02 cmp $0x2,%al
2: 00 0f add %cl,(%rdi)
4: 85 61 03 test %esp,0x3(%rcx)
7: 00 00 add %al,(%rax)
9: 53 push %rbx
a: 48 8b 55 00 mov 0x0(%rbp),%rdx
e: 4d 89 e9 mov %r13,%r9
11: 4d 89 f8 mov %r15,%r8
14: 4c 89 f1 mov %r14,%rcx
17: 48 c7 c6 40 70 c9 83 mov $0xffffffff83c97040,%rsi
1e: 48 c7 c7 a0 6e c9 83 mov $0xffffffff83c96ea0,%rdi
25: e8 11 15 e9 ff callq 0xffffffffffe9153b
2a:* 0f 0b ud2 <-- trapping instruction
2c: 5a pop %rdx
2d: e9 0e ff ff ff jmpq 0xffffffffffffff40
32: 48 89 44 24 38 mov %rax,0x38(%rsp)
37: e9 db fd ff ff jmpq 0xfffffffffffffe17
3c: be .byte 0xbe
3d: 08 00 or %al,(%rax)
...

Code starting with the faulting instruction
===========================================
0: 0f 0b ud2
2: 5a pop %rdx
3: e9 0e ff ff ff jmpq 0xffffffffffffff16
8: 48 89 44 24 38 mov %rax,0x38(%rsp)
d: e9 db fd ff ff jmpq 0xfffffffffffffded
12: be .byte 0xbe
13: 08 00 or %al,(%rax)
...
[ 184.232027][ T3695] RSP: 0000:ffffc9000b6ffd48 EFLAGS: 00010286
[ 184.237919][ T3695] RAX: 0000000000000000 RBX: ffffffff83c96de0 RCX: fffff520016dff6e
[ 184.245734][ T3695] RDX: 0000000000000004 RSI: 0000000000000008 RDI: ffff88878c7ac38c
[ 184.253611][ T3695] RBP: ffff888160af9d28 R08: ffffffff821fca01 R09: ffffc9000b6ffb6f
[ 184.261422][ T3695] R10: fffff520016dff6d R11: 0000000000000001 R12: 1ffff920016dffad
[ 184.269254][ T3695] R13: ffff8881b2dc8040 R14: ffff888160af9d28 R15: 0000000000000001
[ 184.277051][ T3695] FS: 00007f57e6448740(0000) GS:ffff88878c780000(0000) knlGS:0000000000000000
[ 184.285796][ T3695] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 184.292208][ T3695] CR2: 00007f57bde00000 CR3: 0000000100e30004 CR4: 00000000003706e0
[ 184.300016][ T3695] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 184.307809][ T3695] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 184.315616][ T3695] Call Trace:
[ 184.318762][ T3695] <TASK>
[ 184.321563][ T3695] ? up_write (kernel/locking/rwsem.c:1339)
[ 184.325575][ T3695] ? lock_release (kernel/locking/lockdep.c:5693)
[ 184.330105][ T3695] handle_mm_fault (mm/memory.c:5276)
[ 184.334721][ T3695] ? lock_is_held_type (kernel/locking/lockdep.c:5410 kernel/locking/lockdep.c:5712)
[ 184.339595][ T3695] ? __handle_mm_fault (mm/memory.c:5201)
[ 184.344556][ T3695] ? find_vma (mm/mmap.c:1854)
[ 184.348651][ T3695] ? can_vma_merge_before+0x330/0x330
[ 184.354479][ T3695] do_user_addr_fault (arch/x86/mm/fault.c:1407)
[ 184.359351][ T3695] exc_page_fault (arch/x86/include/asm/irqflags.h:26 arch/x86/include/asm/irqflags.h:67 arch/x86/include/asm/irqflags.h:127 arch/x86/mm/fault.c:1506 arch/x86/mm/fault.c:1554)
[ 184.363707][ T3695] asm_exc_page_fault (arch/x86/include/asm/idtentry.h:570)
[ 184.368408][ T3695] RIP: 0033:0x403b2e
[ 184.372145][ T3695] Code: 89 75 e0 48 89 55 d8 48 c7 45 f8 00 00 00 00 eb 2c 48 8b 55 e8 48 8b 45 f8 48 01 d0 0f b6 10 48 8b 4d e0 48 8b 45 f8 48 01 c8 <0f> b6 00 38 c2 74 07 b8 01 00 00 00 eb 14 48 83 45 f8 01 48 8b 45
All code
========
0: 89 75 e0 mov %esi,-0x20(%rbp)
3: 48 89 55 d8 mov %rdx,-0x28(%rbp)
7: 48 c7 45 f8 00 00 00 movq $0x0,-0x8(%rbp)
e: 00
f: eb 2c jmp 0x3d
11: 48 8b 55 e8 mov -0x18(%rbp),%rdx
15: 48 8b 45 f8 mov -0x8(%rbp),%rax
19: 48 01 d0 add %rdx,%rax
1c: 0f b6 10 movzbl (%rax),%edx
1f: 48 8b 4d e0 mov -0x20(%rbp),%rcx
23: 48 8b 45 f8 mov -0x8(%rbp),%rax
27: 48 01 c8 add %rcx,%rax
2a:* 0f b6 00 movzbl (%rax),%eax <-- trapping instruction
2d: 38 c2 cmp %al,%dl
2f: 74 07 je 0x38
31: b8 01 00 00 00 mov $0x1,%eax
36: eb 14 jmp 0x4c
38: 48 83 45 f8 01 addq $0x1,-0x8(%rbp)
3d: 48 rex.W
3e: 8b .byte 0x8b
3f: 45 rex.RB

Code starting with the faulting instruction
===========================================
0: 0f b6 00 movzbl (%rax),%eax
3: 38 c2 cmp %al,%dl
5: 74 07 je 0xe
7: b8 01 00 00 00 mov $0x1,%eax
c: eb 14 jmp 0x22
e: 48 83 45 f8 01 addq $0x1,-0x8(%rbp)
13: 48 rex.W
14: 8b .byte 0x8b
15: 45 rex.RB
[ 184.391546][ T3695] RSP: 002b:00007ffe05336170 EFLAGS: 00010206
[ 184.397460][ T3695] RAX: 00007f57bde00000 RBX: 00007ffe05336340 RCX: 00007f57bde00000
[ 184.405251][ T3695] RDX: 00000000000000ff RSI: 00007f57bde00000 RDI: 00007f57ce000000
[ 184.413057][ T3695] RBP: 00007ffe05336170 R08: 00000000ffffffff R09: 0000000000000000
[ 184.420863][ T3695] R10: 0000000000000022 R11: 0000000000000246 R12: 0000000000000000
[ 184.428670][ T3695] R13: 00007ffe05336560 R14: 000000000040be00 R15: 00007f57e6668020
[ 184.436482][ T3695] </TASK>
[ 184.439365][ T3695] irq event stamp: 368859
[ 184.443530][ T3695] hardirqs last enabled at (368859): finish_task_switch+0x21c/0x910
[ 184.453841][ T3695] hardirqs last disabled at (368858): __schedule (kernel/sched/core.c:6521 (discriminator 1))
[ 184.462946][ T3695] softirqs last enabled at (368830): __do_softirq (arch/x86/include/asm/preempt.h:27 kernel/softirq.c:415 kernel/softirq.c:600)
[ 184.472118][ T3695] softirqs last disabled at (368825): __irq_exit_rcu (kernel/softirq.c:445 kernel/softirq.c:650)
[ 184.481495][ T3695] ---[ end trace 0000000000000000 ]---
[ 184.486819][ T3695] VM_SHARED_PT vma with NULL ptshare_data
[ 184.486823][ T3695] CPU: 3 PID: 3695 Comm: userfaultfd Tainted: G W 6.3.0-rc1-00015-ga2eef9e49f57 #1
[ 184.502865][ T3695] Hardware name: Dell Inc. OptiPlex 7050/062KRH, BIOS 1.2.0 12/22/2016


To reproduce:

git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
sudo bin/lkp install job.yaml # job file is attached in this email
bin/lkp split-job --compatible job.yaml # generate the yaml file for lkp run
sudo bin/lkp run generated-yaml-file

# if come across any failure that blocks the test,
# please remove ~/.lkp and /lkp dir to run from a clean state.



--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests



Attachments:
(No filename) (14.03 kB)
config-6.3.0-rc1-00015-ga2eef9e49f57 (164.02 kB)
job-script (6.10 kB)
dmesg.xz (52.23 kB)
kernel-selftests (36.65 kB)
job.yaml (5.15 kB)
reproduce (283.00 B)
Download all attachments

2023-06-12 16:39:11

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH RFC v2 0/4] Add support for sharing page tables across processes (Previously mshare)

On Wed, Apr 26, 2023 at 10:49:47AM -0600, Khalid Aziz wrote:
> This patch series adds a new flag to mmap() call - MAP_SHARED_PT.

Since hugetlb has this, it'll be very helpful if you can provide a
comparison between this approach and hugetlb's - especially on the
differences - and reasonings about those.

Merging anything like this definitely should also pave way for hugetlb's
future, so it even seems to be an "requirement" of such patchset even
though implicitily.. considering the "hotness" that hugetlb recently has
on refactoring demand (if not a rewrite).

Some high level questions:

- Why mmap() flag?

For this one, I agree it should be opt-in - sharing pgtable definitely
means sharing of a lot of privileges operating on current mm, so one
should be aware and be prepared to be messed up.

IIUC hugetlb doesn't do this but instead when something "racy" happens
itt just unshares by default. To me opt-in makes more sense if to
start from zero, because I don't think by default a process wants to
leak its mm to others.

I think you mentioned allowing pgtable to be shared even for mprotect()
from one MM then all MMs can see; but if so then DONTNEED should really
do the same - when one MM DONTNEED it it should go away for all. It
doesn't make a lot of sense to me to treat it differently with a
DONTNEED refcount anywhere..

- Can guest MM opt-out?

Should we allow guest MM to opt-out when they want? It sounds like a
good thing to have to me, especially for file that sounds like as
simple as zapping the pgtable. But then mmap flag will stop working
iiuc, so goes back to that question (e.g. what about madvise or prctl?).

- Why mm_struct to represent shared pgtable?

IIUC hugetlb used the pgtable page itself plus some refcounting (the
refcounting is racy with gup-fast that Jann used to point out, but
that's another story..). My question is do you think that won't work?
Are there reasons to explain? Why mm_struct is the structure you chose
for representing a shared pgtable? Why per-file?

Thanks,

--
Peter Xu


2023-06-30 12:07:05

by Rongwei Wang

[permalink] [raw]
Subject: Re: [PATCH RFC v2 0/4] Add support for sharing page tables across processes (Previously mshare)

Hi Khalid

I see this patch has send out in April, and wanna to ask about the
status of this RFC now (IMHO, it seems that the code has some places to
fix/do). This feature is useful to save much memory on pgtables, so we
also want to use this optimization in our databases if upstream accept that.

BTW, in the past few weeks, I made some adjustments to simplify and meet
with our databases base on your code, e.g. multi-vmas share same shadow
mm, madvise, and memory compaction. if you are interested, I can provide
a detailed codes.


Thanks,

-wrw

On 2023/4/27 00:49, Khalid Aziz wrote:
> Memory pages shared between processes require a page table entry
> (PTE) for each process. Each of these PTE consumes some of the
> memory and as long as number of mappings being maintained is small
> enough, this space consumed by page tables is not objectionable.
> When very few memory pages are shared between processes, the number
> of page table entries (PTEs) to maintain is mostly constrained by
> the number of pages of memory on the system. As the number of
> shared pages and the number of times pages are shared goes up,
> amount of memory consumed by page tables starts to become
> significant. This issue does not apply to threads. Any number of
> threads can share the same pages inside a process while sharing the
> same PTEs. Extending this same model to sharing pages across
> processes can eliminate this issue for sharing across processes as
> well.
>
> Some of the field deployments commonly see memory pages shared
> across 1000s of processes. On x86_64, each page requires a PTE that
> is only 8 bytes long which is very small compared to the 4K page
> size. When 2000 processes map the same page in their address space,
> each one of them requires 8 bytes for its PTE and together that adds
> up to 8K of memory just to hold the PTEs for one 4K page. On a
> database server with 300GB SGA, a system crash was seen with
> out-of-memory condition when 1500+ clients tried to share this SGA
> even though the system had 512GB of memory. On this server, in the
> worst case scenario of all 1500 processes mapping every page from
> SGA would have required 878GB+ for just the PTEs. If these PTEs
> could be shared, amount of memory saved is very significant.
>
> This patch series adds a new flag to mmap() call - MAP_SHARED_PT.
> This flag can be specified along with MAP_SHARED by a process to
> hint to kernel that it wishes to share page table entries for this
> file mapping mmap region with other processes. Any other process
> that mmaps the same file with MAP_SHARED_PT flag can then share the
> same page table entries. Besides specifying MAP_SHARED_PT flag, the
> processes must map the files at a PMD aligned address with a size
> that is a multiple of PMD size and at the same virtual addresses.
> This last requirement of same virtual addresses can possibly be
> relaxed if that is the consensus.
>
> When mmap() is called with MAP_SHARED_PT flag, a new host mm struct
> is created to hold the shared page tables. Host mm struct is not
> attached to a process. Start and size of host mm are set to the
> start and size of the mmap region and a VMA covering this range is
> also added to host mm struct. Existing page table entries from the
> process that creates the mapping are copied over to the host mm
> struct. All processes mapping this shared region are considered
> guest processes. When a guest process mmap's the shared region, a vm
> flag VM_SHARED_PT is added to the VMAs in guest process. Upon a page
> fault, VMA is checked for the presence of VM_SHARED_PT flag. If the
> flag is found, its corresponding PMD is updated with the PMD from
> host mm struct so the PMD will point to the page tables in host mm
> struct. vm_mm pointer of the VMA is also updated to point to host mm
> struct for the duration of fault handling to ensure fault handling
> happens in the context of host mm struct. When a new PTE is
> created, it is created in the host mm struct page tables and the PMD
> in guest mm points to the same PTEs.
>
> This is a basic working implementation. It will need to go through
> more testing and refinements. Some notes and questions:
>
> - PMD size alignment and size requirement is currently hard coded
> in. Is there a need or desire to make this more flexible and work
> with other alignments/sizes? PMD size allows for adapting this
> infrastructure to form the basis for hugetlbfs page table sharing
> as well. More work will be needed to make that happen.
>
> - Is there a reason to allow a userspace app to query this size and
> alignment requirement for MAP_SHARED_PT in some way?
>
> - Shared PTEs means mprotect() call made by one process affects all
> processes sharing the same mapping and that behavior will need to
> be documented clearly. Effect of mprotect call being different for
> processes using shared page tables is the primary reason to
> require an explicit opt-in from userspace processes to share page
> tables. With a transparent sharing derived from MAP_SHARED alone,
> changed effect of mprotect can break significant number of
> userspace apps. One could work around that by unsharing whenever
> mprotect changes modes on shared mapping but that introduces
> complexity and the capability to execute a single mprotect to
> change modes across 1000's of processes sharing a mapped database
> is a feature explicitly asked for by database folks. This
> capability has significant performance advantage when compared to
> mechanism of sending messages to every process using shared
> mapping to call mprotect and change modes in each process, or
> using traps on permissions mismatch in each process.
>
> - This implementation does not allow unmapping page table shared
> mappings partially. Should that be supported in future?
>
> Some concerns in this RFC:
>
> - When page tables for a process are freed upon process exit,
> pmd_free_tlb() gets called at one point to free all PMDs allocated
> by the process. For a shared page table, shared PMDs can not be
> released when a guest process exits. These shared PMDs are
> released when host mm struct is released upon end of last
> reference to page table shared region hosted by this mm. For now
> to stop PMDs being released, this RFC introduces following change
> in mm/memory.c which works but does not feel like the right
> approach. Any suggestions for a better long term approach will be
> very appreciated:
>
> @@ -210,13 +221,19 @@ static inline void free_pmd_range(struct mmu_gather *tlb,
> pud_t *pud,
>
> pmd = pmd_offset(pud, start);
> pud_clear(pud);
> - pmd_free_tlb(tlb, pmd, start);
> - mm_dec_nr_pmds(tlb->mm);
> + if (shared_pte) {
> + tlb_flush_pud_range(tlb, start, PAGE_SIZE);
> + tlb->freed_tables = 1;
> + } else {
> + pmd_free_tlb(tlb, pmd, start);
> + mm_dec_nr_pmds(tlb->mm);
> + }
> }
>
> static inline void free_pud_range(struct mmu_gather *tlb, p4d_t *p4d,
>
> - This implementation requires an additional VM flag. Since all lower
> 32 bits are currently in use, the new VM flag must come from upper
> 32 bits which restricts this feature to 64-bit processors.
>
> - This feature is implemented for file mappings only. Is there a
> need to support it for anonymous memory as well?
>
> - Accounting for MAP_SHARED_PT mapped filepages in a process and
> pagetable bytes is not quite accurate yet in this RFC and will be
> fixed in the non-RFC version of patches.
>
> I appreciate any feedback on these patches and ideas for
> improvements before moving these patches out of RFC stage.
>
>
> Changes from RFC v1:
> - Broken the patches up into smaller patches
> - Fixed a few bugs related to freeing PTEs and PMDs incorrectly
> - Cleaned up the code a bit
>
>
> Khalid Aziz (4):
> mm/ptshare: Add vm flag for shared PTE
> mm/ptshare: Add flag MAP_SHARED_PT to mmap()
> mm/ptshare: Create new mm struct for page table sharing
> mm/ptshare: Add page fault handling for page table shared regions
>
> include/linux/fs.h | 2 +
> include/linux/mm.h | 8 +
> include/trace/events/mmflags.h | 3 +-
> include/uapi/asm-generic/mman-common.h | 1 +
> mm/Makefile | 2 +-
> mm/internal.h | 21 ++
> mm/memory.c | 105 ++++++++--
> mm/mmap.c | 88 +++++++++
> mm/ptshare.c | 263 +++++++++++++++++++++++++
> 9 files changed, 476 insertions(+), 17 deletions(-)
> create mode 100644 mm/ptshare.c
>

2023-07-31 05:27:08

by Rongwei Wang

[permalink] [raw]
Subject: Re: [PATCH RFC v2 0/4] Add support for sharing page tables across processes (Previously mshare)

Hi Matthew

May I ask you another question about mshare under this RFC? I remember
you said you will redesign the mshare to per-vma not per-mapping
(apologize if remember wrongly) in last time MM alignment session. And I
also refer to you to re-code this part in our internal version (based on
this RFC). It seems that per VMA will can simplify the structure of
pgtable sharing, even doesn't care the different permission of file
mapping. these are advantages (maybe) that I can imagine. But IMHO, It
seems not a strongly reason to switch per-mapping to per-vma.

And I can't imagine other considerations of upstream. Can you share the
reason why redesigning in a per-vma way, due to integation with
hugetlbfs pgtable sharing or anonymous page sharing?

Thanks for your time.

On 2023/4/27 00:49, Khalid Aziz wrote:
> Memory pages shared between processes require a page table entry
> (PTE) for each process. Each of these PTE consumes some of the
> memory and as long as number of mappings being maintained is small
> enough, this space consumed by page tables is not objectionable.
> When very few memory pages are shared between processes, the number
> of page table entries (PTEs) to maintain is mostly constrained by
> the number of pages of memory on the system. As the number of
> shared pages and the number of times pages are shared goes up,
> amount of memory consumed by page tables starts to become
> significant. This issue does not apply to threads. Any number of
> threads can share the same pages inside a process while sharing the
> same PTEs. Extending this same model to sharing pages across
> processes can eliminate this issue for sharing across processes as
> well.
>
> Some of the field deployments commonly see memory pages shared
> across 1000s of processes. On x86_64, each page requires a PTE that
> is only 8 bytes long which is very small compared to the 4K page
> size. When 2000 processes map the same page in their address space,
> each one of them requires 8 bytes for its PTE and together that adds
> up to 8K of memory just to hold the PTEs for one 4K page. On a
> database server with 300GB SGA, a system crash was seen with
> out-of-memory condition when 1500+ clients tried to share this SGA
> even though the system had 512GB of memory. On this server, in the
> worst case scenario of all 1500 processes mapping every page from
> SGA would have required 878GB+ for just the PTEs. If these PTEs
> could be shared, amount of memory saved is very significant.
>
> This patch series adds a new flag to mmap() call - MAP_SHARED_PT.
> This flag can be specified along with MAP_SHARED by a process to
> hint to kernel that it wishes to share page table entries for this
> file mapping mmap region with other processes. Any other process
> that mmaps the same file with MAP_SHARED_PT flag can then share the
> same page table entries. Besides specifying MAP_SHARED_PT flag, the
> processes must map the files at a PMD aligned address with a size
> that is a multiple of PMD size and at the same virtual addresses.
> This last requirement of same virtual addresses can possibly be
> relaxed if that is the consensus.
>
> When mmap() is called with MAP_SHARED_PT flag, a new host mm struct
> is created to hold the shared page tables. Host mm struct is not
> attached to a process. Start and size of host mm are set to the
> start and size of the mmap region and a VMA covering this range is
> also added to host mm struct. Existing page table entries from the
> process that creates the mapping are copied over to the host mm
> struct. All processes mapping this shared region are considered
> guest processes. When a guest process mmap's the shared region, a vm
> flag VM_SHARED_PT is added to the VMAs in guest process. Upon a page
> fault, VMA is checked for the presence of VM_SHARED_PT flag. If the
> flag is found, its corresponding PMD is updated with the PMD from
> host mm struct so the PMD will point to the page tables in host mm
> struct. vm_mm pointer of the VMA is also updated to point to host mm
> struct for the duration of fault handling to ensure fault handling
> happens in the context of host mm struct. When a new PTE is
> created, it is created in the host mm struct page tables and the PMD
> in guest mm points to the same PTEs.
>
> This is a basic working implementation. It will need to go through
> more testing and refinements. Some notes and questions:
>
> - PMD size alignment and size requirement is currently hard coded
> in. Is there a need or desire to make this more flexible and work
> with other alignments/sizes? PMD size allows for adapting this
> infrastructure to form the basis for hugetlbfs page table sharing
> as well. More work will be needed to make that happen.
>
> - Is there a reason to allow a userspace app to query this size and
> alignment requirement for MAP_SHARED_PT in some way?
>
> - Shared PTEs means mprotect() call made by one process affects all
> processes sharing the same mapping and that behavior will need to
> be documented clearly. Effect of mprotect call being different for
> processes using shared page tables is the primary reason to
> require an explicit opt-in from userspace processes to share page
> tables. With a transparent sharing derived from MAP_SHARED alone,
> changed effect of mprotect can break significant number of
> userspace apps. One could work around that by unsharing whenever
> mprotect changes modes on shared mapping but that introduces
> complexity and the capability to execute a single mprotect to
> change modes across 1000's of processes sharing a mapped database
> is a feature explicitly asked for by database folks. This
> capability has significant performance advantage when compared to
> mechanism of sending messages to every process using shared
> mapping to call mprotect and change modes in each process, or
> using traps on permissions mismatch in each process.
>
> - This implementation does not allow unmapping page table shared
> mappings partially. Should that be supported in future?
>
> Some concerns in this RFC:
>
> - When page tables for a process are freed upon process exit,
> pmd_free_tlb() gets called at one point to free all PMDs allocated
> by the process. For a shared page table, shared PMDs can not be
> released when a guest process exits. These shared PMDs are
> released when host mm struct is released upon end of last
> reference to page table shared region hosted by this mm. For now
> to stop PMDs being released, this RFC introduces following change
> in mm/memory.c which works but does not feel like the right
> approach. Any suggestions for a better long term approach will be
> very appreciated:
>
> @@ -210,13 +221,19 @@ static inline void free_pmd_range(struct mmu_gather *tlb,
> pud_t *pud,
>
> pmd = pmd_offset(pud, start);
> pud_clear(pud);
> - pmd_free_tlb(tlb, pmd, start);
> - mm_dec_nr_pmds(tlb->mm);
> + if (shared_pte) {
> + tlb_flush_pud_range(tlb, start, PAGE_SIZE);
> + tlb->freed_tables = 1;
> + } else {
> + pmd_free_tlb(tlb, pmd, start);
> + mm_dec_nr_pmds(tlb->mm);
> + }
> }
>
> static inline void free_pud_range(struct mmu_gather *tlb, p4d_t *p4d,
>
> - This implementation requires an additional VM flag. Since all lower
> 32 bits are currently in use, the new VM flag must come from upper
> 32 bits which restricts this feature to 64-bit processors.
>
> - This feature is implemented for file mappings only. Is there a
> need to support it for anonymous memory as well?
>
> - Accounting for MAP_SHARED_PT mapped filepages in a process and
> pagetable bytes is not quite accurate yet in this RFC and will be
> fixed in the non-RFC version of patches.
>
> I appreciate any feedback on these patches and ideas for
> improvements before moving these patches out of RFC stage.
>
>
> Changes from RFC v1:
> - Broken the patches up into smaller patches
> - Fixed a few bugs related to freeing PTEs and PMDs incorrectly
> - Cleaned up the code a bit
>
>
> Khalid Aziz (4):
> mm/ptshare: Add vm flag for shared PTE
> mm/ptshare: Add flag MAP_SHARED_PT to mmap()
> mm/ptshare: Create new mm struct for page table sharing
> mm/ptshare: Add page fault handling for page table shared regions
>
> include/linux/fs.h | 2 +
> include/linux/mm.h | 8 +
> include/trace/events/mmflags.h | 3 +-
> include/uapi/asm-generic/mman-common.h | 1 +
> mm/Makefile | 2 +-
> mm/internal.h | 21 ++
> mm/memory.c | 105 ++++++++--
> mm/mmap.c | 88 +++++++++
> mm/ptshare.c | 263 +++++++++++++++++++++++++
> 9 files changed, 476 insertions(+), 17 deletions(-)
> create mode 100644 mm/ptshare.c
>

2023-07-31 13:05:24

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH RFC v2 0/4] Add support for sharing page tables across processes (Previously mshare)

On Mon, Jul 31, 2023 at 12:35:00PM +0800, Rongwei Wang wrote:
> Hi Matthew
>
> May I ask you another question about mshare under this RFC? I remember you
> said you will redesign the mshare to per-vma not per-mapping (apologize if
> remember wrongly) in last time MM alignment session. And I also refer to you
> to re-code this part in our internal version (based on this RFC). It seems
> that per VMA will can simplify the structure of pgtable sharing, even
> doesn't care the different permission of file mapping. these are advantages
> (maybe) that I can imagine. But IMHO, It seems not a strongly reason to
> switch per-mapping to per-vma.
>
> And I can't imagine other considerations of upstream. Can you share the
> reason why redesigning in a per-vma way, due to integation with hugetlbfs
> pgtable sharing or anonymous page sharing?

It was David who wants to make page table sharing be per-VMA. I think
he is advocating for the wrong approach. In any case, I don't have time
to work on mshare and Khalid is on leave until September, so I don't
think anybody is actively working on mshare.

> Thanks for your time.
>
> On 2023/4/27 00:49, Khalid Aziz wrote:
> > Memory pages shared between processes require a page table entry
> > (PTE) for each process. Each of these PTE consumes some of the
> > memory and as long as number of mappings being maintained is small
> > enough, this space consumed by page tables is not objectionable.
> > When very few memory pages are shared between processes, the number
> > of page table entries (PTEs) to maintain is mostly constrained by
> > the number of pages of memory on the system. As the number of
> > shared pages and the number of times pages are shared goes up,
> > amount of memory consumed by page tables starts to become
> > significant. This issue does not apply to threads. Any number of
> > threads can share the same pages inside a process while sharing the
> > same PTEs. Extending this same model to sharing pages across
> > processes can eliminate this issue for sharing across processes as
> > well.
> >
> > Some of the field deployments commonly see memory pages shared
> > across 1000s of processes. On x86_64, each page requires a PTE that
> > is only 8 bytes long which is very small compared to the 4K page
> > size. When 2000 processes map the same page in their address space,
> > each one of them requires 8 bytes for its PTE and together that adds
> > up to 8K of memory just to hold the PTEs for one 4K page. On a
> > database server with 300GB SGA, a system crash was seen with
> > out-of-memory condition when 1500+ clients tried to share this SGA
> > even though the system had 512GB of memory. On this server, in the
> > worst case scenario of all 1500 processes mapping every page from
> > SGA would have required 878GB+ for just the PTEs. If these PTEs
> > could be shared, amount of memory saved is very significant.
> >
> > This patch series adds a new flag to mmap() call - MAP_SHARED_PT.
> > This flag can be specified along with MAP_SHARED by a process to
> > hint to kernel that it wishes to share page table entries for this
> > file mapping mmap region with other processes. Any other process
> > that mmaps the same file with MAP_SHARED_PT flag can then share the
> > same page table entries. Besides specifying MAP_SHARED_PT flag, the
> > processes must map the files at a PMD aligned address with a size
> > that is a multiple of PMD size and at the same virtual addresses.
> > This last requirement of same virtual addresses can possibly be
> > relaxed if that is the consensus.
> >
> > When mmap() is called with MAP_SHARED_PT flag, a new host mm struct
> > is created to hold the shared page tables. Host mm struct is not
> > attached to a process. Start and size of host mm are set to the
> > start and size of the mmap region and a VMA covering this range is
> > also added to host mm struct. Existing page table entries from the
> > process that creates the mapping are copied over to the host mm
> > struct. All processes mapping this shared region are considered
> > guest processes. When a guest process mmap's the shared region, a vm
> > flag VM_SHARED_PT is added to the VMAs in guest process. Upon a page
> > fault, VMA is checked for the presence of VM_SHARED_PT flag. If the
> > flag is found, its corresponding PMD is updated with the PMD from
> > host mm struct so the PMD will point to the page tables in host mm
> > struct. vm_mm pointer of the VMA is also updated to point to host mm
> > struct for the duration of fault handling to ensure fault handling
> > happens in the context of host mm struct. When a new PTE is
> > created, it is created in the host mm struct page tables and the PMD
> > in guest mm points to the same PTEs.
> >
> > This is a basic working implementation. It will need to go through
> > more testing and refinements. Some notes and questions:
> >
> > - PMD size alignment and size requirement is currently hard coded
> > in. Is there a need or desire to make this more flexible and work
> > with other alignments/sizes? PMD size allows for adapting this
> > infrastructure to form the basis for hugetlbfs page table sharing
> > as well. More work will be needed to make that happen.
> >
> > - Is there a reason to allow a userspace app to query this size and
> > alignment requirement for MAP_SHARED_PT in some way?
> >
> > - Shared PTEs means mprotect() call made by one process affects all
> > processes sharing the same mapping and that behavior will need to
> > be documented clearly. Effect of mprotect call being different for
> > processes using shared page tables is the primary reason to
> > require an explicit opt-in from userspace processes to share page
> > tables. With a transparent sharing derived from MAP_SHARED alone,
> > changed effect of mprotect can break significant number of
> > userspace apps. One could work around that by unsharing whenever
> > mprotect changes modes on shared mapping but that introduces
> > complexity and the capability to execute a single mprotect to
> > change modes across 1000's of processes sharing a mapped database
> > is a feature explicitly asked for by database folks. This
> > capability has significant performance advantage when compared to
> > mechanism of sending messages to every process using shared
> > mapping to call mprotect and change modes in each process, or
> > using traps on permissions mismatch in each process.
> >
> > - This implementation does not allow unmapping page table shared
> > mappings partially. Should that be supported in future?
> >
> > Some concerns in this RFC:
> >
> > - When page tables for a process are freed upon process exit,
> > pmd_free_tlb() gets called at one point to free all PMDs allocated
> > by the process. For a shared page table, shared PMDs can not be
> > released when a guest process exits. These shared PMDs are
> > released when host mm struct is released upon end of last
> > reference to page table shared region hosted by this mm. For now
> > to stop PMDs being released, this RFC introduces following change
> > in mm/memory.c which works but does not feel like the right
> > approach. Any suggestions for a better long term approach will be
> > very appreciated:
> >
> > @@ -210,13 +221,19 @@ static inline void free_pmd_range(struct mmu_gather *tlb,
> > pud_t *pud,
> >
> > pmd = pmd_offset(pud, start);
> > pud_clear(pud);
> > - pmd_free_tlb(tlb, pmd, start);
> > - mm_dec_nr_pmds(tlb->mm);
> > + if (shared_pte) {
> > + tlb_flush_pud_range(tlb, start, PAGE_SIZE);
> > + tlb->freed_tables = 1;
> > + } else {
> > + pmd_free_tlb(tlb, pmd, start);
> > + mm_dec_nr_pmds(tlb->mm);
> > + }
> > }
> >
> > static inline void free_pud_range(struct mmu_gather *tlb, p4d_t *p4d,
> >
> > - This implementation requires an additional VM flag. Since all lower
> > 32 bits are currently in use, the new VM flag must come from upper
> > 32 bits which restricts this feature to 64-bit processors.
> >
> > - This feature is implemented for file mappings only. Is there a
> > need to support it for anonymous memory as well?
> >
> > - Accounting for MAP_SHARED_PT mapped filepages in a process and
> > pagetable bytes is not quite accurate yet in this RFC and will be
> > fixed in the non-RFC version of patches.
> >
> > I appreciate any feedback on these patches and ideas for
> > improvements before moving these patches out of RFC stage.
> >
> >
> > Changes from RFC v1:
> > - Broken the patches up into smaller patches
> > - Fixed a few bugs related to freeing PTEs and PMDs incorrectly
> > - Cleaned up the code a bit
> >
> >
> > Khalid Aziz (4):
> > mm/ptshare: Add vm flag for shared PTE
> > mm/ptshare: Add flag MAP_SHARED_PT to mmap()
> > mm/ptshare: Create new mm struct for page table sharing
> > mm/ptshare: Add page fault handling for page table shared regions
> >
> > include/linux/fs.h | 2 +
> > include/linux/mm.h | 8 +
> > include/trace/events/mmflags.h | 3 +-
> > include/uapi/asm-generic/mman-common.h | 1 +
> > mm/Makefile | 2 +-
> > mm/internal.h | 21 ++
> > mm/memory.c | 105 ++++++++--
> > mm/mmap.c | 88 +++++++++
> > mm/ptshare.c | 263 +++++++++++++++++++++++++
> > 9 files changed, 476 insertions(+), 17 deletions(-)
> > create mode 100644 mm/ptshare.c
> >

2023-07-31 13:26:14

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH RFC v2 0/4] Add support for sharing page tables across processes (Previously mshare)

On 31.07.23 14:25, Matthew Wilcox wrote:
> On Mon, Jul 31, 2023 at 12:35:00PM +0800, Rongwei Wang wrote:
>> Hi Matthew
>>
>> May I ask you another question about mshare under this RFC? I remember you
>> said you will redesign the mshare to per-vma not per-mapping (apologize if
>> remember wrongly) in last time MM alignment session. And I also refer to you
>> to re-code this part in our internal version (based on this RFC). It seems
>> that per VMA will can simplify the structure of pgtable sharing, even
>> doesn't care the different permission of file mapping. these are advantages
>> (maybe) that I can imagine. But IMHO, It seems not a strongly reason to
>> switch per-mapping to per-vma.
>>
>> And I can't imagine other considerations of upstream. Can you share the
>> reason why redesigning in a per-vma way, due to integation with hugetlbfs
>> pgtable sharing or anonymous page sharing?
>
> It was David who wants to make page table sharing be per-VMA. I think
> he is advocating for the wrong approach. In any case, I don't have time
> to work on mshare and Khalid is on leave until September, so I don't
> think anybody is actively working on mshare.

Not that I also don't have any time to look into this, but my comment
essentially was that we should try decoupling page table sharing (reduce
memory consumption, shorter rmap walk) from the mprotect(PROT_READ) use
case.

For page table sharing I was wondering whether there could be ways to
just have that done semi-automatically. Similar to how it's done for
hugetlb. There are some clear limitations: mappings < PMD_SIZE won't be
able to benefit.

It's still unclear whether that is a real limitation. Some use cases
were raised (put all user space library mappings into a shared area),
but I realized that these conflict with MAP_PRIVATE requirements of such
areas. Maybe I'm wrong and this is easily resolved.

At least it's not the primary use case that was raised. For the primary
use cases (VMs, databases) that map huge areas, it might not be a
limitation.


Regarding mprotect(PROT_READ), my point was that mprotect() is most
probably the wrong tool to use (especially, due to signal handling).
Instead, I was suggesting having a way to essentially protect pages in a
shmem file -- and get notified whenever wants to write to such a page
either via the page tables or via write() and friends. We do have the
write-notify infrastructure for filesystems in place that we might
extend/reuse. That mechanism could benefit from shared page tables by
having to do less rmap walks.

Again, I don't have time to look into that (just like everybody else as
it appears) and might miss something important. Just sharing my thoughts
that I raised in the call.

--
Cheers,

David / dhildenb


2023-07-31 16:44:18

by Rongwei Wang

[permalink] [raw]
Subject: Re: [PATCH RFC v2 0/4] Add support for sharing page tables across processes (Previously mshare)


On 2023/7/31 20:50, David Hildenbrand wrote:
> On 31.07.23 14:25, Matthew Wilcox wrote:
>> On Mon, Jul 31, 2023 at 12:35:00PM +0800, Rongwei Wang wrote:
>>> Hi Matthew
>>>
>>> May I ask you another question about mshare under this RFC? I
>>> remember you
>>> said you will redesign the mshare to per-vma not per-mapping
>>> (apologize if
>>> remember wrongly) in last time MM alignment session. And I also
>>> refer to you
>>> to re-code this part in our internal version (based on this RFC). It
>>> seems
>>> that per VMA will can simplify the structure of pgtable sharing, even
>>> doesn't care the different permission of file mapping. these are
>>> advantages
>>> (maybe) that I can imagine. But IMHO, It seems not a strongly reason to
>>> switch per-mapping to per-vma.
>>>
>>> And I can't imagine other considerations of upstream. Can you share the
>>> reason why redesigning in a per-vma way, due to integation with
>>> hugetlbfs
>>> pgtable sharing or anonymous page sharing?
>>
>> It was David who wants to make page table sharing be per-VMA.  I think
>> he is advocating for the wrong approach.  In any case, I don't have time
>> to work on mshare and Khalid is on leave until September, so I don't
>> think anybody is actively working on mshare.
>
> Not that I also don't have any time to look into this, but my comment
> essentially was that we should try decoupling page table sharing
> (reduce memory consumption, shorter rmap walk) from the
> mprotect(PROT_READ) use case.

Hi David, Matthew

Thanks for your reply.

Uh, sorry, I can't imagine the relative between decouping page table
sharing with per-VMA design. And I think mprotect(PROT_READ) has to
modify all sharing page tables of related tasks. It seems that I miss
something about per-VMA from your words.

BTW, I can imagine a corner case to show the defect (maybe) of
per-mapping. If we create a range of page table sharing by
memfd_create(), and a child also own this range of page table sharing.
But this child process can not create page table sharing base on the
same fd after mumap() this range (same mapping but different vma area).
Of course, per-VMA is better choice that can continue to create page
table sharing base on original fd. That's because new mm struct created
in this way. I guess that is a type of decoupling you said?

It's just corner case. I am not sure how important it is.

>
> For page table sharing I was wondering whether there could be ways to
> just have that done semi-automatically. Similar to how it's done for
> hugetlb. There are some clear limitations: mappings < PMD_SIZE won't
> be able to benefit.
>
> It's still unclear whether that is a real limitation. Some use cases
> were raised (put all user space library mappings into a shared area),
> but I realized that these conflict with MAP_PRIVATE requirements of
> such areas. Maybe I'm wrong and this is easily resolved.
>
> At least it's not the primary use case that was raised. For the
> primary use cases (VMs, databases) that map huge areas, it might not
> be a limitation.
>
>
> Regarding mprotect(PROT_READ), my point was that mprotect() is most
> probably the wrong tool to use (especially, due to signal handling).
> Instead, I was suggesting having a way to essentially protect pages in
> a shmem file -- and get notified whenever wants to write to such a
> page either via the page tables or via write() and friends. We do have
> the write-notify infrastructure for filesystems in place that we might
> extend/reuse.
I am poor in filesystem. The write-notify sounds a good idea. Maybe I
need some times to digest this.
> That mechanism could benefit from shared page tables by having to do
> less rmap walks.
>
> Again, I don't have time to look into that (just like everybody else
> as it appears) and might miss something important. Just sharing my
> thoughts that I raised in the call.

Your words are very helpful to me. I try to design our internal version
about this feature in a right way.

Thanks again.


2023-07-31 17:07:03

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH RFC v2 0/4] Add support for sharing page tables across processes (Previously mshare)

On 31.07.23 18:19, Rongwei Wang wrote:
>
> On 2023/7/31 20:50, David Hildenbrand wrote:
>> On 31.07.23 14:25, Matthew Wilcox wrote:
>>> On Mon, Jul 31, 2023 at 12:35:00PM +0800, Rongwei Wang wrote:
>>>> Hi Matthew
>>>>
>>>> May I ask you another question about mshare under this RFC? I
>>>> remember you
>>>> said you will redesign the mshare to per-vma not per-mapping
>>>> (apologize if
>>>> remember wrongly) in last time MM alignment session. And I also
>>>> refer to you
>>>> to re-code this part in our internal version (based on this RFC). It
>>>> seems
>>>> that per VMA will can simplify the structure of pgtable sharing, even
>>>> doesn't care the different permission of file mapping. these are
>>>> advantages
>>>> (maybe) that I can imagine. But IMHO, It seems not a strongly reason to
>>>> switch per-mapping to per-vma.
>>>>
>>>> And I can't imagine other considerations of upstream. Can you share the
>>>> reason why redesigning in a per-vma way, due to integation with
>>>> hugetlbfs
>>>> pgtable sharing or anonymous page sharing?
>>>
>>> It was David who wants to make page table sharing be per-VMA.  I think
>>> he is advocating for the wrong approach.  In any case, I don't have time
>>> to work on mshare and Khalid is on leave until September, so I don't
>>> think anybody is actively working on mshare.
>>
>> Not that I also don't have any time to look into this, but my comment
>> essentially was that we should try decoupling page table sharing
>> (reduce memory consumption, shorter rmap walk) from the
>> mprotect(PROT_READ) use case.
>
> Hi David, Matthew
>
> Thanks for your reply.
>
> Uh, sorry, I can't imagine the relative between decouping page table
> sharing with per-VMA design. And I think mprotect(PROT_READ) has to
> modify all sharing page tables of related tasks. It seems that I miss
> something about per-VMA from your words.

Assume we do do the page table sharing at mmap time, if the flags are
right. Let's focus on the most common:

mmap(memfd, PROT_READ | PROT_WRITE, MAP_SHARED)

And doing the same in each and every process.


Having the original design of doing an mprotect(PROT_READ) in each and
every process is just absolutely inefficient to protect a memfd page.

For that case, my thought was that you actually want to write-protect
the pages on the memfd level.

So instead of doing mprotect(PROT_READ) in 999 processes, or doing
mprotect(PROT_READ) on mshare(), you have memfd feature to protect pages
from any write access -- not using virtual addresses but using an offset
in the memfd.


Assume such a (badly imagined) memfd_protect(PROT_READ) would make sure
that:
(1) Any page table mappings of the page are write-protected and
(2) Any write access using the page table mappings trigger write-notify and
(3) Any other access -- e.g., write() -- similarly informs memfd.

Without page table sharing, (1) would have to walk all mappings via the
rmap. With page table sharing, it would only have to walk one page table.

But the features would be two separate things.

What memfd would do with that write notification (inject a signal,
something like uffd) would be a different story.


Again, just an idea and maybe complete garbage.

--
Cheers,

David / dhildenb