Due to the rebase to latest rc6, the major pte copy patch changed a lot. So
maybe not that useful to write a changelog any more. However all the comments
should be addressed as long as discussed in previous thread. Please shoot if I
missed anything important.
This series is majorly inspired by the previous discussion on the list [1],
starting from the report from Jason on the rdma test failure. Linus proposed
the solution, which seems to be a very nice approach to avoid the breakage of
userspace apps that didn't use MADV_DONTFORK properly before. More information
can be found in that thread too.
I tested it myself with fork() after vfio pinning a bunch of device pages, and
I verified that the new copy pte logic worked as expected at least in the most
general path. However I didn't test thp case yet because afaict vfio does not
support thp backed dma pages. Luckily, the pmd/pud thp patch is much more
straightforward than the pte one, so hopefully it can be directly verified by
some code review plus some more heavy-weight rdma tests.
Patch 1: Introduce mm.has_pinned
Patch 2: Preparation patch
Patch 3: Early cow solution for pte copy for pinned pages
Patch 4: Same as above, but for thp (pmd/pud).
Hugetlbfs fix is still missing, but as planned, that's not urgent so we can
work upon. Comments greatly welcomed.
[1] https://lore.kernel.org/lkml/[email protected]/
Thanks.
Peter Xu (4):
mm: Introduce mm_struct.has_pinned
mm/fork: Pass new vma pointer into copy_page_range()
mm: Do early cow for pinned pages during fork() for ptes
mm/thp: Split huge pmds/puds if they're pinned when fork()
include/linux/mm.h | 2 +-
include/linux/mm_types.h | 10 +++
kernel/fork.c | 3 +-
mm/gup.c | 6 ++
mm/huge_memory.c | 28 ++++++
mm/memory.c | 186 ++++++++++++++++++++++++++++++++++-----
6 files changed, 212 insertions(+), 23 deletions(-)
--
2.26.2
(Commit message majorly collected from Jason Gunthorpe)
Reduce the chance of false positive from page_maybe_dma_pinned() by keeping
track if the mm_struct has ever been used with pin_user_pages(). This allows
cases that might drive up the page ref_count to avoid any penalty from handling
dma_pinned pages.
Future work is planned, to provide a more sophisticated solution, likely to
turn it into a real counter. For now, make it atomic_t but use it as a boolean
for simplicity.
Suggested-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
include/linux/mm_types.h | 10 ++++++++++
kernel/fork.c | 1 +
mm/gup.c | 6 ++++++
3 files changed, 17 insertions(+)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 496c3ff97cce..ed028af3cb19 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -436,6 +436,16 @@ struct mm_struct {
*/
atomic_t mm_count;
+ /**
+ * @has_pinned: Whether this mm has pinned any pages. This can
+ * be either replaced in the future by @pinned_vm when it
+ * becomes stable, or grow into a counter on its own. We're
+ * aggresive on this bit now - even if the pinned pages were
+ * unpinned later on, we'll still keep this bit set for the
+ * lifecycle of this mm just for simplicity.
+ */
+ atomic_t has_pinned;
+
#ifdef CONFIG_MMU
atomic_long_t pgtables_bytes; /* PTE page table pages */
#endif
diff --git a/kernel/fork.c b/kernel/fork.c
index 49677d668de4..e65d8192d080 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1011,6 +1011,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
mm_pgtables_bytes_init(mm);
mm->map_count = 0;
mm->locked_vm = 0;
+ atomic_set(&mm->has_pinned, 0);
atomic64_set(&mm->pinned_vm, 0);
memset(&mm->rss_stat, 0, sizeof(mm->rss_stat));
spin_lock_init(&mm->page_table_lock);
diff --git a/mm/gup.c b/mm/gup.c
index e5739a1974d5..238667445337 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1255,6 +1255,9 @@ static __always_inline long __get_user_pages_locked(struct mm_struct *mm,
BUG_ON(*locked != 1);
}
+ if (flags & FOLL_PIN)
+ atomic_set(¤t->mm->has_pinned, 1);
+
/*
* FOLL_PIN and FOLL_GET are mutually exclusive. Traditional behavior
* is to set FOLL_GET if the caller wants pages[] filled in (but has
@@ -2660,6 +2663,9 @@ static int internal_get_user_pages_fast(unsigned long start, int nr_pages,
FOLL_FAST_ONLY)))
return -EINVAL;
+ if (gup_flags & FOLL_PIN)
+ atomic_set(¤t->mm->has_pinned, 1);
+
if (!(gup_flags & FOLL_FAST_ONLY))
might_lock_read(¤t->mm->mmap_lock);
--
2.26.2
It allows copy_pte_range() to do early cow if the pages were pinned on the
source mm. Currently we don't have an accurate way to know whether a page is
pinned or not. The only thing we have is page_maybe_dma_pinned(). However
that's good enough for now. Especially, with the newly added mm->has_pinned
flag to make sure we won't affect processes that never pinned any pages.
It would be easier if we can do GFP_KERNEL allocation within copy_one_pte().
Unluckily, we can't because we're with the page table locks held for both the
parent and child processes. So the page allocation needs to be done outside
copy_one_pte().
Some trick is there in copy_present_pte(), majorly the wrprotect trick to block
concurrent fast-gup. Comments in the function should explain better in place.
Oleg Nesterov reported a (probably harmless) bug during review that we didn't
reset entry.val properly in copy_pte_range() so that potentially there's chance
to call add_swap_count_continuation() multiple times on the same swp entry.
However that should be harmless since even if it happens, the same function
(add_swap_count_continuation()) will return directly noticing that there're
enough space for the swp counter. So instead of a standalone stable patch, it
is touched up in this patch directly.
Reference discussions:
https://lore.kernel.org/lkml/[email protected]/
Suggested-by: Linus Torvalds <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
mm/memory.c | 172 +++++++++++++++++++++++++++++++++++++++++++++++-----
1 file changed, 156 insertions(+), 16 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index 4c56d7b92b0e..92ad08616e60 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -773,15 +773,109 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
return 0;
}
-static inline void
+/*
+ * Copy one pte. Returns 0 if succeeded, or -EAGAIN if one preallocated page
+ * is required to copy this pte.
+ */
+static inline int
copy_present_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *vma,
- unsigned long addr, int *rss)
+ struct vm_area_struct *new,
+ unsigned long addr, int *rss, struct page **prealloc)
{
unsigned long vm_flags = vma->vm_flags;
pte_t pte = *src_pte;
struct page *page;
+ page = vm_normal_page(vma, addr, pte);
+ if (page) {
+ if (is_cow_mapping(vm_flags)) {
+ bool is_write = pte_write(pte);
+
+ /*
+ * The trick starts.
+ *
+ * What we want to do is to check whether this page may
+ * have been pinned by the parent process. If so,
+ * instead of wrprotect the pte on both sides, we copy
+ * the page immediately so that we'll always guarantee
+ * the pinned page won't be randomly replaced in the
+ * future.
+ *
+ * To achieve this, we do the following:
+ *
+ * 1. Write-protect the pte if it's writable. This is
+ * to protect concurrent write fast-gup with
+ * FOLL_PIN, so that we'll fail the fast-gup with
+ * the write bit removed.
+ *
+ * 2. Check page_maybe_dma_pinned() to see whether this
+ * page may have been pinned.
+ *
+ * The order of these steps is important to serialize
+ * against the fast-gup code (gup_pte_range()) on the
+ * pte check and try_grab_compound_head(), so that
+ * we'll make sure either we'll capture that fast-gup
+ * so we'll copy the pinned page here, or we'll fail
+ * that fast-gup.
+ */
+ if (is_write) {
+ ptep_set_wrprotect(src_mm, addr, src_pte);
+ /*
+ * This is not needed for serializing fast-gup,
+ * however always make it consistent with
+ * src_pte, since we'll need it when current
+ * page is not pinned.
+ */
+ pte = pte_wrprotect(pte);
+ }
+
+ if (atomic_read(&src_mm->has_pinned) &&
+ page_maybe_dma_pinned(page)) {
+ struct page *new_page = *prealloc;
+
+ /*
+ * This is possibly pinned page, need to copy.
+ * Safe to release the write bit if necessary.
+ */
+ if (is_write)
+ set_pte_at(src_mm, addr, src_pte,
+ pte_mkwrite(pte));
+
+ /* If we don't have a pre-allocated page, ask */
+ if (!new_page)
+ return -EAGAIN;
+
+ /*
+ * We have a prealloc page, all good! Take it
+ * over and copy the page & arm it.
+ */
+ *prealloc = NULL;
+ copy_user_highpage(new_page, page, addr, vma);
+ __SetPageUptodate(new_page);
+ pte = mk_pte(new_page, new->vm_page_prot);
+ pte = pte_sw_mkyoung(pte);
+ pte = maybe_mkwrite(pte_mkdirty(pte), new);
+ page_add_new_anon_rmap(new_page, new, addr, false);
+ rss[mm_counter(new_page)]++;
+ set_pte_at(dst_mm, addr, dst_pte, pte);
+ return 0;
+ }
+
+ /*
+ * Logically we should recover the wrprotect() for
+ * fast-gup, however when reach here it also means we
+ * actually need to wrprotect() it again for cow.
+ * Simply keep everything. Note that there's another
+ * chunk of cow logic below, but we should still need
+ * that for !page case.
+ */
+ }
+ get_page(page);
+ page_dup_rmap(page, false);
+ rss[mm_counter(page)]++;
+ }
+
/*
* If it's a COW mapping, write protect it both
* in the parent and the child
@@ -807,14 +901,27 @@ copy_present_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
if (!(vm_flags & VM_UFFD_WP))
pte = pte_clear_uffd_wp(pte);
- page = vm_normal_page(vma, addr, pte);
- if (page) {
- get_page(page);
- page_dup_rmap(page, false);
- rss[mm_counter(page)]++;
+ set_pte_at(dst_mm, addr, dst_pte, pte);
+ return 0;
+}
+
+static inline struct page *
+page_copy_prealloc(struct mm_struct *src_mm, struct vm_area_struct *vma,
+ unsigned long addr)
+{
+ struct page *new_page;
+
+ new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, addr);
+ if (!new_page)
+ return NULL;
+
+ if (mem_cgroup_charge(new_page, src_mm, GFP_KERNEL)) {
+ put_page(new_page);
+ return NULL;
}
+ cgroup_throttle_swaprate(new_page, GFP_KERNEL);
- set_pte_at(dst_mm, addr, dst_pte, pte);
+ return new_page;
}
static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
@@ -825,16 +932,20 @@ static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
pte_t *orig_src_pte, *orig_dst_pte;
pte_t *src_pte, *dst_pte;
spinlock_t *src_ptl, *dst_ptl;
- int progress = 0;
+ int progress, ret = 0;
int rss[NR_MM_COUNTERS];
swp_entry_t entry = (swp_entry_t){0};
+ struct page *prealloc = NULL;
again:
+ progress = 0;
init_rss_vec(rss);
dst_pte = pte_alloc_map_lock(dst_mm, dst_pmd, addr, &dst_ptl);
- if (!dst_pte)
- return -ENOMEM;
+ if (!dst_pte) {
+ ret = -ENOMEM;
+ goto out;
+ }
src_pte = pte_offset_map(src_pmd, addr);
src_ptl = pte_lockptr(src_mm, src_pmd);
spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
@@ -866,8 +977,25 @@ static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
progress += 8;
continue;
}
- copy_present_pte(dst_mm, src_mm, dst_pte, src_pte,
- vma, addr, rss);
+ /* copy_present_pte() will clear `*prealloc' if consumed */
+ ret = copy_present_pte(dst_mm, src_mm, dst_pte, src_pte,
+ vma, new, addr, rss, &prealloc);
+ /*
+ * If we need a pre-allocated page for this pte, drop the
+ * locks, allocate, and try again.
+ */
+ if (unlikely(ret == -EAGAIN))
+ break;
+ if (unlikely(prealloc)) {
+ /*
+ * pre-alloc page cannot be reused by next time so as
+ * to strictly follow mempolicy (e.g., alloc_page_vma()
+ * will allocate page according to address). This
+ * could only happen if one pinned pte changed.
+ */
+ put_page(prealloc);
+ prealloc = NULL;
+ }
progress += 8;
} while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end);
@@ -879,13 +1007,25 @@ static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
cond_resched();
if (entry.val) {
- if (add_swap_count_continuation(entry, GFP_KERNEL) < 0)
+ if (add_swap_count_continuation(entry, GFP_KERNEL) < 0) {
+ ret = -ENOMEM;
+ goto out;
+ }
+ entry.val = 0;
+ } else if (ret) {
+ WARN_ON_ONCE(ret != -EAGAIN);
+ prealloc = page_copy_prealloc(src_mm, vma, addr);
+ if (!prealloc)
return -ENOMEM;
- progress = 0;
+ /* We've captured and resolved the error. Reset, try again. */
+ ret = 0;
}
if (addr != end)
goto again;
- return 0;
+out:
+ if (unlikely(prealloc))
+ put_page(prealloc);
+ return ret;
}
static inline int copy_pmd_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
--
2.26.2
Pinned pages shouldn't be write-protected when fork() happens, because follow
up copy-on-write on these pages could cause the pinned pages to be replaced by
random newly allocated pages.
For huge PMDs, we split the huge pmd if pinning is detected. So that future
handling will be done by the PTE level (with our latest changes, each of the
small pages will be copied). We can achieve this by let copy_huge_pmd() return
-EAGAIN for pinned pages, so that we'll fallthrough in copy_pmd_range() and
finally land the next copy_pte_range() call.
Huge PUDs will be even more special - so far it does not support anonymous
pages. But it can actually be done the same as the huge PMDs even if the split
huge PUDs means to erase the PUD entries. It'll guarantee the follow up fault
ins will remap the same pages in either parent/child later.
This might not be the most efficient way, but it should be easy and clean
enough. It should be fine, since we're tackling with a very rare case just to
make sure userspaces that pinned some thps will still work even without
MADV_DONTFORK and after they fork()ed.
Signed-off-by: Peter Xu <[email protected]>
---
mm/huge_memory.c | 28 ++++++++++++++++++++++++++++
1 file changed, 28 insertions(+)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index faadc449cca5..da397779a6d4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1074,6 +1074,24 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
src_page = pmd_page(pmd);
VM_BUG_ON_PAGE(!PageHead(src_page), src_page);
+
+ /*
+ * If this page is a potentially pinned page, split and retry the fault
+ * with smaller page size. Normally this should not happen because the
+ * userspace should use MADV_DONTFORK upon pinned regions. This is a
+ * best effort that the pinned pages won't be replaced by another
+ * random page during the coming copy-on-write.
+ */
+ if (unlikely(is_cow_mapping(vma->vm_flags) &&
+ atomic_read(&src_mm->has_pinned) &&
+ page_maybe_dma_pinned(src_page))) {
+ pte_free(dst_mm, pgtable);
+ spin_unlock(src_ptl);
+ spin_unlock(dst_ptl);
+ __split_huge_pmd(vma, src_pmd, addr, false, NULL);
+ return -EAGAIN;
+ }
+
get_page(src_page);
page_dup_rmap(src_page, true);
add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
@@ -1177,6 +1195,16 @@ int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
/* No huge zero pud yet */
}
+ /* Please refer to comments in copy_huge_pmd() */
+ if (unlikely(is_cow_mapping(vma->vm_flags) &&
+ atomic_read(&src_mm->has_pinned) &&
+ page_maybe_dma_pinned(pud_page(pud)))) {
+ spin_unlock(src_ptl);
+ spin_unlock(dst_ptl);
+ __split_huge_pud(vma, src_pud, addr);
+ return -EAGAIN;
+ }
+
pudp_set_wrprotect(src_mm, addr, src_pud);
pud = pud_mkold(pud_wrprotect(pud));
set_pud_at(dst_mm, addr, dst_pud, pud);
--
2.26.2
On Fri, Sep 25, 2020 at 06:25:59PM -0400, Peter Xu wrote:
> -static inline void
> +/*
> + * Copy one pte. Returns 0 if succeeded, or -EAGAIN if one preallocated page
> + * is required to copy this pte.
> + */
> +static inline int
> copy_present_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *vma,
> - unsigned long addr, int *rss)
> + struct vm_area_struct *new,
> + unsigned long addr, int *rss, struct page **prealloc)
> {
> unsigned long vm_flags = vma->vm_flags;
> pte_t pte = *src_pte;
> struct page *page;
>
> + page = vm_normal_page(vma, addr, pte);
> + if (page) {
> + if (is_cow_mapping(vm_flags)) {
> + bool is_write = pte_write(pte);
Very minor, but I liked the readability to put this chunk in a
function 'copy_normal_page' with the src/dst naming
> +
> + /*
> + * We have a prealloc page, all good! Take it
> + * over and copy the page & arm it.
> + */
> + *prealloc = NULL;
> + copy_user_highpage(new_page, page, addr, vma);
> + __SetPageUptodate(new_page);
> + pte = mk_pte(new_page, new->vm_page_prot);
> + pte = pte_sw_mkyoung(pte);
Linus's version doesn't do pte_sw_mkyoung(), but looks OK to have it
> + pte = maybe_mkwrite(pte_mkdirty(pte), new);
maybe_mkwrite() was not in Linus's version, but is in
wp_page_copy(). It seemed like mk_pte() should set the proper write
bit already from the vm_page_prot? Perhaps this is harmless but
redundant?
> + page_add_new_anon_rmap(new_page, new, addr, false);
> + rss[mm_counter(new_page)]++;
> + set_pte_at(dst_mm, addr, dst_pte, pte);
Linus's patch had a lru_cache_add_inactive_or_unevictable() here, like
wp_page_copy()
Didn't think of anything profound to say, looks good thanks!
I'll forward this for testing as well, there are some holidays next
week so I may have been optimistic to think by Monday.
Jason
On Sat, Sep 26, 2020 at 4:23 PM Jason Gunthorpe <[email protected]> wrote:
>
> Linus's version doesn't do pte_sw_mkyoung(), but looks OK to have it
I don't think it matters. But I don't think it should make it young,
since there's no access, but it's not like it's a big deal.
> > + pte = maybe_mkwrite(pte_mkdirty(pte), new);
>
> maybe_mkwrite() was not in Linus's version but it is wp_page_copy().
Actually, it is in my version too, just in a different form.
I did it using
if (vma->vm_flags & VM_WRITE)
*src_pte = pte_mkwrite(*src_pte);
instead, ie avoiding the write to src_pte if it wasn't a writable vma
(and I had checked that it was dirty and not done this at all if not,
so no need for the mkdirty).
> It seemed like mk_pte() should set the proper write
> bit already from the vm_page_prot?
No, vm_page_prot won't have the writable bit for a COW mapping.
The write bit gets set when the write happens (which may be on the
first access, of course), by the code that makes sure it's a private
copy.
> Perhaps this is harmless but redundant?
No, the pte_mkwrite() is required in some form, whether it's that
"maybe_mkwrite()" that looks at the vm flags, or in that explicit
form.
> > + page_add_new_anon_rmap(new_page, new, addr, false);
> > + rss[mm_counter(new_page)]++;
> > + set_pte_at(dst_mm, addr, dst_pte, pte);
>
> Linus's patch had a lru_cache_add_inactive_or_unevictable() here, like
> wp_page_copy()
Yeah, I do think that is needed so that we have the new page on the
LRU and it gets properly evicted under memory pressure.
Linus
On Fri, Sep 25, 2020 at 3:26 PM Peter Xu <[email protected]> wrote:
>
> This series is majorly inspired by the previous discussion on the list [1],
> starting from the report from Jason on the rdma test failure.
Ok, this is now in my git tree with the changes I outlined in the other email.
> I tested it myself with fork() after vfio pinning a bunch of device pages,
.. but _my_ only testing was to just add a nasty hack that said that
all pages are pinned, and made fork() much slower, but hey, it at
least tests the preallocation paths etc. And I'm not seeing any
obvious failures due to taking that slow-path that is supposed to be a
special case.
Let's hope this closes the rdma issues.
Linus
On Sun, Sep 27, 2020 at 12:35:59PM -0700, Linus Torvalds wrote:
> On Fri, Sep 25, 2020 at 3:26 PM Peter Xu <[email protected]> wrote:
> >
> > This series is majorly inspired by the previous discussion on the list [1],
> > starting from the report from Jason on the rdma test failure.
>
> Ok, this is now in my git tree with the changes I outlined in the other email.
>
> > I tested it myself with fork() after vfio pinning a bunch of device pages,
>
> .. but _my_ only testing was to just add a nasty hack that said that
> all pages are pinned, and made fork() much slower, but hey, it at
> least tests the preallocation paths etc. And I'm not seeing any
> obvious failures due to taking that slow-path that is supposed to be a
> special case.
>
> Let's hope this closes the rdma issues.
Hi Linus,
We tested your tree upto commit "fb0155a09b02 Merge tag 'nfs-for-5.9-3' of
git://git.linux-nfs.org/projects/trondmy/linux-nfs" and our RDMA tests passed.
Thanks
>
> Linus