2020-09-21 21:21:55

by Peter Xu

[permalink] [raw]
Subject: [PATCH 0/5] mm: Break COW for pinned pages during fork()

Finally I start to post formal patches because it's growing. And also since

we've discussed quite some issues already, so I feel like it's clearer on what

we need to do, and how.



This series is majorly inspired by the previous discussion on the list [1],

starting from the report from Jason on the rdma test failure. Linus proposed

the solution, which seems to be a very nice approach to avoid the breakage of

userspace apps that didn't use MADV_DONTFORK properly before. More information

can be found in that thread too.



I believe the initial plan was to consider merging something like this for

rc7/rc8. However now I'm not sure due to the fact that the code change in

copy_pte_range() is probably more than expected, so it can be with some risk.

I'll leave this question to the reviewers...



I tested it myself with fork() after vfio pinning a bunch of device pages, and

I verified that the new copy pte logic worked as expected at least in the most

general path. However I didn't test thp case yet because afaict vfio does not

support thp backed dma pages. Luckily, the pmd/pud thp patch is much more

straightforward than the pte one, so hopefully it can be directly verified by

some code review plus some more heavy-weight rdma tests.



Patch 1: Introduce mm.has_pinned (as single patch as suggested by Jason)

Patch 2-3: Some slight rework on copy_page_range() path as preparation

Patch 4: Early cow solution for pte copy for pinned pages

Patch 5: Same as above, but for thp (pmd/pud).



Hugetlbfs fix is still missing, but as planned, that's not urgent so we can

work upon. Comments greatly welcomed.



Thanks.



Peter Xu (5):

mm: Introduce mm_struct.has_pinned

mm/fork: Pass new vma pointer into copy_page_range()

mm: Rework return value for copy_one_pte()

mm: Do early cow for pinned pages during fork() for ptes

mm/thp: Split huge pmds/puds if they're pinned when fork()



include/linux/mm.h | 2 +-

include/linux/mm_types.h | 10 ++

kernel/fork.c | 3 +-

mm/gup.c | 6 ++

mm/huge_memory.c | 26 +++++

mm/memory.c | 226 +++++++++++++++++++++++++++++++++++----

6 files changed, 248 insertions(+), 25 deletions(-)



--

2.26.2





2020-09-21 21:22:02

by Peter Xu

[permalink] [raw]
Subject: [PATCH 3/5] mm: Rework return value for copy_one_pte()

There's one special path for copy_one_pte() with swap entries, in which
add_swap_count_continuation(GFP_ATOMIC) might fail. In that case we'll return
the swp_entry_t so that the caller will release the locks and redo the same
thing with GFP_KERNEL.

It's confusing when copy_one_pte() must return a swp_entry_t (even if all the
ptes are non-swap entries). More importantly, we face other requirement to
extend this "we need to do something else, but without the locks held" case.

Rework the return value into something easier to understand, as defined in enum
copy_mm_ret. We'll pass the swp_entry_t back using the newly introduced union
copy_mm_data parameter.

Another trivial change is to move the reset of the "progress" counter into the
retry path, so that we'll reset it for other reasons too.

This should prepare us with adding new return codes, very soon.

Signed-off-by: Peter Xu <[email protected]>
---
mm/memory.c | 42 +++++++++++++++++++++++++++++-------------
1 file changed, 29 insertions(+), 13 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 7525147908c4..1530bb1070f4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -689,16 +689,24 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
}
#endif

+#define COPY_MM_DONE 0
+#define COPY_MM_SWAP_CONT 1
+
+struct copy_mm_data {
+ /* COPY_MM_SWAP_CONT */
+ swp_entry_t entry;
+};
+
/*
* copy one vm_area from one task to the other. Assumes the page tables
* already present in the new task to be cleared in the whole range
* covered by this vma.
*/

-static inline unsigned long
+static inline int
copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *vma,
- unsigned long addr, int *rss)
+ unsigned long addr, int *rss, struct copy_mm_data *data)
{
unsigned long vm_flags = vma->vm_flags;
pte_t pte = *src_pte;
@@ -709,8 +717,10 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
swp_entry_t entry = pte_to_swp_entry(pte);

if (likely(!non_swap_entry(entry))) {
- if (swap_duplicate(entry) < 0)
- return entry.val;
+ if (swap_duplicate(entry) < 0) {
+ data->entry = entry;
+ return COPY_MM_SWAP_CONT;
+ }

/* make sure dst_mm is on swapoff's mmlist. */
if (unlikely(list_empty(&dst_mm->mmlist))) {
@@ -809,7 +819,7 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,

out_set_pte:
set_pte_at(dst_mm, addr, dst_pte, pte);
- return 0;
+ return COPY_MM_DONE;
}

static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
@@ -820,9 +830,9 @@ static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
pte_t *orig_src_pte, *orig_dst_pte;
pte_t *src_pte, *dst_pte;
spinlock_t *src_ptl, *dst_ptl;
- int progress = 0;
+ int progress, copy_ret = COPY_MM_DONE;
int rss[NR_MM_COUNTERS];
- swp_entry_t entry = (swp_entry_t){0};
+ struct copy_mm_data data;

again:
init_rss_vec(rss);
@@ -837,6 +847,7 @@ static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
orig_dst_pte = dst_pte;
arch_enter_lazy_mmu_mode();

+ progress = 0;
do {
/*
* We are holding two locks at this point - either of them
@@ -852,9 +863,9 @@ static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
progress++;
continue;
}
- entry.val = copy_one_pte(dst_mm, src_mm, dst_pte, src_pte,
- vma, addr, rss);
- if (entry.val)
+ copy_ret = copy_one_pte(dst_mm, src_mm, dst_pte, src_pte,
+ vma, addr, rss, &data);
+ if (copy_ret != COPY_MM_DONE)
break;
progress += 8;
} while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end);
@@ -866,13 +877,18 @@ static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
pte_unmap_unlock(orig_dst_pte, dst_ptl);
cond_resched();

- if (entry.val) {
- if (add_swap_count_continuation(entry, GFP_KERNEL) < 0)
+ switch (copy_ret) {
+ case COPY_MM_SWAP_CONT:
+ if (add_swap_count_continuation(data.entry, GFP_KERNEL) < 0)
return -ENOMEM;
- progress = 0;
+ break;
+ default:
+ break;
}
+
if (addr != end)
goto again;
+
return 0;
}

--
2.26.2

2020-09-21 21:23:56

by Peter Xu

[permalink] [raw]
Subject: [PATCH 4/5] mm: Do early cow for pinned pages during fork() for ptes

This patch is greatly inspired by the discussions on the list from Linus, Jason
Gunthorpe and others [1].

It allows copy_pte_range() to do early cow if the pages were pinned on the
source mm. Currently we don't have an accurate way to know whether a page is
pinned or not. The only thing we have is page_maybe_dma_pinned(). However
that's good enough for now. Especially, with the newly added mm->has_pinned
flag to make sure we won't affect processes that never pinned any pages.

It would be easier if we can do GFP_KERNEL allocation within copy_one_pte().
Unluckily, we can't because we're with the page table locks held for both the
parent and child processes. So the page copy process needs to be done outside
copy_one_pte().

The new COPY_MM_BREAK_COW is introduced for this - copy_one_pte() would return
this when it finds any pte that may need an early breaking of cow.

page_duplicate() is used to handle the page copy process in copy_pte_range().
Of course we need to do that after releasing of the locks.

The slightly tricky part is page_duplicate() will fill in the copy_mm_data with
the new page copied and we'll need to re-install the pte again with page table
locks held again. That's done in pte_install_copied_page().

The whole procedure looks quite similar to wp_page_copy() however it's simpler
because we know the page is special (pinned) and we know we don't need tlb
flushings because no one is referencing the new mm yet.

Though we still have to be very careful on maintaining the two pages (one old
source page, one new allocated page) across all these lock taking/releasing
process and make sure neither of them will get lost.

[1] https://lore.kernel.org/lkml/[email protected]/

Suggested-by: Linus Torvalds <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
mm/memory.c | 174 +++++++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 167 insertions(+), 7 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 1530bb1070f4..8f3521be80ca 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -691,12 +691,72 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,

#define COPY_MM_DONE 0
#define COPY_MM_SWAP_CONT 1
+#define COPY_MM_BREAK_COW 2

struct copy_mm_data {
/* COPY_MM_SWAP_CONT */
swp_entry_t entry;
+ /* COPY_MM_BREAK_COW */
+ struct {
+ struct page *cow_old_page; /* Released by page_duplicate() */
+ struct page *cow_new_page; /* Released by page_release_cow() */
+ pte_t cow_oldpte;
+ };
};

+static inline void page_release_cow(struct copy_mm_data *data)
+{
+ /* The old page should only be released in page_duplicate() */
+ WARN_ON_ONCE(data->cow_old_page);
+
+ if (data->cow_new_page) {
+ put_page(data->cow_new_page);
+ data->cow_new_page = NULL;
+ }
+}
+
+/*
+ * Duplicate the page for this PTE. Returns zero if page copied (so we need to
+ * retry on the same PTE again to arm the copied page very soon), or negative
+ * if error happened. In all cases, the old page will be properly released.
+ */
+static int page_duplicate(struct mm_struct *src_mm, struct vm_area_struct *vma,
+ unsigned long address, struct copy_mm_data *data)
+{
+ struct page *new_page = NULL;
+ int ret;
+
+ /* This should have been set in change_one_pte() when reach here */
+ WARN_ON_ONCE(!data->cow_old_page);
+
+ new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
+ if (!new_page) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ copy_user_highpage(new_page, data->cow_old_page, address, vma);
+ ret = mem_cgroup_charge(new_page, src_mm, GFP_KERNEL);
+ if (ret) {
+ put_page(new_page);
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ cgroup_throttle_swaprate(new_page, GFP_KERNEL);
+ __SetPageUptodate(new_page);
+
+ /* So far so good; arm the new page for the next attempt */
+ data->cow_new_page = new_page;
+
+out:
+ /* Always release the old page */
+ put_page(data->cow_old_page);
+ data->cow_old_page = NULL;
+
+ return ret;
+}
+
/*
* copy one vm_area from one task to the other. Assumes the page tables
* already present in the new task to be cleared in the whole range
@@ -711,6 +771,7 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
unsigned long vm_flags = vma->vm_flags;
pte_t pte = *src_pte;
struct page *page;
+ bool wp;

/* pte contains position in swap or file, so copy. */
if (unlikely(!pte_present(pte))) {
@@ -789,10 +850,7 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
* If it's a COW mapping, write protect it both
* in the parent and the child
*/
- if (is_cow_mapping(vm_flags) && pte_write(pte)) {
- ptep_set_wrprotect(src_mm, addr, src_pte);
- pte = pte_wrprotect(pte);
- }
+ wp = is_cow_mapping(vm_flags) && pte_write(pte);

/*
* If it's a shared mapping, mark it clean in
@@ -813,15 +871,80 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
page = vm_normal_page(vma, addr, pte);
if (page) {
get_page(page);
+
+ /*
+ * If the page is pinned in source mm, do early cow right now
+ * so that the pinned page won't be replaced by another random
+ * page without being noticed after the fork().
+ *
+ * Note: there can be some very rare cases that we'll do
+ * unnecessary cow here, due to page_maybe_dma_pinned() is
+ * sometimes bogus, and has_pinned flag is currently aggresive
+ * too. However this should be good enough for us for now as
+ * long as we covered all the pinned pages. We can make this
+ * better in the future by providing an accurate accounting for
+ * pinned pages.
+ *
+ * Because we'll need to release the locks before doing cow,
+ * pass this work to upper layer.
+ */
+ if (READ_ONCE(src_mm->has_pinned) && wp &&
+ page_maybe_dma_pinned(page)) {
+ /* We've got the page already; we're safe */
+ data->cow_old_page = page;
+ data->cow_oldpte = *src_pte;
+ return COPY_MM_BREAK_COW;
+ }
+
page_dup_rmap(page, false);
rss[mm_counter(page)]++;
}

+ if (wp) {
+ ptep_set_wrprotect(src_mm, addr, src_pte);
+ pte = pte_wrprotect(pte);
+ }
+
out_set_pte:
set_pte_at(dst_mm, addr, dst_pte, pte);
return COPY_MM_DONE;
}

+/*
+ * Install the pte with the copied page stored in `data'. Returns true when
+ * installation completes, or false when src pte has changed.
+ */
+static int pte_install_copied_page(struct mm_struct *dst_mm,
+ struct vm_area_struct *new,
+ pte_t *src_pte, pte_t *dst_pte,
+ unsigned long addr, int *rss,
+ struct copy_mm_data *data)
+{
+ struct page *new_page = data->cow_new_page;
+ pte_t entry;
+
+ if (!pte_same(*src_pte, data->cow_oldpte)) {
+ /* PTE has changed under us. Release the page and retry */
+ page_release_cow(data);
+ return false;
+ }
+
+ entry = mk_pte(new_page, new->vm_page_prot);
+ entry = pte_sw_mkyoung(entry);
+ entry = maybe_mkwrite(pte_mkdirty(entry), new);
+ page_add_new_anon_rmap(new_page, new, addr, false);
+ set_pte_at(dst_mm, addr, dst_pte, entry);
+ rss[mm_counter(new_page)]++;
+
+ /*
+ * Manually clear the new page pointer since we've moved ownership to
+ * the newly armed PTE.
+ */
+ data->cow_new_page = NULL;
+
+ return true;
+}
+
static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma,
struct vm_area_struct *new,
@@ -830,16 +953,23 @@ static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
pte_t *orig_src_pte, *orig_dst_pte;
pte_t *src_pte, *dst_pte;
spinlock_t *src_ptl, *dst_ptl;
- int progress, copy_ret = COPY_MM_DONE;
+ int progress, ret, copy_ret = COPY_MM_DONE;
int rss[NR_MM_COUNTERS];
struct copy_mm_data data;

again:
+ /* We don't reset this for COPY_MM_BREAK_COW */
+ memset(&data, 0, sizeof(data));
+
+again_break_cow:
init_rss_vec(rss);

dst_pte = pte_alloc_map_lock(dst_mm, dst_pmd, addr, &dst_ptl);
- if (!dst_pte)
+ if (!dst_pte) {
+ /* Guarantee that the new page is released if there is */
+ page_release_cow(&data);
return -ENOMEM;
+ }
src_pte = pte_offset_map(src_pmd, addr);
src_ptl = pte_lockptr(src_mm, src_pmd);
spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
@@ -859,6 +989,25 @@ static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
spin_needbreak(src_ptl) || spin_needbreak(dst_ptl))
break;
}
+
+ if (unlikely(data.cow_new_page)) {
+ /*
+ * If cow_new_page set, we must be at the 2nd round of
+ * a previous COPY_MM_BREAK_COW. Try to arm the new
+ * page now. Note that in all cases page_break_cow()
+ * will properly release the objects in copy_mm_data.
+ */
+ WARN_ON_ONCE(copy_ret != COPY_MM_BREAK_COW);
+ if (pte_install_copied_page(dst_mm, new, src_pte,
+ dst_pte, addr, rss,
+ &data)) {
+ /* We installed the pte successfully; move on */
+ progress++;
+ continue;
+ }
+ /* PTE changed. Retry this pte (falls through) */
+ }
+
if (pte_none(*src_pte)) {
progress++;
continue;
@@ -882,8 +1031,19 @@ static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
if (add_swap_count_continuation(data.entry, GFP_KERNEL) < 0)
return -ENOMEM;
break;
- default:
+ case COPY_MM_BREAK_COW:
+ /* Do accounting onto parent mm directly */
+ ret = page_duplicate(src_mm, vma, addr, &data);
+ if (ret)
+ return ret;
+ goto again_break_cow;
+ case COPY_MM_DONE:
+ /* This means we're all good. */
break;
+ default:
+ /* This should mean copy_ret < 0. Time to fail this fork().. */
+ WARN_ON_ONCE(copy_ret >= 0);
+ return copy_ret;
}

if (addr != end)
--
2.26.2

2020-09-21 21:24:24

by Peter Xu

[permalink] [raw]
Subject: [PATCH 5/5] mm/thp: Split huge pmds/puds if they're pinned when fork()

Pinned pages shouldn't be write-protected when fork() happens, because follow
up copy-on-write on these pages could cause the pinned pages to be replaced by
random newly allocated pages.

For huge PMDs, we split the huge pmd if pinning is detected. So that future
handling will be done by the PTE level (with our latest changes, each of the
small pages will be copied). We can achieve this by let copy_huge_pmd() return
-EAGAIN for pinned pages, so that we'll fallthrough in copy_pmd_range() and
finally land the next copy_pte_range() call.

Huge PUDs will be even more special - so far it does not support anonymous
pages. But it can actually be done the same as the huge PMDs even if the split
huge PUDs means to erase the PUD entries. It'll guarantee the follow up fault
ins will remap the same pages in either parent/child later.

This might not be the most efficient way, but it should be easy and clean
enough. It should be fine, since we're tackling with a very rare case just to
make sure userspaces that pinned some thps will still work even without
MADV_DONTFORK and after they fork()ed.

Signed-off-by: Peter Xu <[email protected]>
---
mm/huge_memory.c | 26 ++++++++++++++++++++++++++
1 file changed, 26 insertions(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 7ff29cc3d55c..c40aac0ad87e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1074,6 +1074,23 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,

src_page = pmd_page(pmd);
VM_BUG_ON_PAGE(!PageHead(src_page), src_page);
+
+ /*
+ * If this page is a potentially pinned page, split and retry the fault
+ * with smaller page size. Normally this should not happen because the
+ * userspace should use MADV_DONTFORK upon pinned regions. This is a
+ * best effort that the pinned pages won't be replaced by another
+ * random page during the coming copy-on-write.
+ */
+ if (unlikely(READ_ONCE(src_mm->has_pinned) &&
+ page_maybe_dma_pinned(src_page))) {
+ pte_free(dst_mm, pgtable);
+ spin_unlock(src_ptl);
+ spin_unlock(dst_ptl);
+ __split_huge_pmd(vma, src_pmd, addr, false, NULL);
+ return -EAGAIN;
+ }
+
get_page(src_page);
page_dup_rmap(src_page, true);
add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
@@ -1177,6 +1194,15 @@ int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
/* No huge zero pud yet */
}

+ /* Please refer to comments in copy_huge_pmd() */
+ if (unlikely(READ_ONCE(src_mm->has_pinned) &&
+ page_maybe_dma_pinned(pud_page(pud)))) {
+ spin_unlock(src_ptl);
+ spin_unlock(dst_ptl);
+ __split_huge_pud(vma, src_pud, addr);
+ return -EAGAIN;
+ }
+
pudp_set_wrprotect(src_mm, addr, src_pud);
pud = pud_mkold(pud_wrprotect(pud));
set_pud_at(dst_mm, addr, dst_pud, pud);
--
2.26.2

2020-09-21 21:57:36

by Jann Horn

[permalink] [raw]
Subject: Re: [PATCH 4/5] mm: Do early cow for pinned pages during fork() for ptes

On Mon, Sep 21, 2020 at 11:20 PM Peter Xu <[email protected]> wrote:
> This patch is greatly inspired by the discussions on the list from Linus, Jason
> Gunthorpe and others [1].
>
> It allows copy_pte_range() to do early cow if the pages were pinned on the
> source mm. Currently we don't have an accurate way to know whether a page is
> pinned or not. The only thing we have is page_maybe_dma_pinned(). However
> that's good enough for now. Especially, with the newly added mm->has_pinned
> flag to make sure we won't affect processes that never pinned any pages.

To clarify: This patch only handles pin_user_pages() callers and
doesn't try to address other GUP users, right? E.g. if task A uses
process_vm_write() on task B while task B is going through fork(),
that can still race in such a way that the written data only shows up
in the child and not in B, right?

I dislike the whole pin_user_pages() concept because (as far as I
understand) it fundamentally tries to fix a problem in the subset of
cases that are more likely to occur in practice (long-term pins
overlapping with things like writeback), and ignores the rarer cases
("short-term" GUP).

2020-09-21 23:49:16

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH 4/5] mm: Do early cow for pinned pages during fork() for ptes

On 9/21/20 2:55 PM, Jann Horn wrote:
> On Mon, Sep 21, 2020 at 11:20 PM Peter Xu <[email protected]> wrote:
...
> I dislike the whole pin_user_pages() concept because (as far as I
> understand) it fundamentally tries to fix a problem in the subset of
> cases that are more likely to occur in practice (long-term pins
> overlapping with things like writeback), and ignores the rarer cases
> ("short-term" GUP).
>

Well, no, that's not really fair. pin_user_pages() provides a key
prerequisite to fixing *all* of the bugs in that area, not just a
subset. The 5 cases in Documentation/core-api/pin_user_pages.rst cover
this pretty well. Or if they don't, let me know and I'll have another
pass at it.

The case for a "pin count" that is (logically) separate from a
page->_refcount is real, and it fixes real problems. An elevated
refcount can be caused by a lot of things, but it can normally be waited
for and/or retried. The FOLL_PIN pages cannot.

Of course, a valid remaining criticism of the situation is, "why not
just *always* mark any of these pages as "dma-pinned"? In other words,
why even have a separate gup/pup API? And in fact, perhaps eventually
we'll just get rid of the get_user_pages*() side of it. But the pin
count will need to remain, in order to discern between DMA pins and
temporary refcount boosts.

thanks,
--
John Hubbard
NVIDIA

2020-09-21 23:50:30

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 4/5] mm: Do early cow for pinned pages during fork() for ptes

Hi, Jann,

On Mon, Sep 21, 2020 at 11:55:06PM +0200, Jann Horn wrote:
> On Mon, Sep 21, 2020 at 11:20 PM Peter Xu <[email protected]> wrote:
> > This patch is greatly inspired by the discussions on the list from Linus, Jason
> > Gunthorpe and others [1].
> >
> > It allows copy_pte_range() to do early cow if the pages were pinned on the
> > source mm. Currently we don't have an accurate way to know whether a page is
> > pinned or not. The only thing we have is page_maybe_dma_pinned(). However
> > that's good enough for now. Especially, with the newly added mm->has_pinned
> > flag to make sure we won't affect processes that never pinned any pages.
>
> To clarify: This patch only handles pin_user_pages() callers and
> doesn't try to address other GUP users, right? E.g. if task A uses
> process_vm_write() on task B while task B is going through fork(),
> that can still race in such a way that the written data only shows up
> in the child and not in B, right?

I saw that process_vm_write() is using pin_user_pages_remote(), so I think
after this patch applied the data will only be written to B but not the child.
Because when B fork() with these temp pinned pages, it will copy the pages
rather than write-protect them any more. IIUC the child could still have
partial data, but at last (after unpinned) B should always have the complete
data set.

>
> I dislike the whole pin_user_pages() concept because (as far as I
> understand) it fundamentally tries to fix a problem in the subset of
> cases that are more likely to occur in practice (long-term pins
> overlapping with things like writeback), and ignores the rarer cases
> ("short-term" GUP).

John/Jason or others may be better on commenting on this one. From my own
understanding, I thought it was the right thing to do so that we'll always
guarantee process B gets the whole data. From that pov this patch should make
sense even for short term gups. But maybe I've missed something.

--
Peter Xu

2020-09-21 23:50:31

by Jann Horn

[permalink] [raw]
Subject: Re: [PATCH 4/5] mm: Do early cow for pinned pages during fork() for ptes

On Tue, Sep 22, 2020 at 12:18 AM John Hubbard <[email protected]> wrote:
> On 9/21/20 2:55 PM, Jann Horn wrote:
> > On Mon, Sep 21, 2020 at 11:20 PM Peter Xu <[email protected]> wrote:
> ...
> > I dislike the whole pin_user_pages() concept because (as far as I
> > understand) it fundamentally tries to fix a problem in the subset of
> > cases that are more likely to occur in practice (long-term pins
> > overlapping with things like writeback), and ignores the rarer cases
> > ("short-term" GUP).
> >
>
> Well, no, that's not really fair. pin_user_pages() provides a key
> prerequisite to fixing *all* of the bugs in that area, not just a
> subset. The 5 cases in Documentation/core-api/pin_user_pages.rst cover
> this pretty well. Or if they don't, let me know and I'll have another
> pass at it.
>
> The case for a "pin count" that is (logically) separate from a
> page->_refcount is real, and it fixes real problems. An elevated
> refcount can be caused by a lot of things, but it can normally be waited
> for and/or retried. The FOLL_PIN pages cannot.
>
> Of course, a valid remaining criticism of the situation is, "why not
> just *always* mark any of these pages as "dma-pinned"? In other words,
> why even have a separate gup/pup API? And in fact, perhaps eventually
> we'll just get rid of the get_user_pages*() side of it. But the pin
> count will need to remain, in order to discern between DMA pins and
> temporary refcount boosts.

Ah... the documentation you linked implies that FOLL_WRITE should more
or less imply FOLL_PIN? I didn't realize that.

Whoops, and actually, process_vm_writev() does use FOLL_PIN
already, and I just grepped the code the wrong way.

Thanks for the enlightenment; I take back everything I said.

2020-09-22 01:16:35

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH 4/5] mm: Do early cow for pinned pages during fork() for ptes

On 9/21/20 3:27 PM, Jann Horn wrote:
> On Tue, Sep 22, 2020 at 12:18 AM John Hubbard <[email protected]> wrote:
>> On 9/21/20 2:55 PM, Jann Horn wrote:
>>> On Mon, Sep 21, 2020 at 11:20 PM Peter Xu <[email protected]> wrote:
>> ...
> Ah... the documentation you linked implies that FOLL_WRITE should more
> or less imply FOLL_PIN? I didn't realize that.
>

hmmm, that does seem like a pretty close approximation. It's certainly
true that if we were only doing reads, and also never marking pages
dirty, that the file system writeback code would be OK.

For completeness we should add: even just reading a page is still a
problem, if one also marks the page as dirty (which is inconsistent and
wrong, but still). That's because the file system code can then break,
during writeback in particular.


thanks,
--
John Hubbard
NVIDIA

2020-09-22 08:31:56

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH 5/5] mm/thp: Split huge pmds/puds if they're pinned when fork()

On 9/21/20 2:20 PM, Peter Xu wrote:
...
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 7ff29cc3d55c..c40aac0ad87e 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1074,6 +1074,23 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>
> src_page = pmd_page(pmd);
> VM_BUG_ON_PAGE(!PageHead(src_page), src_page);
> +
> + /*
> + * If this page is a potentially pinned page, split and retry the fault
> + * with smaller page size. Normally this should not happen because the
> + * userspace should use MADV_DONTFORK upon pinned regions. This is a
> + * best effort that the pinned pages won't be replaced by another
> + * random page during the coming copy-on-write.
> + */
> + if (unlikely(READ_ONCE(src_mm->has_pinned) &&
> + page_maybe_dma_pinned(src_page))) {

This condition would make a good static inline function. It's used in 3 places,
and the condition is quite special and worth documenting, and having a separate
function helps with that, because the function name adds to the story. I'd suggest
approximately:

page_likely_dma_pinned()

for the name.

> + pte_free(dst_mm, pgtable);
> + spin_unlock(src_ptl);
> + spin_unlock(dst_ptl);
> + __split_huge_pmd(vma, src_pmd, addr, false, NULL);
> + return -EAGAIN;
> + }


Why wait until we are so deep into this routine to detect this and unwind?
It seems like if you could do a check near the beginning of this routine, and
handle it there, with less unwinding? In fact, after taking only the src_ptl,
the check could be made, right?


thanks,
--
John Hubbard
NVIDIA

2020-09-22 08:32:12

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH 3/5] mm: Rework return value for copy_one_pte()

On 9/21/20 2:17 PM, Peter Xu wrote:
> There's one special path for copy_one_pte() with swap entries, in which
> add_swap_count_continuation(GFP_ATOMIC) might fail. In that case we'll return

I might be looking at the wrong place, but the existing code seems to call
add_swap_count_continuation(GFP_KERNEL), not with GFP_ATOMIC?

> the swp_entry_t so that the caller will release the locks and redo the same
> thing with GFP_KERNEL.
>
> It's confusing when copy_one_pte() must return a swp_entry_t (even if all the
> ptes are non-swap entries). More importantly, we face other requirement to
> extend this "we need to do something else, but without the locks held" case.
>
> Rework the return value into something easier to understand, as defined in enum
> copy_mm_ret. We'll pass the swp_entry_t back using the newly introduced union

I like the documentation here, but it doesn't match what you did in the patch.
Actually, the documentation had the right idea (enum, rather than #define, for
COPY_MM_* items). Below...

> copy_mm_data parameter.
>
> Another trivial change is to move the reset of the "progress" counter into the
> retry path, so that we'll reset it for other reasons too.
>
> This should prepare us with adding new return codes, very soon.
>
> Signed-off-by: Peter Xu <[email protected]>
> ---
> mm/memory.c | 42 +++++++++++++++++++++++++++++-------------
> 1 file changed, 29 insertions(+), 13 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 7525147908c4..1530bb1070f4 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -689,16 +689,24 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
> }
> #endif
>
> +#define COPY_MM_DONE 0
> +#define COPY_MM_SWAP_CONT 1

Those should be enums, so as to get a little type safety and other goodness from
using non-macro items.

...
> @@ -866,13 +877,18 @@ static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> pte_unmap_unlock(orig_dst_pte, dst_ptl);
> cond_resched();
>
> - if (entry.val) {
> - if (add_swap_count_continuation(entry, GFP_KERNEL) < 0)
> + switch (copy_ret) {
> + case COPY_MM_SWAP_CONT:
> + if (add_swap_count_continuation(data.entry, GFP_KERNEL) < 0)
> return -ENOMEM;
> - progress = 0;

Yes. Definitely a little cleaner to reset this above, instead of here.

> + break;
> + default:
> + break;

I assume this no-op noise is to placate the compiler and/or static checkers. :)

I'm unable to find any actual problems with the diffs, aside from the nit about
using an enum.

thanks,
--
John Hubbard
NVIDIA

2020-09-22 10:10:31

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 3/5] mm: Rework return value for copy_one_pte()

On 09/21, Peter Xu wrote:
>
> @@ -866,13 +877,18 @@ static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> pte_unmap_unlock(orig_dst_pte, dst_ptl);
> cond_resched();
>
> - if (entry.val) {
> - if (add_swap_count_continuation(entry, GFP_KERNEL) < 0)
> + switch (copy_ret) {
> + case COPY_MM_SWAP_CONT:
> + if (add_swap_count_continuation(data.entry, GFP_KERNEL) < 0)
> return -ENOMEM;
> - progress = 0;
> + break;

Note that you didn't clear copy_ret, it is still COPY_MM_SWAP_CONT,

> + default:
> + break;
> }
> +
> if (addr != end)
> goto again;

After that the main loop can stop again because of need_resched(), and
in this case add_swap_count_continuation(data.entry) will be called again?

Oleg.

2020-09-22 10:19:54

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 3/5] mm: Rework return value for copy_one_pte()

On 09/22, Oleg Nesterov wrote:
>
> On 09/21, Peter Xu wrote:
> >
> > @@ -866,13 +877,18 @@ static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> > pte_unmap_unlock(orig_dst_pte, dst_ptl);
> > cond_resched();
> >
> > - if (entry.val) {
> > - if (add_swap_count_continuation(entry, GFP_KERNEL) < 0)
> > + switch (copy_ret) {
> > + case COPY_MM_SWAP_CONT:
> > + if (add_swap_count_continuation(data.entry, GFP_KERNEL) < 0)
> > return -ENOMEM;
> > - progress = 0;
> > + break;
>
> Note that you didn't clear copy_ret, it is still COPY_MM_SWAP_CONT,
>
> > + default:
> > + break;
> > }
> > +
> > if (addr != end)
> > goto again;
>
> After that the main loop can stop again because of need_resched(), and
> in this case add_swap_count_continuation(data.entry) will be called again?

No, this is not possible, copy_one_pte() should be called at least once,
progress = 0 before restart. Sorry for noise.

Oleg.

2020-09-22 10:35:14

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 5/5] mm/thp: Split huge pmds/puds if they're pinned when fork()

On Mon 21-09-20 23:41:16, John Hubbard wrote:
> On 9/21/20 2:20 PM, Peter Xu wrote:
> ...
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 7ff29cc3d55c..c40aac0ad87e 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1074,6 +1074,23 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> > src_page = pmd_page(pmd);
> > VM_BUG_ON_PAGE(!PageHead(src_page), src_page);
> > +
> > + /*
> > + * If this page is a potentially pinned page, split and retry the fault
> > + * with smaller page size. Normally this should not happen because the
> > + * userspace should use MADV_DONTFORK upon pinned regions. This is a
> > + * best effort that the pinned pages won't be replaced by another
> > + * random page during the coming copy-on-write.
> > + */
> > + if (unlikely(READ_ONCE(src_mm->has_pinned) &&
> > + page_maybe_dma_pinned(src_page))) {
>
> This condition would make a good static inline function. It's used in 3
> places, and the condition is quite special and worth documenting, and
> having a separate function helps with that, because the function name
> adds to the story. I'd suggest approximately:
>
> page_likely_dma_pinned()
>
> for the name.

Well, but we should also capture that this really only works for anonymous
pages. For file pages mm->has_pinned does not work because the page may be
still pinned by completely unrelated process as Jann already properly
pointed out earlier in the thread. So maybe anon_page_likely_pinned()?
Possibly also assert PageAnon(page) in it if we want to be paranoid...

Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR

2020-09-22 11:53:08

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 4/5] mm: Do early cow for pinned pages during fork() for ptes

On 09/21, Peter Xu wrote:
>
> @@ -859,6 +989,25 @@ static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> spin_needbreak(src_ptl) || spin_needbreak(dst_ptl))
> break;
> }
> +
> + if (unlikely(data.cow_new_page)) {
> + /*
> + * If cow_new_page set, we must be at the 2nd round of
> + * a previous COPY_MM_BREAK_COW. Try to arm the new
> + * page now. Note that in all cases page_break_cow()
> + * will properly release the objects in copy_mm_data.
> + */
> + WARN_ON_ONCE(copy_ret != COPY_MM_BREAK_COW);
> + if (pte_install_copied_page(dst_mm, new, src_pte,
> + dst_pte, addr, rss,
> + &data)) {
> + /* We installed the pte successfully; move on */
> + progress++;
> + continue;

I'm afraid I misread this patch too ;)

But it seems to me in this case the main loop can really "leak"
COPY_MM_BREAK_COW. Suppose the the next 31 pte's are pte_none() and
need_resched() is true.

No?

Oleg.

2020-09-22 12:07:04

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 5/5] mm/thp: Split huge pmds/puds if they're pinned when fork()

On Mon, Sep 21, 2020 at 05:20:31PM -0400, Peter Xu wrote:
> Pinned pages shouldn't be write-protected when fork() happens, because follow
> up copy-on-write on these pages could cause the pinned pages to be replaced by
> random newly allocated pages.
>
> For huge PMDs, we split the huge pmd if pinning is detected. So that future
> handling will be done by the PTE level (with our latest changes, each of the
> small pages will be copied). We can achieve this by let copy_huge_pmd() return
> -EAGAIN for pinned pages, so that we'll fallthrough in copy_pmd_range() and
> finally land the next copy_pte_range() call.
>
> Huge PUDs will be even more special - so far it does not support anonymous
> pages. But it can actually be done the same as the huge PMDs even if the split
> huge PUDs means to erase the PUD entries. It'll guarantee the follow up fault
> ins will remap the same pages in either parent/child later.
>
> This might not be the most efficient way, but it should be easy and clean
> enough. It should be fine, since we're tackling with a very rare case just to
> make sure userspaces that pinned some thps will still work even without
> MADV_DONTFORK and after they fork()ed.
>
> Signed-off-by: Peter Xu <[email protected]>
> mm/huge_memory.c | 26 ++++++++++++++++++++++++++
> 1 file changed, 26 insertions(+)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 7ff29cc3d55c..c40aac0ad87e 100644
> +++ b/mm/huge_memory.c
> @@ -1074,6 +1074,23 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>
> src_page = pmd_page(pmd);
> VM_BUG_ON_PAGE(!PageHead(src_page), src_page);
> +
> + /*
> + * If this page is a potentially pinned page, split and retry the fault
> + * with smaller page size. Normally this should not happen because the
> + * userspace should use MADV_DONTFORK upon pinned regions. This is a
> + * best effort that the pinned pages won't be replaced by another
> + * random page during the coming copy-on-write.
> + */
> + if (unlikely(READ_ONCE(src_mm->has_pinned) &&
> + page_maybe_dma_pinned(src_page))) {
> + pte_free(dst_mm, pgtable);
> + spin_unlock(src_ptl);
> + spin_unlock(dst_ptl);
> + __split_huge_pmd(vma, src_pmd, addr, false, NULL);
> + return -EAGAIN;
> + }

Not sure why, but the PMD stuff here is not calling is_cow_mapping()
before doing the write protect. Seems like it might be an existing
bug?

In any event, the has_pinned logic shouldn't be used without also
checking is_cow_mapping(), so it should be added to that test. Same
remarks for PUD

Jason

2020-09-22 12:42:17

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 4/5] mm: Do early cow for pinned pages during fork() for ptes

On 09/22, Oleg Nesterov wrote:
>
> On 09/21, Peter Xu wrote:
> >
> > @@ -859,6 +989,25 @@ static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> > spin_needbreak(src_ptl) || spin_needbreak(dst_ptl))
> > break;
> > }
> > +
> > + if (unlikely(data.cow_new_page)) {
> > + /*
> > + * If cow_new_page set, we must be at the 2nd round of
> > + * a previous COPY_MM_BREAK_COW. Try to arm the new
> > + * page now. Note that in all cases page_break_cow()
> > + * will properly release the objects in copy_mm_data.
> > + */
> > + WARN_ON_ONCE(copy_ret != COPY_MM_BREAK_COW);
> > + if (pte_install_copied_page(dst_mm, new, src_pte,
> > + dst_pte, addr, rss,
> > + &data)) {
> > + /* We installed the pte successfully; move on */
> > + progress++;
> > + continue;
>
> I'm afraid I misread this patch too ;)
>
> But it seems to me in this case the main loop can really "leak"
> COPY_MM_BREAK_COW. Suppose the the next 31 pte's are pte_none() and
> need_resched() is true.
>
> No?

If yes, perhaps we can simplify the copy_ret/cow_new_page logic and make
it more explicit?

Something like below, on top of this patch...

Oleg.


--- x/mm/memory.c
+++ x/mm/memory.c
@@ -704,17 +704,6 @@
};
};

-static inline void page_release_cow(struct copy_mm_data *data)
-{
- /* The old page should only be released in page_duplicate() */
- WARN_ON_ONCE(data->cow_old_page);
-
- if (data->cow_new_page) {
- put_page(data->cow_new_page);
- data->cow_new_page = NULL;
- }
-}
-
/*
* Duplicate the page for this PTE. Returns zero if page copied (so we need to
* retry on the same PTE again to arm the copied page very soon), or negative
@@ -925,7 +914,7 @@

if (!pte_same(*src_pte, data->cow_oldpte)) {
/* PTE has changed under us. Release the page and retry */
- page_release_cow(data);
+ put_page(data->cow_new_page);
return false;
}

@@ -936,12 +925,6 @@
set_pte_at(dst_mm, addr, dst_pte, entry);
rss[mm_counter(new_page)]++;

- /*
- * Manually clear the new page pointer since we've moved ownership to
- * the newly armed PTE.
- */
- data->cow_new_page = NULL;
-
return true;
}

@@ -958,16 +941,12 @@
struct copy_mm_data data;

again:
- /* We don't reset this for COPY_MM_BREAK_COW */
- memset(&data, 0, sizeof(data));
-
-again_break_cow:
init_rss_vec(rss);

dst_pte = pte_alloc_map_lock(dst_mm, dst_pmd, addr, &dst_ptl);
if (!dst_pte) {
- /* Guarantee that the new page is released if there is */
- page_release_cow(&data);
+ if (unlikely(copy_ret == COPY_MM_BREAK_COW))
+ put_page(data.cow_new_page);
return -ENOMEM;
}
src_pte = pte_offset_map(src_pmd, addr);
@@ -978,6 +957,22 @@
arch_enter_lazy_mmu_mode();

progress = 0;
+ if (unlikely(copy_ret == COPY_MM_BREAK_COW)) {
+ /*
+ * Note that in all cases pte_install_copied_page()
+ * will properly release the objects in copy_mm_data.
+ */
+ copy_ret = COPY_MM_DONE;
+ if (pte_install_copied_page(dst_mm, new, src_pte,
+ dst_pte, addr, rss,
+ &data)) {
+ /* We installed the pte successfully; move on */
+ progress++;
+ goto next;
+ }
+ /* PTE changed. Retry this pte (falls through) */
+ }
+
do {
/*
* We are holding two locks at this point - either of them
@@ -990,24 +985,6 @@
break;
}

- if (unlikely(data.cow_new_page)) {
- /*
- * If cow_new_page set, we must be at the 2nd round of
- * a previous COPY_MM_BREAK_COW. Try to arm the new
- * page now. Note that in all cases page_break_cow()
- * will properly release the objects in copy_mm_data.
- */
- WARN_ON_ONCE(copy_ret != COPY_MM_BREAK_COW);
- if (pte_install_copied_page(dst_mm, new, src_pte,
- dst_pte, addr, rss,
- &data)) {
- /* We installed the pte successfully; move on */
- progress++;
- continue;
- }
- /* PTE changed. Retry this pte (falls through) */
- }
-
if (pte_none(*src_pte)) {
progress++;
continue;
@@ -1017,6 +994,7 @@
if (copy_ret != COPY_MM_DONE)
break;
progress += 8;
+next:
} while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end);

arch_leave_lazy_mmu_mode();
@@ -1030,13 +1008,14 @@
case COPY_MM_SWAP_CONT:
if (add_swap_count_continuation(data.entry, GFP_KERNEL) < 0)
return -ENOMEM;
- break;
+ copy_ret = COPY_MM_DONE;
+ goto again;
case COPY_MM_BREAK_COW:
/* Do accounting onto parent mm directly */
ret = page_duplicate(src_mm, vma, addr, &data);
if (ret)
return ret;
- goto again_break_cow;
+ goto again;
case COPY_MM_DONE:
/* This means we're all good. */
break;

2020-09-22 15:32:10

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 3/5] mm: Rework return value for copy_one_pte()

On Tue, Sep 22, 2020 at 12:11:29AM -0700, John Hubbard wrote:
> On 9/21/20 2:17 PM, Peter Xu wrote:
> > There's one special path for copy_one_pte() with swap entries, in which
> > add_swap_count_continuation(GFP_ATOMIC) might fail. In that case we'll return
>
> I might be looking at the wrong place, but the existing code seems to call
> add_swap_count_continuation(GFP_KERNEL), not with GFP_ATOMIC?

Ah, I wanted to reference the one in swap_duplicate().

>
> > the swp_entry_t so that the caller will release the locks and redo the same
> > thing with GFP_KERNEL.
> >
> > It's confusing when copy_one_pte() must return a swp_entry_t (even if all the
> > ptes are non-swap entries). More importantly, we face other requirement to
> > extend this "we need to do something else, but without the locks held" case.
> >
> > Rework the return value into something easier to understand, as defined in enum
> > copy_mm_ret. We'll pass the swp_entry_t back using the newly introduced union
>
> I like the documentation here, but it doesn't match what you did in the patch.
> Actually, the documentation had the right idea (enum, rather than #define, for
> COPY_MM_* items). Below...

Yeah actually my very initial version has it as an enum, then I changed it to
macros because I started to want it return negative as errors. However funnily
in the current version copy_one_pte() won't return an error anymore... So
probably, yes, it should be a good idea to get the enum back.

Also we should be able to drop the negative handling too with copy_ret, though
it should be in the next patch.

>
> > copy_mm_data parameter.
> >
> > Another trivial change is to move the reset of the "progress" counter into the
> > retry path, so that we'll reset it for other reasons too.
> >
> > This should prepare us with adding new return codes, very soon.
> >
> > Signed-off-by: Peter Xu <[email protected]>
> > ---
> > mm/memory.c | 42 +++++++++++++++++++++++++++++-------------
> > 1 file changed, 29 insertions(+), 13 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 7525147908c4..1530bb1070f4 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -689,16 +689,24 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
> > }
> > #endif
> > +#define COPY_MM_DONE 0
> > +#define COPY_MM_SWAP_CONT 1
>
> Those should be enums, so as to get a little type safety and other goodness from
> using non-macro items.
>
> ...
> > @@ -866,13 +877,18 @@ static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> > pte_unmap_unlock(orig_dst_pte, dst_ptl);
> > cond_resched();
> > - if (entry.val) {
> > - if (add_swap_count_continuation(entry, GFP_KERNEL) < 0)
> > + switch (copy_ret) {
> > + case COPY_MM_SWAP_CONT:
> > + if (add_swap_count_continuation(data.entry, GFP_KERNEL) < 0)
> > return -ENOMEM;
> > - progress = 0;
>
> Yes. Definitely a little cleaner to reset this above, instead of here.
>
> > + break;
> > + default:
> > + break;
>
> I assume this no-op noise is to placate the compiler and/or static checkers. :)

This is (so far) for COPY_MM_DONE. I normally will cover all cases in a
"switch()" and here "default" is for it. Even if I covered all the
possibilities, I may still tend to keep one "default" and a WARN_ON_ONCE(1) to
make sure nothing I've missed. Not sure whether that's the ideal way, though.

Thanks,

--
Peter Xu

2020-09-22 15:39:38

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 3/5] mm: Rework return value for copy_one_pte()

On Tue, Sep 22, 2020 at 12:18:16PM +0200, Oleg Nesterov wrote:
> On 09/22, Oleg Nesterov wrote:
> >
> > On 09/21, Peter Xu wrote:
> > >
> > > @@ -866,13 +877,18 @@ static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> > > pte_unmap_unlock(orig_dst_pte, dst_ptl);
> > > cond_resched();
> > >
> > > - if (entry.val) {
> > > - if (add_swap_count_continuation(entry, GFP_KERNEL) < 0)
> > > + switch (copy_ret) {
> > > + case COPY_MM_SWAP_CONT:
> > > + if (add_swap_count_continuation(data.entry, GFP_KERNEL) < 0)
> > > return -ENOMEM;
> > > - progress = 0;
> > > + break;
> >
> > Note that you didn't clear copy_ret, it is still COPY_MM_SWAP_CONT,
> >
> > > + default:
> > > + break;
> > > }
> > > +
> > > if (addr != end)
> > > goto again;
> >
> > After that the main loop can stop again because of need_resched(), and
> > in this case add_swap_count_continuation(data.entry) will be called again?
>
> No, this is not possible, copy_one_pte() should be called at least once,
> progress = 0 before restart. Sorry for noise.

Oh wait, I think you're right... when we get a COPY_MM_SWAP_CONT, goto "again",
then if there're 32 pte_none() ptes _plus_ an need_resched(), then we might
reach again at the same add_swap_count_continuation() with the same swp entry.

However since I didn't change this logic in this patch, it probably means this
bug is also in the original code before this series... I'm thinking maybe I
should prepare a standalone patch to clear the swp_entry_t and cc stable.

Thanks,

--
Peter Xu

2020-09-22 15:53:08

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 3/5] mm: Rework return value for copy_one_pte()

On 09/22, Peter Xu wrote:
>
> On Tue, Sep 22, 2020 at 12:18:16PM +0200, Oleg Nesterov wrote:
> > On 09/22, Oleg Nesterov wrote:
> > >
> > > On 09/21, Peter Xu wrote:
> > > >
> > > > @@ -866,13 +877,18 @@ static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> > > > pte_unmap_unlock(orig_dst_pte, dst_ptl);
> > > > cond_resched();
> > > >
> > > > - if (entry.val) {
> > > > - if (add_swap_count_continuation(entry, GFP_KERNEL) < 0)
> > > > + switch (copy_ret) {
> > > > + case COPY_MM_SWAP_CONT:
> > > > + if (add_swap_count_continuation(data.entry, GFP_KERNEL) < 0)
> > > > return -ENOMEM;
> > > > - progress = 0;
> > > > + break;
> > >
> > > Note that you didn't clear copy_ret, it is still COPY_MM_SWAP_CONT,
> > >
> > > > + default:
> > > > + break;
> > > > }
> > > > +
> > > > if (addr != end)
> > > > goto again;
> > >
> > > After that the main loop can stop again because of need_resched(), and
> > > in this case add_swap_count_continuation(data.entry) will be called again?
> >
> > No, this is not possible, copy_one_pte() should be called at least once,
> > progress = 0 before restart. Sorry for noise.
>
> Oh wait, I think you're right... when we get a COPY_MM_SWAP_CONT, goto "again",
> then if there're 32 pte_none() ptes _plus_ an need_resched(), then we might
> reach again at the same add_swap_count_continuation() with the same swp entry.

Yes, please see my reply to 4/5 ;)

> However since I didn't change this logic in this patch, it probably means this
> bug is also in the original code before this series... I'm thinking maybe I
> should prepare a standalone patch to clear the swp_entry_t and cc stable.

Well, if copy_one_pte(src_pte) hits a swap entry and returns entry.val != 0, then
pte_none(*src_pte) is not possible after restart? This means that copy_one_pte()
will be called at least once.

So _think_ that the current code is fine, but I can be easily wrong and I agree
this doesn't look clean.

Oleg.

2020-09-22 16:03:06

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 4/5] mm: Do early cow for pinned pages during fork() for ptes

On Tue, Sep 22, 2020 at 02:40:14PM +0200, Oleg Nesterov wrote:
> On 09/22, Oleg Nesterov wrote:
> >
> > On 09/21, Peter Xu wrote:
> > >
> > > @@ -859,6 +989,25 @@ static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> > > spin_needbreak(src_ptl) || spin_needbreak(dst_ptl))
> > > break;
> > > }
> > > +
> > > + if (unlikely(data.cow_new_page)) {
> > > + /*
> > > + * If cow_new_page set, we must be at the 2nd round of
> > > + * a previous COPY_MM_BREAK_COW. Try to arm the new
> > > + * page now. Note that in all cases page_break_cow()
> > > + * will properly release the objects in copy_mm_data.
> > > + */
> > > + WARN_ON_ONCE(copy_ret != COPY_MM_BREAK_COW);
> > > + if (pte_install_copied_page(dst_mm, new, src_pte,
> > > + dst_pte, addr, rss,
> > > + &data)) {
> > > + /* We installed the pte successfully; move on */
> > > + progress++;
> > > + continue;
> >
> > I'm afraid I misread this patch too ;)
> >
> > But it seems to me in this case the main loop can really "leak"
> > COPY_MM_BREAK_COW. Suppose the the next 31 pte's are pte_none() and
> > need_resched() is true.
> >
> > No?

I still think it's a no...

Note that now we'll reset "progress" every time before the do loop, so we'll
never reach need_resched() (since progress<32) before pte_install_copied_page()
when needed.

I explicitly put the pte_install_copied_page() into the loop just...

>
> If yes, perhaps we can simplify the copy_ret/cow_new_page logic and make
> it more explicit?
>
> Something like below, on top of this patch...
>
> Oleg.
>
>
> --- x/mm/memory.c
> +++ x/mm/memory.c
> @@ -704,17 +704,6 @@
> };
> };
>
> -static inline void page_release_cow(struct copy_mm_data *data)
> -{
> - /* The old page should only be released in page_duplicate() */
> - WARN_ON_ONCE(data->cow_old_page);
> -
> - if (data->cow_new_page) {
> - put_page(data->cow_new_page);
> - data->cow_new_page = NULL;
> - }
> -}

(I'm not very sure on whether I should drop this helper. I wanted to have more
spots for checking everything is right and raise if something got wrong, and I
also wanted to have the cow_new_page to never contain invalid page pointer too
since after the put_page() it's invalid (otherwise we'll need to set NULL when
we do put_page every time explicitly). I'll still tend to keep this if no
strong opinion.. or I can also drop it if there's another vote.)

> -
> /*
> * Duplicate the page for this PTE. Returns zero if page copied (so we need to
> * retry on the same PTE again to arm the copied page very soon), or negative
> @@ -925,7 +914,7 @@
>
> if (!pte_same(*src_pte, data->cow_oldpte)) {
> /* PTE has changed under us. Release the page and retry */
> - page_release_cow(data);
> + put_page(data->cow_new_page);
> return false;
> }
>
> @@ -936,12 +925,6 @@
> set_pte_at(dst_mm, addr, dst_pte, entry);
> rss[mm_counter(new_page)]++;
>
> - /*
> - * Manually clear the new page pointer since we've moved ownership to
> - * the newly armed PTE.
> - */
> - data->cow_new_page = NULL;
> -
> return true;
> }
>
> @@ -958,16 +941,12 @@
> struct copy_mm_data data;
>
> again:
> - /* We don't reset this for COPY_MM_BREAK_COW */
> - memset(&data, 0, sizeof(data));
> -
> -again_break_cow:
> init_rss_vec(rss);
>
> dst_pte = pte_alloc_map_lock(dst_mm, dst_pmd, addr, &dst_ptl);
> if (!dst_pte) {
> - /* Guarantee that the new page is released if there is */
> - page_release_cow(&data);
> + if (unlikely(copy_ret == COPY_MM_BREAK_COW))
> + put_page(data.cow_new_page);
> return -ENOMEM;
> }
> src_pte = pte_offset_map(src_pmd, addr);
> @@ -978,6 +957,22 @@
> arch_enter_lazy_mmu_mode();
>
> progress = 0;
> + if (unlikely(copy_ret == COPY_MM_BREAK_COW)) {
> + /*
> + * Note that in all cases pte_install_copied_page()
> + * will properly release the objects in copy_mm_data.
> + */
> + copy_ret = COPY_MM_DONE;
> + if (pte_install_copied_page(dst_mm, new, src_pte,
> + dst_pte, addr, rss,
> + &data)) {
> + /* We installed the pte successfully; move on */
> + progress++;
> + goto next;

... to avoid jumps like this because I think it's really tricky. :)

> + }
> + /* PTE changed. Retry this pte (falls through) */
> + }
> +
> do {
> /*
> * We are holding two locks at this point - either of them
> @@ -990,24 +985,6 @@
> break;
> }
>
> - if (unlikely(data.cow_new_page)) {
> - /*
> - * If cow_new_page set, we must be at the 2nd round of
> - * a previous COPY_MM_BREAK_COW. Try to arm the new
> - * page now. Note that in all cases page_break_cow()
> - * will properly release the objects in copy_mm_data.
> - */
> - WARN_ON_ONCE(copy_ret != COPY_MM_BREAK_COW);
> - if (pte_install_copied_page(dst_mm, new, src_pte,
> - dst_pte, addr, rss,
> - &data)) {
> - /* We installed the pte successfully; move on */
> - progress++;
> - continue;
> - }
> - /* PTE changed. Retry this pte (falls through) */
> - }
> -
> if (pte_none(*src_pte)) {
> progress++;
> continue;
> @@ -1017,6 +994,7 @@
> if (copy_ret != COPY_MM_DONE)
> break;
> progress += 8;
> +next:
> } while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end);
>
> arch_leave_lazy_mmu_mode();
> @@ -1030,13 +1008,14 @@
> case COPY_MM_SWAP_CONT:
> if (add_swap_count_continuation(data.entry, GFP_KERNEL) < 0)
> return -ENOMEM;
> - break;
> + copy_ret = COPY_MM_DONE;

Kind of a continuation of the discussion from previous patch - I think we'd
better reset copy_ret not only for this case, but move it after the switch
(just in case there'll be new ones). The new BREAK_COW uses goto so it's quite
special.

> + goto again;

I feel like this could go wrong without the "addr != end" check later, when
this is the last pte to check.

Thanks,

> case COPY_MM_BREAK_COW:
> /* Do accounting onto parent mm directly */
> ret = page_duplicate(src_mm, vma, addr, &data);
> if (ret)
> return ret;
> - goto again_break_cow;
> + goto again;
> case COPY_MM_DONE:
> /* This means we're all good. */
> break;
>

--
Peter Xu

2020-09-22 16:05:23

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 3/5] mm: Rework return value for copy_one_pte()

On Tue, Sep 22, 2020 at 05:48:46PM +0200, Oleg Nesterov wrote:
> > However since I didn't change this logic in this patch, it probably means this
> > bug is also in the original code before this series... I'm thinking maybe I
> > should prepare a standalone patch to clear the swp_entry_t and cc stable.
>
> Well, if copy_one_pte(src_pte) hits a swap entry and returns entry.val != 0, then
> pte_none(*src_pte) is not possible after restart? This means that copy_one_pte()
> will be called at least once.

Note that we've released the page table locks, so afaict the old swp entry can
be gone under us when we go back to the "do" loop... :) Extremely corner case,
but maybe still good to fix, extra clearness as a (good) side effect.

--
Peter Xu

2020-09-22 16:54:59

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 4/5] mm: Do early cow for pinned pages during fork() for ptes

On 09/22, Peter Xu wrote:
>
> On Tue, Sep 22, 2020 at 02:40:14PM +0200, Oleg Nesterov wrote:
> > On 09/22, Oleg Nesterov wrote:
> > >
> > > On 09/21, Peter Xu wrote:
> > > >
> > > > @@ -859,6 +989,25 @@ static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> > > > spin_needbreak(src_ptl) || spin_needbreak(dst_ptl))
> > > > break;
> > > > }
> > > > +
> > > > + if (unlikely(data.cow_new_page)) {
> > > > + /*
> > > > + * If cow_new_page set, we must be at the 2nd round of
> > > > + * a previous COPY_MM_BREAK_COW. Try to arm the new
> > > > + * page now. Note that in all cases page_break_cow()
> > > > + * will properly release the objects in copy_mm_data.
> > > > + */
> > > > + WARN_ON_ONCE(copy_ret != COPY_MM_BREAK_COW);
> > > > + if (pte_install_copied_page(dst_mm, new, src_pte,
> > > > + dst_pte, addr, rss,
> > > > + &data)) {
> > > > + /* We installed the pte successfully; move on */
> > > > + progress++;
> > > > + continue;
> > >
> > > I'm afraid I misread this patch too ;)
> > >
> > > But it seems to me in this case the main loop can really "leak"
> > > COPY_MM_BREAK_COW. Suppose the the next 31 pte's are pte_none() and
> > > need_resched() is true.
> > >
> > > No?
>
> I still think it's a no...
>
> Note that now we'll reset "progress" every time before the do loop, so we'll
> never reach need_resched() (since progress<32) before pte_install_copied_page()
> when needed.

Yes. But copy_ret is still COPY_MM_BREAK_COW after pte_install_copied_page().
Now suppose that the next 31 pte's are pte_none(), progress will be incremented
every time.

> I explicitly put the pte_install_copied_page() into the loop just...
...
> > progress = 0;
> > + if (unlikely(copy_ret == COPY_MM_BREAK_COW)) {
> > + /*
> > + * Note that in all cases pte_install_copied_page()
> > + * will properly release the objects in copy_mm_data.
> > + */
> > + copy_ret = COPY_MM_DONE;
> > + if (pte_install_copied_page(dst_mm, new, src_pte,
> > + dst_pte, addr, rss,
> > + &data)) {
> > + /* We installed the pte successfully; move on */
> > + progress++;
> > + goto next;
>
> ... to avoid jumps like this because I think it's really tricky. :)

To me it looks better before the main loop because we know that
data.cow_new_page != NULL is only possible at the 1st iterattion after
restart ;)

But I agree, this is subjective, please ignore. However, I still think
it is better to rely on the copy_ret == COPY_MM_BREAK_COW check rather
than data.cow_new_page != NULL.

> > case COPY_MM_SWAP_CONT:
> > if (add_swap_count_continuation(data.entry, GFP_KERNEL) < 0)
> > return -ENOMEM;
> > - break;
> > + copy_ret = COPY_MM_DONE;
>
> Kind of a continuation of the discussion from previous patch - I think we'd
> better reset copy_ret not only for this case, but move it after the switch
> (just in case there'll be new ones). The new BREAK_COW uses goto so it's quite
> special.
>
> > + goto again;
>
> I feel like this could go wrong without the "addr != end" check later, when
> this is the last pte to check.

How? We know that copy_one_pte() failed and returned COPY_MM_SWAP_CONT
before addr = end.

And this matters "case COPY_MM_BREAK_COW" below which does "goto again"
without the "addr != end" check.

Oleg.

2020-09-22 16:55:46

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 3/5] mm: Rework return value for copy_one_pte()

On 09/22, Peter Xu wrote:
>
> On Tue, Sep 22, 2020 at 05:48:46PM +0200, Oleg Nesterov wrote:
> > > However since I didn't change this logic in this patch, it probably means this
> > > bug is also in the original code before this series... I'm thinking maybe I
> > > should prepare a standalone patch to clear the swp_entry_t and cc stable.
> >
> > Well, if copy_one_pte(src_pte) hits a swap entry and returns entry.val != 0, then
> > pte_none(*src_pte) is not possible after restart? This means that copy_one_pte()
> > will be called at least once.
>
> Note that we've released the page table locks, so afaict the old swp entry can
> be gone under us when we go back to the "do" loop... :)

But how?

I am just curious, I don't understand this code enough.

Oleg.

2020-09-22 18:17:09

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 3/5] mm: Rework return value for copy_one_pte()

On Tue, Sep 22, 2020 at 06:53:55PM +0200, Oleg Nesterov wrote:
> On 09/22, Peter Xu wrote:
> >
> > On Tue, Sep 22, 2020 at 05:48:46PM +0200, Oleg Nesterov wrote:
> > > > However since I didn't change this logic in this patch, it probably means this
> > > > bug is also in the original code before this series... I'm thinking maybe I
> > > > should prepare a standalone patch to clear the swp_entry_t and cc stable.
> > >
> > > Well, if copy_one_pte(src_pte) hits a swap entry and returns entry.val != 0, then
> > > pte_none(*src_pte) is not possible after restart? This means that copy_one_pte()
> > > will be called at least once.
> >
> > Note that we've released the page table locks, so afaict the old swp entry can
> > be gone under us when we go back to the "do" loop... :)
>
> But how?
>
> I am just curious, I don't understand this code enough.

Me neither.

The point is I think we can't assume *src_pte will read the same if we have
released the src_ptl in copy_pte_range(), because imho the src_ptl is the only
thing to protect it. Or to be more explicit, we need pte_alloc_map_lock() to
read a stable pmd/pte or before update (since src_ptl itself could change too).

Thanks,

--
Peter Xu

2020-09-22 18:27:55

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 3/5] mm: Rework return value for copy_one_pte()

On 09/22, Peter Xu wrote:
>
> On Tue, Sep 22, 2020 at 06:53:55PM +0200, Oleg Nesterov wrote:
> > On 09/22, Peter Xu wrote:
> > >
> > > On Tue, Sep 22, 2020 at 05:48:46PM +0200, Oleg Nesterov wrote:
> > > > > However since I didn't change this logic in this patch, it probably means this
> > > > > bug is also in the original code before this series... I'm thinking maybe I
> > > > > should prepare a standalone patch to clear the swp_entry_t and cc stable.
> > > >
> > > > Well, if copy_one_pte(src_pte) hits a swap entry and returns entry.val != 0, then
> > > > pte_none(*src_pte) is not possible after restart? This means that copy_one_pte()
> > > > will be called at least once.
> > >
> > > Note that we've released the page table locks, so afaict the old swp entry can
> > > be gone under us when we go back to the "do" loop... :)
> >
> > But how?
> >
> > I am just curious, I don't understand this code enough.
>
> Me neither.
>
> The point is I think we can't assume *src_pte will read the same if we have
> released the src_ptl in copy_pte_range(),

This is clear.

But I still think that !pte_none() -> pte_none() transition is not possible
under mmap_write_lock()...

OK, let me repeat I don't understans these code paths enough, let me reword:
I don't see how this transition is possible.

Oleg.

2020-09-22 18:37:13

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 4/5] mm: Do early cow for pinned pages during fork() for ptes

On Tue, Sep 22, 2020 at 06:52:17PM +0200, Oleg Nesterov wrote:
> On 09/22, Peter Xu wrote:
> >
> > On Tue, Sep 22, 2020 at 02:40:14PM +0200, Oleg Nesterov wrote:
> > > On 09/22, Oleg Nesterov wrote:
> > > >
> > > > On 09/21, Peter Xu wrote:
> > > > >
> > > > > @@ -859,6 +989,25 @@ static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> > > > > spin_needbreak(src_ptl) || spin_needbreak(dst_ptl))
> > > > > break;
> > > > > }
> > > > > +
> > > > > + if (unlikely(data.cow_new_page)) {
> > > > > + /*
> > > > > + * If cow_new_page set, we must be at the 2nd round of
> > > > > + * a previous COPY_MM_BREAK_COW. Try to arm the new
> > > > > + * page now. Note that in all cases page_break_cow()
> > > > > + * will properly release the objects in copy_mm_data.
> > > > > + */
> > > > > + WARN_ON_ONCE(copy_ret != COPY_MM_BREAK_COW);
> > > > > + if (pte_install_copied_page(dst_mm, new, src_pte,
> > > > > + dst_pte, addr, rss,
> > > > > + &data)) {
> > > > > + /* We installed the pte successfully; move on */
> > > > > + progress++;
> > > > > + continue;
> > > >
> > > > I'm afraid I misread this patch too ;)
> > > >
> > > > But it seems to me in this case the main loop can really "leak"
> > > > COPY_MM_BREAK_COW. Suppose the the next 31 pte's are pte_none() and
> > > > need_resched() is true.
> > > >
> > > > No?
> >
> > I still think it's a no...
> >
> > Note that now we'll reset "progress" every time before the do loop, so we'll
> > never reach need_resched() (since progress<32) before pte_install_copied_page()
> > when needed.
>
> Yes. But copy_ret is still COPY_MM_BREAK_COW after pte_install_copied_page().
> Now suppose that the next 31 pte's are pte_none(), progress will be incremented
> every time.

Yes, I think you're right - I'll need to reset that.

>
> > I explicitly put the pte_install_copied_page() into the loop just...
> ...
> > > progress = 0;
> > > + if (unlikely(copy_ret == COPY_MM_BREAK_COW)) {
> > > + /*
> > > + * Note that in all cases pte_install_copied_page()
> > > + * will properly release the objects in copy_mm_data.
> > > + */
> > > + copy_ret = COPY_MM_DONE;
> > > + if (pte_install_copied_page(dst_mm, new, src_pte,
> > > + dst_pte, addr, rss,
> > > + &data)) {
> > > + /* We installed the pte successfully; move on */
> > > + progress++;
> > > + goto next;
> >
> > ... to avoid jumps like this because I think it's really tricky. :)
>
> To me it looks better before the main loop because we know that
> data.cow_new_page != NULL is only possible at the 1st iterattion after
> restart ;)
>
> But I agree, this is subjective, please ignore.

Thanks. For simplicity, I'll keep the code majorly as is. But I'm still open
to change if e.g. someone else still perfers the other way.

> However, I still think
> it is better to rely on the copy_ret == COPY_MM_BREAK_COW check rather
> than data.cow_new_page != NULL.

Yes. Logically we should check both, but now as I'm written it as:

if (unlikely(data.cow_new_page)) {
WARN_ON_ONCE(copy_ret != COPY_MM_BREAK_COW);
...
}

I think it's even safer because it's actually checking both, but also warn if
only cow_new_page is set, which should never happen anyways.

Or I can also do it in inverted order if you think better:

if (unlikely(copy_ret == COPY_MM_BREAK_COW)) {
WARN_ON_ONCE(!data.cow_new_page);
...
}

>
> > > case COPY_MM_SWAP_CONT:
> > > if (add_swap_count_continuation(data.entry, GFP_KERNEL) < 0)
> > > return -ENOMEM;
> > > - break;
> > > + copy_ret = COPY_MM_DONE;
> >
> > Kind of a continuation of the discussion from previous patch - I think we'd
> > better reset copy_ret not only for this case, but move it after the switch
> > (just in case there'll be new ones). The new BREAK_COW uses goto so it's quite
> > special.
> >
> > > + goto again;
> >
> > I feel like this could go wrong without the "addr != end" check later, when
> > this is the last pte to check.
>
> How? We know that copy_one_pte() failed and returned COPY_MM_SWAP_CONT
> before addr = end.

I think you're right, again. :)

Thanks,

--
Peter Xu

2020-09-22 18:47:48

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 4/5] mm: Do early cow for pinned pages during fork() for ptes

On 09/22, Peter Xu wrote:
>
> Or I can also do it in inverted order if you think better:
>
> if (unlikely(copy_ret == COPY_MM_BREAK_COW)) {
> WARN_ON_ONCE(!data.cow_new_page);
> ...
> }

Peter, let me say this again. I don't understand this code enough, you
can safely ignore me ;)

However. Personally I strongly prefer the above. Personally I really
dislike this part of 4/5:

again:
+ /* We don't reset this for COPY_MM_BREAK_COW */
+ memset(&data, 0, sizeof(data));
+
+again_break_cow:

If we rely on "copy_ret == COPY_MM_BREAK_COW" we can unify "again" and
"again_break_cow", we don't need to clear ->cow_new_page, this makes the
logic more understandable. To me at least ;)

Oleg.

2020-09-22 18:50:49

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 3/5] mm: Rework return value for copy_one_pte()

On Tue, Sep 22, 2020 at 08:23:18PM +0200, Oleg Nesterov wrote:
> On 09/22, Peter Xu wrote:
> >
> > On Tue, Sep 22, 2020 at 06:53:55PM +0200, Oleg Nesterov wrote:
> > > On 09/22, Peter Xu wrote:
> > > >
> > > > On Tue, Sep 22, 2020 at 05:48:46PM +0200, Oleg Nesterov wrote:
> > > > > > However since I didn't change this logic in this patch, it probably means this
> > > > > > bug is also in the original code before this series... I'm thinking maybe I
> > > > > > should prepare a standalone patch to clear the swp_entry_t and cc stable.
> > > > >
> > > > > Well, if copy_one_pte(src_pte) hits a swap entry and returns entry.val != 0, then
> > > > > pte_none(*src_pte) is not possible after restart? This means that copy_one_pte()
> > > > > will be called at least once.
> > > >
> > > > Note that we've released the page table locks, so afaict the old swp entry can
> > > > be gone under us when we go back to the "do" loop... :)
> > >
> > > But how?
> > >
> > > I am just curious, I don't understand this code enough.
> >
> > Me neither.
> >
> > The point is I think we can't assume *src_pte will read the same if we have
> > released the src_ptl in copy_pte_range(),
>
> This is clear.
>
> But I still think that !pte_none() -> pte_none() transition is not possible
> under mmap_write_lock()...
>
> OK, let me repeat I don't understans these code paths enough, let me reword:
> I don't see how this transition is possible.

Though I guess I'll keep my wording, because I still think it's accurate to
me. :)

Can we e.g. punch a page hole without changing vmas?

--
Peter Xu

2020-09-22 20:06:16

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH 5/5] mm/thp: Split huge pmds/puds if they're pinned when fork()

On 9/22/20 3:33 AM, Jan Kara wrote:
> On Mon 21-09-20 23:41:16, John Hubbard wrote:
>> On 9/21/20 2:20 PM, Peter Xu wrote:
>> ...
>>> + if (unlikely(READ_ONCE(src_mm->has_pinned) &&
>>> + page_maybe_dma_pinned(src_page))) {
>>
>> This condition would make a good static inline function. It's used in 3
>> places, and the condition is quite special and worth documenting, and
>> having a separate function helps with that, because the function name
>> adds to the story. I'd suggest approximately:
>>
>> page_likely_dma_pinned()
>>
>> for the name.
>
> Well, but we should also capture that this really only works for anonymous
> pages. For file pages mm->has_pinned does not work because the page may be
> still pinned by completely unrelated process as Jann already properly
> pointed out earlier in the thread. So maybe anon_page_likely_pinned()?
> Possibly also assert PageAnon(page) in it if we want to be paranoid...
>
> Honza

The file-backed case doesn't really change anything, though:
page_maybe_dma_pinned() is already a "fuzzy yes" in the same sense: you
can get a false positive. Just like here, with an mm->has_pinned that
could be a false positive for a process.

And for that reason, I'm also not sure an "assert PageAnon(page)" is
desirable. That assertion would prevent file-backed callers from being
able to call a function that provides a fuzzy answer, but I don't see
why you'd want or need to do that. The goal here is to make the fuzzy
answer a little bit more definite, but it's not "broken" just because
the result is still fuzzy, right?

Apologies if I'm missing a huge point here... :)


thanks,
--
John Hubbard
NVIDIA

2020-09-23 01:30:20

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 4/5] mm: Do early cow for pinned pages during fork() for ptes

On Tue, Sep 22, 2020 at 08:44:00PM +0200, Oleg Nesterov wrote:
> On 09/22, Peter Xu wrote:
> >
> > Or I can also do it in inverted order if you think better:
> >
> > if (unlikely(copy_ret == COPY_MM_BREAK_COW)) {
> > WARN_ON_ONCE(!data.cow_new_page);
> > ...
> > }
>
> Peter, let me say this again. I don't understand this code enough, you
> can safely ignore me ;)

Why? I appreciate every single comment from you! :)

>
> However. Personally I strongly prefer the above. Personally I really
> dislike this part of 4/5:
>
> again:
> + /* We don't reset this for COPY_MM_BREAK_COW */
> + memset(&data, 0, sizeof(data));
> +
> +again_break_cow:
>
> If we rely on "copy_ret == COPY_MM_BREAK_COW" we can unify "again" and
> "again_break_cow", we don't need to clear ->cow_new_page, this makes the
> logic more understandable. To me at least ;)

I see your point. I'll definitely try it out. I think I'll at least use what
you preferred above since it's actually the same as before, logically. Then
I'll consider drop the again_break_cow, as long as I'm still as confident after
I do the change on not leaking anything :).

Thanks,

--
Peter Xu

2020-09-23 06:56:24

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 3/5] mm: Rework return value for copy_one_pte()

On 09/22, Peter Xu wrote:
>
> On Tue, Sep 22, 2020 at 08:23:18PM +0200, Oleg Nesterov wrote:
> >
> > But I still think that !pte_none() -> pte_none() transition is not possible
> > under mmap_write_lock()...
> >
> > OK, let me repeat I don't understans these code paths enough, let me reword:
> > I don't see how this transition is possible.
>
> Though I guess I'll keep my wording, because I still think it's accurate to
> me. :)
>
> Can we e.g. punch a page hole without changing vmas?

punch a hole? I don't know what does it mean...

However, I think you are right anyway. I forgot that (at least) truncate can
clear this pte without mmap_sem after pte_unmap_unlock().

So I think you are right, the current code is wrong too.

Thanks!

Oleg.

2020-09-23 09:23:54

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 5/5] mm/thp: Split huge pmds/puds if they're pinned when fork()

On Tue 22-09-20 13:01:13, John Hubbard wrote:
> On 9/22/20 3:33 AM, Jan Kara wrote:
> > On Mon 21-09-20 23:41:16, John Hubbard wrote:
> > > On 9/21/20 2:20 PM, Peter Xu wrote:
> > > ...
> > > > + if (unlikely(READ_ONCE(src_mm->has_pinned) &&
> > > > + page_maybe_dma_pinned(src_page))) {
> > >
> > > This condition would make a good static inline function. It's used in 3
> > > places, and the condition is quite special and worth documenting, and
> > > having a separate function helps with that, because the function name
> > > adds to the story. I'd suggest approximately:
> > >
> > > page_likely_dma_pinned()
> > >
> > > for the name.
> >
> > Well, but we should also capture that this really only works for anonymous
> > pages. For file pages mm->has_pinned does not work because the page may be
> > still pinned by completely unrelated process as Jann already properly
> > pointed out earlier in the thread. So maybe anon_page_likely_pinned()?
> > Possibly also assert PageAnon(page) in it if we want to be paranoid...
> >
> > Honza
>
> The file-backed case doesn't really change anything, though:
> page_maybe_dma_pinned() is already a "fuzzy yes" in the same sense: you
> can get a false positive. Just like here, with an mm->has_pinned that
> could be a false positive for a process.
>
> And for that reason, I'm also not sure an "assert PageAnon(page)" is
> desirable. That assertion would prevent file-backed callers from being
> able to call a function that provides a fuzzy answer, but I don't see
> why you'd want or need to do that. The goal here is to make the fuzzy
> answer a little bit more definite, but it's not "broken" just because
> the result is still fuzzy, right?
>
> Apologies if I'm missing a huge point here... :)

But the problem is that if you apply mm->has_pinned check on file pages,
you can get false negatives now. And that's not acceptable...

Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR

2020-09-23 10:23:25

by Leon Romanovsky

[permalink] [raw]
Subject: Re: [PATCH 0/5] mm: Break COW for pinned pages during fork()

On Mon, Sep 21, 2020 at 05:17:39PM -0400, Peter Xu wrote:
> Finally I start to post formal patches because it's growing. And also since
> we've discussed quite some issues already, so I feel like it's clearer on what
> we need to do, and how.
>
> This series is majorly inspired by the previous discussion on the list [1],
> starting from the report from Jason on the rdma test failure. Linus proposed
> the solution, which seems to be a very nice approach to avoid the breakage of
> userspace apps that didn't use MADV_DONTFORK properly before. More information
> can be found in that thread too.
>
> I believe the initial plan was to consider merging something like this for
> rc7/rc8. However now I'm not sure due to the fact that the code change in
> copy_pte_range() is probably more than expected, so it can be with some risk.
> I'll leave this question to the reviewers...
>
> I tested it myself with fork() after vfio pinning a bunch of device pages, and
> I verified that the new copy pte logic worked as expected at least in the most
> general path. However I didn't test thp case yet because afaict vfio does not
> support thp backed dma pages. Luckily, the pmd/pud thp patch is much more
> straightforward than the pte one, so hopefully it can be directly verified by
> some code review plus some more heavy-weight rdma tests.
>
> Patch 1: Introduce mm.has_pinned (as single patch as suggested by Jason)
> Patch 2-3: Some slight rework on copy_page_range() path as preparation
> Patch 4: Early cow solution for pte copy for pinned pages
> Patch 5: Same as above, but for thp (pmd/pud).
>
> Hugetlbfs fix is still missing, but as planned, that's not urgent so we can
> work upon. Comments greatly welcomed.

Hi Peter,

I'm ware that this series is under ongoing review and probably not
final, but we tested anyway and it solves our RDMA failures.

Thanks

>
> Thanks.
>
> Peter Xu (5):
> mm: Introduce mm_struct.has_pinned
> mm/fork: Pass new vma pointer into copy_page_range()
> mm: Rework return value for copy_one_pte()
> mm: Do early cow for pinned pages during fork() for ptes
> mm/thp: Split huge pmds/puds if they're pinned when fork()
>
> include/linux/mm.h | 2 +-
> include/linux/mm_types.h | 10 ++
> kernel/fork.c | 3 +-
> mm/gup.c | 6 ++
> mm/huge_memory.c | 26 +++++
> mm/memory.c | 226 +++++++++++++++++++++++++++++++++++----
> 6 files changed, 248 insertions(+), 25 deletions(-)
>
> --
> 2.26.2
>
>

2020-09-23 13:51:51

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 5/5] mm/thp: Split huge pmds/puds if they're pinned when fork()

On Wed, Sep 23, 2020 at 11:22:05AM +0200, Jan Kara wrote:
> On Tue 22-09-20 13:01:13, John Hubbard wrote:
> > On 9/22/20 3:33 AM, Jan Kara wrote:
> > > On Mon 21-09-20 23:41:16, John Hubbard wrote:
> > > > On 9/21/20 2:20 PM, Peter Xu wrote:
> > > > ...
> > > > > + if (unlikely(READ_ONCE(src_mm->has_pinned) &&
> > > > > + page_maybe_dma_pinned(src_page))) {
> > > >
> > > > This condition would make a good static inline function. It's used in 3
> > > > places, and the condition is quite special and worth documenting, and
> > > > having a separate function helps with that, because the function name
> > > > adds to the story. I'd suggest approximately:
> > > >
> > > > page_likely_dma_pinned()
> > > >
> > > > for the name.
> > >
> > > Well, but we should also capture that this really only works for anonymous
> > > pages. For file pages mm->has_pinned does not work because the page may be
> > > still pinned by completely unrelated process as Jann already properly
> > > pointed out earlier in the thread. So maybe anon_page_likely_pinned()?
> > > Possibly also assert PageAnon(page) in it if we want to be paranoid...
> > >
> > > Honza
> >
> > The file-backed case doesn't really change anything, though:
> > page_maybe_dma_pinned() is already a "fuzzy yes" in the same sense: you
> > can get a false positive. Just like here, with an mm->has_pinned that
> > could be a false positive for a process.
> >
> > And for that reason, I'm also not sure an "assert PageAnon(page)" is
> > desirable. That assertion would prevent file-backed callers from being
> > able to call a function that provides a fuzzy answer, but I don't see
> > why you'd want or need to do that. The goal here is to make the fuzzy
> > answer a little bit more definite, but it's not "broken" just because
> > the result is still fuzzy, right?
> >
> > Apologies if I'm missing a huge point here... :)
>
> But the problem is that if you apply mm->has_pinned check on file pages,
> you can get false negatives now. And that's not acceptable...

Do you mean the case where proc A pinned page P from a file, then proc B mapped
the same page P on the file, then fork() on proc B?

If proc B didn't explicitly pinned page P in B's address space too, shouldn't
we return "false" for page_likely_dma_pinned(P)? Because if proc B didn't pin
the page in its own address space, I'd think it's ok to get the page replaced
at any time as long as the content keeps the same. Or couldn't we?

Thanks,

--
Peter Xu

2020-09-23 14:03:26

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 5/5] mm/thp: Split huge pmds/puds if they're pinned when fork()

On Wed 23-09-20 09:50:04, Peter Xu wrote:
> On Wed, Sep 23, 2020 at 11:22:05AM +0200, Jan Kara wrote:
> > On Tue 22-09-20 13:01:13, John Hubbard wrote:
> > > On 9/22/20 3:33 AM, Jan Kara wrote:
> > > > On Mon 21-09-20 23:41:16, John Hubbard wrote:
> > > > > On 9/21/20 2:20 PM, Peter Xu wrote:
> > > > > ...
> > > > > > + if (unlikely(READ_ONCE(src_mm->has_pinned) &&
> > > > > > + page_maybe_dma_pinned(src_page))) {
> > > > >
> > > > > This condition would make a good static inline function. It's used in 3
> > > > > places, and the condition is quite special and worth documenting, and
> > > > > having a separate function helps with that, because the function name
> > > > > adds to the story. I'd suggest approximately:
> > > > >
> > > > > page_likely_dma_pinned()
> > > > >
> > > > > for the name.
> > > >
> > > > Well, but we should also capture that this really only works for anonymous
> > > > pages. For file pages mm->has_pinned does not work because the page may be
> > > > still pinned by completely unrelated process as Jann already properly
> > > > pointed out earlier in the thread. So maybe anon_page_likely_pinned()?
> > > > Possibly also assert PageAnon(page) in it if we want to be paranoid...
> > > >
> > > > Honza
> > >
> > > The file-backed case doesn't really change anything, though:
> > > page_maybe_dma_pinned() is already a "fuzzy yes" in the same sense: you
> > > can get a false positive. Just like here, with an mm->has_pinned that
> > > could be a false positive for a process.
> > >
> > > And for that reason, I'm also not sure an "assert PageAnon(page)" is
> > > desirable. That assertion would prevent file-backed callers from being
> > > able to call a function that provides a fuzzy answer, but I don't see
> > > why you'd want or need to do that. The goal here is to make the fuzzy
> > > answer a little bit more definite, but it's not "broken" just because
> > > the result is still fuzzy, right?
> > >
> > > Apologies if I'm missing a huge point here... :)
> >
> > But the problem is that if you apply mm->has_pinned check on file pages,
> > you can get false negatives now. And that's not acceptable...
>
> Do you mean the case where proc A pinned page P from a file, then proc B
> mapped the same page P on the file, then fork() on proc B?

Yes.

> If proc B didn't explicitly pinned page P in B's address space too,
> shouldn't we return "false" for page_likely_dma_pinned(P)? Because if
> proc B didn't pin the page in its own address space, I'd think it's ok to
> get the page replaced at any time as long as the content keeps the same.
> Or couldn't we?

So it depends on the reason why you call page_likely_dma_pinned(). For your
COW purposes the check is correct but e.g. for "can filesystem safely
writeback this page" the page_likely_dma_pinned() would be wrong. So I'm
not objecting to the mechanism as such. I'm mainly objecting to the generic
function name which suggests something else than what it really checks and
thus it could be used in wrong places in the future... That's why I'd
prefer to restrict the function to PageAnon pages where there's no risk of
confusion what the check actually does.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2020-09-23 15:27:48

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 5/5] mm/thp: Split huge pmds/puds if they're pinned when fork()

On Tue, Sep 22, 2020 at 09:05:05AM -0300, Jason Gunthorpe wrote:
> On Mon, Sep 21, 2020 at 05:20:31PM -0400, Peter Xu wrote:
> > Pinned pages shouldn't be write-protected when fork() happens, because follow
> > up copy-on-write on these pages could cause the pinned pages to be replaced by
> > random newly allocated pages.
> >
> > For huge PMDs, we split the huge pmd if pinning is detected. So that future
> > handling will be done by the PTE level (with our latest changes, each of the
> > small pages will be copied). We can achieve this by let copy_huge_pmd() return
> > -EAGAIN for pinned pages, so that we'll fallthrough in copy_pmd_range() and
> > finally land the next copy_pte_range() call.
> >
> > Huge PUDs will be even more special - so far it does not support anonymous
> > pages. But it can actually be done the same as the huge PMDs even if the split
> > huge PUDs means to erase the PUD entries. It'll guarantee the follow up fault
> > ins will remap the same pages in either parent/child later.
> >
> > This might not be the most efficient way, but it should be easy and clean
> > enough. It should be fine, since we're tackling with a very rare case just to
> > make sure userspaces that pinned some thps will still work even without
> > MADV_DONTFORK and after they fork()ed.
> >
> > Signed-off-by: Peter Xu <[email protected]>
> > mm/huge_memory.c | 26 ++++++++++++++++++++++++++
> > 1 file changed, 26 insertions(+)
> >
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 7ff29cc3d55c..c40aac0ad87e 100644
> > +++ b/mm/huge_memory.c
> > @@ -1074,6 +1074,23 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> >
> > src_page = pmd_page(pmd);
> > VM_BUG_ON_PAGE(!PageHead(src_page), src_page);
> > +
> > + /*
> > + * If this page is a potentially pinned page, split and retry the fault
> > + * with smaller page size. Normally this should not happen because the
> > + * userspace should use MADV_DONTFORK upon pinned regions. This is a
> > + * best effort that the pinned pages won't be replaced by another
> > + * random page during the coming copy-on-write.
> > + */
> > + if (unlikely(READ_ONCE(src_mm->has_pinned) &&
> > + page_maybe_dma_pinned(src_page))) {
> > + pte_free(dst_mm, pgtable);
> > + spin_unlock(src_ptl);
> > + spin_unlock(dst_ptl);
> > + __split_huge_pmd(vma, src_pmd, addr, false, NULL);
> > + return -EAGAIN;
> > + }
>
> Not sure why, but the PMD stuff here is not calling is_cow_mapping()
> before doing the write protect. Seems like it might be an existing
> bug?

IMHO it's not a bug, because splitting a huge pmd should always be safe.

One thing I can think of that might be special here is when the pmd is
anonymously mapped but also shared (shared, tmpfs thp, I think?), then here
we'll also mark it as wrprotected even if we don't need to (or maybe we need it
for some reason..). But again I think it's safe anyways - when page fault
happens, wp_huge_pmd() should split it into smaller pages unconditionally. I
just don't know whether it's the ideal way for the shared case. Andrea should
definitely know it better (because it is there since the 1st day of thp).

>
> In any event, the has_pinned logic shouldn't be used without also
> checking is_cow_mapping(), so it should be added to that test. Same
> remarks for PUD

I think the case mentioned above is also the special case here when we didn't
check is_cow_mapping(). The major difference is whether we'll split the page
right now, or postpone it until the next write to each mm. But I think, yes,
maybe I should better still keep the is_cow_mapping() to be explicit.

Thanks,

--
Peter Xu

2020-09-23 15:39:19

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 0/5] mm: Break COW for pinned pages during fork()

On Wed, Sep 23, 2020 at 01:21:19PM +0300, Leon Romanovsky wrote:
> Hi Peter,
>
> I'm ware that this series is under ongoing review and probably not
> final, but we tested anyway and it solves our RDMA failures.

Hi, Leon,

Yes I think there'll definitely be more version(s), but thank you for the quick
follow up! This is a very valuable information to know that we're at least
with a good base point.

Thanks,

--
Peter Xu

2020-09-23 15:45:58

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 5/5] mm/thp: Split huge pmds/puds if they're pinned when fork()

On Wed, Sep 23, 2020 at 04:01:14PM +0200, Jan Kara wrote:
> On Wed 23-09-20 09:50:04, Peter Xu wrote:
> > On Wed, Sep 23, 2020 at 11:22:05AM +0200, Jan Kara wrote:
> > > On Tue 22-09-20 13:01:13, John Hubbard wrote:
> > > > On 9/22/20 3:33 AM, Jan Kara wrote:
> > > > > On Mon 21-09-20 23:41:16, John Hubbard wrote:
> > > > > > On 9/21/20 2:20 PM, Peter Xu wrote:
> > > > > > ...
> > > > > > > + if (unlikely(READ_ONCE(src_mm->has_pinned) &&
> > > > > > > + page_maybe_dma_pinned(src_page))) {
> > > > > >
> > > > > > This condition would make a good static inline function. It's used in 3
> > > > > > places, and the condition is quite special and worth documenting, and
> > > > > > having a separate function helps with that, because the function name
> > > > > > adds to the story. I'd suggest approximately:
> > > > > >
> > > > > > page_likely_dma_pinned()
> > > > > >
> > > > > > for the name.
> > > > >
> > > > > Well, but we should also capture that this really only works for anonymous
> > > > > pages. For file pages mm->has_pinned does not work because the page may be
> > > > > still pinned by completely unrelated process as Jann already properly
> > > > > pointed out earlier in the thread. So maybe anon_page_likely_pinned()?
> > > > > Possibly also assert PageAnon(page) in it if we want to be paranoid...
> > > > >
> > > > > Honza
> > > >
> > > > The file-backed case doesn't really change anything, though:
> > > > page_maybe_dma_pinned() is already a "fuzzy yes" in the same sense: you
> > > > can get a false positive. Just like here, with an mm->has_pinned that
> > > > could be a false positive for a process.
> > > >
> > > > And for that reason, I'm also not sure an "assert PageAnon(page)" is
> > > > desirable. That assertion would prevent file-backed callers from being
> > > > able to call a function that provides a fuzzy answer, but I don't see
> > > > why you'd want or need to do that. The goal here is to make the fuzzy
> > > > answer a little bit more definite, but it's not "broken" just because
> > > > the result is still fuzzy, right?
> > > >
> > > > Apologies if I'm missing a huge point here... :)
> > >
> > > But the problem is that if you apply mm->has_pinned check on file pages,
> > > you can get false negatives now. And that's not acceptable...
> >
> > Do you mean the case where proc A pinned page P from a file, then proc B
> > mapped the same page P on the file, then fork() on proc B?
>
> Yes.
>
> > If proc B didn't explicitly pinned page P in B's address space too,
> > shouldn't we return "false" for page_likely_dma_pinned(P)? Because if
> > proc B didn't pin the page in its own address space, I'd think it's ok to
> > get the page replaced at any time as long as the content keeps the same.
> > Or couldn't we?
>
> So it depends on the reason why you call page_likely_dma_pinned(). For your
> COW purposes the check is correct but e.g. for "can filesystem safely
> writeback this page" the page_likely_dma_pinned() would be wrong. So I'm
> not objecting to the mechanism as such. I'm mainly objecting to the generic
> function name which suggests something else than what it really checks and
> thus it could be used in wrong places in the future... That's why I'd
> prefer to restrict the function to PageAnon pages where there's no risk of
> confusion what the check actually does.

How about I introduce the helper as John suggested, but rename it to

page_maybe_dma_pinned_by_mm()

?

Then we also don't need to judge on which is more likely to happen (between
"maybe" and "likely", since that will confuse me if I only read these words..).

I didn't use any extra suffix like "cow" because I think it might be useful for
things besides cow. Fundamentally the new helper will be mm-based, so "by_mm"
seems to suite better to me.

Does that sound ok?

--
Peter Xu

2020-09-23 16:08:07

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 5/5] mm/thp: Split huge pmds/puds if they're pinned when fork()

On Mon, Sep 21, 2020 at 11:41:16PM -0700, John Hubbard wrote:
> On 9/21/20 2:20 PM, Peter Xu wrote:
> ...
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 7ff29cc3d55c..c40aac0ad87e 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1074,6 +1074,23 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> > src_page = pmd_page(pmd);
> > VM_BUG_ON_PAGE(!PageHead(src_page), src_page);
> > +
> > + /*
> > + * If this page is a potentially pinned page, split and retry the fault
> > + * with smaller page size. Normally this should not happen because the
> > + * userspace should use MADV_DONTFORK upon pinned regions. This is a
> > + * best effort that the pinned pages won't be replaced by another
> > + * random page during the coming copy-on-write.
> > + */
> > + if (unlikely(READ_ONCE(src_mm->has_pinned) &&
> > + page_maybe_dma_pinned(src_page))) {

[...]

> > + pte_free(dst_mm, pgtable);
> > + spin_unlock(src_ptl);
> > + spin_unlock(dst_ptl);
> > + __split_huge_pmd(vma, src_pmd, addr, false, NULL);
> > + return -EAGAIN;
> > + }
>
>
> Why wait until we are so deep into this routine to detect this and unwind?
> It seems like if you could do a check near the beginning of this routine, and
> handle it there, with less unwinding? In fact, after taking only the src_ptl,
> the check could be made, right?

Because that's where we've fetched the page from the pmd so I can directly
reference src_page. Also I think at least I need to check against swp entries?
So it seems still easier to keep it here, considering it's an unlikely path.

Thanks,

--
Peter Xu

2020-09-23 16:09:40

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCH 5/5] mm/thp: Split huge pmds/puds if they're pinned when fork()

On Wed, Sep 23, 2020 at 8:26 AM Peter Xu <[email protected]> wrote:
>
> On Tue, Sep 22, 2020 at 09:05:05AM -0300, Jason Gunthorpe wrote:
> > On Mon, Sep 21, 2020 at 05:20:31PM -0400, Peter Xu wrote:
> > > Pinned pages shouldn't be write-protected when fork() happens, because follow
> > > up copy-on-write on these pages could cause the pinned pages to be replaced by
> > > random newly allocated pages.
> > >
> > > For huge PMDs, we split the huge pmd if pinning is detected. So that future
> > > handling will be done by the PTE level (with our latest changes, each of the
> > > small pages will be copied). We can achieve this by let copy_huge_pmd() return
> > > -EAGAIN for pinned pages, so that we'll fallthrough in copy_pmd_range() and
> > > finally land the next copy_pte_range() call.
> > >
> > > Huge PUDs will be even more special - so far it does not support anonymous
> > > pages. But it can actually be done the same as the huge PMDs even if the split
> > > huge PUDs means to erase the PUD entries. It'll guarantee the follow up fault
> > > ins will remap the same pages in either parent/child later.
> > >
> > > This might not be the most efficient way, but it should be easy and clean
> > > enough. It should be fine, since we're tackling with a very rare case just to
> > > make sure userspaces that pinned some thps will still work even without
> > > MADV_DONTFORK and after they fork()ed.
> > >
> > > Signed-off-by: Peter Xu <[email protected]>
> > > mm/huge_memory.c | 26 ++++++++++++++++++++++++++
> > > 1 file changed, 26 insertions(+)
> > >
> > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > index 7ff29cc3d55c..c40aac0ad87e 100644
> > > +++ b/mm/huge_memory.c
> > > @@ -1074,6 +1074,23 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> > >
> > > src_page = pmd_page(pmd);
> > > VM_BUG_ON_PAGE(!PageHead(src_page), src_page);
> > > +
> > > + /*
> > > + * If this page is a potentially pinned page, split and retry the fault
> > > + * with smaller page size. Normally this should not happen because the
> > > + * userspace should use MADV_DONTFORK upon pinned regions. This is a
> > > + * best effort that the pinned pages won't be replaced by another
> > > + * random page during the coming copy-on-write.
> > > + */
> > > + if (unlikely(READ_ONCE(src_mm->has_pinned) &&
> > > + page_maybe_dma_pinned(src_page))) {
> > > + pte_free(dst_mm, pgtable);
> > > + spin_unlock(src_ptl);
> > > + spin_unlock(dst_ptl);
> > > + __split_huge_pmd(vma, src_pmd, addr, false, NULL);
> > > + return -EAGAIN;
> > > + }
> >
> > Not sure why, but the PMD stuff here is not calling is_cow_mapping()
> > before doing the write protect. Seems like it might be an existing
> > bug?
>
> IMHO it's not a bug, because splitting a huge pmd should always be safe.
>
> One thing I can think of that might be special here is when the pmd is
> anonymously mapped but also shared (shared, tmpfs thp, I think?), then here
> we'll also mark it as wrprotected even if we don't need to (or maybe we need it
> for some reason..). But again I think it's safe anyways - when page fault

For tmpfs map, the pmd split just clears the pmd entry without
reinstalling ptes (oppositely anonymous map would reinstall ptes). It
looks this patch intends to copy at pte level by splitting pmd. But
I'm afraid this may not work for tmpfs mappings.

> happens, wp_huge_pmd() should split it into smaller pages unconditionally. I
> just don't know whether it's the ideal way for the shared case. Andrea should
> definitely know it better (because it is there since the 1st day of thp).
>
> >
> > In any event, the has_pinned logic shouldn't be used without also
> > checking is_cow_mapping(), so it should be added to that test. Same
> > remarks for PUD
>
> I think the case mentioned above is also the special case here when we didn't
> check is_cow_mapping(). The major difference is whether we'll split the page
> right now, or postpone it until the next write to each mm. But I think, yes,
> maybe I should better still keep the is_cow_mapping() to be explicit.
>
> Thanks,
>
> --
> Peter Xu
>

2020-09-23 17:21:51

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 5/5] mm/thp: Split huge pmds/puds if they're pinned when fork()

On Wed, Sep 23, 2020 at 11:24:09AM -0400, Peter Xu wrote:
> On Tue, Sep 22, 2020 at 09:05:05AM -0300, Jason Gunthorpe wrote:
> > On Mon, Sep 21, 2020 at 05:20:31PM -0400, Peter Xu wrote:
> > > Pinned pages shouldn't be write-protected when fork() happens, because follow
> > > up copy-on-write on these pages could cause the pinned pages to be replaced by
> > > random newly allocated pages.
> > >
> > > For huge PMDs, we split the huge pmd if pinning is detected. So that future
> > > handling will be done by the PTE level (with our latest changes, each of the
> > > small pages will be copied). We can achieve this by let copy_huge_pmd() return
> > > -EAGAIN for pinned pages, so that we'll fallthrough in copy_pmd_range() and
> > > finally land the next copy_pte_range() call.
> > >
> > > Huge PUDs will be even more special - so far it does not support anonymous
> > > pages. But it can actually be done the same as the huge PMDs even if the split
> > > huge PUDs means to erase the PUD entries. It'll guarantee the follow up fault
> > > ins will remap the same pages in either parent/child later.
> > >
> > > This might not be the most efficient way, but it should be easy and clean
> > > enough. It should be fine, since we're tackling with a very rare case just to
> > > make sure userspaces that pinned some thps will still work even without
> > > MADV_DONTFORK and after they fork()ed.
> > >
> > > Signed-off-by: Peter Xu <[email protected]>
> > > mm/huge_memory.c | 26 ++++++++++++++++++++++++++
> > > 1 file changed, 26 insertions(+)
> > >
> > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > index 7ff29cc3d55c..c40aac0ad87e 100644
> > > +++ b/mm/huge_memory.c
> > > @@ -1074,6 +1074,23 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> > >
> > > src_page = pmd_page(pmd);
> > > VM_BUG_ON_PAGE(!PageHead(src_page), src_page);
> > > +
> > > + /*
> > > + * If this page is a potentially pinned page, split and retry the fault
> > > + * with smaller page size. Normally this should not happen because the
> > > + * userspace should use MADV_DONTFORK upon pinned regions. This is a
> > > + * best effort that the pinned pages won't be replaced by another
> > > + * random page during the coming copy-on-write.
> > > + */
> > > + if (unlikely(READ_ONCE(src_mm->has_pinned) &&
> > > + page_maybe_dma_pinned(src_page))) {
> > > + pte_free(dst_mm, pgtable);
> > > + spin_unlock(src_ptl);
> > > + spin_unlock(dst_ptl);
> > > + __split_huge_pmd(vma, src_pmd, addr, false, NULL);
> > > + return -EAGAIN;
> > > + }
> >
> > Not sure why, but the PMD stuff here is not calling is_cow_mapping()
> > before doing the write protect. Seems like it might be an existing
> > bug?
>
> IMHO it's not a bug, because splitting a huge pmd should always be safe.

Sur splitting is safe, but testing has_pinned without checking COW is
not, for what Jann explained.

The 'maybe' in page_maybe_dma_pinned() means it can return true when
the correct answer is false. It can never return false when the
correct answer is true.

It is the same when has_pinned is involved, the combined expression
must never return false when true is correct. Which means it can only
be applied for COW cases.

Jason

2020-09-23 17:27:18

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 3/5] mm: Rework return value for copy_one_pte()

On Mon, Sep 21, 2020 at 2:18 PM Peter Xu <[email protected]> wrote:
>
> There's one special path for copy_one_pte() with swap entries, in which
> add_swap_count_continuation(GFP_ATOMIC) might fail. In that case we'll return
> the swp_entry_t so that the caller will release the locks and redo the same
> thing with GFP_KERNEL.
>
> It's confusing when copy_one_pte() must return a swp_entry_t (even if all the
> ptes are non-swap entries). More importantly, we face other requirement to
> extend this "we need to do something else, but without the locks held" case.
>
> Rework the return value into something easier to understand, as defined in enum
> copy_mm_ret. We'll pass the swp_entry_t back using the newly introduced union
> copy_mm_data parameter.

Ok, I'm reading this series, and I do hate this.

And I think it's unnecessary.

There's a very simple way to avoid this all: split out the
"!pte_present(pte)" case from the function entirely.

That actually makes the code much more legible: that non-present case
is very different, and it's also unlikely() and causes deeper
indentation etc.

Because it's unlikely, it probably also shouldn't be inline.

That unlikely case is also why when then have that special
"out_set_pte" label, which should just go away and be copied into the
(now uninlined) function.

Once that re-organization has been done, the second step is to then
just move the "pte_present()" check into the caller, and suddenly all
the ugly return value games will go entirely away.

I'm attaching the two patches that do this here, but I do want to note
how that first patch is much more legible with "--ignore-all-space",
and then you really see that the diff is a _pure_ code movement thing.
Otherwise it looks like it's doing a big change.

Comments?

NOTE! The intent here is that now we can easily add new argument (a
pre-allocated page or NULL) and a return value to
"copy_present_page()": it can return "I needed a temporary page but
you hadn't allocated one yet" or "I used up the temporary page you
gave me" or "all good, keep the temporary page around for the future".

But these two patches are very intentionally meant to be just "this
clearly changes NO semantics at all".

Linus


Attachments:
0001-mm-split-out-the-non-present-case-from-copy_one_pte.patch (6.56 kB)
0002-mm-move-the-copy_one_pte-pte_present-check-into-the-.patch (2.77 kB)
Download all attachments

2020-09-23 20:20:50

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH 5/5] mm/thp: Split huge pmds/puds if they're pinned when fork()

On 9/23/20 8:44 AM, Peter Xu wrote:
> On Wed, Sep 23, 2020 at 04:01:14PM +0200, Jan Kara wrote:
>> On Wed 23-09-20 09:50:04, Peter Xu wrote:
...
>>>> But the problem is that if you apply mm->has_pinned check on file pages,
>>>> you can get false negatives now. And that's not acceptable...
>>>
>>> Do you mean the case where proc A pinned page P from a file, then proc B
>>> mapped the same page P on the file, then fork() on proc B?
>>
>> Yes.

aha, thanks for spelling out the false negative problem.

>>
>>> If proc B didn't explicitly pinned page P in B's address space too,
>>> shouldn't we return "false" for page_likely_dma_pinned(P)? Because if
>>> proc B didn't pin the page in its own address space, I'd think it's ok to
>>> get the page replaced at any time as long as the content keeps the same.
>>> Or couldn't we?
>>
>> So it depends on the reason why you call page_likely_dma_pinned(). For your
>> COW purposes the check is correct but e.g. for "can filesystem safely
>> writeback this page" the page_likely_dma_pinned() would be wrong. So I'm
>> not objecting to the mechanism as such. I'm mainly objecting to the generic
>> function name which suggests something else than what it really checks and
>> thus it could be used in wrong places in the future... That's why I'd
>> prefer to restrict the function to PageAnon pages where there's no risk of
>> confusion what the check actually does.
>
> How about I introduce the helper as John suggested, but rename it to
>
> page_maybe_dma_pinned_by_mm()
>
> ?
>
> Then we also don't need to judge on which is more likely to happen (between
> "maybe" and "likely", since that will confuse me if I only read these words..).
>

You're right, it is too subtle of a distinction after all. I agree that sticking
with "_maybe_" avoids that confusion.


> I didn't use any extra suffix like "cow" because I think it might be useful for
> things besides cow. Fundamentally the new helper will be mm-based, so "by_mm"
> seems to suite better to me.
>
> Does that sound ok?
>

Actually, Jan nailed it. I just wasn't understanding his scenario, but now that
I do, and considering your other point about wording, I think we end up with:

anon_page_maybe_pinned()

as a pretty good name for a helper function. (We don't want "_mm" because that
refers more to the mechanism used internally, rather than the behavior of the
function. "anon_" adds more meaning.)

...now I better go and try to grok what Jason is recommending for the new
meaning of FOLL_PIN, in another tributary of this thread. I don't *think* it affects
this naming point, though. :)

thanks,
--
John Hubbard
NVIDIA

2020-09-23 20:30:09

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 4/5] mm: Do early cow for pinned pages during fork() for ptes

On Tue, Sep 22, 2020 at 6:03 PM Peter Xu <[email protected]> wrote:
>
> > If we rely on "copy_ret == COPY_MM_BREAK_COW" we can unify "again" and
> > "again_break_cow", we don't need to clear ->cow_new_page, this makes the
> > logic more understandable. To me at least ;)
>
> I see your point. I'll definitely try it out. I think I'll at least use what
> you preferred above since it's actually the same as before, logically. Then
> I'll consider drop the again_break_cow, as long as I'm still as confident after
> I do the change on not leaking anything :).

So the two patches I sent out to re-organize copy_one_pte() were
literally meant to make all this mess go away.

IOW, the third patch would be something (COMPLETELY UNTESTED) like the attached.

I think the logic for the preallocation is fairly obvious, but it
might be better to allocate a batch of pages for all I know. That
said, I can't really make myself care about the performance of a
fork() after you've pinned pages in it, so..

Linus


Attachments:
patch (2.87 kB)

2020-09-23 21:33:23

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 3/5] mm: Rework return value for copy_one_pte()

On Wed, Sep 23, 2020 at 10:16 AM Linus Torvalds
<[email protected]> wrote:
>
> But these two patches are very intentionally meant to be just "this
> clearly changes NO semantics at all".

The more I look at these, the more I go "this is a cleanup
regardless", so I'll just keep thes in my tree as-is.

Linus

2020-09-24 12:51:13

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH 4/5] mm: Do early cow for pinned pages during fork() for ptes

On 22.09.2020 00:20, Peter Xu wrote:
> This patch is greatly inspired by the discussions on the list from Linus, Jason
> Gunthorpe and others [1].
>
> It allows copy_pte_range() to do early cow if the pages were pinned on the
> source mm. Currently we don't have an accurate way to know whether a page is
> pinned or not. The only thing we have is page_maybe_dma_pinned(). However
> that's good enough for now. Especially, with the newly added mm->has_pinned
> flag to make sure we won't affect processes that never pinned any pages.
>
> It would be easier if we can do GFP_KERNEL allocation within copy_one_pte().
> Unluckily, we can't because we're with the page table locks held for both the
> parent and child processes. So the page copy process needs to be done outside
> copy_one_pte().
>
> The new COPY_MM_BREAK_COW is introduced for this - copy_one_pte() would return
> this when it finds any pte that may need an early breaking of cow.
>
> page_duplicate() is used to handle the page copy process in copy_pte_range().
> Of course we need to do that after releasing of the locks.
>
> The slightly tricky part is page_duplicate() will fill in the copy_mm_data with
> the new page copied and we'll need to re-install the pte again with page table
> locks held again. That's done in pte_install_copied_page().
>
> The whole procedure looks quite similar to wp_page_copy() however it's simpler
> because we know the page is special (pinned) and we know we don't need tlb
> flushings because no one is referencing the new mm yet.
>
> Though we still have to be very careful on maintaining the two pages (one old
> source page, one new allocated page) across all these lock taking/releasing
> process and make sure neither of them will get lost.
>
> [1] https://lore.kernel.org/lkml/[email protected]/
>
> Suggested-by: Linus Torvalds <[email protected]>
> Signed-off-by: Peter Xu <[email protected]>
> ---
> mm/memory.c | 174 +++++++++++++++++++++++++++++++++++++++++++++++++---
> 1 file changed, 167 insertions(+), 7 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 1530bb1070f4..8f3521be80ca 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -691,12 +691,72 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
>
> #define COPY_MM_DONE 0
> #define COPY_MM_SWAP_CONT 1
> +#define COPY_MM_BREAK_COW 2
>
> struct copy_mm_data {
> /* COPY_MM_SWAP_CONT */
> swp_entry_t entry;
> + /* COPY_MM_BREAK_COW */
> + struct {
> + struct page *cow_old_page; /* Released by page_duplicate() */
> + struct page *cow_new_page; /* Released by page_release_cow() */
> + pte_t cow_oldpte;
> + };
> };
>
> +static inline void page_release_cow(struct copy_mm_data *data)
> +{
> + /* The old page should only be released in page_duplicate() */
> + WARN_ON_ONCE(data->cow_old_page);
> +
> + if (data->cow_new_page) {
> + put_page(data->cow_new_page);
> + data->cow_new_page = NULL;
> + }
> +}
> +
> +/*
> + * Duplicate the page for this PTE. Returns zero if page copied (so we need to
> + * retry on the same PTE again to arm the copied page very soon), or negative
> + * if error happened. In all cases, the old page will be properly released.
> + */
> +static int page_duplicate(struct mm_struct *src_mm, struct vm_area_struct *vma,
> + unsigned long address, struct copy_mm_data *data)
> +{
> + struct page *new_page = NULL;
> + int ret;
> +
> + /* This should have been set in change_one_pte() when reach here */
> + WARN_ON_ONCE(!data->cow_old_page);

Despite WARN() is preferred over BUG() in kernel, it looks a little strange that
we catch WARN once here, but later avoid panic in put_page().

> + new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
> + if (!new_page) {
> + ret = -ENOMEM;
> + goto out;
> + }
> +
> + copy_user_highpage(new_page, data->cow_old_page, address, vma);
> + ret = mem_cgroup_charge(new_page, src_mm, GFP_KERNEL);

All failing operations should go first, while copy_user_highpage() should go last.

> + if (ret) {
> + put_page(new_page);
> + ret = -ENOMEM;
> + goto out;
> + }
> +
> + cgroup_throttle_swaprate(new_page, GFP_KERNEL);
> + __SetPageUptodate(new_page);
> +
> + /* So far so good; arm the new page for the next attempt */
> + data->cow_new_page = new_page;
> +
> +out:
> + /* Always release the old page */
> + put_page(data->cow_old_page);
> + data->cow_old_page = NULL;
> +
> + return ret;
> +}
> +
> /*
> * copy one vm_area from one task to the other. Assumes the page tables
> * already present in the new task to be cleared in the whole range
> @@ -711,6 +771,7 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> unsigned long vm_flags = vma->vm_flags;
> pte_t pte = *src_pte;
> struct page *page;
> + bool wp;
>
> /* pte contains position in swap or file, so copy. */
> if (unlikely(!pte_present(pte))) {
> @@ -789,10 +850,7 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> * If it's a COW mapping, write protect it both
> * in the parent and the child
> */
> - if (is_cow_mapping(vm_flags) && pte_write(pte)) {
> - ptep_set_wrprotect(src_mm, addr, src_pte);
> - pte = pte_wrprotect(pte);
> - }
> + wp = is_cow_mapping(vm_flags) && pte_write(pte);
>
> /*
> * If it's a shared mapping, mark it clean in
> @@ -813,15 +871,80 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> page = vm_normal_page(vma, addr, pte);
> if (page) {
> get_page(page);
> +
> + /*
> + * If the page is pinned in source mm, do early cow right now
> + * so that the pinned page won't be replaced by another random
> + * page without being noticed after the fork().
> + *
> + * Note: there can be some very rare cases that we'll do
> + * unnecessary cow here, due to page_maybe_dma_pinned() is
> + * sometimes bogus, and has_pinned flag is currently aggresive
> + * too. However this should be good enough for us for now as
> + * long as we covered all the pinned pages. We can make this
> + * better in the future by providing an accurate accounting for
> + * pinned pages.
> + *
> + * Because we'll need to release the locks before doing cow,
> + * pass this work to upper layer.
> + */
> + if (READ_ONCE(src_mm->has_pinned) && wp &&
> + page_maybe_dma_pinned(page)) {
> + /* We've got the page already; we're safe */
> + data->cow_old_page = page;
> + data->cow_oldpte = *src_pte;
> + return COPY_MM_BREAK_COW;
> + }
> +
> page_dup_rmap(page, false);
> rss[mm_counter(page)]++;
> }
>
> + if (wp) {
> + ptep_set_wrprotect(src_mm, addr, src_pte);
> + pte = pte_wrprotect(pte);
> + }
> +
> out_set_pte:
> set_pte_at(dst_mm, addr, dst_pte, pte);
> return COPY_MM_DONE;
> }
>
> +/*
> + * Install the pte with the copied page stored in `data'. Returns true when
> + * installation completes, or false when src pte has changed.
> + */
> +static int pte_install_copied_page(struct mm_struct *dst_mm,
> + struct vm_area_struct *new,
> + pte_t *src_pte, pte_t *dst_pte,
> + unsigned long addr, int *rss,
> + struct copy_mm_data *data)
> +{
> + struct page *new_page = data->cow_new_page;
> + pte_t entry;
> +
> + if (!pte_same(*src_pte, data->cow_oldpte)) {
> + /* PTE has changed under us. Release the page and retry */
> + page_release_cow(data);
> + return false;
> + }
> +
> + entry = mk_pte(new_page, new->vm_page_prot);
> + entry = pte_sw_mkyoung(entry);
> + entry = maybe_mkwrite(pte_mkdirty(entry), new);
> + page_add_new_anon_rmap(new_page, new, addr, false);
> + set_pte_at(dst_mm, addr, dst_pte, entry);
> + rss[mm_counter(new_page)]++;
> +
> + /*
> + * Manually clear the new page pointer since we've moved ownership to
> + * the newly armed PTE.
> + */
> + data->cow_new_page = NULL;
> +
> + return true;
> +}
> +
> static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma,
> struct vm_area_struct *new,
> @@ -830,16 +953,23 @@ static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> pte_t *orig_src_pte, *orig_dst_pte;
> pte_t *src_pte, *dst_pte;
> spinlock_t *src_ptl, *dst_ptl;
> - int progress, copy_ret = COPY_MM_DONE;
> + int progress, ret, copy_ret = COPY_MM_DONE;
> int rss[NR_MM_COUNTERS];
> struct copy_mm_data data;
>
> again:
> + /* We don't reset this for COPY_MM_BREAK_COW */
> + memset(&data, 0, sizeof(data));
> +
> +again_break_cow:
> init_rss_vec(rss);
>
> dst_pte = pte_alloc_map_lock(dst_mm, dst_pmd, addr, &dst_ptl);
> - if (!dst_pte)
> + if (!dst_pte) {
> + /* Guarantee that the new page is released if there is */
> + page_release_cow(&data);
> return -ENOMEM;
> + }
> src_pte = pte_offset_map(src_pmd, addr);
> src_ptl = pte_lockptr(src_mm, src_pmd);
> spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
> @@ -859,6 +989,25 @@ static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> spin_needbreak(src_ptl) || spin_needbreak(dst_ptl))
> break;
> }
> +
> + if (unlikely(data.cow_new_page)) {
> + /*
> + * If cow_new_page set, we must be at the 2nd round of
> + * a previous COPY_MM_BREAK_COW. Try to arm the new
> + * page now. Note that in all cases page_break_cow()
> + * will properly release the objects in copy_mm_data.
> + */
> + WARN_ON_ONCE(copy_ret != COPY_MM_BREAK_COW);
> + if (pte_install_copied_page(dst_mm, new, src_pte,
> + dst_pte, addr, rss,
> + &data)) {

It looks a little confusing, that all helpers in this function return 0 in case of success,
while pte_install_copied_page() returns true. Won't be better to return 0 and -EAGAIN instead
from it?

> + /* We installed the pte successfully; move on */
> + progress++;
> + continue;
> + }
> + /* PTE changed. Retry this pte (falls through) */
> + }
> +
> if (pte_none(*src_pte)) {
> progress++;
> continue;
> @@ -882,8 +1031,19 @@ static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> if (add_swap_count_continuation(data.entry, GFP_KERNEL) < 0)
> return -ENOMEM;
> break;
> - default:
> + case COPY_MM_BREAK_COW:
> + /* Do accounting onto parent mm directly */
> + ret = page_duplicate(src_mm, vma, addr, &data);
> + if (ret)
> + return ret;
> + goto again_break_cow;
> + case COPY_MM_DONE:
> + /* This means we're all good. */
> break;
> + default:
> + /* This should mean copy_ret < 0. Time to fail this fork().. */
> + WARN_ON_ONCE(copy_ret >= 0);
> + return copy_ret;
> }
>
> if (addr != end)
>

2020-09-24 15:12:53

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 4/5] mm: Do early cow for pinned pages during fork() for ptes

On Wed, Sep 23, 2020 at 01:25:52PM -0700, Linus Torvalds wrote:
> IOW, the third patch would be something (COMPLETELY UNTESTED) like the attached.

Thanks. I'll rework on top.

--
Peter Xu

2020-09-24 15:18:32

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 4/5] mm: Do early cow for pinned pages during fork() for ptes

On Thu, Sep 24, 2020 at 02:48:00PM +0300, Kirill Tkhai wrote:
> > +/*
> > + * Duplicate the page for this PTE. Returns zero if page copied (so we need to
> > + * retry on the same PTE again to arm the copied page very soon), or negative
> > + * if error happened. In all cases, the old page will be properly released.
> > + */
> > +static int page_duplicate(struct mm_struct *src_mm, struct vm_area_struct *vma,
> > + unsigned long address, struct copy_mm_data *data)
> > +{
> > + struct page *new_page = NULL;
> > + int ret;
> > +
> > + /* This should have been set in change_one_pte() when reach here */
> > + WARN_ON_ONCE(!data->cow_old_page);
>
> Despite WARN() is preferred over BUG() in kernel, it looks a little strange that
> we catch WARN once here, but later avoid panic in put_page().

Do you mean "it'll panic in put_page()"? I'll agree if so, seems this
WARN_ON_ONCE() won't help much.

>
> > + new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
> > + if (!new_page) {
> > + ret = -ENOMEM;
> > + goto out;
> > + }
> > +
> > + copy_user_highpage(new_page, data->cow_old_page, address, vma);
> > + ret = mem_cgroup_charge(new_page, src_mm, GFP_KERNEL);
>
> All failing operations should go first, while copy_user_highpage() should go last.

Since I'll rebase to Linus's patch, I'll move this into the critical section
because the preallocated page can be used by any pte after that. The spin
locks will need to be taken longer for that, but assuming that's not a problem
for an unlikely path.

> > @@ -859,6 +989,25 @@ static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> > spin_needbreak(src_ptl) || spin_needbreak(dst_ptl))
> > break;
> > }
> > +
> > + if (unlikely(data.cow_new_page)) {
> > + /*
> > + * If cow_new_page set, we must be at the 2nd round of
> > + * a previous COPY_MM_BREAK_COW. Try to arm the new
> > + * page now. Note that in all cases page_break_cow()
> > + * will properly release the objects in copy_mm_data.
> > + */
> > + WARN_ON_ONCE(copy_ret != COPY_MM_BREAK_COW);
> > + if (pte_install_copied_page(dst_mm, new, src_pte,
> > + dst_pte, addr, rss,
> > + &data)) {
>
> It looks a little confusing, that all helpers in this function return 0 in case of success,
> while pte_install_copied_page() returns true. Won't be better to return 0 and -EAGAIN instead
> from it?

IMHO it's fine as long as no real errno will be popped out of the new helper.
But no strong opinion either, I'll see what I can do after the rebase.

Thanks for reviewing the patch even if it's going away.

--
Peter Xu

2020-09-24 15:49:29

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 5/5] mm/thp: Split huge pmds/puds if they're pinned when fork()

On Wed, Sep 23, 2020 at 09:07:49AM -0700, Yang Shi wrote:
> For tmpfs map, the pmd split just clears the pmd entry without
> reinstalling ptes (oppositely anonymous map would reinstall ptes). It
> looks this patch intends to copy at pte level by splitting pmd. But
> I'm afraid this may not work for tmpfs mappings.

IIUC that's exactly what we want.

We only want to make sure the pinned tmpfs shared pages will be kept there in
the parent. It's not a must to copy the pages to the child, as long as they
can be faulted in later correctly.

--
Peter Xu

2020-09-24 17:32:34

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCH 5/5] mm/thp: Split huge pmds/puds if they're pinned when fork()

On Thu, Sep 24, 2020 at 8:47 AM Peter Xu <[email protected]> wrote:
>
> On Wed, Sep 23, 2020 at 09:07:49AM -0700, Yang Shi wrote:
> > For tmpfs map, the pmd split just clears the pmd entry without
> > reinstalling ptes (oppositely anonymous map would reinstall ptes). It
> > looks this patch intends to copy at pte level by splitting pmd. But
> > I'm afraid this may not work for tmpfs mappings.
>
> IIUC that's exactly what we want.
>
> We only want to make sure the pinned tmpfs shared pages will be kept there in
> the parent. It's not a must to copy the pages to the child, as long as they
> can be faulted in later correctly.

Aha, got your point. Yes, they can be refaulted in later. This is how
the file THP pmd split was designed.

>
> --
> Peter Xu
>

2020-09-24 18:53:30

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 5/5] mm/thp: Split huge pmds/puds if they're pinned when fork()

On Wed, Sep 23, 2020 at 01:19:08PM -0700, John Hubbard wrote:
> On 9/23/20 8:44 AM, Peter Xu wrote:
> > On Wed, Sep 23, 2020 at 04:01:14PM +0200, Jan Kara wrote:
> > > On Wed 23-09-20 09:50:04, Peter Xu wrote:
> ...
> > > > > But the problem is that if you apply mm->has_pinned check on file pages,
> > > > > you can get false negatives now. And that's not acceptable...
> > > >
> > > > Do you mean the case where proc A pinned page P from a file, then proc B
> > > > mapped the same page P on the file, then fork() on proc B?
> > >
> > > Yes.
>
> aha, thanks for spelling out the false negative problem.
>
> > >
> > > > If proc B didn't explicitly pinned page P in B's address space too,
> > > > shouldn't we return "false" for page_likely_dma_pinned(P)? Because if
> > > > proc B didn't pin the page in its own address space, I'd think it's ok to
> > > > get the page replaced at any time as long as the content keeps the same.
> > > > Or couldn't we?
> > >
> > > So it depends on the reason why you call page_likely_dma_pinned(). For your
> > > COW purposes the check is correct but e.g. for "can filesystem safely
> > > writeback this page" the page_likely_dma_pinned() would be wrong. So I'm
> > > not objecting to the mechanism as such. I'm mainly objecting to the generic
> > > function name which suggests something else than what it really checks and
> > > thus it could be used in wrong places in the future... That's why I'd
> > > prefer to restrict the function to PageAnon pages where there's no risk of
> > > confusion what the check actually does.
> >
> > How about I introduce the helper as John suggested, but rename it to
> >
> > page_maybe_dma_pinned_by_mm()
> >
> > ?
> >
> > Then we also don't need to judge on which is more likely to happen (between
> > "maybe" and "likely", since that will confuse me if I only read these words..).
> >
>
> You're right, it is too subtle of a distinction after all. I agree that sticking
> with "_maybe_" avoids that confusion.
>
>
> > I didn't use any extra suffix like "cow" because I think it might be useful for
> > things besides cow. Fundamentally the new helper will be mm-based, so "by_mm"
> > seems to suite better to me.
> >
> > Does that sound ok?
> >
>
> Actually, Jan nailed it. I just wasn't understanding his scenario, but now that
> I do, and considering your other point about wording, I think we end up with:
>
> anon_page_maybe_pinned()
>
> as a pretty good name for a helper function. (We don't want "_mm" because that
> refers more to the mechanism used internally, rather than the behavior of the
> function. "anon_" adds more meaning.)

Actually it was really my intention when I suggested "_by_mm", because IMHO the
new helper actually means "whether this page may be pinned by _this mm_ (not
any other address space)". IOW, the case that Jan mentioned on the share page
can be reflected in this case, because although that page was pinned, however
it was not pinned "by this mm" for e.g. proc B above.

Though I've no strong opinion either. I'll start with anon_page_maybe_pinned().
To me it's probably more important to prepare the next spin first and see
whether we'd still like it for this release.

Thanks,

--
Peter Xu