2021-02-03 21:14:13

by Peter Xu

[permalink] [raw]
Subject: [PATCH 0/4] mm/hugetlb: Early cow on fork, and a few cleanups

As reported by Gal [1], we still miss the code clip to handle early cow for

hugetlb case, which is true. Again, it still feels odd to fork() after using a

few huge pages, especially if they're privately mapped to me.. However I do

agree with Gal and Jason in that we should still have that since that'll

complete the early cow on fork effort at least, and it'll still fix issues

where buffers are not well under control and not easy to apply MADV_DONTFORK.



The first two patches (1-2) are some cleanups I noticed when reading into the

hugetlb reserve map code. I think it's good to have but they're not necessary

for fixing the fork issue.



The last two patches (3-4) is the real fix.



I tested this with a fork() after some vfio-pci assignment, so I'm pretty sure

the page copy path could trigger well (page will be accounted right after the

fork()), but I didn't do data check since the card I assigned is some random

nic. Gal, please feel free to try this if you have better way to verify the

series.



https://github.com/xzpeter/linux/tree/fork-cow-pin-huge



Please review, thanks!



[1] https://lore.kernel.org/lkml/[email protected]/



Peter Xu (4):

hugetlb: Dedup the code to add a new file_region

hugetlg: Break earlier in add_reservation_in_range() when we can

mm: Introduce page_needs_cow_for_dma() for deciding whether cow

hugetlb: Do early cow when page pinned on src mm



include/linux/mm.h | 21 ++++++++

mm/huge_memory.c | 8 +--

mm/hugetlb.c | 129 ++++++++++++++++++++++++++++++++++-----------

mm/internal.h | 5 --

mm/memory.c | 7 +--

5 files changed, 123 insertions(+), 47 deletions(-)



--

2.26.2





2021-02-03 21:15:39

by Peter Xu

[permalink] [raw]
Subject: [PATCH 4/4] hugetlb: Do early cow when page pinned on src mm

This is the last missing piece of the COW-during-fork effort when there're
pinned pages found. One can reference 70e806e4e645 ("mm: Do early cow for
pinned pages during fork() for ptes", 2020-09-27) for more information, since
we do similar things here rather than pte this time, but just for hugetlb.

Signed-off-by: Peter Xu <[email protected]>
---
mm/hugetlb.c | 76 ++++++++++++++++++++++++++++++++++++++++++++++++----
1 file changed, 71 insertions(+), 5 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 9e6ea96bf33b..931bf1a81c16 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3734,11 +3734,27 @@ static bool is_hugetlb_entry_hwpoisoned(pte_t pte)
return false;
}

+static void
+hugetlb_copy_page(struct vm_area_struct *vma, pte_t *ptep, unsigned long addr,
+ struct page *old_page, struct page *new_page)
+{
+ struct hstate *h = hstate_vma(vma);
+ unsigned int psize = pages_per_huge_page(h);
+
+ copy_user_huge_page(new_page, old_page, addr, vma, psize);
+ __SetPageUptodate(new_page);
+ ClearPagePrivate(new_page);
+ set_page_huge_active(new_page);
+ set_huge_pte_at(vma->vm_mm, addr, ptep, make_huge_pte(vma, new_page, 1));
+ hugepage_add_new_anon_rmap(new_page, vma, addr);
+ hugetlb_count_add(psize, vma->vm_mm);
+}
+
int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
struct vm_area_struct *vma)
{
pte_t *src_pte, *dst_pte, entry, dst_entry;
- struct page *ptepage;
+ struct page *ptepage, *prealloc = NULL;
unsigned long addr;
int cow;
struct hstate *h = hstate_vma(vma);
@@ -3787,7 +3803,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
dst_entry = huge_ptep_get(dst_pte);
if ((dst_pte == src_pte) || !huge_pte_none(dst_entry))
continue;
-
+again:
dst_ptl = huge_pte_lock(h, dst, dst_pte);
src_ptl = huge_pte_lockptr(h, src, src_pte);
spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
@@ -3816,6 +3832,54 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
}
set_huge_swap_pte_at(dst, addr, dst_pte, entry, sz);
} else {
+ entry = huge_ptep_get(src_pte);
+ ptepage = pte_page(entry);
+ get_page(ptepage);
+
+ if (unlikely(page_needs_cow_for_dma(vma, ptepage))) {
+ /* This is very possibly a pinned huge page */
+ if (!prealloc) {
+ /*
+ * Preallocate the huge page without
+ * tons of locks since we could sleep.
+ * Note: we can't use any reservation
+ * because the page will be exclusively
+ * owned by the child later.
+ */
+ put_page(ptepage);
+ spin_unlock(src_ptl);
+ spin_unlock(dst_ptl);
+ prealloc = alloc_huge_page(vma, addr, 0);
+ if (!prealloc) {
+ /*
+ * hugetlb_cow() seems to be
+ * more careful here than us.
+ * However for fork() we could
+ * be strict not only because
+ * no one should be referencing
+ * the child mm yet, but also
+ * if resources are rare we'd
+ * better simply fail the
+ * fork() even earlier.
+ */
+ ret = -ENOMEM;
+ break;
+ }
+ goto again;
+ }
+ /*
+ * We have page preallocated so that we can do
+ * the copy right now.
+ */
+ hugetlb_copy_page(vma, dst_pte, addr, ptepage,
+ prealloc);
+ put_page(ptepage);
+ spin_unlock(src_ptl);
+ spin_unlock(dst_ptl);
+ prealloc = NULL;
+ continue;
+ }
+
if (cow) {
/*
* No need to notify as we are downgrading page
@@ -3826,9 +3890,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
*/
huge_ptep_set_wrprotect(src, addr, src_pte);
}
- entry = huge_ptep_get(src_pte);
- ptepage = pte_page(entry);
- get_page(ptepage);
+
page_dup_rmap(ptepage, true);
set_huge_pte_at(dst, addr, dst_pte, entry);
hugetlb_count_add(pages_per_huge_page(h), dst);
@@ -3842,6 +3904,10 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
else
i_mmap_unlock_read(mapping);

+ /* Free the preallocated page if not used at last */
+ if (prealloc)
+ put_page(prealloc);
+
return ret;
}

--
2.26.2

2021-02-03 21:20:56

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 4/4] hugetlb: Do early cow when page pinned on src mm

On Wed, Feb 3, 2021 at 1:08 PM Peter Xu <[email protected]> wrote:
>
> This is the last missing piece of the COW-during-fork effort when there're
> pinned pages found. One can reference 70e806e4e645 ("mm: Do early cow for
> pinned pages during fork() for ptes", 2020-09-27) for more information, since
> we do similar things here rather than pte this time, but just for hugetlb.

No issues with the code itself, but..

Comments are good, but the comments inside this block of code actually
makes the code *much* harder to read, because now the actual logic is
much more spread out and you can't see what it does so well.

> + if (unlikely(page_needs_cow_for_dma(vma, ptepage))) {
> + /* This is very possibly a pinned huge page */
> + if (!prealloc) {
> + /*
> + * Preallocate the huge page without
> + * tons of locks since we could sleep.
> + * Note: we can't use any reservation
> + * because the page will be exclusively
> + * owned by the child later.
> + */
> + put_page(ptepage);
> + spin_unlock(src_ptl);
> + spin_unlock(dst_ptl);
> + prealloc = alloc_huge_page(vma, addr, 0);
> + if (!prealloc) {
> + /*
> + * hugetlb_cow() seems to be
> + * more careful here than us.
> + * However for fork() we could
> + * be strict not only because
> + * no one should be referencing
> + * the child mm yet, but also
> + * if resources are rare we'd
> + * better simply fail the
> + * fork() even earlier.
> + */
> + ret = -ENOMEM;
> + break;
> + }
> + goto again;
> + }
> + /*
> + * We have page preallocated so that we can do
> + * the copy right now.
> + */
> + hugetlb_copy_page(vma, dst_pte, addr, ptepage,
> + prealloc);
> + put_page(ptepage);
> + spin_unlock(src_ptl);
> + spin_unlock(dst_ptl);
> + prealloc = NULL;
> + continue;
> + }

Can you move the comment above the code? And I _think_ the prealloc
conditional could be split up to a helper function (which would help
more), but maybe there are too many variables for that to be
practical.

Linus

2021-02-03 22:34:06

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 4/4] hugetlb: Do early cow when page pinned on src mm

On Wed, Feb 03, 2021 at 02:04:30PM -0800, Mike Kravetz wrote:
> > @@ -3816,6 +3832,54 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
> > }
> > set_huge_swap_pte_at(dst, addr, dst_pte, entry, sz);
> > } else {
> > + entry = huge_ptep_get(src_pte);
> > + ptepage = pte_page(entry);
> > + get_page(ptepage);
> > +
> > + if (unlikely(page_needs_cow_for_dma(vma, ptepage))) {
> > + /* This is very possibly a pinned huge page */
> > + if (!prealloc) {
> > + /*
> > + * Preallocate the huge page without
> > + * tons of locks since we could sleep.
> > + * Note: we can't use any reservation
> > + * because the page will be exclusively
> > + * owned by the child later.
> > + */
> > + put_page(ptepage);
> > + spin_unlock(src_ptl);
> > + spin_unlock(dst_ptl);
> > + prealloc = alloc_huge_page(vma, addr, 0);
>
> One quick question:
>
> The comment says we can't use any reservation, and I agree. However, the
> alloc_huge_page call has 0 as the avoid_reserve argument. Shouldn't that
> be !0 to avoid reserves?

Good point.. so I obviously wanted to skip reservation check but successfully
got cheated by the inverted name. :)

Though I do checked the reservation, so it seems not extremely important - when
we fork and copy the vma, we have already dropped the vma resv map:

if (is_vm_hugetlb_page(tmp))
reset_vma_resv_huge_pages(tmp);

Then in alloc_huge_page() we checked vma_resv_map() mostly everywhere we'd
check avoid_reserve too (either in vma_needs_reservation, or calculating
deferred_reserve). It seems to be mostly useful when vma_resv_map() existed.

But I completely agree I should pass in "1" here in v2.

Thanks,

--
Peter Xu

2021-02-04 02:04:12

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH 4/4] hugetlb: Do early cow when page pinned on src mm

On 2/3/21 1:08 PM, Peter Xu wrote:
> This is the last missing piece of the COW-during-fork effort when there're
> pinned pages found. One can reference 70e806e4e645 ("mm: Do early cow for
> pinned pages during fork() for ptes", 2020-09-27) for more information, since
> we do similar things here rather than pte this time, but just for hugetlb.
>
> Signed-off-by: Peter Xu <[email protected]>
> ---
> mm/hugetlb.c | 76 ++++++++++++++++++++++++++++++++++++++++++++++++----
> 1 file changed, 71 insertions(+), 5 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 9e6ea96bf33b..931bf1a81c16 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3734,11 +3734,27 @@ static bool is_hugetlb_entry_hwpoisoned(pte_t pte)
> return false;
> }
>
> +static void
> +hugetlb_copy_page(struct vm_area_struct *vma, pte_t *ptep, unsigned long addr,
> + struct page *old_page, struct page *new_page)
> +{
> + struct hstate *h = hstate_vma(vma);
> + unsigned int psize = pages_per_huge_page(h);
> +
> + copy_user_huge_page(new_page, old_page, addr, vma, psize);
> + __SetPageUptodate(new_page);
> + ClearPagePrivate(new_page);
> + set_page_huge_active(new_page);
> + set_huge_pte_at(vma->vm_mm, addr, ptep, make_huge_pte(vma, new_page, 1));
> + hugepage_add_new_anon_rmap(new_page, vma, addr);
> + hugetlb_count_add(psize, vma->vm_mm);
> +}
> +
> int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
> struct vm_area_struct *vma)
> {
> pte_t *src_pte, *dst_pte, entry, dst_entry;
> - struct page *ptepage;
> + struct page *ptepage, *prealloc = NULL;
> unsigned long addr;
> int cow;
> struct hstate *h = hstate_vma(vma);
> @@ -3787,7 +3803,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
> dst_entry = huge_ptep_get(dst_pte);
> if ((dst_pte == src_pte) || !huge_pte_none(dst_entry))
> continue;
> -
> +again:
> dst_ptl = huge_pte_lock(h, dst, dst_pte);
> src_ptl = huge_pte_lockptr(h, src, src_pte);
> spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
> @@ -3816,6 +3832,54 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
> }
> set_huge_swap_pte_at(dst, addr, dst_pte, entry, sz);
> } else {
> + entry = huge_ptep_get(src_pte);
> + ptepage = pte_page(entry);
> + get_page(ptepage);
> +
> + if (unlikely(page_needs_cow_for_dma(vma, ptepage))) {
> + /* This is very possibly a pinned huge page */
> + if (!prealloc) {
> + /*
> + * Preallocate the huge page without
> + * tons of locks since we could sleep.
> + * Note: we can't use any reservation
> + * because the page will be exclusively
> + * owned by the child later.
> + */
> + put_page(ptepage);
> + spin_unlock(src_ptl);
> + spin_unlock(dst_ptl);
> + prealloc = alloc_huge_page(vma, addr, 0);

One quick question:

The comment says we can't use any reservation, and I agree. However, the
alloc_huge_page call has 0 as the avoid_reserve argument. Shouldn't that
be !0 to avoid reserves?

--
Mike Kravetz

> + if (!prealloc) {
> + /*
> + * hugetlb_cow() seems to be
> + * more careful here than us.
> + * However for fork() we could
> + * be strict not only because
> + * no one should be referencing
> + * the child mm yet, but also
> + * if resources are rare we'd
> + * better simply fail the
> + * fork() even earlier.
> + */
> + ret = -ENOMEM;
> + break;
> + }
> + goto again;
> + }
> + /*
> + * We have page preallocated so that we can do
> + * the copy right now.
> + */
> + hugetlb_copy_page(vma, dst_pte, addr, ptepage,
> + prealloc);
> + put_page(ptepage);
> + spin_unlock(src_ptl);
> + spin_unlock(dst_ptl);
> + prealloc = NULL;
> + continue;
> + }
> +
> if (cow) {
> /*
> * No need to notify as we are downgrading page
> @@ -3826,9 +3890,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
> */
> huge_ptep_set_wrprotect(src, addr, src_pte);
> }
> - entry = huge_ptep_get(src_pte);
> - ptepage = pte_page(entry);
> - get_page(ptepage);
> +
> page_dup_rmap(ptepage, true);
> set_huge_pte_at(dst, addr, dst_pte, entry);
> hugetlb_count_add(pages_per_huge_page(h), dst);
> @@ -3842,6 +3904,10 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
> else
> i_mmap_unlock_read(mapping);
>
> + /* Free the preallocated page if not used at last */
> + if (prealloc)
> + put_page(prealloc);
> +
> return ret;
> }
>

2021-02-04 02:04:57

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 4/4] hugetlb: Do early cow when page pinned on src mm

On Wed, Feb 03, 2021 at 01:15:03PM -0800, Linus Torvalds wrote:
> On Wed, Feb 3, 2021 at 1:08 PM Peter Xu <[email protected]> wrote:
> >
> > This is the last missing piece of the COW-during-fork effort when there're
> > pinned pages found. One can reference 70e806e4e645 ("mm: Do early cow for
> > pinned pages during fork() for ptes", 2020-09-27) for more information, since
> > we do similar things here rather than pte this time, but just for hugetlb.
>
> No issues with the code itself, but..
>
> Comments are good, but the comments inside this block of code actually
> makes the code *much* harder to read, because now the actual logic is
> much more spread out and you can't see what it does so well.
>
> > + if (unlikely(page_needs_cow_for_dma(vma, ptepage))) {
> > + /* This is very possibly a pinned huge page */
> > + if (!prealloc) {
> > + /*
> > + * Preallocate the huge page without
> > + * tons of locks since we could sleep.
> > + * Note: we can't use any reservation
> > + * because the page will be exclusively
> > + * owned by the child later.
> > + */
> > + put_page(ptepage);
> > + spin_unlock(src_ptl);
> > + spin_unlock(dst_ptl);
> > + prealloc = alloc_huge_page(vma, addr, 0);
> > + if (!prealloc) {
> > + /*
> > + * hugetlb_cow() seems to be
> > + * more careful here than us.
> > + * However for fork() we could
> > + * be strict not only because
> > + * no one should be referencing
> > + * the child mm yet, but also
> > + * if resources are rare we'd
> > + * better simply fail the
> > + * fork() even earlier.
> > + */
> > + ret = -ENOMEM;
> > + break;
> > + }
> > + goto again;
> > + }
> > + /*
> > + * We have page preallocated so that we can do
> > + * the copy right now.
> > + */
> > + hugetlb_copy_page(vma, dst_pte, addr, ptepage,
> > + prealloc);
> > + put_page(ptepage);
> > + spin_unlock(src_ptl);
> > + spin_unlock(dst_ptl);
> > + prealloc = NULL;
> > + continue;
> > + }
>
> Can you move the comment above the code?

Sure.

> And I _think_ the prealloc conditional could be split up to a helper function
> (which would help more), but maybe there are too many variables for that to
> be practical.

It's just that comparing to pte case where we introduced page_copy_prealloc(),
we've already got a very nice helper alloc_huge_page() for that for e.g. cgroup
charging and so on, so it seems already clean enough to use it.

The only difference comparing to the pte case is I moved the reset of
"prealloc" to be out of the copy function since we never fail after all, to
avoid passing a struct page** double pointer.

Would below look better (only comment change)?

---------------8<------------------
@@ -3816,6 +3832,39 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
}
set_huge_swap_pte_at(dst, addr, dst_pte, entry, sz);
} else {
+ entry = huge_ptep_get(src_pte);
+ ptepage = pte_page(entry);
+ get_page(ptepage);
+
+ /*
+ * This is a rare case where we see pinned hugetlb
+ * pages while they're prone to COW. We need to do the
+ * COW earlier during fork.
+ *
+ * When pre-allocating the page we need to be without
+ * all the locks since we could sleep when allocate.
+ */
+ if (unlikely(page_needs_cow_for_dma(vma, ptepage))) {
+ if (!prealloc) {
+ put_page(ptepage);
+ spin_unlock(src_ptl);
+ spin_unlock(dst_ptl);
+ prealloc = alloc_huge_page(vma, addr, 0);
+ if (!prealloc) {
+ ret = -ENOMEM;
+ break;
+ }
+ goto again;
+ }
+ hugetlb_copy_page(vma, dst_pte, addr, ptepage,
+ prealloc);
+ put_page(ptepage);
+ spin_unlock(src_ptl);
+ spin_unlock(dst_ptl);
+ prealloc = NULL;
+ continue;
+ }
+
---------------8<------------------

Thanks,

--
Peter Xu

2021-02-05 00:29:36

by Gal Pressman

[permalink] [raw]
Subject: Re: [PATCH 0/4] mm/hugetlb: Early cow on fork, and a few cleanups

On 03/02/2021 23:08, Peter Xu wrote:
> As reported by Gal [1], we still miss the code clip to handle early cow for
>
> hugetlb case, which is true. Again, it still feels odd to fork() after using a
>
> few huge pages, especially if they're privately mapped to me.. However I do
>
> agree with Gal and Jason in that we should still have that since that'll
>
> complete the early cow on fork effort at least, and it'll still fix issues
>
> where buffers are not well under control and not easy to apply MADV_DONTFORK.
>
>
>
> The first two patches (1-2) are some cleanups I noticed when reading into the
>
> hugetlb reserve map code. I think it's good to have but they're not necessary
>
> for fixing the fork issue.
>
>
>
> The last two patches (3-4) is the real fix.
>
>
>
> I tested this with a fork() after some vfio-pci assignment, so I'm pretty sure
>
> the page copy path could trigger well (page will be accounted right after the
>
> fork()), but I didn't do data check since the card I assigned is some random
>
> nic. Gal, please feel free to try this if you have better way to verify the
>
> series.

Thanks Peter, once v2 is submitted I'll pull the patches and we'll run the tests
that discovered the issue to verify it works.