2023-06-13 22:09:22

by Peter Xu

[permalink] [raw]
Subject: [PATCH 0/7] mm/gup: Unify hugetlb, speed up thp

Hugetlb has a special path for slow gup that follow_page_mask() is actually
skipped completely along with faultin_page(). It's not only confusing, but
also duplicating a lot of logics that generic gup already has, making
hugetlb slightly special.

This patchset tries to dedup the logic, by first touching up the slow gup
code to be able to handle hugetlb pages correctly with the current follow
page and faultin routines (where we're mostly there.. due to 10 years ago
we did try to optimize thp, but half way done; more below), then at the
last patch drop the special path, then the hugetlb gup will always go the
generic routine too via faultin_page().

Note that hugetlb is still special for gup, mostly due to the pgtable
walking (hugetlb_walk()) that we rely on which is currently per-arch. But
this is still one small step forward, and the diffstat might be a proof
too that this might be worthwhile.

Then for the "speed up thp" side: as a side effect, when I'm looking at the
chunk of code, I found that thp support is actually partially done. It
doesn't mean that thp won't work for gup, but as long as **pages pointer
passed over, the optimization will be skipped too. Patch 6 should address
that, so for thp we now get full speed gup.

For a quick number, "chrt -f 1 ./gup_test -m 512 -t -L -n 1024 -r 10" gives
me 13992.50us -> 378.50us. Gup_test is an extreme case, but just to show
how it affects thp gups.

James: I hope this won't affect too much of your HGM series as this might
conflict with yours. Logically it should even make your series smaller,
since you can drop the patch to support HGM for follow_hugetlb_page() now
after this one, but you only need to worry that if this one is proven
useful and merged first.

Patch 1-6 prepares for the switch, while patch 7 does the switch over.

I hope I didn't miss anything, and will be very happy to know. Please have
a look, thanks.

Peter Xu (7):
mm/hugetlb: Handle FOLL_DUMP well in follow_page_mask()
mm/hugetlb: Fix hugetlb_follow_page_mask() on permission checks
mm/hugetlb: Add page_mask for hugetlb_follow_page_mask()
mm/hugetlb: Prepare hugetlb_follow_page_mask() for FOLL_PIN
mm/gup: Cleanup next_page handling
mm/gup: Accelerate thp gup even for "pages != NULL"
mm/gup: Retire follow_hugetlb_page()

include/linux/hugetlb.h | 20 +---
mm/gup.c | 68 +++++------
mm/hugetlb.c | 259 ++++------------------------------------
3 files changed, 63 insertions(+), 284 deletions(-)

--
2.40.1



2023-06-13 22:09:26

by Peter Xu

[permalink] [raw]
Subject: [PATCH 5/7] mm/gup: Cleanup next_page handling

The only path that doesn't use generic "**pages" handling is the gate vma.
Make it use the same path, meanwhile tune the next_page label upper to
cover "**pages" handling. This prepares for THP handling for "**pages".

Signed-off-by: Peter Xu <[email protected]>
---
mm/gup.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 8d59ae4554e7..a2d1b3c4b104 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1135,7 +1135,7 @@ static long __get_user_pages(struct mm_struct *mm,
if (!vma && in_gate_area(mm, start)) {
ret = get_gate_page(mm, start & PAGE_MASK,
gup_flags, &vma,
- pages ? &pages[i] : NULL);
+ pages ? &page : NULL);
if (ret)
goto out;
ctx.page_mask = 0;
@@ -1205,19 +1205,18 @@ static long __get_user_pages(struct mm_struct *mm,
ret = PTR_ERR(page);
goto out;
}
-
- goto next_page;
} else if (IS_ERR(page)) {
ret = PTR_ERR(page);
goto out;
}
+next_page:
if (pages) {
pages[i] = page;
flush_anon_page(vma, page, start);
flush_dcache_page(page);
ctx.page_mask = 0;
}
-next_page:
+
page_increm = 1 + (~(start >> PAGE_SHIFT) & ctx.page_mask);
if (page_increm > nr_pages)
page_increm = nr_pages;
--
2.40.1


2023-06-13 22:09:43

by Peter Xu

[permalink] [raw]
Subject: [PATCH 7/7] mm/gup: Retire follow_hugetlb_page()

Now __get_user_pages() should be well prepared to handle thp completely,
as long as hugetlb gup requests even without the hugetlb's special path.

Time to retire follow_hugetlb_page().

Tweak the comments in follow_page_mask() to reflect reality, by dropping
the "follow_page()" description.

Signed-off-by: Peter Xu <[email protected]>
---
include/linux/hugetlb.h | 12 ---
mm/gup.c | 19 ----
mm/hugetlb.c | 223 ----------------------------------------
3 files changed, 254 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 0d6f389d98de..44e5836eed15 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -133,9 +133,6 @@ int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *,
struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
unsigned long address, unsigned int flags,
unsigned int *page_mask);
-long follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
- struct page **, unsigned long *, unsigned long *,
- long, unsigned int, int *);
void unmap_hugepage_range(struct vm_area_struct *,
unsigned long, unsigned long, struct page *,
zap_flags_t);
@@ -305,15 +302,6 @@ static inline struct page *hugetlb_follow_page_mask(
BUILD_BUG(); /* should never be compiled in if !CONFIG_HUGETLB_PAGE*/
}

-static inline long follow_hugetlb_page(struct mm_struct *mm,
- struct vm_area_struct *vma, struct page **pages,
- unsigned long *position, unsigned long *nr_pages,
- long i, unsigned int flags, int *nonblocking)
-{
- BUG();
- return 0;
-}
-
static inline int copy_hugetlb_page_range(struct mm_struct *dst,
struct mm_struct *src,
struct vm_area_struct *dst_vma,
diff --git a/mm/gup.c b/mm/gup.c
index cdabc8ea783b..a65b80953b7a 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -789,9 +789,6 @@ static struct page *follow_page_mask(struct vm_area_struct *vma,
* Call hugetlb_follow_page_mask for hugetlb vmas as it will use
* special hugetlb page table walking code. This eliminates the
* need to check for hugetlb entries in the general walking code.
- *
- * hugetlb_follow_page_mask is only for follow_page() handling here.
- * Ordinary GUP uses follow_hugetlb_page for hugetlb processing.
*/
if (is_vm_hugetlb_page(vma))
return hugetlb_follow_page_mask(vma, address, flags,
@@ -1149,22 +1146,6 @@ static long __get_user_pages(struct mm_struct *mm,
ret = check_vma_flags(vma, gup_flags);
if (ret)
goto out;
-
- if (is_vm_hugetlb_page(vma)) {
- i = follow_hugetlb_page(mm, vma, pages,
- &start, &nr_pages, i,
- gup_flags, locked);
- if (!*locked) {
- /*
- * We've got a VM_FAULT_RETRY
- * and we've lost mmap_lock.
- * We must stop here.
- */
- BUG_ON(gup_flags & FOLL_NOWAIT);
- goto out;
- }
- continue;
- }
}
retry:
/*
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 31d8f18bc2e4..b7ff413ff68b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6425,37 +6425,6 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
}
#endif /* CONFIG_USERFAULTFD */

-static void record_subpages(struct page *page, struct vm_area_struct *vma,
- int refs, struct page **pages)
-{
- int nr;
-
- for (nr = 0; nr < refs; nr++) {
- if (likely(pages))
- pages[nr] = nth_page(page, nr);
- }
-}
-
-static inline bool __follow_hugetlb_must_fault(struct vm_area_struct *vma,
- unsigned int flags, pte_t *pte,
- bool *unshare)
-{
- pte_t pteval = huge_ptep_get(pte);
-
- *unshare = false;
- if (is_swap_pte(pteval))
- return true;
- if (huge_pte_write(pteval))
- return false;
- if (flags & FOLL_WRITE)
- return true;
- if (gup_must_unshare(vma, flags, pte_page(pteval))) {
- *unshare = true;
- return true;
- }
- return false;
-}
-
struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
unsigned long address, unsigned int flags,
unsigned int *page_mask)
@@ -6518,198 +6487,6 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
return page;
}

-long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
- struct page **pages, unsigned long *position,
- unsigned long *nr_pages, long i, unsigned int flags,
- int *locked)
-{
- unsigned long pfn_offset;
- unsigned long vaddr = *position;
- unsigned long remainder = *nr_pages;
- struct hstate *h = hstate_vma(vma);
- int err = -EFAULT, refs;
-
- while (vaddr < vma->vm_end && remainder) {
- pte_t *pte;
- spinlock_t *ptl = NULL;
- bool unshare = false;
- int absent;
- struct page *page;
-
- /*
- * If we have a pending SIGKILL, don't keep faulting pages and
- * potentially allocating memory.
- */
- if (fatal_signal_pending(current)) {
- remainder = 0;
- break;
- }
-
- hugetlb_vma_lock_read(vma);
- /*
- * Some archs (sparc64, sh*) have multiple pte_ts to
- * each hugepage. We have to make sure we get the
- * first, for the page indexing below to work.
- *
- * Note that page table lock is not held when pte is null.
- */
- pte = hugetlb_walk(vma, vaddr & huge_page_mask(h),
- huge_page_size(h));
- if (pte)
- ptl = huge_pte_lock(h, mm, pte);
- absent = !pte || huge_pte_none(huge_ptep_get(pte));
-
- /*
- * When coredumping, it suits get_dump_page if we just return
- * an error where there's an empty slot with no huge pagecache
- * to back it. This way, we avoid allocating a hugepage, and
- * the sparse dumpfile avoids allocating disk blocks, but its
- * huge holes still show up with zeroes where they need to be.
- */
- if (absent && (flags & FOLL_DUMP) &&
- !hugetlbfs_pagecache_present(h, vma, vaddr)) {
- if (pte)
- spin_unlock(ptl);
- hugetlb_vma_unlock_read(vma);
- remainder = 0;
- break;
- }
-
- /*
- * We need call hugetlb_fault for both hugepages under migration
- * (in which case hugetlb_fault waits for the migration,) and
- * hwpoisoned hugepages (in which case we need to prevent the
- * caller from accessing to them.) In order to do this, we use
- * here is_swap_pte instead of is_hugetlb_entry_migration and
- * is_hugetlb_entry_hwpoisoned. This is because it simply covers
- * both cases, and because we can't follow correct pages
- * directly from any kind of swap entries.
- */
- if (absent ||
- __follow_hugetlb_must_fault(vma, flags, pte, &unshare)) {
- vm_fault_t ret;
- unsigned int fault_flags = 0;
-
- if (pte)
- spin_unlock(ptl);
- hugetlb_vma_unlock_read(vma);
-
- if (flags & FOLL_WRITE)
- fault_flags |= FAULT_FLAG_WRITE;
- else if (unshare)
- fault_flags |= FAULT_FLAG_UNSHARE;
- if (locked) {
- fault_flags |= FAULT_FLAG_ALLOW_RETRY |
- FAULT_FLAG_KILLABLE;
- if (flags & FOLL_INTERRUPTIBLE)
- fault_flags |= FAULT_FLAG_INTERRUPTIBLE;
- }
- if (flags & FOLL_NOWAIT)
- fault_flags |= FAULT_FLAG_ALLOW_RETRY |
- FAULT_FLAG_RETRY_NOWAIT;
- if (flags & FOLL_TRIED) {
- /*
- * Note: FAULT_FLAG_ALLOW_RETRY and
- * FAULT_FLAG_TRIED can co-exist
- */
- fault_flags |= FAULT_FLAG_TRIED;
- }
- ret = hugetlb_fault(mm, vma, vaddr, fault_flags);
- if (ret & VM_FAULT_ERROR) {
- err = vm_fault_to_errno(ret, flags);
- remainder = 0;
- break;
- }
- if (ret & VM_FAULT_RETRY) {
- if (locked &&
- !(fault_flags & FAULT_FLAG_RETRY_NOWAIT))
- *locked = 0;
- *nr_pages = 0;
- /*
- * VM_FAULT_RETRY must not return an
- * error, it will return zero
- * instead.
- *
- * No need to update "position" as the
- * caller will not check it after
- * *nr_pages is set to 0.
- */
- return i;
- }
- continue;
- }
-
- pfn_offset = (vaddr & ~huge_page_mask(h)) >> PAGE_SHIFT;
- page = pte_page(huge_ptep_get(pte));
-
- VM_BUG_ON_PAGE((flags & FOLL_PIN) && PageAnon(page) &&
- !PageAnonExclusive(page), page);
-
- /*
- * If subpage information not requested, update counters
- * and skip the same_page loop below.
- */
- if (!pages && !pfn_offset &&
- (vaddr + huge_page_size(h) < vma->vm_end) &&
- (remainder >= pages_per_huge_page(h))) {
- vaddr += huge_page_size(h);
- remainder -= pages_per_huge_page(h);
- i += pages_per_huge_page(h);
- spin_unlock(ptl);
- hugetlb_vma_unlock_read(vma);
- continue;
- }
-
- /* vaddr may not be aligned to PAGE_SIZE */
- refs = min3(pages_per_huge_page(h) - pfn_offset, remainder,
- (vma->vm_end - ALIGN_DOWN(vaddr, PAGE_SIZE)) >> PAGE_SHIFT);
-
- if (pages)
- record_subpages(nth_page(page, pfn_offset),
- vma, refs,
- likely(pages) ? pages + i : NULL);
-
- if (pages) {
- /*
- * try_grab_folio() should always succeed here,
- * because: a) we hold the ptl lock, and b) we've just
- * checked that the huge page is present in the page
- * tables. If the huge page is present, then the tail
- * pages must also be present. The ptl prevents the
- * head page and tail pages from being rearranged in
- * any way. As this is hugetlb, the pages will never
- * be p2pdma or not longterm pinable. So this page
- * must be available at this point, unless the page
- * refcount overflowed:
- */
- if (WARN_ON_ONCE(!try_grab_folio(pages[i], refs,
- flags))) {
- spin_unlock(ptl);
- hugetlb_vma_unlock_read(vma);
- remainder = 0;
- err = -ENOMEM;
- break;
- }
- }
-
- vaddr += (refs << PAGE_SHIFT);
- remainder -= refs;
- i += refs;
-
- spin_unlock(ptl);
- hugetlb_vma_unlock_read(vma);
- }
- *nr_pages = remainder;
- /*
- * setting position is actually required only if remainder is
- * not zero but it's faster not to add a "if (remainder)"
- * branch.
- */
- *position = vaddr;
-
- return i ? i : err;
-}
-
long hugetlb_change_protection(struct vm_area_struct *vma,
unsigned long address, unsigned long end,
pgprot_t newprot, unsigned long cp_flags)
--
2.40.1


2023-06-13 22:10:36

by Peter Xu

[permalink] [raw]
Subject: [PATCH 6/7] mm/gup: Accelerate thp gup even for "pages != NULL"

The acceleration of THP was done with ctx.page_mask, however it'll be
ignored if **pages is non-NULL.

The old optimization was introduced in 2013 in 240aadeedc4a ("mm:
accelerate mm_populate() treatment of THP pages"). It didn't explain why
we can't optimize the **pages non-NULL case. It's possible that at that
time the major goal was for mm_populate() which should be enough back then.

Optimize thp for all cases, by properly looping over each subpage, doing
cache flushes, and boost refcounts / pincounts where needed in one go.

This can be verified using gup_test below:

# chrt -f 1 ./gup_test -m 512 -t -L -n 1024 -r 10

Before: 13992.50 ( +-8.75%)
After: 378.50 (+-69.62%)

Signed-off-by: Peter Xu <[email protected]>
---
mm/gup.c | 36 +++++++++++++++++++++++++++++-------
1 file changed, 29 insertions(+), 7 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index a2d1b3c4b104..cdabc8ea783b 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1210,16 +1210,38 @@ static long __get_user_pages(struct mm_struct *mm,
goto out;
}
next_page:
- if (pages) {
- pages[i] = page;
- flush_anon_page(vma, page, start);
- flush_dcache_page(page);
- ctx.page_mask = 0;
- }
-
page_increm = 1 + (~(start >> PAGE_SHIFT) & ctx.page_mask);
if (page_increm > nr_pages)
page_increm = nr_pages;
+
+ if (pages) {
+ struct page *subpage;
+ unsigned int j;
+
+ /*
+ * This must be a large folio (and doesn't need to
+ * be the whole folio; it can be part of it), do
+ * the refcount work for all the subpages too.
+ * Since we already hold refcount on the head page,
+ * it should never fail.
+ *
+ * NOTE: here the page may not be the head page
+ * e.g. when start addr is not thp-size aligned.
+ */
+ if (page_increm > 1)
+ WARN_ON_ONCE(
+ try_grab_folio(compound_head(page),
+ page_increm - 1,
+ foll_flags) == NULL);
+
+ for (j = 0; j < page_increm; j++) {
+ subpage = nth_page(page, j);
+ pages[i+j] = subpage;
+ flush_anon_page(vma, subpage, start + j * PAGE_SIZE);
+ flush_dcache_page(subpage);
+ }
+ }
+
i += page_increm;
start += page_increm * PAGE_SIZE;
nr_pages -= page_increm;
--
2.40.1


2023-06-14 15:42:23

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 6/7] mm/gup: Accelerate thp gup even for "pages != NULL"

On Wed, Jun 14, 2023 at 03:58:34PM +0100, Matthew Wilcox wrote:
> On Tue, Jun 13, 2023 at 05:53:45PM -0400, Peter Xu wrote:
> > + if (page_increm > 1)
> > + WARN_ON_ONCE(
> > + try_grab_folio(compound_head(page),
>
> You don't need to call compound_head() here; try_grab_folio() works
> on tail pages just fine.

I did it with caution because two things I'm not sure: either
is_pci_p2pdma_page() or is_longterm_pinnable_page() inside, both calls
is_zone_device_page() on the page*.

But I just noticed try_grab_folio() is also used in gup_pte_range() where
the thp can be pte mapped, so I assume we at least need that to handle tail
page well.

Do we perhaps need the compound_head() in try_grab_folio() as a separate
patch? Or maybe I was wrong on is_zone_device_page()?

>
> > + page_increm - 1,
> > + foll_flags) == NULL);
> > +
> > + for (j = 0; j < page_increm; j++) {
> > + subpage = nth_page(page, j);
> > + pages[i+j] = subpage;
> > + flush_anon_page(vma, subpage, start + j * PAGE_SIZE);
> > + flush_dcache_page(subpage);
>
> You're better off calling flush_dcache_folio() right at the end.

Will do.

Thanks,

--
Peter Xu


2023-06-14 15:44:53

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 7/7] mm/gup: Retire follow_hugetlb_page()

On Tue, Jun 13, 2023 at 05:53:46PM -0400, Peter Xu wrote:
> Now __get_user_pages() should be well prepared to handle thp completely,
> as long as hugetlb gup requests even without the hugetlb's special path.
>
> Time to retire follow_hugetlb_page().
>
> Tweak the comments in follow_page_mask() to reflect reality, by dropping
> the "follow_page()" description.
>
> Signed-off-by: Peter Xu <[email protected]>
> ---
> include/linux/hugetlb.h | 12 ---
> mm/gup.c | 19 ----
> mm/hugetlb.c | 223 ----------------------------------------
> 3 files changed, 254 deletions(-)

It is a like a dream come true :)

I don't know enough about hugetlb details but this series broadly made
sense to me

Thanks,
Jason

2023-06-14 15:44:57

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 6/7] mm/gup: Accelerate thp gup even for "pages != NULL"

On Tue, Jun 13, 2023 at 05:53:45PM -0400, Peter Xu wrote:
> + if (page_increm > 1)
> + WARN_ON_ONCE(
> + try_grab_folio(compound_head(page),

You don't need to call compound_head() here; try_grab_folio() works
on tail pages just fine.

> + page_increm - 1,
> + foll_flags) == NULL);
> +
> + for (j = 0; j < page_increm; j++) {
> + subpage = nth_page(page, j);
> + pages[i+j] = subpage;
> + flush_anon_page(vma, subpage, start + j * PAGE_SIZE);
> + flush_dcache_page(subpage);

You're better off calling flush_dcache_folio() right at the end.


2023-06-14 16:21:21

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 6/7] mm/gup: Accelerate thp gup even for "pages != NULL"

On Wed, Jun 14, 2023 at 11:19:48AM -0400, Peter Xu wrote:
> > > + for (j = 0; j < page_increm; j++) {
> > > + subpage = nth_page(page, j);
> > > + pages[i+j] = subpage;
> > > + flush_anon_page(vma, subpage, start + j * PAGE_SIZE);
> > > + flush_dcache_page(subpage);
> >
> > You're better off calling flush_dcache_folio() right at the end.
>
> Will do.

Ah when I start to modify it I noticed it's a two-sided sword: we'll then
also do flush dcache over the whole folio even if we gup one page.

We'll start to get benefit only if some arch at least starts to impl
flush_dcache_folio() (which seems to be none, right now..), and we'll
already start to lose on amplifying the flush when gup on partial folio.

Perhaps I still keep it as-is which will still be accurate, always faster
than old code, and definitely not regress in any form?

--
Peter Xu


2023-06-17 20:35:43

by Lorenzo Stoakes

[permalink] [raw]
Subject: Re: [PATCH 5/7] mm/gup: Cleanup next_page handling

On Tue, Jun 13, 2023 at 05:53:44PM -0400, Peter Xu wrote:
> The only path that doesn't use generic "**pages" handling is the gate vma.
> Make it use the same path, meanwhile tune the next_page label upper to
> cover "**pages" handling. This prepares for THP handling for "**pages".
>
> Signed-off-by: Peter Xu <[email protected]>
> ---
> mm/gup.c | 7 +++----
> 1 file changed, 3 insertions(+), 4 deletions(-)
>
> diff --git a/mm/gup.c b/mm/gup.c
> index 8d59ae4554e7..a2d1b3c4b104 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1135,7 +1135,7 @@ static long __get_user_pages(struct mm_struct *mm,
> if (!vma && in_gate_area(mm, start)) {
> ret = get_gate_page(mm, start & PAGE_MASK,
> gup_flags, &vma,
> - pages ? &pages[i] : NULL);
> + pages ? &page : NULL);

Good spot... ugh that we handled this differently.

> if (ret)
> goto out;
> ctx.page_mask = 0;

We can drop this line now right? As the new next_page block will duplicate
this.

> @@ -1205,19 +1205,18 @@ static long __get_user_pages(struct mm_struct *mm,
> ret = PTR_ERR(page);
> goto out;
> }
> -
> - goto next_page;

This is neat, we've already checked if pages != NULL so the if (pages)
block at the new next_page label will not be run.

> } else if (IS_ERR(page)) {
> ret = PTR_ERR(page);
> goto out;
> }
> +next_page:
> if (pages) {
> pages[i] = page;
> flush_anon_page(vma, page, start);
> flush_dcache_page(page);

I guess there's no harm that we now flush here, though it seems to me to be
superfluous, it's not a big deal I don't think.

> ctx.page_mask = 0;
> }
> -next_page:
> +
> page_increm = 1 + (~(start >> PAGE_SHIFT) & ctx.page_mask);
> if (page_increm > nr_pages)
> page_increm = nr_pages;
> --
> 2.40.1
>

Other than that, LGTM,

Reviewed-by: Lorenzo Stoakes <[email protected]>

2023-06-17 20:37:18

by Lorenzo Stoakes

[permalink] [raw]
Subject: Re: [PATCH 5/7] mm/gup: Cleanup next_page handling

On Sat, Jun 17, 2023 at 08:48:38PM +0100, Lorenzo Stoakes wrote:
> On Tue, Jun 13, 2023 at 05:53:44PM -0400, Peter Xu wrote:
> > The only path that doesn't use generic "**pages" handling is the gate vma.
> > Make it use the same path, meanwhile tune the next_page label upper to
> > cover "**pages" handling. This prepares for THP handling for "**pages".
> >
> > Signed-off-by: Peter Xu <[email protected]>
> > ---
> > mm/gup.c | 7 +++----
> > 1 file changed, 3 insertions(+), 4 deletions(-)
> >
> > diff --git a/mm/gup.c b/mm/gup.c
> > index 8d59ae4554e7..a2d1b3c4b104 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -1135,7 +1135,7 @@ static long __get_user_pages(struct mm_struct *mm,
> > if (!vma && in_gate_area(mm, start)) {
> > ret = get_gate_page(mm, start & PAGE_MASK,
> > gup_flags, &vma,
> > - pages ? &pages[i] : NULL);
> > + pages ? &page : NULL);
>
> Good spot... ugh that we handled this differently.
>
> > if (ret)
> > goto out;
> > ctx.page_mask = 0;
>
> We can drop this line now right? As the new next_page block will duplicate
> this.

OK I can see why you left this in given the last patch in the series :)
Please disregard.

>
> > @@ -1205,19 +1205,18 @@ static long __get_user_pages(struct mm_struct *mm,
> > ret = PTR_ERR(page);
> > goto out;
> > }
> > -
> > - goto next_page;
>
> This is neat, we've already checked if pages != NULL so the if (pages)
> block at the new next_page label will not be run.
>
> > } else if (IS_ERR(page)) {
> > ret = PTR_ERR(page);
> > goto out;
> > }
> > +next_page:
> > if (pages) {
> > pages[i] = page;
> > flush_anon_page(vma, page, start);
> > flush_dcache_page(page);
>
> I guess there's no harm that we now flush here, though it seems to me to be
> superfluous, it's not a big deal I don't think.
>
> > ctx.page_mask = 0;
> > }
> > -next_page:
> > +
> > page_increm = 1 + (~(start >> PAGE_SHIFT) & ctx.page_mask);
> > if (page_increm > nr_pages)
> > page_increm = nr_pages;
> > --
> > 2.40.1
> >
>
> Other than that, LGTM,
>
> Reviewed-by: Lorenzo Stoakes <[email protected]>

2023-06-17 20:44:42

by Lorenzo Stoakes

[permalink] [raw]
Subject: Re: [PATCH 6/7] mm/gup: Accelerate thp gup even for "pages != NULL"

On Tue, Jun 13, 2023 at 05:53:45PM -0400, Peter Xu wrote:
> The acceleration of THP was done with ctx.page_mask, however it'll be
> ignored if **pages is non-NULL.
>
> The old optimization was introduced in 2013 in 240aadeedc4a ("mm:
> accelerate mm_populate() treatment of THP pages"). It didn't explain why
> we can't optimize the **pages non-NULL case. It's possible that at that
> time the major goal was for mm_populate() which should be enough back then.
>
> Optimize thp for all cases, by properly looping over each subpage, doing
> cache flushes, and boost refcounts / pincounts where needed in one go.
>
> This can be verified using gup_test below:
>
> # chrt -f 1 ./gup_test -m 512 -t -L -n 1024 -r 10
>
> Before: 13992.50 ( +-8.75%)
> After: 378.50 (+-69.62%)
>
> Signed-off-by: Peter Xu <[email protected]>
> ---
> mm/gup.c | 36 +++++++++++++++++++++++++++++-------
> 1 file changed, 29 insertions(+), 7 deletions(-)
>
> diff --git a/mm/gup.c b/mm/gup.c
> index a2d1b3c4b104..cdabc8ea783b 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1210,16 +1210,38 @@ static long __get_user_pages(struct mm_struct *mm,
> goto out;
> }
> next_page:
> - if (pages) {
> - pages[i] = page;
> - flush_anon_page(vma, page, start);
> - flush_dcache_page(page);
> - ctx.page_mask = 0;
> - }
> -
> page_increm = 1 + (~(start >> PAGE_SHIFT) & ctx.page_mask);
> if (page_increm > nr_pages)
> page_increm = nr_pages;
> +
> + if (pages) {
> + struct page *subpage;
> + unsigned int j;
> +
> + /*
> + * This must be a large folio (and doesn't need to
> + * be the whole folio; it can be part of it), do
> + * the refcount work for all the subpages too.
> + * Since we already hold refcount on the head page,
> + * it should never fail.
> + *
> + * NOTE: here the page may not be the head page
> + * e.g. when start addr is not thp-size aligned.
> + */
> + if (page_increm > 1)
> + WARN_ON_ONCE(
> + try_grab_folio(compound_head(page),
> + page_increm - 1,
> + foll_flags) == NULL);

I'm not sure this should be warning but otherwise ignoring this returning
NULL? This feels like a case that could come up in realtiy,
e.g. folio_ref_try_add_rcu() fails, or !folio_is_longterm_pinnable().

Side-note: I _hate_ the semantics of GUP such that try_grab_folio()
(invoked, other than for huge page cases, by the GUP-fast logic) will
explicitly fail if neither FOLL_GET or FOLL_PIN are specified,
differentiating it from try_grab_page() in this respect.

This is a side-note and not relevant here, as all callers to
__get_user_pages() either explicitly set FOLL_GET if not set by user (in
__get_user_pages_locked()) or don't set pages (e.g. in
faultin_vma_page_range())

> +
> + for (j = 0; j < page_increm; j++) {
> + subpage = nth_page(page, j);
> + pages[i+j] = subpage;
> + flush_anon_page(vma, subpage, start + j * PAGE_SIZE);
> + flush_dcache_page(subpage);
> + }
> + }
> +
> i += page_increm;
> start += page_increm * PAGE_SIZE;
> nr_pages -= page_increm;
> --
> 2.40.1
>

2023-06-17 20:57:48

by Lorenzo Stoakes

[permalink] [raw]
Subject: Re: [PATCH 7/7] mm/gup: Retire follow_hugetlb_page()

On Tue, Jun 13, 2023 at 05:53:46PM -0400, Peter Xu wrote:
> Now __get_user_pages() should be well prepared to handle thp completely,
> as long as hugetlb gup requests even without the hugetlb's special path.
>
> Time to retire follow_hugetlb_page().

Nit, but there's a couple left over references to this function in comments
in fs/userfaultfd.c and mm/hugetlb.c.

>
> Tweak the comments in follow_page_mask() to reflect reality, by dropping
> the "follow_page()" description.
>
> Signed-off-by: Peter Xu <[email protected]>
> ---
> include/linux/hugetlb.h | 12 ---
> mm/gup.c | 19 ----
> mm/hugetlb.c | 223 ----------------------------------------
> 3 files changed, 254 deletions(-)
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 0d6f389d98de..44e5836eed15 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -133,9 +133,6 @@ int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *,
> struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
> unsigned long address, unsigned int flags,
> unsigned int *page_mask);
> -long follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
> - struct page **, unsigned long *, unsigned long *,
> - long, unsigned int, int *);
> void unmap_hugepage_range(struct vm_area_struct *,
> unsigned long, unsigned long, struct page *,
> zap_flags_t);
> @@ -305,15 +302,6 @@ static inline struct page *hugetlb_follow_page_mask(
> BUILD_BUG(); /* should never be compiled in if !CONFIG_HUGETLB_PAGE*/
> }
>
> -static inline long follow_hugetlb_page(struct mm_struct *mm,
> - struct vm_area_struct *vma, struct page **pages,
> - unsigned long *position, unsigned long *nr_pages,
> - long i, unsigned int flags, int *nonblocking)
> -{
> - BUG();
> - return 0;
> -}
> -
> static inline int copy_hugetlb_page_range(struct mm_struct *dst,
> struct mm_struct *src,
> struct vm_area_struct *dst_vma,
> diff --git a/mm/gup.c b/mm/gup.c
> index cdabc8ea783b..a65b80953b7a 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -789,9 +789,6 @@ static struct page *follow_page_mask(struct vm_area_struct *vma,
> * Call hugetlb_follow_page_mask for hugetlb vmas as it will use
> * special hugetlb page table walking code. This eliminates the
> * need to check for hugetlb entries in the general walking code.
> - *
> - * hugetlb_follow_page_mask is only for follow_page() handling here.
> - * Ordinary GUP uses follow_hugetlb_page for hugetlb processing.
> */
> if (is_vm_hugetlb_page(vma))
> return hugetlb_follow_page_mask(vma, address, flags,
> @@ -1149,22 +1146,6 @@ static long __get_user_pages(struct mm_struct *mm,
> ret = check_vma_flags(vma, gup_flags);
> if (ret)
> goto out;
> -
> - if (is_vm_hugetlb_page(vma)) {
> - i = follow_hugetlb_page(mm, vma, pages,
> - &start, &nr_pages, i,
> - gup_flags, locked);
> - if (!*locked) {
> - /*
> - * We've got a VM_FAULT_RETRY
> - * and we've lost mmap_lock.
> - * We must stop here.
> - */
> - BUG_ON(gup_flags & FOLL_NOWAIT);
> - goto out;
> - }
> - continue;
> - }
> }
> retry:
> /*
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 31d8f18bc2e4..b7ff413ff68b 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -6425,37 +6425,6 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
> }
> #endif /* CONFIG_USERFAULTFD */
>
> -static void record_subpages(struct page *page, struct vm_area_struct *vma,
> - int refs, struct page **pages)
> -{
> - int nr;
> -
> - for (nr = 0; nr < refs; nr++) {
> - if (likely(pages))
> - pages[nr] = nth_page(page, nr);
> - }
> -}
> -
> -static inline bool __follow_hugetlb_must_fault(struct vm_area_struct *vma,
> - unsigned int flags, pte_t *pte,
> - bool *unshare)
> -{
> - pte_t pteval = huge_ptep_get(pte);
> -
> - *unshare = false;
> - if (is_swap_pte(pteval))
> - return true;
> - if (huge_pte_write(pteval))
> - return false;
> - if (flags & FOLL_WRITE)
> - return true;
> - if (gup_must_unshare(vma, flags, pte_page(pteval))) {
> - *unshare = true;
> - return true;
> - }
> - return false;
> -}
> -
> struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
> unsigned long address, unsigned int flags,
> unsigned int *page_mask)
> @@ -6518,198 +6487,6 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
> return page;
> }
>
> -long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
> - struct page **pages, unsigned long *position,
> - unsigned long *nr_pages, long i, unsigned int flags,
> - int *locked)
> -{
> - unsigned long pfn_offset;
> - unsigned long vaddr = *position;
> - unsigned long remainder = *nr_pages;
> - struct hstate *h = hstate_vma(vma);
> - int err = -EFAULT, refs;
> -
> - while (vaddr < vma->vm_end && remainder) {
> - pte_t *pte;
> - spinlock_t *ptl = NULL;
> - bool unshare = false;
> - int absent;
> - struct page *page;
> -
> - /*
> - * If we have a pending SIGKILL, don't keep faulting pages and
> - * potentially allocating memory.
> - */
> - if (fatal_signal_pending(current)) {
> - remainder = 0;
> - break;
> - }
> -
> - hugetlb_vma_lock_read(vma);
> - /*
> - * Some archs (sparc64, sh*) have multiple pte_ts to
> - * each hugepage. We have to make sure we get the
> - * first, for the page indexing below to work.
> - *
> - * Note that page table lock is not held when pte is null.
> - */
> - pte = hugetlb_walk(vma, vaddr & huge_page_mask(h),
> - huge_page_size(h));
> - if (pte)
> - ptl = huge_pte_lock(h, mm, pte);
> - absent = !pte || huge_pte_none(huge_ptep_get(pte));
> -
> - /*
> - * When coredumping, it suits get_dump_page if we just return
> - * an error where there's an empty slot with no huge pagecache
> - * to back it. This way, we avoid allocating a hugepage, and
> - * the sparse dumpfile avoids allocating disk blocks, but its
> - * huge holes still show up with zeroes where they need to be.
> - */
> - if (absent && (flags & FOLL_DUMP) &&
> - !hugetlbfs_pagecache_present(h, vma, vaddr)) {
> - if (pte)
> - spin_unlock(ptl);
> - hugetlb_vma_unlock_read(vma);
> - remainder = 0;
> - break;
> - }
> -
> - /*
> - * We need call hugetlb_fault for both hugepages under migration
> - * (in which case hugetlb_fault waits for the migration,) and
> - * hwpoisoned hugepages (in which case we need to prevent the
> - * caller from accessing to them.) In order to do this, we use
> - * here is_swap_pte instead of is_hugetlb_entry_migration and
> - * is_hugetlb_entry_hwpoisoned. This is because it simply covers
> - * both cases, and because we can't follow correct pages
> - * directly from any kind of swap entries.
> - */
> - if (absent ||
> - __follow_hugetlb_must_fault(vma, flags, pte, &unshare)) {
> - vm_fault_t ret;
> - unsigned int fault_flags = 0;
> -
> - if (pte)
> - spin_unlock(ptl);
> - hugetlb_vma_unlock_read(vma);
> -
> - if (flags & FOLL_WRITE)
> - fault_flags |= FAULT_FLAG_WRITE;
> - else if (unshare)
> - fault_flags |= FAULT_FLAG_UNSHARE;
> - if (locked) {
> - fault_flags |= FAULT_FLAG_ALLOW_RETRY |
> - FAULT_FLAG_KILLABLE;
> - if (flags & FOLL_INTERRUPTIBLE)
> - fault_flags |= FAULT_FLAG_INTERRUPTIBLE;
> - }
> - if (flags & FOLL_NOWAIT)
> - fault_flags |= FAULT_FLAG_ALLOW_RETRY |
> - FAULT_FLAG_RETRY_NOWAIT;
> - if (flags & FOLL_TRIED) {
> - /*
> - * Note: FAULT_FLAG_ALLOW_RETRY and
> - * FAULT_FLAG_TRIED can co-exist
> - */
> - fault_flags |= FAULT_FLAG_TRIED;
> - }
> - ret = hugetlb_fault(mm, vma, vaddr, fault_flags);
> - if (ret & VM_FAULT_ERROR) {
> - err = vm_fault_to_errno(ret, flags);
> - remainder = 0;
> - break;
> - }
> - if (ret & VM_FAULT_RETRY) {
> - if (locked &&
> - !(fault_flags & FAULT_FLAG_RETRY_NOWAIT))
> - *locked = 0;
> - *nr_pages = 0;
> - /*
> - * VM_FAULT_RETRY must not return an
> - * error, it will return zero
> - * instead.
> - *
> - * No need to update "position" as the
> - * caller will not check it after
> - * *nr_pages is set to 0.
> - */
> - return i;
> - }
> - continue;
> - }
> -
> - pfn_offset = (vaddr & ~huge_page_mask(h)) >> PAGE_SHIFT;
> - page = pte_page(huge_ptep_get(pte));
> -
> - VM_BUG_ON_PAGE((flags & FOLL_PIN) && PageAnon(page) &&
> - !PageAnonExclusive(page), page);
> -
> - /*
> - * If subpage information not requested, update counters
> - * and skip the same_page loop below.
> - */
> - if (!pages && !pfn_offset &&
> - (vaddr + huge_page_size(h) < vma->vm_end) &&
> - (remainder >= pages_per_huge_page(h))) {
> - vaddr += huge_page_size(h);
> - remainder -= pages_per_huge_page(h);
> - i += pages_per_huge_page(h);
> - spin_unlock(ptl);
> - hugetlb_vma_unlock_read(vma);
> - continue;
> - }
> -
> - /* vaddr may not be aligned to PAGE_SIZE */
> - refs = min3(pages_per_huge_page(h) - pfn_offset, remainder,
> - (vma->vm_end - ALIGN_DOWN(vaddr, PAGE_SIZE)) >> PAGE_SHIFT);
> -
> - if (pages)
> - record_subpages(nth_page(page, pfn_offset),
> - vma, refs,
> - likely(pages) ? pages + i : NULL);
> -
> - if (pages) {
> - /*
> - * try_grab_folio() should always succeed here,
> - * because: a) we hold the ptl lock, and b) we've just
> - * checked that the huge page is present in the page
> - * tables. If the huge page is present, then the tail
> - * pages must also be present. The ptl prevents the
> - * head page and tail pages from being rearranged in
> - * any way. As this is hugetlb, the pages will never
> - * be p2pdma or not longterm pinable. So this page
> - * must be available at this point, unless the page
> - * refcount overflowed:
> - */
> - if (WARN_ON_ONCE(!try_grab_folio(pages[i], refs,
> - flags))) {
> - spin_unlock(ptl);
> - hugetlb_vma_unlock_read(vma);
> - remainder = 0;
> - err = -ENOMEM;
> - break;
> - }
> - }
> -
> - vaddr += (refs << PAGE_SHIFT);
> - remainder -= refs;
> - i += refs;
> -
> - spin_unlock(ptl);
> - hugetlb_vma_unlock_read(vma);
> - }
> - *nr_pages = remainder;
> - /*
> - * setting position is actually required only if remainder is
> - * not zero but it's faster not to add a "if (remainder)"
> - * branch.
> - */
> - *position = vaddr;
> -
> - return i ? i : err;
> -}
> -
> long hugetlb_change_protection(struct vm_area_struct *vma,
> unsigned long address, unsigned long end,
> pgprot_t newprot, unsigned long cp_flags)
> --
> 2.40.1
>

Absolutely wonderful to see such delightful code deletion :)

I haven't really dug into the huge page side of this in great detail, so I
can't give a meaningful tag, but I can certainly appreciate the code you're
removing here!

2023-06-19 19:37:58

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 5/7] mm/gup: Cleanup next_page handling

On Sat, Jun 17, 2023 at 09:00:34PM +0100, Lorenzo Stoakes wrote:
> On Sat, Jun 17, 2023 at 08:48:38PM +0100, Lorenzo Stoakes wrote:
> > On Tue, Jun 13, 2023 at 05:53:44PM -0400, Peter Xu wrote:
> > > The only path that doesn't use generic "**pages" handling is the gate vma.
> > > Make it use the same path, meanwhile tune the next_page label upper to
> > > cover "**pages" handling. This prepares for THP handling for "**pages".
> > >
> > > Signed-off-by: Peter Xu <[email protected]>
> > > ---
> > > mm/gup.c | 7 +++----
> > > 1 file changed, 3 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/mm/gup.c b/mm/gup.c
> > > index 8d59ae4554e7..a2d1b3c4b104 100644
> > > --- a/mm/gup.c
> > > +++ b/mm/gup.c
> > > @@ -1135,7 +1135,7 @@ static long __get_user_pages(struct mm_struct *mm,
> > > if (!vma && in_gate_area(mm, start)) {
> > > ret = get_gate_page(mm, start & PAGE_MASK,
> > > gup_flags, &vma,
> > > - pages ? &pages[i] : NULL);
> > > + pages ? &page : NULL);
> >
> > Good spot... ugh that we handled this differently.
> >
> > > if (ret)
> > > goto out;
> > > ctx.page_mask = 0;
> >
> > We can drop this line now right? As the new next_page block will duplicate
> > this.
>
> OK I can see why you left this in given the last patch in the series :)
> Please disregard.

Yes the other "page_mask=0" will be removed in the next (not last) patch.

>
> >
> > > @@ -1205,19 +1205,18 @@ static long __get_user_pages(struct mm_struct *mm,
> > > ret = PTR_ERR(page);
> > > goto out;
> > > }
> > > -
> > > - goto next_page;
> >
> > This is neat, we've already checked if pages != NULL so the if (pages)
> > block at the new next_page label will not be run.

Yes.

> >
> > > } else if (IS_ERR(page)) {
> > > ret = PTR_ERR(page);
> > > goto out;
> > > }
> > > +next_page:
> > > if (pages) {
> > > pages[i] = page;
> > > flush_anon_page(vma, page, start);
> > > flush_dcache_page(page);
> >
> > I guess there's no harm that we now flush here, though it seems to me to be
> > superfluous, it's not a big deal I don't think.

I'd say GUP on gate vma page should be so rare so yeah I think it shouldn't
be a big deal. Even iiuc vsyscall=xonly should be the default, so gup may
have already failed on a gate vma page even trying to read-only..

> >
> > > ctx.page_mask = 0;
> > > }
> > > -next_page:
> > > +
> > > page_increm = 1 + (~(start >> PAGE_SHIFT) & ctx.page_mask);
> > > if (page_increm > nr_pages)
> > > page_increm = nr_pages;
> > > --
> > > 2.40.1
> > >
> >
> > Other than that, LGTM,
> >
> > Reviewed-by: Lorenzo Stoakes <[email protected]>

Thanks for looking!

--
Peter Xu


2023-06-19 19:49:06

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 6/7] mm/gup: Accelerate thp gup even for "pages != NULL"

On Sat, Jun 17, 2023 at 09:27:22PM +0100, Lorenzo Stoakes wrote:
> On Tue, Jun 13, 2023 at 05:53:45PM -0400, Peter Xu wrote:
> > The acceleration of THP was done with ctx.page_mask, however it'll be
> > ignored if **pages is non-NULL.
> >
> > The old optimization was introduced in 2013 in 240aadeedc4a ("mm:
> > accelerate mm_populate() treatment of THP pages"). It didn't explain why
> > we can't optimize the **pages non-NULL case. It's possible that at that
> > time the major goal was for mm_populate() which should be enough back then.
> >
> > Optimize thp for all cases, by properly looping over each subpage, doing
> > cache flushes, and boost refcounts / pincounts where needed in one go.
> >
> > This can be verified using gup_test below:
> >
> > # chrt -f 1 ./gup_test -m 512 -t -L -n 1024 -r 10
> >
> > Before: 13992.50 ( +-8.75%)
> > After: 378.50 (+-69.62%)
> >
> > Signed-off-by: Peter Xu <[email protected]>
> > ---
> > mm/gup.c | 36 +++++++++++++++++++++++++++++-------
> > 1 file changed, 29 insertions(+), 7 deletions(-)
> >
> > diff --git a/mm/gup.c b/mm/gup.c
> > index a2d1b3c4b104..cdabc8ea783b 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -1210,16 +1210,38 @@ static long __get_user_pages(struct mm_struct *mm,
> > goto out;
> > }
> > next_page:
> > - if (pages) {
> > - pages[i] = page;
> > - flush_anon_page(vma, page, start);
> > - flush_dcache_page(page);
> > - ctx.page_mask = 0;
> > - }
> > -
> > page_increm = 1 + (~(start >> PAGE_SHIFT) & ctx.page_mask);
> > if (page_increm > nr_pages)
> > page_increm = nr_pages;
> > +
> > + if (pages) {
> > + struct page *subpage;
> > + unsigned int j;
> > +
> > + /*
> > + * This must be a large folio (and doesn't need to
> > + * be the whole folio; it can be part of it), do
> > + * the refcount work for all the subpages too.
> > + * Since we already hold refcount on the head page,
> > + * it should never fail.
> > + *
> > + * NOTE: here the page may not be the head page
> > + * e.g. when start addr is not thp-size aligned.
> > + */
> > + if (page_increm > 1)
> > + WARN_ON_ONCE(
> > + try_grab_folio(compound_head(page),
> > + page_increm - 1,
> > + foll_flags) == NULL);
>
> I'm not sure this should be warning but otherwise ignoring this returning
> NULL? This feels like a case that could come up in realtiy,
> e.g. folio_ref_try_add_rcu() fails, or !folio_is_longterm_pinnable().

Note that we hold already at least 1 refcount on the folio (also mentioned
in the comment above this chunk of code), so both folio_ref_try_add_rcu()
and folio_is_longterm_pinnable() should already have been called on the
same folio and passed. If it will fail it should have already, afaict.

I still don't see how that would trigger if the refcount won't overflow.

Here what I can do is still guard this try_grab_folio() and fail the GUP if
for any reason it failed. Perhaps then it means I'll also keep that one
untouched in hugetlb_follow_page_mask() too. But I suppose keeping the
WARN_ON_ONCE() seems still proper.

Thanks,

--
Peter Xu


2023-06-19 19:59:04

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 7/7] mm/gup: Retire follow_hugetlb_page()

On Sat, Jun 17, 2023 at 09:40:41PM +0100, Lorenzo Stoakes wrote:
> On Tue, Jun 13, 2023 at 05:53:46PM -0400, Peter Xu wrote:
> > Now __get_user_pages() should be well prepared to handle thp completely,
> > as long as hugetlb gup requests even without the hugetlb's special path.
> >
> > Time to retire follow_hugetlb_page().
>
> Nit, but there's a couple left over references to this function in comments
> in fs/userfaultfd.c and mm/hugetlb.c.

Indeed, let me touch those too when respin.

> Absolutely wonderful to see such delightful code deletion :)
>
> I haven't really dug into the huge page side of this in great detail, so I
> can't give a meaningful tag, but I can certainly appreciate the code you're
> removing here!

Yeah, hopefully it'll make the code also cleaner. Thanks a lot,

--
Peter Xu


2023-06-19 20:34:52

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 6/7] mm/gup: Accelerate thp gup even for "pages != NULL"

On Mon, Jun 19, 2023 at 03:37:30PM -0400, Peter Xu wrote:
> Here what I can do is still guard this try_grab_folio() and fail the GUP if
> for any reason it failed. Perhaps then it means I'll also keep that one
> untouched in hugetlb_follow_page_mask() too. But I suppose keeping the
> WARN_ON_ONCE() seems still proper.

Here's the outcome that I plan to post in the new version, taking care of
try_grab_folio() failures even if it happens, meanwhile remove the
compound_head() redundancy on the page.

__get_user_pages():
...
===8<===
/*
* This must be a large folio (and doesn't need to
* be the whole folio; it can be part of it), do
* the refcount work for all the subpages too.
*
* NOTE: here the page may not be the head page
* e.g. when start addr is not thp-size aligned.
* try_grab_folio() should have taken care of tail
* pages.
*/
if (page_increm > 1) {
struct folio *folio;

/*
* Since we already hold refcount on the
* large folio, this should never fail.
*/
folio = try_grab_folio(page, page_increm - 1,
foll_flags);
if (WARN_ON_ONCE(!folio)) {
/*
* Release the 1st page ref if the
* folio is problematic, fail hard.
*/
gup_put_folio(page_folio(page), 1,
foll_flags);
ret = -EFAULT;
goto out;
}
}
===8<===

Thanks,

--
Peter Xu