v1: https://lore.kernel.org/r/[email protected]
v2: https://lore.kernel.org/r/[email protected]
v3: https://lore.kernel.org/r/[email protected]
v4:
- Patch 2: check pte write for unsharing [David]
- Added more tags, rebased to akpm/mm-unstable
Hugetlb has a special path for slow gup that follow_page_mask() is actually
skipped completely along with faultin_page(). It's not only confusing, but
also duplicating a lot of logics that generic gup already has, making
hugetlb slightly special.
This patchset tries to dedup the logic, by first touching up the slow gup
code to be able to handle hugetlb pages correctly with the current follow
page and faultin routines (where we're mostly there.. due to 10 years ago
we did try to optimize thp, but half way done; more below), then at the
last patch drop the special path, then the hugetlb gup will always go the
generic routine too via faultin_page().
Note that hugetlb is still special for gup, mostly due to the pgtable
walking (hugetlb_walk()) that we rely on which is currently per-arch. But
this is still one small step forward, and the diffstat might be a proof
too that this might be worthwhile.
Then for the "speed up thp" side: as a side effect, when I'm looking at the
chunk of code, I found that thp support is actually partially done. It
doesn't mean that thp won't work for gup, but as long as **pages pointer
passed over, the optimization will be skipped too. Patch 6 should address
that, so for thp we now get full speed gup.
For a quick number, "chrt -f 1 ./gup_test -m 512 -t -L -n 1024 -r 10" gives
me 13992.50us -> 378.50us. Gup_test is an extreme case, but just to show
how it affects thp gups.
Patch 1-5: prepares for the switch
Patch 6: switchover to the new code and remove the old
Patch 7-8: added some gup test matrix into run_vmtests.sh
Please review, thanks.
Peter Xu (8):
mm/hugetlb: Handle FOLL_DUMP well in follow_page_mask()
mm/hugetlb: Prepare hugetlb_follow_page_mask() for FOLL_PIN
mm/hugetlb: Add page_mask for hugetlb_follow_page_mask()
mm/gup: Cleanup next_page handling
mm/gup: Accelerate thp gup even for "pages != NULL"
mm/gup: Retire follow_hugetlb_page()
selftests/mm: Add -a to run_vmtests.sh
selftests/mm: Add gup test matrix in run_vmtests.sh
fs/userfaultfd.c | 2 +-
include/linux/hugetlb.h | 20 +-
mm/gup.c | 83 ++++---
mm/hugetlb.c | 265 +++-------------------
tools/testing/selftests/mm/run_vmtests.sh | 48 +++-
5 files changed, 126 insertions(+), 292 deletions(-)
--
2.41.0
follow_page() doesn't use FOLL_PIN, meanwhile hugetlb seems to not be the
target of FOLL_WRITE either. However add the checks.
Namely, either the need to CoW due to missing write bit, or proper
unsharing on !AnonExclusive pages over R/O pins to reject the follow page.
That brings this function closer to follow_hugetlb_page().
So we don't care before, and also for now. But we'll care if we switch
over slow-gup to use hugetlb_follow_page_mask(). We'll also care when to
return -EMLINK properly, as that's the gup internal api to mean "we should
unshare". Not really needed for follow page path, though.
When at it, switching the try_grab_page() to use WARN_ON_ONCE(), to be
clear that it just should never fail. When error happens, instead of
setting page==NULL, capture the errno instead.
Reviewed-by: Mike Kravetz <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
mm/hugetlb.c | 33 ++++++++++++++++++++++-----------
1 file changed, 22 insertions(+), 11 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d04ba5782fdd..4410139cf890 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6462,13 +6462,7 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
struct page *page = NULL;
spinlock_t *ptl;
pte_t *pte, entry;
-
- /*
- * FOLL_PIN is not supported for follow_page(). Ordinary GUP goes via
- * follow_hugetlb_page().
- */
- if (WARN_ON_ONCE(flags & FOLL_PIN))
- return NULL;
+ int ret;
hugetlb_vma_lock_read(vma);
pte = hugetlb_walk(vma, haddr, huge_page_size(h));
@@ -6478,8 +6472,23 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
ptl = huge_pte_lock(h, mm, pte);
entry = huge_ptep_get(pte);
if (pte_present(entry)) {
- page = pte_page(entry) +
- ((address & ~huge_page_mask(h)) >> PAGE_SHIFT);
+ page = pte_page(entry);
+
+ if (!huge_pte_write(entry)) {
+ if (flags & FOLL_WRITE) {
+ page = NULL;
+ goto out;
+ }
+
+ if (gup_must_unshare(vma, flags, page)) {
+ /* Tell the caller to do unsharing */
+ page = ERR_PTR(-EMLINK);
+ goto out;
+ }
+ }
+
+ page += ((address & ~huge_page_mask(h)) >> PAGE_SHIFT);
+
/*
* Note that page may be a sub-page, and with vmemmap
* optimizations the page struct may be read only.
@@ -6489,8 +6498,10 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
* try_grab_page() should always be able to get the page here,
* because we hold the ptl lock and have verified pte_present().
*/
- if (try_grab_page(page, flags)) {
- page = NULL;
+ ret = try_grab_page(page, flags);
+
+ if (WARN_ON_ONCE(ret)) {
+ page = ERR_PTR(ret);
goto out;
}
}
--
2.41.0
The only path that doesn't use generic "**pages" handling is the gate vma.
Make it use the same path, meanwhile tune the next_page label upper to
cover "**pages" handling. This prepares for THP handling for "**pages".
Reviewed-by: Lorenzo Stoakes <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
mm/gup.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)
diff --git a/mm/gup.c b/mm/gup.c
index 1e2e23084f3c..5af1be81390b 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1124,7 +1124,7 @@ static long __get_user_pages(struct mm_struct *mm,
if (!vma && in_gate_area(mm, start)) {
ret = get_gate_page(mm, start & PAGE_MASK,
gup_flags, &vma,
- pages ? &pages[i] : NULL);
+ pages ? &page : NULL);
if (ret)
goto out;
ctx.page_mask = 0;
@@ -1194,19 +1194,18 @@ static long __get_user_pages(struct mm_struct *mm,
ret = PTR_ERR(page);
goto out;
}
-
- goto next_page;
} else if (IS_ERR(page)) {
ret = PTR_ERR(page);
goto out;
}
+next_page:
if (pages) {
pages[i] = page;
flush_anon_page(vma, page, start);
flush_dcache_page(page);
ctx.page_mask = 0;
}
-next_page:
+
page_increm = 1 + (~(start >> PAGE_SHIFT) & ctx.page_mask);
if (page_increm > nr_pages)
page_increm = nr_pages;
--
2.41.0