v2:
- Added R-bs
- Merged patch 2 & 4 into a single patch [David, Mike]
- Remove redundant compound_head() [Matthew]
- Still handle failure cases of try_get_folio/page() [Lorenzo]
- Modified more comments in the last removal patch [Lorenzo]
- Fixed hugetlb npages>1 longterm pin issue
- Added two testcase patches at last
Hugetlb has a special path for slow gup that follow_page_mask() is actually
skipped completely along with faultin_page(). It's not only confusing, but
also duplicating a lot of logics that generic gup already has, making
hugetlb slightly special.
This patchset tries to dedup the logic, by first touching up the slow gup
code to be able to handle hugetlb pages correctly with the current follow
page and faultin routines (where we're mostly there.. due to 10 years ago
we did try to optimize thp, but half way done; more below), then at the
last patch drop the special path, then the hugetlb gup will always go the
generic routine too via faultin_page().
Note that hugetlb is still special for gup, mostly due to the pgtable
walking (hugetlb_walk()) that we rely on which is currently per-arch. But
this is still one small step forward, and the diffstat might be a proof
too that this might be worthwhile.
Then for the "speed up thp" side: as a side effect, when I'm looking at the
chunk of code, I found that thp support is actually partially done. It
doesn't mean that thp won't work for gup, but as long as **pages pointer
passed over, the optimization will be skipped too. Patch 6 should address
that, so for thp we now get full speed gup.
For a quick number, "chrt -f 1 ./gup_test -m 512 -t -L -n 1024 -r 10" gives
me 13992.50us -> 378.50us. Gup_test is an extreme case, but just to show
how it affects thp gups.
Patch 1-5: prepares for the switch
Patch 6: switchover to the new code and remove the old
Patch 7-8: added some gup test matrix into run_vmtests.sh
Please review, thanks.
Peter Xu (8):
mm/hugetlb: Handle FOLL_DUMP well in follow_page_mask()
mm/hugetlb: Prepare hugetlb_follow_page_mask() for FOLL_PIN
mm/hugetlb: Add page_mask for hugetlb_follow_page_mask()
mm/gup: Cleanup next_page handling
mm/gup: Accelerate thp gup even for "pages != NULL"
mm/gup: Retire follow_hugetlb_page()
selftests/mm: Add -a to run_vmtests.sh
selftests/mm: Add gup test matrix in run_vmtests.sh
fs/userfaultfd.c | 2 +-
include/linux/hugetlb.h | 20 +-
mm/gup.c | 83 ++++---
mm/hugetlb.c | 256 +++-------------------
tools/testing/selftests/mm/run_vmtests.sh | 48 +++-
5 files changed, 119 insertions(+), 290 deletions(-)
--
2.40.1
Allows to specify optional tests in run_vmtests.sh, where we can run time
consuming test matrix only when user specified "-a".
Signed-off-by: Peter Xu <[email protected]>
---
tools/testing/selftests/mm/run_vmtests.sh | 11 ++++++++---
1 file changed, 8 insertions(+), 3 deletions(-)
diff --git a/tools/testing/selftests/mm/run_vmtests.sh b/tools/testing/selftests/mm/run_vmtests.sh
index 3f26f6e15b2a..824e651f62f4 100644
--- a/tools/testing/selftests/mm/run_vmtests.sh
+++ b/tools/testing/selftests/mm/run_vmtests.sh
@@ -12,11 +12,14 @@ exitcode=0
usage() {
cat <<EOF
-usage: ${BASH_SOURCE[0]:-$0} [ -h | -t "<categories>"]
+usage: ${BASH_SOURCE[0]:-$0} [ options ]
+
+ -a: run all tests, including extra ones
-t: specify specific categories to tests to run
-h: display this message
-The default behavior is to run all tests.
+The default behavior is to run required tests only. If -a is specified,
+will run all tests.
Alternatively, specific groups tests can be run by passing a string
to the -t argument containing one or more of the following categories
@@ -60,9 +63,11 @@ EOF
exit 0
}
+RUN_ALL=false
-while getopts "ht:" OPT; do
+while getopts "aht:" OPT; do
case ${OPT} in
+ "a") RUN_ALL=true ;;
"h") usage ;;
"t") VM_SELFTEST_ITEMS=${OPTARG} ;;
esac
--
2.40.1
follow_page() doesn't use FOLL_PIN, meanwhile hugetlb seems to not be the
target of FOLL_WRITE either. However add the checks.
Namely, either the need to CoW due to missing write bit, or proper CoR on
!AnonExclusive pages over R/O pins to reject the follow page. That brings
this function closer to follow_hugetlb_page().
So we don't care before, and also for now. But we'll care if we switch
over slow-gup to use hugetlb_follow_page_mask(). We'll also care when to
return -EMLINK properly, as that's the gup internal api to mean "we should
do CoR". Not really needed for follow page path, though.
When at it, switching the try_grab_page() to use WARN_ON_ONCE(), to be
clear that it just should never fail.
Reviewed-by: Mike Kravetz <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
mm/hugetlb.c | 24 +++++++++++++++---------
1 file changed, 15 insertions(+), 9 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index f75f5e78ff0b..9a6918c4250a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6463,13 +6463,6 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
spinlock_t *ptl;
pte_t *pte, entry;
- /*
- * FOLL_PIN is not supported for follow_page(). Ordinary GUP goes via
- * follow_hugetlb_page().
- */
- if (WARN_ON_ONCE(flags & FOLL_PIN))
- return NULL;
-
hugetlb_vma_lock_read(vma);
pte = hugetlb_walk(vma, haddr, huge_page_size(h));
if (!pte)
@@ -6478,8 +6471,21 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
ptl = huge_pte_lock(h, mm, pte);
entry = huge_ptep_get(pte);
if (pte_present(entry)) {
- page = pte_page(entry) +
- ((address & ~huge_page_mask(h)) >> PAGE_SHIFT);
+ page = pte_page(entry);
+
+ if (gup_must_unshare(vma, flags, page)) {
+ /* Tell the caller to do Copy-On-Read */
+ page = ERR_PTR(-EMLINK);
+ goto out;
+ }
+
+ if ((flags & FOLL_WRITE) && !pte_write(entry)) {
+ page = NULL;
+ goto out;
+ }
+
+ page += ((address & ~huge_page_mask(h)) >> PAGE_SHIFT);
+
/*
* Note that page may be a sub-page, and with vmemmap
* optimizations the page struct may be read only.
--
2.40.1
Firstly, the no_page_table() is meaningless for hugetlb which is a no-op
there, because a hugetlb page always satisfies:
- vma_is_anonymous() == false
- vma->vm_ops->fault != NULL
So we can already safely remove it in hugetlb_follow_page_mask(), alongside
with the page* variable.
Meanwhile, what we do in follow_hugetlb_page() actually makes sense for a
dump: we try to fault in the page only if the page cache is already
allocated. Let's do the same here for follow_page_mask() on hugetlb.
It should so far has zero effect on real dumps, because that still goes
into follow_hugetlb_page(). But this may start to influence a bit on
follow_page() users who mimics a "dump page" scenario, but hopefully in a
good way. This also paves way for unifying the hugetlb gup-slow.
Reviewed-by: Mike Kravetz <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
mm/gup.c | 9 ++-------
mm/hugetlb.c | 9 +++++++++
2 files changed, 11 insertions(+), 7 deletions(-)
diff --git a/mm/gup.c b/mm/gup.c
index ce14d4d28503..abcd841d94b7 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -767,7 +767,6 @@ static struct page *follow_page_mask(struct vm_area_struct *vma,
struct follow_page_context *ctx)
{
pgd_t *pgd;
- struct page *page;
struct mm_struct *mm = vma->vm_mm;
ctx->page_mask = 0;
@@ -780,12 +779,8 @@ static struct page *follow_page_mask(struct vm_area_struct *vma,
* hugetlb_follow_page_mask is only for follow_page() handling here.
* Ordinary GUP uses follow_hugetlb_page for hugetlb processing.
*/
- if (is_vm_hugetlb_page(vma)) {
- page = hugetlb_follow_page_mask(vma, address, flags);
- if (!page)
- page = no_page_table(vma, flags);
- return page;
- }
+ if (is_vm_hugetlb_page(vma))
+ return hugetlb_follow_page_mask(vma, address, flags);
pgd = pgd_offset(mm, address);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d76574425da3..f75f5e78ff0b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6498,6 +6498,15 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
spin_unlock(ptl);
out_unlock:
hugetlb_vma_unlock_read(vma);
+
+ /*
+ * Fixup retval for dump requests: if pagecache doesn't exist,
+ * don't try to allocate a new page but just skip it.
+ */
+ if (!page && (flags & FOLL_DUMP) &&
+ !hugetlbfs_pagecache_present(h, vma, address))
+ page = ERR_PTR(-EFAULT);
+
return page;
}
--
2.40.1
Add a matrix for testing gup based on the current gup_test. Only run the
matrix when -a is specified because it's a bit slow.
It covers:
- Different types of huge pages: thp, hugetlb, or no huge page
- Permissions: Write / Read-only
- Fast-gup, with/without
- Types of the GUP: pin / gup / longterm pins
- Shared / Private memories
- GUP size: 1 / 512 / random page sizes
Signed-off-by: Peter Xu <[email protected]>
---
tools/testing/selftests/mm/run_vmtests.sh | 37 ++++++++++++++++++++---
1 file changed, 32 insertions(+), 5 deletions(-)
diff --git a/tools/testing/selftests/mm/run_vmtests.sh b/tools/testing/selftests/mm/run_vmtests.sh
index 824e651f62f4..9666c0c171ab 100644
--- a/tools/testing/selftests/mm/run_vmtests.sh
+++ b/tools/testing/selftests/mm/run_vmtests.sh
@@ -90,6 +90,30 @@ test_selected() {
fi
}
+run_gup_matrix() {
+ # -t: thp=on, -T: thp=off, -H: hugetlb=on
+ local hugetlb_mb=$(( needmem_KB / 1024 ))
+
+ for huge in -t -T "-H -m $hugetlb_mb"; do
+ # -u: gup-fast, -U: gup-basic, -a: pin-fast, -b: pin-basic, -L: pin-longterm
+ for test_cmd in -u -U -a -b -L; do
+ # -w: write=1, -W: write=0
+ for write in -w -W; do
+ # -S: shared
+ for share in -S " "; do
+ # -n: How many pages to fetch together? 512 is special
+ # because it's default thp size (or 2M on x86), 123 to
+ # just test partial gup when hit a huge in whatever form
+ for num in "-n 1" "-n 512" "-n 123"; do
+ CATEGORY="gup_test" run_test ./gup_test \
+ $huge $test_cmd $write $share $num
+ done
+ done
+ done
+ done
+ done
+}
+
# get huge pagesize and freepages from /proc/meminfo
while read -r name size unit; do
if [ "$name" = "HugePages_Free:" ]; then
@@ -194,13 +218,16 @@ fi
CATEGORY="mmap" run_test ./map_fixed_noreplace
-# get_user_pages_fast() benchmark
-CATEGORY="gup_test" run_test ./gup_test -u
-# pin_user_pages_fast() benchmark
-CATEGORY="gup_test" run_test ./gup_test -a
+if $RUN_ALL; then
+ run_gup_matrix
+else
+ # get_user_pages_fast() benchmark
+ CATEGORY="gup_test" run_test ./gup_test -u
+ # pin_user_pages_fast() benchmark
+ CATEGORY="gup_test" run_test ./gup_test -a
+fi
# Dump pages 0, 19, and 4096, using pin_user_pages:
CATEGORY="gup_test" run_test ./gup_test -ct -F 0x1 0 19 0x1000
-
CATEGORY="gup_test" run_test ./gup_longterm
CATEGORY="userfaultfd" run_test ./uffd-unit-tests
--
2.40.1
Now __get_user_pages() should be well prepared to handle thp completely,
as long as hugetlb gup requests even without the hugetlb's special path.
Time to retire follow_hugetlb_page().
Tweak misc comments to reflect reality of follow_hugetlb_page()'s removal.
Signed-off-by: Peter Xu <[email protected]>
---
fs/userfaultfd.c | 2 +-
include/linux/hugetlb.h | 12 ---
mm/gup.c | 19 ----
mm/hugetlb.c | 224 ----------------------------------------
4 files changed, 1 insertion(+), 256 deletions(-)
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 7cecd49e078b..ae711f1d7a83 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -427,7 +427,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
*
* We also don't do userfault handling during
* coredumping. hugetlbfs has the special
- * follow_hugetlb_page() to skip missing pages in the
+ * hugetlb_follow_page_mask() to skip missing pages in the
* FOLL_DUMP case, anon memory also checks for FOLL_DUMP with
* the no_page_table() helper in follow_page_mask(), but the
* shmem_vm_ops->fault method is invoked even during
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 2e2d89e79d6c..bb5024718fc1 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -133,9 +133,6 @@ int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *,
struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
unsigned long address, unsigned int flags,
unsigned int *page_mask);
-long follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
- struct page **, unsigned long *, unsigned long *,
- long, unsigned int, int *);
void unmap_hugepage_range(struct vm_area_struct *,
unsigned long, unsigned long, struct page *,
zap_flags_t);
@@ -305,15 +302,6 @@ static inline struct page *hugetlb_follow_page_mask(
BUILD_BUG(); /* should never be compiled in if !CONFIG_HUGETLB_PAGE*/
}
-static inline long follow_hugetlb_page(struct mm_struct *mm,
- struct vm_area_struct *vma, struct page **pages,
- unsigned long *position, unsigned long *nr_pages,
- long i, unsigned int flags, int *nonblocking)
-{
- BUG();
- return 0;
-}
-
static inline int copy_hugetlb_page_range(struct mm_struct *dst,
struct mm_struct *src,
struct vm_area_struct *dst_vma,
diff --git a/mm/gup.c b/mm/gup.c
index b50272012e49..e6c1e524bd6b 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -775,9 +775,6 @@ static struct page *follow_page_mask(struct vm_area_struct *vma,
* Call hugetlb_follow_page_mask for hugetlb vmas as it will use
* special hugetlb page table walking code. This eliminates the
* need to check for hugetlb entries in the general walking code.
- *
- * hugetlb_follow_page_mask is only for follow_page() handling here.
- * Ordinary GUP uses follow_hugetlb_page for hugetlb processing.
*/
if (is_vm_hugetlb_page(vma))
return hugetlb_follow_page_mask(vma, address, flags,
@@ -1138,22 +1135,6 @@ static long __get_user_pages(struct mm_struct *mm,
ret = check_vma_flags(vma, gup_flags);
if (ret)
goto out;
-
- if (is_vm_hugetlb_page(vma)) {
- i = follow_hugetlb_page(mm, vma, pages,
- &start, &nr_pages, i,
- gup_flags, locked);
- if (!*locked) {
- /*
- * We've got a VM_FAULT_RETRY
- * and we've lost mmap_lock.
- * We must stop here.
- */
- BUG_ON(gup_flags & FOLL_NOWAIT);
- goto out;
- }
- continue;
- }
}
retry:
/*
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index fbf6a09c0ec4..da4c76bee01f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5721,7 +5721,6 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct vm_area_struct *vma,
/*
* Return whether there is a pagecache page to back given address within VMA.
- * Caller follow_hugetlb_page() holds page_table_lock so we cannot lock_page.
*/
static bool hugetlbfs_pagecache_present(struct hstate *h,
struct vm_area_struct *vma, unsigned long address)
@@ -6422,37 +6421,6 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
}
#endif /* CONFIG_USERFAULTFD */
-static void record_subpages(struct page *page, struct vm_area_struct *vma,
- int refs, struct page **pages)
-{
- int nr;
-
- for (nr = 0; nr < refs; nr++) {
- if (likely(pages))
- pages[nr] = nth_page(page, nr);
- }
-}
-
-static inline bool __follow_hugetlb_must_fault(struct vm_area_struct *vma,
- unsigned int flags, pte_t *pte,
- bool *unshare)
-{
- pte_t pteval = huge_ptep_get(pte);
-
- *unshare = false;
- if (is_swap_pte(pteval))
- return true;
- if (huge_pte_write(pteval))
- return false;
- if (flags & FOLL_WRITE)
- return true;
- if (gup_must_unshare(vma, flags, pte_page(pteval))) {
- *unshare = true;
- return true;
- }
- return false;
-}
-
struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
unsigned long address, unsigned int flags,
unsigned int *page_mask)
@@ -6519,198 +6487,6 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
return page;
}
-long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
- struct page **pages, unsigned long *position,
- unsigned long *nr_pages, long i, unsigned int flags,
- int *locked)
-{
- unsigned long pfn_offset;
- unsigned long vaddr = *position;
- unsigned long remainder = *nr_pages;
- struct hstate *h = hstate_vma(vma);
- int err = -EFAULT, refs;
-
- while (vaddr < vma->vm_end && remainder) {
- pte_t *pte;
- spinlock_t *ptl = NULL;
- bool unshare = false;
- int absent;
- struct page *page;
-
- /*
- * If we have a pending SIGKILL, don't keep faulting pages and
- * potentially allocating memory.
- */
- if (fatal_signal_pending(current)) {
- remainder = 0;
- break;
- }
-
- hugetlb_vma_lock_read(vma);
- /*
- * Some archs (sparc64, sh*) have multiple pte_ts to
- * each hugepage. We have to make sure we get the
- * first, for the page indexing below to work.
- *
- * Note that page table lock is not held when pte is null.
- */
- pte = hugetlb_walk(vma, vaddr & huge_page_mask(h),
- huge_page_size(h));
- if (pte)
- ptl = huge_pte_lock(h, mm, pte);
- absent = !pte || huge_pte_none(huge_ptep_get(pte));
-
- /*
- * When coredumping, it suits get_dump_page if we just return
- * an error where there's an empty slot with no huge pagecache
- * to back it. This way, we avoid allocating a hugepage, and
- * the sparse dumpfile avoids allocating disk blocks, but its
- * huge holes still show up with zeroes where they need to be.
- */
- if (absent && (flags & FOLL_DUMP) &&
- !hugetlbfs_pagecache_present(h, vma, vaddr)) {
- if (pte)
- spin_unlock(ptl);
- hugetlb_vma_unlock_read(vma);
- remainder = 0;
- break;
- }
-
- /*
- * We need call hugetlb_fault for both hugepages under migration
- * (in which case hugetlb_fault waits for the migration,) and
- * hwpoisoned hugepages (in which case we need to prevent the
- * caller from accessing to them.) In order to do this, we use
- * here is_swap_pte instead of is_hugetlb_entry_migration and
- * is_hugetlb_entry_hwpoisoned. This is because it simply covers
- * both cases, and because we can't follow correct pages
- * directly from any kind of swap entries.
- */
- if (absent ||
- __follow_hugetlb_must_fault(vma, flags, pte, &unshare)) {
- vm_fault_t ret;
- unsigned int fault_flags = 0;
-
- if (pte)
- spin_unlock(ptl);
- hugetlb_vma_unlock_read(vma);
-
- if (flags & FOLL_WRITE)
- fault_flags |= FAULT_FLAG_WRITE;
- else if (unshare)
- fault_flags |= FAULT_FLAG_UNSHARE;
- if (locked) {
- fault_flags |= FAULT_FLAG_ALLOW_RETRY |
- FAULT_FLAG_KILLABLE;
- if (flags & FOLL_INTERRUPTIBLE)
- fault_flags |= FAULT_FLAG_INTERRUPTIBLE;
- }
- if (flags & FOLL_NOWAIT)
- fault_flags |= FAULT_FLAG_ALLOW_RETRY |
- FAULT_FLAG_RETRY_NOWAIT;
- if (flags & FOLL_TRIED) {
- /*
- * Note: FAULT_FLAG_ALLOW_RETRY and
- * FAULT_FLAG_TRIED can co-exist
- */
- fault_flags |= FAULT_FLAG_TRIED;
- }
- ret = hugetlb_fault(mm, vma, vaddr, fault_flags);
- if (ret & VM_FAULT_ERROR) {
- err = vm_fault_to_errno(ret, flags);
- remainder = 0;
- break;
- }
- if (ret & VM_FAULT_RETRY) {
- if (locked &&
- !(fault_flags & FAULT_FLAG_RETRY_NOWAIT))
- *locked = 0;
- *nr_pages = 0;
- /*
- * VM_FAULT_RETRY must not return an
- * error, it will return zero
- * instead.
- *
- * No need to update "position" as the
- * caller will not check it after
- * *nr_pages is set to 0.
- */
- return i;
- }
- continue;
- }
-
- pfn_offset = (vaddr & ~huge_page_mask(h)) >> PAGE_SHIFT;
- page = pte_page(huge_ptep_get(pte));
-
- VM_BUG_ON_PAGE((flags & FOLL_PIN) && PageAnon(page) &&
- !PageAnonExclusive(page), page);
-
- /*
- * If subpage information not requested, update counters
- * and skip the same_page loop below.
- */
- if (!pages && !pfn_offset &&
- (vaddr + huge_page_size(h) < vma->vm_end) &&
- (remainder >= pages_per_huge_page(h))) {
- vaddr += huge_page_size(h);
- remainder -= pages_per_huge_page(h);
- i += pages_per_huge_page(h);
- spin_unlock(ptl);
- hugetlb_vma_unlock_read(vma);
- continue;
- }
-
- /* vaddr may not be aligned to PAGE_SIZE */
- refs = min3(pages_per_huge_page(h) - pfn_offset, remainder,
- (vma->vm_end - ALIGN_DOWN(vaddr, PAGE_SIZE)) >> PAGE_SHIFT);
-
- if (pages)
- record_subpages(nth_page(page, pfn_offset),
- vma, refs,
- likely(pages) ? pages + i : NULL);
-
- if (pages) {
- /*
- * try_grab_folio() should always succeed here,
- * because: a) we hold the ptl lock, and b) we've just
- * checked that the huge page is present in the page
- * tables. If the huge page is present, then the tail
- * pages must also be present. The ptl prevents the
- * head page and tail pages from being rearranged in
- * any way. As this is hugetlb, the pages will never
- * be p2pdma or not longterm pinable. So this page
- * must be available at this point, unless the page
- * refcount overflowed:
- */
- if (WARN_ON_ONCE(!try_grab_folio(pages[i], refs,
- flags))) {
- spin_unlock(ptl);
- hugetlb_vma_unlock_read(vma);
- remainder = 0;
- err = -ENOMEM;
- break;
- }
- }
-
- vaddr += (refs << PAGE_SHIFT);
- remainder -= refs;
- i += refs;
-
- spin_unlock(ptl);
- hugetlb_vma_unlock_read(vma);
- }
- *nr_pages = remainder;
- /*
- * setting position is actually required only if remainder is
- * not zero but it's faster not to add a "if (remainder)"
- * branch.
- */
- *position = vaddr;
-
- return i ? i : err;
-}
-
long hugetlb_change_protection(struct vm_area_struct *vma,
unsigned long address, unsigned long end,
pgprot_t newprot, unsigned long cp_flags)
--
2.40.1
follow_page() doesn't need it, but we'll start to need it when unifying gup
for hugetlb.
Signed-off-by: Peter Xu <[email protected]>
---
include/linux/hugetlb.h | 8 +++++---
mm/gup.c | 3 ++-
mm/hugetlb.c | 5 ++++-
3 files changed, 11 insertions(+), 5 deletions(-)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index beb7c63d2871..2e2d89e79d6c 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -131,7 +131,8 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma,
int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *,
struct vm_area_struct *, struct vm_area_struct *);
struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
- unsigned long address, unsigned int flags);
+ unsigned long address, unsigned int flags,
+ unsigned int *page_mask);
long follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
struct page **, unsigned long *, unsigned long *,
long, unsigned int, int *);
@@ -297,8 +298,9 @@ static inline void adjust_range_if_pmd_sharing_possible(
{
}
-static inline struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
- unsigned long address, unsigned int flags)
+static inline struct page *hugetlb_follow_page_mask(
+ struct vm_area_struct *vma, unsigned long address, unsigned int flags,
+ unsigned int *page_mask)
{
BUILD_BUG(); /* should never be compiled in if !CONFIG_HUGETLB_PAGE*/
}
diff --git a/mm/gup.c b/mm/gup.c
index abcd841d94b7..9fc9271cba8d 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -780,7 +780,8 @@ static struct page *follow_page_mask(struct vm_area_struct *vma,
* Ordinary GUP uses follow_hugetlb_page for hugetlb processing.
*/
if (is_vm_hugetlb_page(vma))
- return hugetlb_follow_page_mask(vma, address, flags);
+ return hugetlb_follow_page_mask(vma, address, flags,
+ &ctx->page_mask);
pgd = pgd_offset(mm, address);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 9a6918c4250a..fbf6a09c0ec4 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6454,7 +6454,8 @@ static inline bool __follow_hugetlb_must_fault(struct vm_area_struct *vma,
}
struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
- unsigned long address, unsigned int flags)
+ unsigned long address, unsigned int flags,
+ unsigned int *page_mask)
{
struct hstate *h = hstate_vma(vma);
struct mm_struct *mm = vma->vm_mm;
@@ -6499,6 +6500,8 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
page = NULL;
goto out;
}
+
+ *page_mask = ~huge_page_mask(h) >> PAGE_SHIFT;
}
out:
spin_unlock(ptl);
--
2.40.1
On 20.06.23 01:10, Peter Xu wrote:
> follow_page() doesn't use FOLL_PIN, meanwhile hugetlb seems to not be the
> target of FOLL_WRITE either. However add the checks.
>
> Namely, either the need to CoW due to missing write bit, or proper CoR on
s/CoR/unsharing/
> !AnonExclusive pages over R/O pins to reject the follow page. That brings
> this function closer to follow_hugetlb_page().
>
> So we don't care before, and also for now. But we'll care if we switch
> over slow-gup to use hugetlb_follow_page_mask(). We'll also care when to
> return -EMLINK properly, as that's the gup internal api to mean "we should
> do CoR". Not really needed for follow page path, though.
"we should unshare".
>
> When at it, switching the try_grab_page() to use WARN_ON_ONCE(), to be
> clear that it just should never fail.
>
> Reviewed-by: Mike Kravetz <[email protected]>
> Signed-off-by: Peter Xu <[email protected]>
> ---
> mm/hugetlb.c | 24 +++++++++++++++---------
> 1 file changed, 15 insertions(+), 9 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index f75f5e78ff0b..9a6918c4250a 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -6463,13 +6463,6 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
> spinlock_t *ptl;
> pte_t *pte, entry;
>
> - /*
> - * FOLL_PIN is not supported for follow_page(). Ordinary GUP goes via
> - * follow_hugetlb_page().
> - */
> - if (WARN_ON_ONCE(flags & FOLL_PIN))
> - return NULL;
> -
> hugetlb_vma_lock_read(vma);
> pte = hugetlb_walk(vma, haddr, huge_page_size(h));
> if (!pte)
> @@ -6478,8 +6471,21 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
> ptl = huge_pte_lock(h, mm, pte);
> entry = huge_ptep_get(pte);
> if (pte_present(entry)) {
> - page = pte_page(entry) +
> - ((address & ~huge_page_mask(h)) >> PAGE_SHIFT);
> + page = pte_page(entry);
> +
> + if (gup_must_unshare(vma, flags, page)) {
All other callers (like follow_page_pte(), including
__follow_hugetlb_must_fault())
(a) check for write permissions first.
(b) check for gup_must_unshare() only if !pte_write(entry)
I'd vote to keep these checks as similar as possible to the other GUP code.
> + /* Tell the caller to do Copy-On-Read */
"Tell the caller to unshare".
> + page = ERR_PTR(-EMLINK);
> + goto out;
> + }
> +
> + if ((flags & FOLL_WRITE) && !pte_write(entry)) {
> + page = NULL;
> + goto out;
> + }
I'm confused about pte_write() vs. huge_pte_write(), and I don't know
what's right or wrong here.
--
Cheers,
David / dhildenb
On 20.06.23 01:10, Peter Xu wrote:
> follow_page() doesn't need it, but we'll start to need it when unifying gup
> for hugetlb.
>
> Signed-off-by: Peter Xu <[email protected]>
> ---
> include/linux/hugetlb.h | 8 +++++---
> mm/gup.c | 3 ++-
> mm/hugetlb.c | 5 ++++-
> 3 files changed, 11 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index beb7c63d2871..2e2d89e79d6c 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -131,7 +131,8 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma,
> int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *,
> struct vm_area_struct *, struct vm_area_struct *);
> struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
> - unsigned long address, unsigned int flags);
> + unsigned long address, unsigned int flags,
> + unsigned int *page_mask);
> long follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
> struct page **, unsigned long *, unsigned long *,
> long, unsigned int, int *);
> @@ -297,8 +298,9 @@ static inline void adjust_range_if_pmd_sharing_possible(
> {
> }
>
> -static inline struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
> - unsigned long address, unsigned int flags)
> +static inline struct page *hugetlb_follow_page_mask(
> + struct vm_area_struct *vma, unsigned long address, unsigned int flags,
> + unsigned int *page_mask)
> {
> BUILD_BUG(); /* should never be compiled in if !CONFIG_HUGETLB_PAGE*/
> }
> diff --git a/mm/gup.c b/mm/gup.c
> index abcd841d94b7..9fc9271cba8d 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -780,7 +780,8 @@ static struct page *follow_page_mask(struct vm_area_struct *vma,
> * Ordinary GUP uses follow_hugetlb_page for hugetlb processing.
> */
> if (is_vm_hugetlb_page(vma))
> - return hugetlb_follow_page_mask(vma, address, flags);
> + return hugetlb_follow_page_mask(vma, address, flags,
> + &ctx->page_mask);
>
> pgd = pgd_offset(mm, address);
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 9a6918c4250a..fbf6a09c0ec4 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -6454,7 +6454,8 @@ static inline bool __follow_hugetlb_must_fault(struct vm_area_struct *vma,
> }
>
> struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
> - unsigned long address, unsigned int flags)
> + unsigned long address, unsigned int flags,
> + unsigned int *page_mask)
> {
> struct hstate *h = hstate_vma(vma);
> struct mm_struct *mm = vma->vm_mm;
> @@ -6499,6 +6500,8 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
> page = NULL;
> goto out;
> }
> +
> + *page_mask = ~huge_page_mask(h) >> PAGE_SHIFT;
As discussed, can be simplified. But can be done on top (or not at all,
but it is confusing code).
Reviewed-by: David Hildenbrand <[email protected]>
--
Cheers,
David / dhildenb
On 20.06.23 01:10, Peter Xu wrote:
> follow_page() doesn't use FOLL_PIN, meanwhile hugetlb seems to not be the
> target of FOLL_WRITE either. However add the checks.
>
> Namely, either the need to CoW due to missing write bit, or proper CoR on
> !AnonExclusive pages over R/O pins to reject the follow page. That brings
> this function closer to follow_hugetlb_page().
>
> So we don't care before, and also for now. But we'll care if we switch
> over slow-gup to use hugetlb_follow_page_mask(). We'll also care when to
> return -EMLINK properly, as that's the gup internal api to mean "we should
> do CoR". Not really needed for follow page path, though.
>
> When at it, switching the try_grab_page() to use WARN_ON_ONCE(), to be
> clear that it just should never fail.
Oh, and does this comment really belong into this patch or am I confused?
--
Cheers,
David / dhildenb
On Tue, Jun 20, 2023 at 05:22:02PM +0200, David Hildenbrand wrote:
> On 20.06.23 01:10, Peter Xu wrote:
> > follow_page() doesn't use FOLL_PIN, meanwhile hugetlb seems to not be the
> > target of FOLL_WRITE either. However add the checks.
> >
> > Namely, either the need to CoW due to missing write bit, or proper CoR on
>
> s/CoR/unsharing/
>
> > !AnonExclusive pages over R/O pins to reject the follow page. That brings
> > this function closer to follow_hugetlb_page().
> >
> > So we don't care before, and also for now. But we'll care if we switch
> > over slow-gup to use hugetlb_follow_page_mask(). We'll also care when to
> > return -EMLINK properly, as that's the gup internal api to mean "we should
> > do CoR". Not really needed for follow page path, though.
>
> "we should unshare".
>
> >
> > When at it, switching the try_grab_page() to use WARN_ON_ONCE(), to be
> > clear that it just should never fail.
> >
> > Reviewed-by: Mike Kravetz <[email protected]>
> > Signed-off-by: Peter Xu <[email protected]>
> > ---
> > mm/hugetlb.c | 24 +++++++++++++++---------
> > 1 file changed, 15 insertions(+), 9 deletions(-)
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index f75f5e78ff0b..9a6918c4250a 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -6463,13 +6463,6 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
> > spinlock_t *ptl;
> > pte_t *pte, entry;
> > - /*
> > - * FOLL_PIN is not supported for follow_page(). Ordinary GUP goes via
> > - * follow_hugetlb_page().
> > - */
> > - if (WARN_ON_ONCE(flags & FOLL_PIN))
> > - return NULL;
> > -
> > hugetlb_vma_lock_read(vma);
> > pte = hugetlb_walk(vma, haddr, huge_page_size(h));
> > if (!pte)
> > @@ -6478,8 +6471,21 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
> > ptl = huge_pte_lock(h, mm, pte);
> > entry = huge_ptep_get(pte);
> > if (pte_present(entry)) {
> > - page = pte_page(entry) +
> > - ((address & ~huge_page_mask(h)) >> PAGE_SHIFT);
> > + page = pte_page(entry);
> > +
> > + if (gup_must_unshare(vma, flags, page)) {
>
> All other callers (like follow_page_pte(), including
> __follow_hugetlb_must_fault())
>
> (a) check for write permissions first.
>
> (b) check for gup_must_unshare() only if !pte_write(entry)
>
> I'd vote to keep these checks as similar as possible to the other GUP code.
I'm pretty sure the order doesn't matter here since one for read and one
for write.. but sure I can switch the order.
>
> > + /* Tell the caller to do Copy-On-Read */
>
> "Tell the caller to unshare".
>
> > + page = ERR_PTR(-EMLINK);
> > + goto out;
> > + }
> > +
> > + if ((flags & FOLL_WRITE) && !pte_write(entry)) {
> > + page = NULL;
> > + goto out;
> > + }
>
>
> I'm confused about pte_write() vs. huge_pte_write(), and I don't know what's
> right or wrong here.
AFAICT, they should always be identical in code. But yeah.. I should just
use the huge_ version.
Thanks,
--
Peter Xu
On Tue, Jun 20, 2023 at 05:28:28PM +0200, David Hildenbrand wrote:
> On 20.06.23 01:10, Peter Xu wrote:
> > follow_page() doesn't use FOLL_PIN, meanwhile hugetlb seems to not be the
> > target of FOLL_WRITE either. However add the checks.
> >
> > Namely, either the need to CoW due to missing write bit, or proper CoR on
> > !AnonExclusive pages over R/O pins to reject the follow page. That brings
> > this function closer to follow_hugetlb_page().
> >
> > So we don't care before, and also for now. But we'll care if we switch
> > over slow-gup to use hugetlb_follow_page_mask(). We'll also care when to
> > return -EMLINK properly, as that's the gup internal api to mean "we should
> > do CoR". Not really needed for follow page path, though.
> >
> > When at it, switching the try_grab_page() to use WARN_ON_ONCE(), to be
> > clear that it just should never fail.
>
> Oh, and does this comment really belong into this patch or am I confused?
Ah yeh, thanks for spotting. I used to have it in v1 but I kept the old
failure path here to partly address Lorenzo's worry; but then I forgot to
add the WARN_ON_ONCE back to guard. I'll remember to add that in v3.
--
Peter Xu
On Tue, Jun 20, 2023 at 05:23:09PM +0200, David Hildenbrand wrote:
> On 20.06.23 01:10, Peter Xu wrote:
> > follow_page() doesn't need it, but we'll start to need it when unifying gup
> > for hugetlb.
> >
> > Signed-off-by: Peter Xu <[email protected]>
> > ---
> > include/linux/hugetlb.h | 8 +++++---
> > mm/gup.c | 3 ++-
> > mm/hugetlb.c | 5 ++++-
> > 3 files changed, 11 insertions(+), 5 deletions(-)
> >
> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index beb7c63d2871..2e2d89e79d6c 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -131,7 +131,8 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma,
> > int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *,
> > struct vm_area_struct *, struct vm_area_struct *);
> > struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
> > - unsigned long address, unsigned int flags);
> > + unsigned long address, unsigned int flags,
> > + unsigned int *page_mask);
> > long follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
> > struct page **, unsigned long *, unsigned long *,
> > long, unsigned int, int *);
> > @@ -297,8 +298,9 @@ static inline void adjust_range_if_pmd_sharing_possible(
> > {
> > }
> > -static inline struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
> > - unsigned long address, unsigned int flags)
> > +static inline struct page *hugetlb_follow_page_mask(
> > + struct vm_area_struct *vma, unsigned long address, unsigned int flags,
> > + unsigned int *page_mask)
> > {
> > BUILD_BUG(); /* should never be compiled in if !CONFIG_HUGETLB_PAGE*/
> > }
> > diff --git a/mm/gup.c b/mm/gup.c
> > index abcd841d94b7..9fc9271cba8d 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -780,7 +780,8 @@ static struct page *follow_page_mask(struct vm_area_struct *vma,
> > * Ordinary GUP uses follow_hugetlb_page for hugetlb processing.
> > */
> > if (is_vm_hugetlb_page(vma))
> > - return hugetlb_follow_page_mask(vma, address, flags);
> > + return hugetlb_follow_page_mask(vma, address, flags,
> > + &ctx->page_mask);
> > pgd = pgd_offset(mm, address);
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 9a6918c4250a..fbf6a09c0ec4 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -6454,7 +6454,8 @@ static inline bool __follow_hugetlb_must_fault(struct vm_area_struct *vma,
> > }
> > struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
> > - unsigned long address, unsigned int flags)
> > + unsigned long address, unsigned int flags,
> > + unsigned int *page_mask)
> > {
> > struct hstate *h = hstate_vma(vma);
> > struct mm_struct *mm = vma->vm_mm;
> > @@ -6499,6 +6500,8 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
> > page = NULL;
> > goto out;
> > }
> > +
> > + *page_mask = ~huge_page_mask(h) >> PAGE_SHIFT;
>
> As discussed, can be simplified. But can be done on top (or not at all, but
> it is confusing code).
Since we decided to make this prettier.. At last I decided to go with this:
*page_mask = (1U << huge_page_order(h)) - 1;
The previous suggestion of PHYS_PFN() will do two shifts over PAGE_SIZE
(the other one in huge_page_size()) which might be unnecessary, also, PHYS_
can be slightly misleading too as prefix.
>
> Reviewed-by: David Hildenbrand <[email protected]>
I'll take this with above change, please shoot if not applicable. Thanks,
--
Peter Xu
On 20.06.23 18:28, Peter Xu wrote:
> On Tue, Jun 20, 2023 at 05:23:09PM +0200, David Hildenbrand wrote:
>> On 20.06.23 01:10, Peter Xu wrote:
>>> follow_page() doesn't need it, but we'll start to need it when unifying gup
>>> for hugetlb.
>>>
>>> Signed-off-by: Peter Xu <[email protected]>
>>> ---
>>> include/linux/hugetlb.h | 8 +++++---
>>> mm/gup.c | 3 ++-
>>> mm/hugetlb.c | 5 ++++-
>>> 3 files changed, 11 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
>>> index beb7c63d2871..2e2d89e79d6c 100644
>>> --- a/include/linux/hugetlb.h
>>> +++ b/include/linux/hugetlb.h
>>> @@ -131,7 +131,8 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma,
>>> int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *,
>>> struct vm_area_struct *, struct vm_area_struct *);
>>> struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
>>> - unsigned long address, unsigned int flags);
>>> + unsigned long address, unsigned int flags,
>>> + unsigned int *page_mask);
>>> long follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
>>> struct page **, unsigned long *, unsigned long *,
>>> long, unsigned int, int *);
>>> @@ -297,8 +298,9 @@ static inline void adjust_range_if_pmd_sharing_possible(
>>> {
>>> }
>>> -static inline struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
>>> - unsigned long address, unsigned int flags)
>>> +static inline struct page *hugetlb_follow_page_mask(
>>> + struct vm_area_struct *vma, unsigned long address, unsigned int flags,
>>> + unsigned int *page_mask)
>>> {
>>> BUILD_BUG(); /* should never be compiled in if !CONFIG_HUGETLB_PAGE*/
>>> }
>>> diff --git a/mm/gup.c b/mm/gup.c
>>> index abcd841d94b7..9fc9271cba8d 100644
>>> --- a/mm/gup.c
>>> +++ b/mm/gup.c
>>> @@ -780,7 +780,8 @@ static struct page *follow_page_mask(struct vm_area_struct *vma,
>>> * Ordinary GUP uses follow_hugetlb_page for hugetlb processing.
>>> */
>>> if (is_vm_hugetlb_page(vma))
>>> - return hugetlb_follow_page_mask(vma, address, flags);
>>> + return hugetlb_follow_page_mask(vma, address, flags,
>>> + &ctx->page_mask);
>>> pgd = pgd_offset(mm, address);
>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>> index 9a6918c4250a..fbf6a09c0ec4 100644
>>> --- a/mm/hugetlb.c
>>> +++ b/mm/hugetlb.c
>>> @@ -6454,7 +6454,8 @@ static inline bool __follow_hugetlb_must_fault(struct vm_area_struct *vma,
>>> }
>>> struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
>>> - unsigned long address, unsigned int flags)
>>> + unsigned long address, unsigned int flags,
>>> + unsigned int *page_mask)
>>> {
>>> struct hstate *h = hstate_vma(vma);
>>> struct mm_struct *mm = vma->vm_mm;
>>> @@ -6499,6 +6500,8 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
>>> page = NULL;
>>> goto out;
>>> }
>>> +
>>> + *page_mask = ~huge_page_mask(h) >> PAGE_SHIFT;
>>
>> As discussed, can be simplified. But can be done on top (or not at all, but
>> it is confusing code).
>
> Since we decided to make this prettier.. At last I decided to go with this:
>
> *page_mask = (1U << huge_page_order(h)) - 1;
>
> The previous suggestion of PHYS_PFN() will do two shifts over PAGE_SIZE
> (the other one in huge_page_size()) which might be unnecessary, also, PHYS_
> can be slightly misleading too as prefix.
>
>>
>> Reviewed-by: David Hildenbrand <[email protected]>
>
> I'll take this with above change, please shoot if not applicable. Thanks,
>
Perfectly fine :)
--
Cheers,
David / dhildenb