2023-06-23 14:52:06

by Peter Xu

[permalink] [raw]
Subject: [PATCH v3 0/8] mm/gup: Unify hugetlb, speed up thp

v1: https://lore.kernel.org/r/[email protected]
v2: https://lore.kernel.org/r/[email protected]

v3:
- Added R-bs and A-bs
- Patch 2:
- s/huge_pte_write/pte_write/ [David]
- Add back the WARN_ON_ONCE I used to have in v1; change retval to
the real errno from try_grab_page(), rather than NULL
- Patch 3: make the page_mask calculation slightly prettier [David]
- Other small cosmetic changes

Hugetlb has a special path for slow gup that follow_page_mask() is actually
skipped completely along with faultin_page(). It's not only confusing, but
also duplicating a lot of logics that generic gup already has, making
hugetlb slightly special.

This patchset tries to dedup the logic, by first touching up the slow gup
code to be able to handle hugetlb pages correctly with the current follow
page and faultin routines (where we're mostly there.. due to 10 years ago
we did try to optimize thp, but half way done; more below), then at the
last patch drop the special path, then the hugetlb gup will always go the
generic routine too via faultin_page().

Note that hugetlb is still special for gup, mostly due to the pgtable
walking (hugetlb_walk()) that we rely on which is currently per-arch. But
this is still one small step forward, and the diffstat might be a proof
too that this might be worthwhile.

Then for the "speed up thp" side: as a side effect, when I'm looking at the
chunk of code, I found that thp support is actually partially done. It
doesn't mean that thp won't work for gup, but as long as **pages pointer
passed over, the optimization will be skipped too. Patch 6 should address
that, so for thp we now get full speed gup.

For a quick number, "chrt -f 1 ./gup_test -m 512 -t -L -n 1024 -r 10" gives
me 13992.50us -> 378.50us. Gup_test is an extreme case, but just to show
how it affects thp gups.

Patch 1-5: prepares for the switch
Patch 6: switchover to the new code and remove the old
Patch 7-8: added some gup test matrix into run_vmtests.sh

Please review, thanks.

Peter Xu (8):
mm/hugetlb: Handle FOLL_DUMP well in follow_page_mask()
mm/hugetlb: Prepare hugetlb_follow_page_mask() for FOLL_PIN
mm/hugetlb: Add page_mask for hugetlb_follow_page_mask()
mm/gup: Cleanup next_page handling
mm/gup: Accelerate thp gup even for "pages != NULL"
mm/gup: Retire follow_hugetlb_page()
selftests/mm: Add -a to run_vmtests.sh
selftests/mm: Add gup test matrix in run_vmtests.sh

fs/userfaultfd.c | 2 +-
include/linux/hugetlb.h | 20 +-
mm/gup.c | 83 ++++---
mm/hugetlb.c | 263 +++-------------------
tools/testing/selftests/mm/run_vmtests.sh | 48 +++-
5 files changed, 124 insertions(+), 292 deletions(-)

--
2.40.1



2023-06-23 14:53:55

by Peter Xu

[permalink] [raw]
Subject: [PATCH v3 4/8] mm/gup: Cleanup next_page handling

The only path that doesn't use generic "**pages" handling is the gate vma.
Make it use the same path, meanwhile tune the next_page label upper to
cover "**pages" handling. This prepares for THP handling for "**pages".

Reviewed-by: Lorenzo Stoakes <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
mm/gup.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 9fc9271cba8d..4a00d609033e 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1124,7 +1124,7 @@ static long __get_user_pages(struct mm_struct *mm,
if (!vma && in_gate_area(mm, start)) {
ret = get_gate_page(mm, start & PAGE_MASK,
gup_flags, &vma,
- pages ? &pages[i] : NULL);
+ pages ? &page : NULL);
if (ret)
goto out;
ctx.page_mask = 0;
@@ -1194,19 +1194,18 @@ static long __get_user_pages(struct mm_struct *mm,
ret = PTR_ERR(page);
goto out;
}
-
- goto next_page;
} else if (IS_ERR(page)) {
ret = PTR_ERR(page);
goto out;
}
+next_page:
if (pages) {
pages[i] = page;
flush_anon_page(vma, page, start);
flush_dcache_page(page);
ctx.page_mask = 0;
}
-next_page:
+
page_increm = 1 + (~(start >> PAGE_SHIFT) & ctx.page_mask);
if (page_increm > nr_pages)
page_increm = nr_pages;
--
2.40.1


2023-06-23 14:54:43

by Peter Xu

[permalink] [raw]
Subject: [PATCH v3 8/8] selftests/mm: Add gup test matrix in run_vmtests.sh

Add a matrix for testing gup based on the current gup_test. Only run the
matrix when -a is specified because it's a bit slow.

It covers:

- Different types of huge pages: thp, hugetlb, or no huge page
- Permissions: Write / Read-only
- Fast-gup, with/without
- Types of the GUP: pin / gup / longterm pins
- Shared / Private memories
- GUP size: 1 / 512 / random page sizes

Signed-off-by: Peter Xu <[email protected]>
---
tools/testing/selftests/mm/run_vmtests.sh | 37 ++++++++++++++++++++---
1 file changed, 32 insertions(+), 5 deletions(-)

diff --git a/tools/testing/selftests/mm/run_vmtests.sh b/tools/testing/selftests/mm/run_vmtests.sh
index 824e651f62f4..9666c0c171ab 100644
--- a/tools/testing/selftests/mm/run_vmtests.sh
+++ b/tools/testing/selftests/mm/run_vmtests.sh
@@ -90,6 +90,30 @@ test_selected() {
fi
}

+run_gup_matrix() {
+ # -t: thp=on, -T: thp=off, -H: hugetlb=on
+ local hugetlb_mb=$(( needmem_KB / 1024 ))
+
+ for huge in -t -T "-H -m $hugetlb_mb"; do
+ # -u: gup-fast, -U: gup-basic, -a: pin-fast, -b: pin-basic, -L: pin-longterm
+ for test_cmd in -u -U -a -b -L; do
+ # -w: write=1, -W: write=0
+ for write in -w -W; do
+ # -S: shared
+ for share in -S " "; do
+ # -n: How many pages to fetch together? 512 is special
+ # because it's default thp size (or 2M on x86), 123 to
+ # just test partial gup when hit a huge in whatever form
+ for num in "-n 1" "-n 512" "-n 123"; do
+ CATEGORY="gup_test" run_test ./gup_test \
+ $huge $test_cmd $write $share $num
+ done
+ done
+ done
+ done
+ done
+}
+
# get huge pagesize and freepages from /proc/meminfo
while read -r name size unit; do
if [ "$name" = "HugePages_Free:" ]; then
@@ -194,13 +218,16 @@ fi

CATEGORY="mmap" run_test ./map_fixed_noreplace

-# get_user_pages_fast() benchmark
-CATEGORY="gup_test" run_test ./gup_test -u
-# pin_user_pages_fast() benchmark
-CATEGORY="gup_test" run_test ./gup_test -a
+if $RUN_ALL; then
+ run_gup_matrix
+else
+ # get_user_pages_fast() benchmark
+ CATEGORY="gup_test" run_test ./gup_test -u
+ # pin_user_pages_fast() benchmark
+ CATEGORY="gup_test" run_test ./gup_test -a
+fi
# Dump pages 0, 19, and 4096, using pin_user_pages:
CATEGORY="gup_test" run_test ./gup_test -ct -F 0x1 0 19 0x1000
-
CATEGORY="gup_test" run_test ./gup_longterm

CATEGORY="userfaultfd" run_test ./uffd-unit-tests
--
2.40.1


2023-06-23 14:55:28

by Peter Xu

[permalink] [raw]
Subject: [PATCH v3 5/8] mm/gup: Accelerate thp gup even for "pages != NULL"

The acceleration of THP was done with ctx.page_mask, however it'll be
ignored if **pages is non-NULL.

The old optimization was introduced in 2013 in 240aadeedc4a ("mm:
accelerate mm_populate() treatment of THP pages"). It didn't explain why
we can't optimize the **pages non-NULL case. It's possible that at that
time the major goal was for mm_populate() which should be enough back then.

Optimize thp for all cases, by properly looping over each subpage, doing
cache flushes, and boost refcounts / pincounts where needed in one go.

This can be verified using gup_test below:

# chrt -f 1 ./gup_test -m 512 -t -L -n 1024 -r 10

Before: 13992.50 ( +-8.75%)
After: 378.50 (+-69.62%)

Reviewed-by: Lorenzo Stoakes <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
mm/gup.c | 51 ++++++++++++++++++++++++++++++++++++++++++++-------
1 file changed, 44 insertions(+), 7 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 4a00d609033e..22e32cff9ac7 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1199,16 +1199,53 @@ static long __get_user_pages(struct mm_struct *mm,
goto out;
}
next_page:
- if (pages) {
- pages[i] = page;
- flush_anon_page(vma, page, start);
- flush_dcache_page(page);
- ctx.page_mask = 0;
- }
-
page_increm = 1 + (~(start >> PAGE_SHIFT) & ctx.page_mask);
if (page_increm > nr_pages)
page_increm = nr_pages;
+
+ if (pages) {
+ struct page *subpage;
+ unsigned int j;
+
+ /*
+ * This must be a large folio (and doesn't need to
+ * be the whole folio; it can be part of it), do
+ * the refcount work for all the subpages too.
+ *
+ * NOTE: here the page may not be the head page
+ * e.g. when start addr is not thp-size aligned.
+ * try_grab_folio() should have taken care of tail
+ * pages.
+ */
+ if (page_increm > 1) {
+ struct folio *folio;
+
+ /*
+ * Since we already hold refcount on the
+ * large folio, this should never fail.
+ */
+ folio = try_grab_folio(page, page_increm - 1,
+ foll_flags);
+ if (WARN_ON_ONCE(!folio)) {
+ /*
+ * Release the 1st page ref if the
+ * folio is problematic, fail hard.
+ */
+ gup_put_folio(page_folio(page), 1,
+ foll_flags);
+ ret = -EFAULT;
+ goto out;
+ }
+ }
+
+ for (j = 0; j < page_increm; j++) {
+ subpage = nth_page(page, j);
+ pages[i + j] = subpage;
+ flush_anon_page(vma, subpage, start + j * PAGE_SIZE);
+ flush_dcache_page(subpage);
+ }
+ }
+
i += page_increm;
start += page_increm * PAGE_SIZE;
nr_pages -= page_increm;
--
2.40.1


2023-06-26 08:32:03

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v3 8/8] selftests/mm: Add gup test matrix in run_vmtests.sh

On 23.06.23 16:29, Peter Xu wrote:
> Add a matrix for testing gup based on the current gup_test. Only run the
> matrix when -a is specified because it's a bit slow.
>
> It covers:
>
> - Different types of huge pages: thp, hugetlb, or no huge page
> - Permissions: Write / Read-only
> - Fast-gup, with/without
> - Types of the GUP: pin / gup / longterm pins
> - Shared / Private memories
> - GUP size: 1 / 512 / random page sizes
>
> Signed-off-by: Peter Xu <[email protected]>
> ---

Acked-by: David Hildenbrand <[email protected]>

--
Cheers,

David / dhildenb