2023-11-16 01:29:42

by Peter Xu

[permalink] [raw]
Subject: [PATCH RFC 00/12] mm/gup: Unify hugetlb, part 2

This patchset is in RFC stage. It's mostly because it is only yet tested on
x86_64 in a VM. Not even compile tested on PPC or any other archs, it
means at least the hugepd patch (patch 11) is mostly untested, or even not
compile tested. Before doing that, I'd like to collect any information
from high level.

If anyone would like to provide any testing either over hugepd or CONT_PMD
/ CONT_PTE on ARM (before I reach there..), or RISCV over 64K Svnapot,
that'll be very much appreciated. I'm copying PPC, ARM, RISCV list for
that. It can be as simple as "./run_vmtests.sh -t gup_test -a" for now,
making sure hugetlb pages can be allocated along the way; the non-hugetlb
gup tests will guaranteed to be covered more or less, I suppose.

In summary, this is a continuous work for previous series:

https://lore.kernel.org/all/[email protected]

And this more or less is my current take to move one more small step
towards merging hugetlb code into generic mm code, as much as we can.

That part-1 series dropped follow_hugetlb_page(). The plan of this one is
to further drops hugetlb_follow_page_mask(). The hugetlb GUP will use the
same code path for generic mm after whole set applied.

It means the generic code will need to at least understand hugepd, and
that's already done like so in fast-gup. Fortunately it seems that's the
only major thing I need to teach GUP to share the common path for now
besides normal huge PxD entries. Non-gup can be more challenging, but
that's a question for later.

Patch layout:
=============

Patch 1-4: Preparation works, mm generic part
Patch 5-6: Bugfixes; I think patch 5 if verified can be merged earlier
Patch 7-11: Preparation works, gup part
Patch 12: Drop hugetlb_follow_page_mask()

More information can be found in the commit messages of each patch. Any
comment will be welcomed. Thanks.

Peter Xu (12):
mm/hugetlb: Export hugetlbfs_pagecache_present()
mm: Provide generic pmd_thp_or_huge()
mm: Export HPAGE_PXD_* macros even if !THP
mm: Introduce vma_pgtable_walk_{begin|end}()
mm/gup: Fix follow_devmap_p[mu]d() to return even if NULL
mm/gup: Drop folio_fast_pin_allowed() in hugepd processing
mm/gup: Refactor record_subpages() to find 1st small page
mm/gup: Handle hugetlb for no_page_table()
mm/gup: Handle huge pud for follow_pud_mask()
mm/gup: Handle huge pmd for follow_pmd_mask()
mm/gup: Handle hugepd for follow_page()
mm/gup: Merge hugetlb into generic mm code

include/linux/huge_mm.h | 34 +++----
include/linux/hugetlb.h | 10 +-
include/linux/mm.h | 3 +
include/linux/pgtable.h | 4 +
mm/gup.c | 197 +++++++++++++++++++++++++---------------
mm/huge_memory.c | 117 +++++++++++++-----------
mm/hugetlb.c | 75 +--------------
mm/internal.h | 6 +-
mm/memory.c | 12 +++
9 files changed, 233 insertions(+), 225 deletions(-)

--
2.41.0


2023-11-16 01:29:46

by Peter Xu

[permalink] [raw]
Subject: [PATCH RFC 05/12] mm/gup: Fix follow_devmap_p[mu]d() to return even if NULL

This seems to be a bug not by any report but by code observations.

When GUP sees a devpmd or devpud, it should return whatever value returned
from follow_devmap_p[mu]d(). If page==NULL returned, it means a fault is
probably required. Skipping return the NULL should allow the code to fall
through, which can cause unexpected behavior.

It was there at least before the follow page rework (080dbb618b) in 2017,
so 6 years. Not yet digging for a Fixes, assuming it can hardly trigger
even if the logical bug does exist.

Signed-off-by: Peter Xu <[email protected]>
---
mm/gup.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index a8b73a8289ad..0e00204761d2 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -708,8 +708,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
ptl = pmd_lock(mm, pmd);
page = follow_devmap_pmd(vma, address, pmd, flags, &ctx->pgmap);
spin_unlock(ptl);
- if (page)
- return page;
+ return page;
}
if (likely(!pmd_trans_huge(pmdval)))
return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
@@ -756,8 +755,7 @@ static struct page *follow_pud_mask(struct vm_area_struct *vma,
ptl = pud_lock(mm, pud);
page = follow_devmap_pud(vma, address, pud, flags, &ctx->pgmap);
spin_unlock(ptl);
- if (page)
- return page;
+ return page;
}
if (unlikely(pud_bad(*pud)))
return no_page_table(vma, flags);
--
2.41.0

2023-11-16 01:29:51

by Peter Xu

[permalink] [raw]
Subject: [PATCH RFC 07/12] mm/gup: Refactor record_subpages() to find 1st small page

All the fast-gup functions take a tail page to operate, always need to do
page mask calculations before feeding that into record_subpages().

Merge that logic into record_subpages(), so that we always take a head
page, and leave the rest calculation to record_subpages().

Signed-off-by: Peter Xu <[email protected]>
---
mm/gup.c | 23 +++++++++++++----------
1 file changed, 13 insertions(+), 10 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 424d45e1afb3..69dae51f3eb1 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2763,11 +2763,14 @@ static int __gup_device_huge_pud(pud_t pud, pud_t *pudp, unsigned long addr,
}
#endif

-static int record_subpages(struct page *page, unsigned long addr,
- unsigned long end, struct page **pages)
+static int record_subpages(struct page *head, unsigned long sz,
+ unsigned long addr, unsigned long end,
+ struct page **pages)
{
+ struct page *page;
int nr;

+ page = nth_page(head, (addr & (sz - 1)) >> PAGE_SHIFT);
for (nr = 0; addr != end; nr++, addr += PAGE_SIZE)
pages[nr] = nth_page(page, nr);

@@ -2804,8 +2807,8 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
/* hugepages are never "special" */
VM_BUG_ON(!pfn_valid(pte_pfn(pte)));

- page = nth_page(pte_page(pte), (addr & (sz - 1)) >> PAGE_SHIFT);
- refs = record_subpages(page, addr, end, pages + *nr);
+ page = pte_page(pte);
+ refs = record_subpages(page, sz, addr, end, pages + *nr);

folio = try_grab_folio(page, refs, flags);
if (!folio)
@@ -2870,8 +2873,8 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
pages, nr);
}

- page = nth_page(pmd_page(orig), (addr & ~PMD_MASK) >> PAGE_SHIFT);
- refs = record_subpages(page, addr, end, pages + *nr);
+ page = pmd_page(orig);
+ refs = record_subpages(page, PMD_SIZE, addr, end, pages + *nr);

folio = try_grab_folio(page, refs, flags);
if (!folio)
@@ -2914,8 +2917,8 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
pages, nr);
}

- page = nth_page(pud_page(orig), (addr & ~PUD_MASK) >> PAGE_SHIFT);
- refs = record_subpages(page, addr, end, pages + *nr);
+ page = pud_page(orig);
+ refs = record_subpages(page, PUD_SIZE, addr, end, pages + *nr);

folio = try_grab_folio(page, refs, flags);
if (!folio)
@@ -2954,8 +2957,8 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,

BUILD_BUG_ON(pgd_devmap(orig));

- page = nth_page(pgd_page(orig), (addr & ~PGDIR_MASK) >> PAGE_SHIFT);
- refs = record_subpages(page, addr, end, pages + *nr);
+ page = pgd_page(orig);
+ refs = record_subpages(page, PGDIR_SIZE, addr, end, pages + *nr);

folio = try_grab_folio(page, refs, flags);
if (!folio)
--
2.41.0

2023-11-16 01:29:54

by Peter Xu

[permalink] [raw]
Subject: [PATCH RFC 09/12] mm/gup: Handle huge pud for follow_pud_mask()

Teach follow_pud_mask() to be able to handle normal PUD pages like hugetlb.

Rename follow_devmap_pud() to follow_huge_pud(), move it out of config
CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD, instead let that macro only
covers the devmap special operations, like pgmap.

In the new follow_huge_pud(), taking care of possible CoR for hugetlb if
necessary.

Since at it, optimize the non-present check by adding a pud_present() early
check before taking the pgtable lock, failing the follow_page() early if
PUD is not present: that is required by both devmap or hugetlb. Use
pud_huge() to also cover the pud_devmap() case.

We need to export "struct follow_page_context" along the way, so that
huge_memory.c can understand it.

One trivial more thing to mention is, introduce "pud_t pud" in the code
paths along the way, so the code doesn't dereference *pudp multiple time.
Not only because that looks less straightforward, but also because if the
dereference really happened, it's not clear whether there can be race to
see different *pudp values when it's being modified at the same time.

Setting ctx->page_mask properly for a PUD entry. As a side effect, this
should also be able to optimize devmap GUP on PUD to be able to jump over
the whole PUD range, but not yet verified.

Signed-off-by: Peter Xu <[email protected]>
---
include/linux/huge_mm.h | 17 +++----
mm/gup.c | 22 ++++-----
mm/huge_memory.c | 98 +++++++++++++++++++++++------------------
3 files changed, 73 insertions(+), 64 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index ec463410aecc..84815012d3cf 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -7,6 +7,11 @@

#include <linux/fs.h> /* only for vma_is_dax() */

+struct follow_page_context {
+ struct dev_pagemap *pgmap;
+ unsigned int page_mask;
+};
+
vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf);
int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
@@ -222,8 +227,6 @@ static inline bool folio_test_pmd_mappable(struct folio *folio)

struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
pmd_t *pmd, int flags, struct dev_pagemap **pgmap);
-struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
- pud_t *pud, int flags, struct dev_pagemap **pgmap);

vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf);

@@ -372,18 +375,16 @@ static inline struct page *follow_devmap_pmd(struct vm_area_struct *vma,
return NULL;
}

-static inline struct page *follow_devmap_pud(struct vm_area_struct *vma,
- unsigned long addr, pud_t *pud, int flags, struct dev_pagemap **pgmap)
-{
- return NULL;
-}
-
static inline bool thp_migration_supported(void)
{
return false;
}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */

+struct page *follow_huge_pud(struct vm_area_struct *vma, unsigned long addr,
+ pud_t *pud, int flags,
+ struct follow_page_context *ctx);
+
static inline int split_folio_to_list(struct folio *folio,
struct list_head *list)
{
diff --git a/mm/gup.c b/mm/gup.c
index 89c1584d68f0..55a2ae55f00f 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -25,11 +25,6 @@

#include "internal.h"

-struct follow_page_context {
- struct dev_pagemap *pgmap;
- unsigned int page_mask;
-};
-
static inline void sanity_check_pinned_pages(struct page **pages,
unsigned long npages)
{
@@ -751,24 +746,25 @@ static struct page *follow_pud_mask(struct vm_area_struct *vma,
unsigned int flags,
struct follow_page_context *ctx)
{
- pud_t *pud;
+ pud_t *pudp, pud;
spinlock_t *ptl;
struct page *page;
struct mm_struct *mm = vma->vm_mm;

- pud = pud_offset(p4dp, address);
- if (pud_none(*pud))
+ pudp = pud_offset(p4dp, address);
+ pud = *pudp;
+ if (pud_none(pud) || !pud_present(pud))
return no_page_table(vma, flags, address);
- if (pud_devmap(*pud)) {
- ptl = pud_lock(mm, pud);
- page = follow_devmap_pud(vma, address, pud, flags, &ctx->pgmap);
+ if (pud_huge(pud)) {
+ ptl = pud_lock(mm, pudp);
+ page = follow_huge_pud(vma, address, pudp, flags, ctx);
spin_unlock(ptl);
return page;
}
- if (unlikely(pud_bad(*pud)))
+ if (unlikely(pud_bad(pud)))
return no_page_table(vma, flags, address);

- return follow_pmd_mask(vma, address, pud, flags, ctx);
+ return follow_pmd_mask(vma, address, pudp, flags, ctx);
}

static struct page *follow_p4d_mask(struct vm_area_struct *vma,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 6eb55f97a3d2..6748ef5f3fd9 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1207,49 +1207,6 @@ static void touch_pud(struct vm_area_struct *vma, unsigned long addr,
update_mmu_cache_pud(vma, addr, pud);
}

-struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
- pud_t *pud, int flags, struct dev_pagemap **pgmap)
-{
- unsigned long pfn = pud_pfn(*pud);
- struct mm_struct *mm = vma->vm_mm;
- struct page *page;
- int ret;
-
- assert_spin_locked(pud_lockptr(mm, pud));
-
- if (flags & FOLL_WRITE && !pud_write(*pud))
- return NULL;
-
- if (pud_present(*pud) && pud_devmap(*pud))
- /* pass */;
- else
- return NULL;
-
- if (flags & FOLL_TOUCH)
- touch_pud(vma, addr, pud, flags & FOLL_WRITE);
-
- /*
- * device mapped pages can only be returned if the
- * caller will manage the page reference count.
- *
- * At least one of FOLL_GET | FOLL_PIN must be set, so assert that here:
- */
- if (!(flags & (FOLL_GET | FOLL_PIN)))
- return ERR_PTR(-EEXIST);
-
- pfn += (addr & ~PUD_MASK) >> PAGE_SHIFT;
- *pgmap = get_dev_pagemap(pfn, *pgmap);
- if (!*pgmap)
- return ERR_PTR(-EFAULT);
- page = pfn_to_page(pfn);
-
- ret = try_grab_page(page, flags);
- if (ret)
- page = ERR_PTR(ret);
-
- return page;
-}
-
int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
pud_t *dst_pud, pud_t *src_pud, unsigned long addr,
struct vm_area_struct *vma)
@@ -1305,6 +1262,61 @@ void huge_pud_set_accessed(struct vm_fault *vmf, pud_t orig_pud)
}
#endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */

+struct page *follow_huge_pud(struct vm_area_struct *vma, unsigned long addr,
+ pud_t *pudp, int flags,
+ struct follow_page_context *ctx)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ struct page *page;
+ pud_t pud = *pudp;
+ unsigned long pfn = pud_pfn(pud);
+ int ret;
+
+ assert_spin_locked(pud_lockptr(mm, pudp));
+
+ if ((flags & FOLL_WRITE) && !pud_write(pud))
+ return NULL;
+
+ if (!pud_present(pud))
+ return NULL;
+
+ pfn += (addr & ~PUD_MASK) >> PAGE_SHIFT;
+
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+ if (pud_devmap(pud)) {
+ /*
+ * device mapped pages can only be returned if the caller
+ * will manage the page reference count.
+ *
+ * At least one of FOLL_GET | FOLL_PIN must be set, so
+ * assert that here:
+ */
+ if (!(flags & (FOLL_GET | FOLL_PIN)))
+ return ERR_PTR(-EEXIST);
+
+ if (flags & FOLL_TOUCH)
+ touch_pud(vma, addr, pudp, flags & FOLL_WRITE);
+
+ ctx->pgmap = get_dev_pagemap(pfn, ctx->pgmap);
+ if (!ctx->pgmap)
+ return ERR_PTR(-EFAULT);
+ }
+#endif
+ page = pfn_to_page(pfn);
+
+ if (!pud_devmap(pud) && !pud_write(pud) &&
+ gup_must_unshare(vma, flags, page))
+ return ERR_PTR(-EMLINK);
+
+ ret = try_grab_page(page, flags);
+ if (ret)
+ page = ERR_PTR(ret);
+ else
+ ctx->page_mask = HPAGE_PUD_NR - 1;
+
+ return page;
+}
+
void huge_pmd_set_accessed(struct vm_fault *vmf)
{
bool write = vmf->flags & FAULT_FLAG_WRITE;
--
2.41.0

2023-11-16 01:30:03

by Peter Xu

[permalink] [raw]
Subject: [PATCH RFC 08/12] mm/gup: Handle hugetlb for no_page_table()

no_page_table() is not yet used for hugetlb code paths. Make it prepared.

The major difference here is hugetlb will return -EFAULT as long as page
cache does not exist, even if VM_SHARED. See hugetlb_follow_page_mask().

Pass "address" into no_page_table() too, as hugetlb will need it.

Signed-off-by: Peter Xu <[email protected]>
---
mm/gup.c | 40 ++++++++++++++++++++++++----------------
1 file changed, 24 insertions(+), 16 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 69dae51f3eb1..89c1584d68f0 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -501,19 +501,27 @@ static inline void mm_set_has_pinned_flag(unsigned long *mm_flags)

#ifdef CONFIG_MMU
static struct page *no_page_table(struct vm_area_struct *vma,
- unsigned int flags)
+ unsigned int flags, unsigned long address)
{
+ if (!(flags & FOLL_DUMP))
+ return NULL;
+
/*
- * When core dumping an enormous anonymous area that nobody
- * has touched so far, we don't want to allocate unnecessary pages or
+ * When core dumping, we don't want to allocate unnecessary pages or
* page tables. Return error instead of NULL to skip handle_mm_fault,
* then get_dump_page() will return NULL to leave a hole in the dump.
* But we can only make this optimization where a hole would surely
* be zero-filled if handle_mm_fault() actually did handle it.
*/
- if ((flags & FOLL_DUMP) &&
- (vma_is_anonymous(vma) || !vma->vm_ops->fault))
+ if (is_vm_hugetlb_page(vma)) {
+ struct hstate *h = hstate_vma(vma);
+
+ if (!hugetlbfs_pagecache_present(h, vma, address))
+ return ERR_PTR(-EFAULT);
+ } else if ((vma_is_anonymous(vma) || !vma->vm_ops->fault)) {
return ERR_PTR(-EFAULT);
+ }
+
return NULL;
}

@@ -593,7 +601,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,

ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
if (!ptep)
- return no_page_table(vma, flags);
+ return no_page_table(vma, flags, address);
pte = ptep_get(ptep);
if (!pte_present(pte))
goto no_page;
@@ -685,7 +693,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
pte_unmap_unlock(ptep, ptl);
if (!pte_none(pte))
return NULL;
- return no_page_table(vma, flags);
+ return no_page_table(vma, flags, address);
}

static struct page *follow_pmd_mask(struct vm_area_struct *vma,
@@ -701,9 +709,9 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
pmd = pmd_offset(pudp, address);
pmdval = pmdp_get_lockless(pmd);
if (pmd_none(pmdval))
- return no_page_table(vma, flags);
+ return no_page_table(vma, flags, address);
if (!pmd_present(pmdval))
- return no_page_table(vma, flags);
+ return no_page_table(vma, flags, address);
if (pmd_devmap(pmdval)) {
ptl = pmd_lock(mm, pmd);
page = follow_devmap_pmd(vma, address, pmd, flags, &ctx->pgmap);
@@ -714,12 +722,12 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);

if (pmd_protnone(pmdval) && !gup_can_follow_protnone(vma, flags))
- return no_page_table(vma, flags);
+ return no_page_table(vma, flags, address);

ptl = pmd_lock(mm, pmd);
if (unlikely(!pmd_present(*pmd))) {
spin_unlock(ptl);
- return no_page_table(vma, flags);
+ return no_page_table(vma, flags, address);
}
if (unlikely(!pmd_trans_huge(*pmd))) {
spin_unlock(ptl);
@@ -750,7 +758,7 @@ static struct page *follow_pud_mask(struct vm_area_struct *vma,

pud = pud_offset(p4dp, address);
if (pud_none(*pud))
- return no_page_table(vma, flags);
+ return no_page_table(vma, flags, address);
if (pud_devmap(*pud)) {
ptl = pud_lock(mm, pud);
page = follow_devmap_pud(vma, address, pud, flags, &ctx->pgmap);
@@ -758,7 +766,7 @@ static struct page *follow_pud_mask(struct vm_area_struct *vma,
return page;
}
if (unlikely(pud_bad(*pud)))
- return no_page_table(vma, flags);
+ return no_page_table(vma, flags, address);

return follow_pmd_mask(vma, address, pud, flags, ctx);
}
@@ -772,10 +780,10 @@ static struct page *follow_p4d_mask(struct vm_area_struct *vma,

p4d = p4d_offset(pgdp, address);
if (p4d_none(*p4d))
- return no_page_table(vma, flags);
+ return no_page_table(vma, flags, address);
BUILD_BUG_ON(p4d_huge(*p4d));
if (unlikely(p4d_bad(*p4d)))
- return no_page_table(vma, flags);
+ return no_page_table(vma, flags, address);

return follow_pud_mask(vma, address, p4d, flags, ctx);
}
@@ -825,7 +833,7 @@ static struct page *follow_page_mask(struct vm_area_struct *vma,
pgd = pgd_offset(mm, address);

if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
- return no_page_table(vma, flags);
+ return no_page_table(vma, flags, address);

return follow_p4d_mask(vma, address, pgd, flags, ctx);
}
--
2.41.0

2023-11-16 01:30:07

by Peter Xu

[permalink] [raw]
Subject: [PATCH RFC 10/12] mm/gup: Handle huge pmd for follow_pmd_mask()

Do proper replacement of pmd_trans_huge() by using pmd_thp_or_huge() to
also cover pmd_huge() as long as enabled.

FOLL_TOUCH and FOLL_SPLIT_PMD only apply to THP, not yet huge.

Since now follow_trans_huge_pmd() can handle hugetlb pages, renaming it
into follow_huge_pmd() to match what it does.

When at it, move the ctx->page_mask setup into follow_huge_pmd(), only set
it when the page is valid. It was not a bug to set it before even if GUP
failed (page==NULL), because follow_page_mask() callers always ignores
page_mask if so. But doing so makes the code cleaner.

Signed-off-by: Peter Xu <[email protected]>
---
mm/gup.c | 12 ++++++------
mm/huge_memory.c | 19 ++++++++++---------
mm/internal.h | 6 +++---
3 files changed, 19 insertions(+), 18 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 55a2ae55f00f..7c210206470f 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -713,31 +713,31 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
spin_unlock(ptl);
return page;
}
- if (likely(!pmd_trans_huge(pmdval)))
+ if (likely(!pmd_thp_or_huge(pmdval)))
return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);

if (pmd_protnone(pmdval) && !gup_can_follow_protnone(vma, flags))
return no_page_table(vma, flags, address);

ptl = pmd_lock(mm, pmd);
- if (unlikely(!pmd_present(*pmd))) {
+ pmdval = *pmd;
+ if (unlikely(!pmd_present(pmdval))) {
spin_unlock(ptl);
return no_page_table(vma, flags, address);
}
- if (unlikely(!pmd_trans_huge(*pmd))) {
+ if (unlikely(!pmd_thp_or_huge(pmdval))) {
spin_unlock(ptl);
return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
}
- if (flags & FOLL_SPLIT_PMD) {
+ if (pmd_trans_huge(pmdval) && (flags & FOLL_SPLIT_PMD)) {
spin_unlock(ptl);
split_huge_pmd(vma, pmd, address);
/* If pmd was left empty, stuff a page table in there quickly */
return pte_alloc(mm, pmd) ? ERR_PTR(-ENOMEM) :
follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
}
- page = follow_trans_huge_pmd(vma, address, pmd, flags);
+ page = follow_huge_pmd(vma, address, pmd, flags, ctx);
spin_unlock(ptl);
- ctx->page_mask = HPAGE_PMD_NR - 1;
return page;
}

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 6748ef5f3fd9..43fb81218c5e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1486,32 +1486,32 @@ static inline bool can_follow_write_pmd(pmd_t pmd, struct page *page,
return !userfaultfd_huge_pmd_wp(vma, pmd);
}

-struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
- unsigned long addr,
- pmd_t *pmd,
- unsigned int flags)
+struct page *follow_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
+ pmd_t *pmd, unsigned int flags,
+ struct follow_page_context *ctx)
{
struct mm_struct *mm = vma->vm_mm;
+ pmd_t pmdval = *pmd;
struct page *page;
int ret;

assert_spin_locked(pmd_lockptr(mm, pmd));

- page = pmd_page(*pmd);
+ page = pmd_page(pmdval);
VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);

if ((flags & FOLL_WRITE) &&
- !can_follow_write_pmd(*pmd, page, vma, flags))
+ !can_follow_write_pmd(pmdval, page, vma, flags))
return NULL;

/* Avoid dumping huge zero page */
- if ((flags & FOLL_DUMP) && is_huge_zero_pmd(*pmd))
+ if ((flags & FOLL_DUMP) && is_huge_zero_pmd(pmdval))
return ERR_PTR(-EFAULT);

if (pmd_protnone(*pmd) && !gup_can_follow_protnone(vma, flags))
return NULL;

- if (!pmd_write(*pmd) && gup_must_unshare(vma, flags, page))
+ if (!pmd_write(pmdval) && gup_must_unshare(vma, flags, page))
return ERR_PTR(-EMLINK);

VM_BUG_ON_PAGE((flags & FOLL_PIN) && PageAnon(page) &&
@@ -1521,10 +1521,11 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
if (ret)
return ERR_PTR(ret);

- if (flags & FOLL_TOUCH)
+ if (pmd_trans_huge(pmdval) && (flags & FOLL_TOUCH))
touch_pmd(vma, addr, pmd, flags & FOLL_WRITE);

page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
+ ctx->page_mask = HPAGE_PMD_NR - 1;
VM_BUG_ON_PAGE(!PageCompound(page) && !is_zone_device_page(page), page);

return page;
diff --git a/mm/internal.h b/mm/internal.h
index 8450562744cf..bf0dc896c274 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1007,9 +1007,9 @@ int __must_check try_grab_page(struct page *page, unsigned int flags);
/*
* mm/huge_memory.c
*/
-struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
- unsigned long addr, pmd_t *pmd,
- unsigned int flags);
+struct page *follow_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
+ pmd_t *pmd, unsigned int flags,
+ struct follow_page_context *ctx);

/*
* mm/mmap.c
--
2.41.0

2023-11-16 01:30:11

by Peter Xu

[permalink] [raw]
Subject: [PATCH RFC 11/12] mm/gup: Handle hugepd for follow_page()

Hugepd is only used in PowerPC's hugetlbfs. follow_page_mask() used to
leverage hugetlb APIs to access hugepd entries. Teach follow_page_mask()
itself on hugepd.

With previous refactors on fast-gup gup_huge_pd(), most of the code can be
easily leveraged. Since follow_page() always only fetch one page, set the
end to "address + PAGE_SIZE" should suffice.

Signed-off-by: Peter Xu <[email protected]>
---
mm/gup.c | 77 ++++++++++++++++++++++++++++++++++++++++++++++++++------
1 file changed, 69 insertions(+), 8 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 7c210206470f..e635278f65f9 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -25,6 +25,11 @@

#include "internal.h"

+static struct page *follow_hugepd(struct vm_area_struct *vma, hugepd_t hugepd,
+ unsigned long addr, unsigned int pdshift,
+ unsigned int flags,
+ struct follow_page_context *ctx);
+
static inline void sanity_check_pinned_pages(struct page **pages,
unsigned long npages)
{
@@ -713,6 +718,9 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
spin_unlock(ptl);
return page;
}
+ if (unlikely(is_hugepd(__hugepd(pmd_val(pmdval)))))
+ return follow_hugepd(vma, __hugepd(pmd_val(pmdval)),
+ address, PMD_SHIFT, flags, ctx);
if (likely(!pmd_thp_or_huge(pmdval)))
return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);

@@ -764,6 +772,10 @@ static struct page *follow_pud_mask(struct vm_area_struct *vma,
if (unlikely(pud_bad(pud)))
return no_page_table(vma, flags, address);

+ if (unlikely(is_hugepd(__hugepd(pud_val(pud)))))
+ return follow_hugepd(vma, __hugepd(pud_val(pud)),
+ address, PUD_SHIFT, flags, ctx);
+
return follow_pmd_mask(vma, address, pudp, flags, ctx);
}

@@ -772,15 +784,19 @@ static struct page *follow_p4d_mask(struct vm_area_struct *vma,
unsigned int flags,
struct follow_page_context *ctx)
{
- p4d_t *p4d;
+ p4d_t *p4d, p4dval;

p4d = p4d_offset(pgdp, address);
- if (p4d_none(*p4d))
- return no_page_table(vma, flags, address);
- BUILD_BUG_ON(p4d_huge(*p4d));
- if (unlikely(p4d_bad(*p4d)))
+ p4dval = *p4d;
+ BUILD_BUG_ON(p4d_huge(p4dval));
+
+ if (p4d_none(p4dval) || unlikely(p4d_bad(p4dval)))
return no_page_table(vma, flags, address);

+ if (unlikely(is_hugepd(__hugepd(p4d_val(p4dval)))))
+ return follow_hugepd(vma, __hugepd(p4d_val(p4dval)),
+ address, P4D_SHIFT, flags, ctx);
+
return follow_pud_mask(vma, address, p4d, flags, ctx);
}

@@ -812,7 +828,7 @@ static struct page *follow_page_mask(struct vm_area_struct *vma,
unsigned long address, unsigned int flags,
struct follow_page_context *ctx)
{
- pgd_t *pgd;
+ pgd_t *pgd, pgdval;
struct mm_struct *mm = vma->vm_mm;

ctx->page_mask = 0;
@@ -827,11 +843,17 @@ static struct page *follow_page_mask(struct vm_area_struct *vma,
&ctx->page_mask);

pgd = pgd_offset(mm, address);
+ pgdval = *pgd;

if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
- return no_page_table(vma, flags, address);
+ page = no_page_table(vma, flags, address);
+ else if (unlikely(is_hugepd(__hugepd(pgd_val(pgdval)))))
+ page = follow_hugepd(vma, __hugepd(pgd_val(pgdval)),
+ address, PGDIR_SHIFT, flags, ctx);
+ else
+ page = follow_p4d_mask(vma, address, pgd, flags, ctx);

- return follow_p4d_mask(vma, address, pgd, flags, ctx);
+ return page;
}

struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
@@ -2850,6 +2872,37 @@ static int gup_huge_pd(hugepd_t hugepd, unsigned long addr,

return 1;
}
+
+static struct page *follow_hugepd(struct vm_area_struct *vma, hugepd_t hugepd,
+ unsigned long addr, unsigned int pdshift,
+ unsigned int flags,
+ struct follow_page_context *ctx)
+{
+ struct page *page;
+ struct hstate *h;
+ spinlock_t *ptl;
+ int nr = 0, ret;
+ pte_t *ptep;
+
+ /* Only hugetlb supports hugepd */
+ if (WARN_ON_ONCE(!is_vm_hugetlb_page(vma)))
+ return ERR_PTR(-EFAULT);
+
+ h = hstate_vma(vma);
+ ptep = hugepte_offset(hugepd, addr, pdshift);
+ ptl = huge_pte_lock(h, mm, ptep);
+ ret = gup_huge_pd(hugepd, addr, pdshift, addr + PAGE_SIZE,
+ flags, &page, &nr)) {
+ spin_unlock(ptl);
+
+ if (ret) {
+ WARN_ON_ONCE(nr != 1);
+ ctx->page_mask = (1U << huge_page_order(h)) - 1;
+ return page;
+ }
+
+ return NULL;
+}
#else
static inline int gup_huge_pd(hugepd_t hugepd, unsigned long addr,
unsigned int pdshift, unsigned long end, unsigned int flags,
@@ -2857,6 +2910,14 @@ static inline int gup_huge_pd(hugepd_t hugepd, unsigned long addr,
{
return 0;
}
+
+static struct page *follow_hugepd(struct vm_area_struct *vma, hugepd_t hugepd,
+ unsigned long addr, unsigned int pdshift,
+ unsigned int flags,
+ struct follow_page_context *ctx)
+{
+ return NULL;
+}
#endif /* CONFIG_ARCH_HAS_HUGEPD */

static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
--
2.41.0

2023-11-16 01:30:48

by Peter Xu

[permalink] [raw]
Subject: [PATCH RFC 03/12] mm: Export HPAGE_PXD_* macros even if !THP

These macros can be helpful when we plan to merge hugetlb code into generic
code. Move them out and define them even if !THP.

We actually already defined HPAGE_PMD_NR for other reasons even if !THP.
Reorganize these macros.

Signed-off-by: Peter Xu <[email protected]>
---
include/linux/huge_mm.h | 17 ++++++-----------
1 file changed, 6 insertions(+), 11 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index fa0350b0812a..ec463410aecc 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -64,17 +64,19 @@ ssize_t single_hugepage_flag_show(struct kobject *kobj,
enum transparent_hugepage_flag flag);
extern struct kobj_attribute shmem_enabled_attr;

-#define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
-#define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
-
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
#define HPAGE_PMD_SHIFT PMD_SHIFT
#define HPAGE_PMD_SIZE ((1UL) << HPAGE_PMD_SHIFT)
#define HPAGE_PMD_MASK (~(HPAGE_PMD_SIZE - 1))
+#define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
+#define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)

#define HPAGE_PUD_SHIFT PUD_SHIFT
#define HPAGE_PUD_SIZE ((1UL) << HPAGE_PUD_SHIFT)
#define HPAGE_PUD_MASK (~(HPAGE_PUD_SIZE - 1))
+#define HPAGE_PUD_ORDER (HPAGE_PUD_SHIFT-PAGE_SHIFT)
+#define HPAGE_PUD_NR (1<<HPAGE_PUD_ORDER)
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE

extern unsigned long transparent_hugepage_flags;

@@ -254,13 +256,6 @@ static inline bool thp_migration_supported(void)
}

#else /* CONFIG_TRANSPARENT_HUGEPAGE */
-#define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
-#define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
-#define HPAGE_PMD_SIZE ({ BUILD_BUG(); 0; })
-
-#define HPAGE_PUD_SHIFT ({ BUILD_BUG(); 0; })
-#define HPAGE_PUD_MASK ({ BUILD_BUG(); 0; })
-#define HPAGE_PUD_SIZE ({ BUILD_BUG(); 0; })

static inline bool folio_test_pmd_mappable(struct folio *folio)
{
--
2.41.0

2023-11-16 01:30:51

by Peter Xu

[permalink] [raw]
Subject: [PATCH RFC 06/12] mm/gup: Drop folio_fast_pin_allowed() in hugepd processing

Hugepd format is only used in PowerPC with hugetlbfs. In commit
a6e79df92e4a ("mm/gup: disallow FOLL_LONGTERM GUP-fast writing to
file-backed mappings"), we added a check to fail gup-fast if there's
potential risk of violating GUP over writeback file systems. That should
never apply to hugepd.

Drop that check, not only because it'll never be true for hugepd, but also
it paves way for reusing the function outside fast-gup.

Cc: Lorenzo Stoakes <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: [email protected]
Signed-off-by: Peter Xu <[email protected]>
---
mm/gup.c | 5 -----
1 file changed, 5 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 0e00204761d2..424d45e1afb3 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2816,11 +2816,6 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
return 0;
}

- if (!folio_fast_pin_allowed(folio, flags)) {
- gup_put_folio(folio, refs, flags);
- return 0;
- }
-
if (!pte_write(pte) && gup_must_unshare(NULL, flags, &folio->page)) {
gup_put_folio(folio, refs, flags);
return 0;
--
2.41.0

2023-11-16 01:30:56

by Peter Xu

[permalink] [raw]
Subject: [PATCH RFC 12/12] mm/gup: Merge hugetlb into generic mm code

Now follow_page() is ready to handle hugetlb pages in whatever form, and
over all architectures. Switch to the generic code path.

Time to retire hugetlb_follow_page_mask(), following the previous
retirement of follow_hugetlb_page() in 4849807114b8.

There may be a slight difference of how the loops run when processing GUP
over a large hugetlb range on either ARM64 (e.g. CONT_PMD) or RISCV (mostly
its Svnapot extension on 64K huge pages): each loop of __get_user_pages()
will resolve one pgtable entry with the patch applied, rather than relying
on the size of hugetlb hstate, the latter may cover multiple entries in one
loop.

However, the performance difference should hopefully not be a major
concern, considering that GUP just yet got 57edfcfd3419 ("mm/gup:
accelerate thp gup even for "pages != NULL""), and that's not part of a
performance analysis but a side dish. If the performance will be a
concern, we can consider handle CONT_PTE in follow_page(), for example.

Before that is justified to be necessary, keep everything clean and simple.

Signed-off-by: Peter Xu <[email protected]>
---
include/linux/hugetlb.h | 7 ----
mm/gup.c | 15 +++------
mm/hugetlb.c | 71 -----------------------------------------
3 files changed, 5 insertions(+), 88 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index bb07279b8991..87630a185acf 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -332,13 +332,6 @@ static inline void hugetlb_zap_end(
{
}

-static inline struct page *hugetlb_follow_page_mask(
- struct vm_area_struct *vma, unsigned long address, unsigned int flags,
- unsigned int *page_mask)
-{
- BUILD_BUG(); /* should never be compiled in if !CONFIG_HUGETLB_PAGE*/
-}
-
static inline int copy_hugetlb_page_range(struct mm_struct *dst,
struct mm_struct *src,
struct vm_area_struct *dst_vma,
diff --git a/mm/gup.c b/mm/gup.c
index e635278f65f9..23fcac5aa3db 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -830,18 +830,11 @@ static struct page *follow_page_mask(struct vm_area_struct *vma,
{
pgd_t *pgd, pgdval;
struct mm_struct *mm = vma->vm_mm;
+ struct page *page;

- ctx->page_mask = 0;
-
- /*
- * Call hugetlb_follow_page_mask for hugetlb vmas as it will use
- * special hugetlb page table walking code. This eliminates the
- * need to check for hugetlb entries in the general walking code.
- */
- if (is_vm_hugetlb_page(vma))
- return hugetlb_follow_page_mask(vma, address, flags,
- &ctx->page_mask);
+ vma_pgtable_walk_begin(vma);

+ ctx->page_mask = 0;
pgd = pgd_offset(mm, address);
pgdval = *pgd;

@@ -853,6 +846,8 @@ static struct page *follow_page_mask(struct vm_area_struct *vma,
else
page = follow_p4d_mask(vma, address, pgd, flags, ctx);

+ vma_pgtable_walk_end(vma);
+
return page;
}

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 29705e5c6f40..3013122a739f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6783,77 +6783,6 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
}
#endif /* CONFIG_USERFAULTFD */

-struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
- unsigned long address, unsigned int flags,
- unsigned int *page_mask)
-{
- struct hstate *h = hstate_vma(vma);
- struct mm_struct *mm = vma->vm_mm;
- unsigned long haddr = address & huge_page_mask(h);
- struct page *page = NULL;
- spinlock_t *ptl;
- pte_t *pte, entry;
- int ret;
-
- hugetlb_vma_lock_read(vma);
- pte = hugetlb_walk(vma, haddr, huge_page_size(h));
- if (!pte)
- goto out_unlock;
-
- ptl = huge_pte_lock(h, mm, pte);
- entry = huge_ptep_get(pte);
- if (pte_present(entry)) {
- page = pte_page(entry);
-
- if (!huge_pte_write(entry)) {
- if (flags & FOLL_WRITE) {
- page = NULL;
- goto out;
- }
-
- if (gup_must_unshare(vma, flags, page)) {
- /* Tell the caller to do unsharing */
- page = ERR_PTR(-EMLINK);
- goto out;
- }
- }
-
- page = nth_page(page, ((address & ~huge_page_mask(h)) >> PAGE_SHIFT));
-
- /*
- * Note that page may be a sub-page, and with vmemmap
- * optimizations the page struct may be read only.
- * try_grab_page() will increase the ref count on the
- * head page, so this will be OK.
- *
- * try_grab_page() should always be able to get the page here,
- * because we hold the ptl lock and have verified pte_present().
- */
- ret = try_grab_page(page, flags);
-
- if (WARN_ON_ONCE(ret)) {
- page = ERR_PTR(ret);
- goto out;
- }
-
- *page_mask = (1U << huge_page_order(h)) - 1;
- }
-out:
- spin_unlock(ptl);
-out_unlock:
- hugetlb_vma_unlock_read(vma);
-
- /*
- * Fixup retval for dump requests: if pagecache doesn't exist,
- * don't try to allocate a new page but just skip it.
- */
- if (!page && (flags & FOLL_DUMP) &&
- !hugetlbfs_pagecache_present(h, vma, address))
- page = ERR_PTR(-EFAULT);
-
- return page;
-}
-
long hugetlb_change_protection(struct vm_area_struct *vma,
unsigned long address, unsigned long end,
pgprot_t newprot, unsigned long cp_flags)
--
2.41.0

2023-11-16 14:52:18

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH RFC 07/12] mm/gup: Refactor record_subpages() to find 1st small page

On Wed, Nov 15, 2023 at 08:29:03PM -0500, Peter Xu wrote:
> All the fast-gup functions take a tail page to operate, always need to do
> page mask calculations before feeding that into record_subpages().
>
> Merge that logic into record_subpages(), so that we always take a head
> page, and leave the rest calculation to record_subpages().

This is a bit fragile. You're assuming that pmd_page() always returns
a head page, and that's only true today because I looked at the work
required vs the reward and decided to cap the large folio size at PMD
size. If we allowed 2*PMD_SIZE (eg 4MB on x86), pmd_page() would not
return a head page. There is a small amount of demand for > PMD size
large folio support, so I suspect we will want to do this eventually.
I'm not particularly trying to do these conversions, but it would be
good to not add more assumptions that pmd_page() returns a head page.

> +static int record_subpages(struct page *head, unsigned long sz,
> + unsigned long addr, unsigned long end,
> + struct page **pages)

> @@ -2870,8 +2873,8 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
> pages, nr);
> }
>
> - page = nth_page(pmd_page(orig), (addr & ~PMD_MASK) >> PAGE_SHIFT);
> - refs = record_subpages(page, addr, end, pages + *nr);
> + page = pmd_page(orig);
> + refs = record_subpages(page, PMD_SIZE, addr, end, pages + *nr);
>
> folio = try_grab_folio(page, refs, flags);
> if (!folio)

2023-11-16 19:41:16

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH RFC 07/12] mm/gup: Refactor record_subpages() to find 1st small page

On Thu, Nov 16, 2023 at 02:51:52PM +0000, Matthew Wilcox wrote:
> On Wed, Nov 15, 2023 at 08:29:03PM -0500, Peter Xu wrote:
> > All the fast-gup functions take a tail page to operate, always need to do
> > page mask calculations before feeding that into record_subpages().
> >
> > Merge that logic into record_subpages(), so that we always take a head
> > page, and leave the rest calculation to record_subpages().
>
> This is a bit fragile. You're assuming that pmd_page() always returns
> a head page, and that's only true today because I looked at the work
> required vs the reward and decided to cap the large folio size at PMD
> size. If we allowed 2*PMD_SIZE (eg 4MB on x86), pmd_page() would not
> return a head page. There is a small amount of demand for > PMD size
> large folio support, so I suspect we will want to do this eventually.
> I'm not particularly trying to do these conversions, but it would be
> good to not add more assumptions that pmd_page() returns a head page.

Makes sense. Actually, IIUC arm64's CONT_PMD pages can already make that
not a head page.

The code should still be correct, though. AFAIU what I need to do then is
renaming the first field of record_subpages() (s/head/base/) in the next
version, or just keep it the old one ("page"), then update the commit
message.

Thanks,

--
Peter Xu

2023-11-16 19:42:06

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH RFC 07/12] mm/gup: Refactor record_subpages() to find 1st small page

On Thu, Nov 16, 2023 at 02:40:21PM -0500, Peter Xu wrote:
> On Thu, Nov 16, 2023 at 02:51:52PM +0000, Matthew Wilcox wrote:
> > On Wed, Nov 15, 2023 at 08:29:03PM -0500, Peter Xu wrote:
> > > All the fast-gup functions take a tail page to operate, always need to do
> > > page mask calculations before feeding that into record_subpages().
> > >
> > > Merge that logic into record_subpages(), so that we always take a head
> > > page, and leave the rest calculation to record_subpages().
> >
> > This is a bit fragile. You're assuming that pmd_page() always returns
> > a head page, and that's only true today because I looked at the work
> > required vs the reward and decided to cap the large folio size at PMD
> > size. If we allowed 2*PMD_SIZE (eg 4MB on x86), pmd_page() would not
> > return a head page. There is a small amount of demand for > PMD size
> > large folio support, so I suspect we will want to do this eventually.
> > I'm not particularly trying to do these conversions, but it would be
> > good to not add more assumptions that pmd_page() returns a head page.
>
> Makes sense. Actually, IIUC arm64's CONT_PMD pages can already make that
> not a head page.
>
> The code should still be correct, though. AFAIU what I need to do then is
> renaming the first field of record_subpages() (s/head/base/) in the next
> version, or just keep it the old one ("page"), then update the commit
> message.

Yeah, I think just leave it as 'page' would be best. Thanks.

2023-11-20 08:26:55

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH RFC 06/12] mm/gup: Drop folio_fast_pin_allowed() in hugepd processing

On Wed, Nov 15, 2023 at 08:29:02PM -0500, Peter Xu wrote:
> Hugepd format is only used in PowerPC with hugetlbfs. In commit
> a6e79df92e4a ("mm/gup: disallow FOLL_LONGTERM GUP-fast writing to
> file-backed mappings"), we added a check to fail gup-fast if there's
> potential risk of violating GUP over writeback file systems. That should
> never apply to hugepd.
>
> Drop that check, not only because it'll never be true for hugepd, but also
> it paves way for reusing the function outside fast-gup.

What prevents us from ever using hugepd with file mappings? I think
it would naturally fit in with how large folios for the pagecache work.

So keeping this check and generalizing it seems like the better idea to
me.

2023-11-21 16:00:01

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH RFC 06/12] mm/gup: Drop folio_fast_pin_allowed() in hugepd processing

On Mon, Nov 20, 2023 at 12:26:24AM -0800, Christoph Hellwig wrote:
> On Wed, Nov 15, 2023 at 08:29:02PM -0500, Peter Xu wrote:
> > Hugepd format is only used in PowerPC with hugetlbfs. In commit
> > a6e79df92e4a ("mm/gup: disallow FOLL_LONGTERM GUP-fast writing to
> > file-backed mappings"), we added a check to fail gup-fast if there's
> > potential risk of violating GUP over writeback file systems. That should
> > never apply to hugepd.
> >
> > Drop that check, not only because it'll never be true for hugepd, but also
> > it paves way for reusing the function outside fast-gup.
>
> What prevents us from ever using hugepd with file mappings? I think
> it would naturally fit in with how large folios for the pagecache work.
>
> So keeping this check and generalizing it seems like the better idea to
> me.

But then it means we're still keeping that dead code for fast-gup even if
we know that fact.. Or do we have a plan to add that support very soon, so
this code will be destined to add back?

The other option is I can always add a comment above gup_huge_pd()
explaining this special bit, so that when someone is adding hugepd support
to file large folios we'll hopefully not forget it? But then that
generalization work will only happen when the code will be needed.

Thanks,

--
Peter Xu

2023-11-22 08:00:47

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH RFC 06/12] mm/gup: Drop folio_fast_pin_allowed() in hugepd processing

On Tue, Nov 21, 2023 at 10:59:35AM -0500, Peter Xu wrote:
> > What prevents us from ever using hugepd with file mappings? I think
> > it would naturally fit in with how large folios for the pagecache work.
> >
> > So keeping this check and generalizing it seems like the better idea to
> > me.
>
> But then it means we're still keeping that dead code for fast-gup even if
> we know that fact.. Or do we have a plan to add that support very soon, so
> this code will be destined to add back?

The question wasn't mean retorical - we support arbitrary power of two
sized folios for the pagepage, what prevents us from using hugepd with
them right now?

> The other option is I can always add a comment above gup_huge_pd()
> explaining this special bit, so that when someone is adding hugepd support
> to file large folios we'll hopefully not forget it? But then that
> generalization work will only happen when the code will be needed.

If dropping the check is the right thing for now (and I think the ppc
maintainers and willy as the large folio guy might have a more useful
opinions than I do), leaving a comment in would be very useful.

2023-11-22 14:52:27

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH RFC 00/12] mm/gup: Unify hugetlb, part 2

On Wed, Nov 15, 2023 at 08:28:56PM -0500, Peter Xu wrote:
> This patchset is in RFC stage. It's mostly because it is only yet tested on
> x86_64 in a VM. Not even compile tested on PPC or any other archs, it
> means at least the hugepd patch (patch 11) is mostly untested, or even not
> compile tested. Before doing that, I'd like to collect any information
> from high level.
>
> If anyone would like to provide any testing either over hugepd or CONT_PMD
> / CONT_PTE on ARM (before I reach there..), or RISCV over 64K Svnapot,
> that'll be very much appreciated. I'm copying PPC, ARM, RISCV list for
> that. It can be as simple as "./run_vmtests.sh -t gup_test -a" for now,
> making sure hugetlb pages can be allocated along the way; the non-hugetlb
> gup tests will guaranteed to be covered more or less, I suppose.
>
> In summary, this is a continuous work for previous series:
>
> https://lore.kernel.org/all/[email protected]
>
> And this more or less is my current take to move one more small step
> towards merging hugetlb code into generic mm code, as much as we can.
>
> That part-1 series dropped follow_hugetlb_page(). The plan of this one is
> to further drops hugetlb_follow_page_mask(). The hugetlb GUP will use the
> same code path for generic mm after whole set applied.
>
> It means the generic code will need to at least understand hugepd, and
> that's already done like so in fast-gup. Fortunately it seems that's the
> only major thing I need to teach GUP to share the common path for now
> besides normal huge PxD entries. Non-gup can be more challenging, but
> that's a question for later.

This is great, I looked through quickly and didn't have any profound remarks

Thanks,
Jason

2023-11-22 15:22:48

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH RFC 06/12] mm/gup: Drop folio_fast_pin_allowed() in hugepd processing

On Wed, Nov 22, 2023 at 12:00:24AM -0800, Christoph Hellwig wrote:
> On Tue, Nov 21, 2023 at 10:59:35AM -0500, Peter Xu wrote:
> > > What prevents us from ever using hugepd with file mappings? I think
> > > it would naturally fit in with how large folios for the pagecache work.
> > >
> > > So keeping this check and generalizing it seems like the better idea to
> > > me.
> >
> > But then it means we're still keeping that dead code for fast-gup even if
> > we know that fact.. Or do we have a plan to add that support very soon, so
> > this code will be destined to add back?
>
> The question wasn't mean retorical - we support arbitrary power of two
> sized folios for the pagepage, what prevents us from using hugepd with
> them right now?

Ah, didn't catch that point previously. Hugepd is just not used outside
hugetlb right now, afaiu.

For example, __hugepte_alloc() (and that's the only one calls
hugepd_populate()) should be the function to allocate a hugepd (ppc only),
and it's only called in huge_pte_alloc(), which is part of the current
arch-specific hugetlb api.

And generic mm paths don't normally have hugepd handling, afaics. For
example, page_vma_mapped_walk() doesn't handle hugepd at all unless in
hugetlb specific path.

There're actually (only) two generic mm paths that can handle hugepd,
namely:

- fast-gup
- walk_page_*() apis (aka, __walk_page_range())

For fast-gup I think the hugepd code is in use, however for walk_page_*
apis hugepd code shouldn't be reached iiuc as we have the hugetlb specific
handling (walk_hugetlb_range()), so anything within walk_pgd_range() to hit
a hugepd can be dead code to me (but note that this "dead code" is good
stuff to me, if one would like to merge hugetlb instead into generic mm).

This series tries to add slow gup into that list too, so the 3rd one to
support it. I plan to look more into this area (e.g., __walk_page_range()
can be another good candidate soon). I'm not sure whether we should teach
the whole mm to understand hugepd yet, but slow gup and __walk_page_range()
does look like good candidates to already remove the hugetlb specific code
paths - slow-gup has average ~add/~del LOCs (which this series does), and
__walk_page_range() can remove some code logically, no harm I yet see.

Indeed above are based on only my code observations, so I'll be more than
happy to be corrected otherwise, as early as possible.

>
> > The other option is I can always add a comment above gup_huge_pd()
> > explaining this special bit, so that when someone is adding hugepd support
> > to file large folios we'll hopefully not forget it? But then that
> > generalization work will only happen when the code will be needed.
>
> If dropping the check is the right thing for now (and I think the ppc
> maintainers and willy as the large folio guy might have a more useful
> opinions than I do), leaving a comment in would be very useful.

Willy is in the loop, and I just notice I didn't really copy ppc list, even
I planned to.. I am adding the list ([email protected]) into
this reply. I'll remember to do so as long as there's a new version.

The other reason I feel like hugepd may or may not be further developed for
new features like large folio is that I saw Power9 started to shift to
radix pgtables, and afaics hugepd is only supported in hash tables
(hugepd_ok()). But again, I confess I know nothing about Power at all.

Thanks,

--
Peter Xu

2023-11-23 07:24:14

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH RFC 03/12] mm: Export HPAGE_PXD_* macros even if !THP

No way to export macros :)

I'd say define, but other might have better ideas.

Otherwise looks good:

Reviewed-by: Christoph Hellwig <[email protected]>

2023-11-23 07:24:48

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH RFC 06/12] mm/gup: Drop folio_fast_pin_allowed() in hugepd processing

On Wed, Nov 22, 2023 at 10:22:11AM -0500, Peter Xu wrote:
> The other reason I feel like hugepd may or may not be further developed for
> new features like large folio is that I saw Power9 started to shift to
> radix pgtables, and afaics hugepd is only supported in hash tables
> (hugepd_ok()). But again, I confess I know nothing about Power at all.

That alone sounds like a good reason to not bother. So unless more
qualified people have a different opinion I'm fine with this patch
as long as you leave a comment in place, and ammend the commit message
with some of the very useful information from your mail.

2023-11-23 07:25:45

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH RFC 05/12] mm/gup: Fix follow_devmap_p[mu]d() to return even if NULL

On Wed, Nov 15, 2023 at 08:29:01PM -0500, Peter Xu wrote:
> This seems to be a bug not by any report but by code observations.
>
> When GUP sees a devpmd or devpud, it should return whatever value returned
> from follow_devmap_p[mu]d(). If page==NULL returned, it means a fault is
> probably required. Skipping return the NULL should allow the code to fall
> through, which can cause unexpected behavior.
>
> It was there at least before the follow page rework (080dbb618b) in 2017,
> so 6 years. Not yet digging for a Fixes, assuming it can hardly trigger
> even if the logical bug does exist.
>
> Signed-off-by: Peter Xu <[email protected]>

I'd still add a fixes tag and send if off to Linux for 6.7-rc instead
of letting it linger.

Otherwise looks good:

Reviewed-by: Christoph Hellwig <[email protected]>

2023-11-23 07:26:26

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH RFC 08/12] mm/gup: Handle hugetlb for no_page_table()

> static struct page *no_page_table(struct vm_area_struct *vma,
> - unsigned int flags)
> + unsigned int flags, unsigned long address)

More reformatting that makes the code less readable :(

Oterwise looks good:

Reviewed-by: Christoph Hellwig <[email protected]>

2023-11-23 07:28:48

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH RFC 09/12] mm/gup: Handle huge pud for follow_pud_mask()

> We need to export "struct follow_page_context" along the way, so that
> huge_memory.c can understand it.

Again, thankfully not actually exported, just made global.

> @@ -751,24 +746,25 @@ static struct page *follow_pud_mask(struct vm_area_struct *vma,
> unsigned int flags,
> struct follow_page_context *ctx)
> {
> - pud_t *pud;
> + pud_t *pudp, pud;

This adding of pud while useful seems mostly unrelated and clutter
the patch up quite a bit. If you have to respin this anyway it might
be worth to split it out into a little prep patch.

2023-11-23 07:30:44

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH RFC 12/12] mm/gup: Merge hugetlb into generic mm code

Bit, but the subject makes it sound like hugetlbfs is gone to me.
What about something like:

mm/gup: handle hugetlb in the generic follow_page_mask code

?

2023-11-23 09:53:52

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH RFC 03/12] mm: Export HPAGE_PXD_* macros even if !THP

On Wed, Nov 22, 2023 at 11:23:57PM -0800, Christoph Hellwig wrote:
> No way to export macros :)
>
> I'd say define, but other might have better ideas.

Make HPAGE_PXD_* macros visible even if !THP

> Otherwise looks good:
>
> Reviewed-by: Christoph Hellwig <[email protected]>
>

--
Sincerely yours,
Mike.

2023-11-23 15:29:21

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH RFC 03/12] mm: Export HPAGE_PXD_* macros even if !THP

On Thu, Nov 23, 2023 at 11:53:04AM +0200, Mike Rapoport wrote:
> On Wed, Nov 22, 2023 at 11:23:57PM -0800, Christoph Hellwig wrote:
> > No way to export macros :)
> >
> > I'd say define, but other might have better ideas.
>
> Make HPAGE_PXD_* macros visible even if !THP

Sounds good, thanks both!

Besides, I do plan to introduce a new macro in the next version to mean
"THP || HUGETLB", so as to put PxD code segments into it and not compile
when unnecessary (!THP && !HUGETLB).

Currently what I had is:

config PGTABLE_HAS_HUGE_LEAVES
def_bool TRANSPARENT_HUGEPAGE || HUGETLB_PAGE

I didn't use something like CONFIG_HUGE_PAGE because it's too close to
HUGETLB_PAGE, even if generic and short enough. Please speak if there's
any early comments on that, either the name or the format. For example, I
can also define it in e.g. mm/internal.h, instead of a config entry.

--
Peter Xu

2023-11-23 15:48:14

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH RFC 06/12] mm/gup: Drop folio_fast_pin_allowed() in hugepd processing

On Wed, Nov 22, 2023 at 12:00:24AM -0800, Christoph Hellwig wrote:
> > The other option is I can always add a comment above gup_huge_pd()
> > explaining this special bit, so that when someone is adding hugepd support
> > to file large folios we'll hopefully not forget it? But then that
> > generalization work will only happen when the code will be needed.
>
> If dropping the check is the right thing for now (and I think the ppc
> maintainers and willy as the large folio guy might have a more useful
> opinions than I do), leaving a comment in would be very useful.

It looks like ARM (in the person of Ryan) are going to add support for
something equivalent to hugepd. Insofar as I understand hugepd, anyway.
I've done my best to set up the generic code so that the arch code can
use whatever size TLB entries it supports. I haven't been looking to the
hugetlb code as a reference for it, since it can assume natural alignment
and generic THP/large folio must be able to handle arbitrary alignment.

If powerpc want to join in on the fun, they're more than welcome, but I
get the feeling that investment in Linux-on-PPC is somewhat smaller than
Linux-on-ARM these days. Even if we restrict that to the server space.

2023-11-23 16:10:26

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH RFC 06/12] mm/gup: Drop folio_fast_pin_allowed() in hugepd processing

On Wed, Nov 22, 2023 at 11:21:50PM -0800, Christoph Hellwig wrote:
> That alone sounds like a good reason to not bother. So unless more
> qualified people have a different opinion I'm fine with this patch
> as long as you leave a comment in place, and ammend the commit message
> with some of the very useful information from your mail.

Will do, thanks.

This is what I will squash into the same patch in the new version, as the
current plan:

+/*
+ * NOTE: currently hugepd is only used in hugetlbfs file systems on Power,
+ * which does not have issue with folio writeback against GUP updates.
+ * When hugepd will be extended to support non-hugetlbfs or even anonymous
+ * memory, we need to do extra check as what we do with most of the other
+ * folios. See writable_file_mapping_allowed() and folio_fast_pin_allowed()
+ * for more information.
+ */
static int gup_huge_pd(hugepd_t hugepd, unsigned long addr,
unsigned int pdshift, unsigned long end, unsigned int flags,
struct page **pages, int *nr)

--
Peter Xu

2023-11-23 16:19:51

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH RFC 09/12] mm/gup: Handle huge pud for follow_pud_mask()

On Wed, Nov 22, 2023 at 11:28:31PM -0800, Christoph Hellwig wrote:
> > We need to export "struct follow_page_context" along the way, so that
> > huge_memory.c can understand it.
>
> Again, thankfully not actually exported, just made global.

In the new version that shouldn't be needed, because I just noticed
huge_memory.c is only compiled with THP=on. Logically it may start to make
sense at some point to have thp.c for THP=on, and huge_memory.c for THP ||
HUGETLB. But I'd rather leave that done separately even if so..

In short, for this one, I'll drop that in the commit message, as I'll leave
"struct follow_page_context" alone.

>
> > @@ -751,24 +746,25 @@ static struct page *follow_pud_mask(struct vm_area_struct *vma,
> > unsigned int flags,
> > struct follow_page_context *ctx)
> > {
> > - pud_t *pud;
> > + pud_t *pudp, pud;
>
> This adding of pud while useful seems mostly unrelated and clutter
> the patch up quite a bit. If you have to respin this anyway it might
> be worth to split it out into a little prep patch.

I can do this. I'll also try to do the same for the rest patches, if
applicable.

Thanks,

--
Peter Xu

2023-11-23 16:22:13

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH RFC 12/12] mm/gup: Merge hugetlb into generic mm code

On Wed, Nov 22, 2023 at 11:29:58PM -0800, Christoph Hellwig wrote:
> Bit, but the subject makes it sound like hugetlbfs is gone to me.
> What about something like:
>
> mm/gup: handle hugetlb in the generic follow_page_mask code
>
> ?

Sure thing.

--
Peter Xu

2023-11-23 17:23:08

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH RFC 06/12] mm/gup: Drop folio_fast_pin_allowed() in hugepd processing

On Thu, Nov 23, 2023 at 03:47:49PM +0000, Matthew Wilcox wrote:
> It looks like ARM (in the person of Ryan) are going to add support for
> something equivalent to hugepd.

If it's about arm's cont_pte, then it looks ideal because this series
didn't yet touch cont_pte, assuming it'll just work. From that aspect, his
work may help mine, and no immediately collapsing either.

There can be a slight performance difference which I need to measure for
arm's cont_pte already for hugetlb, but I didn't worry much on that;
quotting my commit message in the last patch:

There may be a slight difference of how the loops run when processing
GUP over a large hugetlb range on either ARM64 (e.g. CONT_PMD) or RISCV
(mostly its Svnapot extension on 64K huge pages): each loop of
__get_user_pages() will resolve one pgtable entry with the patch
applied, rather than relying on the size of hugetlb hstate, the latter
may cover multiple entries in one loop.

However, the performance difference should hopefully not be a major
concern, considering that GUP just yet got 57edfcfd3419 ("mm/gup:
accelerate thp gup even for "pages != NULL""), and that's not part of a
performance analysis but a side dish. If the performance will be a
concern, we can consider handle CONT_PTE in follow_page(), for example.

So IMHO it can be slightly different comparing to e.g. page fault, because
each fault is still pretty slow as a whole if one fault for each small pte
(of a large folio / cont_pte), while the loop in GUP is still relatively
tight and short, comparing to a fault. I'd boldly guess more low hanging
fruits out there for large folio outside GUP areas.

In all cases, it'll be interesting to know if Ryan has worked on cont_pte
support for gup on large folios, and whether there's any performance number
to share. It's definitely good news to me because it means Ryan's work can
also then benefit hugetlb if this series will be merged, I just don't know
how much difference there will be.

Thanks,

--
Peter Xu

2023-11-23 17:59:46

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH RFC 05/12] mm/gup: Fix follow_devmap_p[mu]d() to return even if NULL

On Wed, Nov 22, 2023 at 11:25:28PM -0800, Christoph Hellwig wrote:
> On Wed, Nov 15, 2023 at 08:29:01PM -0500, Peter Xu wrote:
> > This seems to be a bug not by any report but by code observations.
> >
> > When GUP sees a devpmd or devpud, it should return whatever value returned
> > from follow_devmap_p[mu]d(). If page==NULL returned, it means a fault is
> > probably required. Skipping return the NULL should allow the code to fall
> > through, which can cause unexpected behavior.
> >
> > It was there at least before the follow page rework (080dbb618b) in 2017,
> > so 6 years. Not yet digging for a Fixes, assuming it can hardly trigger
> > even if the logical bug does exist.
> >
> > Signed-off-by: Peter Xu <[email protected]>
>
> I'd still add a fixes tag and send if off to Linux for 6.7-rc instead
> of letting it linger.

Will do.

>
> Otherwise looks good:
>
> Reviewed-by: Christoph Hellwig <[email protected]>

When rethinking, "return page" is correct for devmap, but it should be
better to always use no_page_table() instead for page==NULL when trying to
return NULL for follow_page(), even if no_page_table() should constantly
return NULL for devmap, afaict, so it should be the same.

So I'll make it slightly different when reposting separately, please feel
free to have another look.

Thanks,

--
Peter Xu

2023-11-23 18:25:19

by Christophe Leroy

[permalink] [raw]
Subject: Re: [PATCH RFC 06/12] mm/gup: Drop folio_fast_pin_allowed() in hugepd processing



Le 22/11/2023 à 16:22, Peter Xu a écrit :
> On Wed, Nov 22, 2023 at 12:00:24AM -0800, Christoph Hellwig wrote:
>> On Tue, Nov 21, 2023 at 10:59:35AM -0500, Peter Xu wrote:
>>>> What prevents us from ever using hugepd with file mappings? I think
>>>> it would naturally fit in with how large folios for the pagecache work.
>>>>
>>>> So keeping this check and generalizing it seems like the better idea to
>>>> me.
>>>
>>> But then it means we're still keeping that dead code for fast-gup even if
>>> we know that fact.. Or do we have a plan to add that support very soon, so
>>> this code will be destined to add back?
>>
>> The question wasn't mean retorical - we support arbitrary power of two
>> sized folios for the pagepage, what prevents us from using hugepd with
>> them right now?
>
> Ah, didn't catch that point previously. Hugepd is just not used outside
> hugetlb right now, afaiu.
>
> For example, __hugepte_alloc() (and that's the only one calls
> hugepd_populate()) should be the function to allocate a hugepd (ppc only),
> and it's only called in huge_pte_alloc(), which is part of the current
> arch-specific hugetlb api.
>
> And generic mm paths don't normally have hugepd handling, afaics. For
> example, page_vma_mapped_walk() doesn't handle hugepd at all unless in
> hugetlb specific path.
>
> There're actually (only) two generic mm paths that can handle hugepd,
> namely:
>
> - fast-gup
> - walk_page_*() apis (aka, __walk_page_range())
>
> For fast-gup I think the hugepd code is in use, however for walk_page_*
> apis hugepd code shouldn't be reached iiuc as we have the hugetlb specific
> handling (walk_hugetlb_range()), so anything within walk_pgd_range() to hit
> a hugepd can be dead code to me (but note that this "dead code" is good
> stuff to me, if one would like to merge hugetlb instead into generic mm).

Not sure what you mean here. What do you mean by "dead code" ?
A hugepage directory can be plugged at any page level, from PGD to PMD.
So the following bit in walk_pgd_range() is valid and not dead:

if (is_hugepd(__hugepd(pgd_val(*pgd))))
err = walk_hugepd_range((hugepd_t *)pgd, addr, next, walk, PGDIR_SHIFT);


>
> This series tries to add slow gup into that list too, so the 3rd one to
> support it. I plan to look more into this area (e.g., __walk_page_range()
> can be another good candidate soon). I'm not sure whether we should teach
> the whole mm to understand hugepd yet, but slow gup and __walk_page_range()
> does look like good candidates to already remove the hugetlb specific code
> paths - slow-gup has average ~add/~del LOCs (which this series does), and
> __walk_page_range() can remove some code logically, no harm I yet see.
>
> Indeed above are based on only my code observations, so I'll be more than
> happy to be corrected otherwise, as early as possible.
>
>>
>>> The other option is I can always add a comment above gup_huge_pd()
>>> explaining this special bit, so that when someone is adding hugepd support
>>> to file large folios we'll hopefully not forget it? But then that
>>> generalization work will only happen when the code will be needed.
>>
>> If dropping the check is the right thing for now (and I think the ppc
>> maintainers and willy as the large folio guy might have a more useful
>> opinions than I do), leaving a comment in would be very useful.
>
> Willy is in the loop, and I just notice I didn't really copy ppc list, even
> I planned to.. I am adding the list ([email protected]) into
> this reply. I'll remember to do so as long as there's a new version.
>
> The other reason I feel like hugepd may or may not be further developed for
> new features like large folio is that I saw Power9 started to shift to
> radix pgtables, and afaics hugepd is only supported in hash tables
> (hugepd_ok()). But again, I confess I know nothing about Power at all.
>
> Thanks,
>

2023-11-23 19:11:50

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH RFC 06/12] mm/gup: Drop folio_fast_pin_allowed() in hugepd processing

On 23/11/2023 17:22, Peter Xu wrote:
> On Thu, Nov 23, 2023 at 03:47:49PM +0000, Matthew Wilcox wrote:
>> It looks like ARM (in the person of Ryan) are going to add support for
>> something equivalent to hugepd.
>
> If it's about arm's cont_pte, then it looks ideal because this series
> didn't yet touch cont_pte, assuming it'll just work. From that aspect, his
> work may help mine, and no immediately collapsing either.

Hi,

I'm not sure I've 100% understood the crossover between this series and my work
to support arm64's contpte mappings generally for anonymous and file-backed memory.

My approach is to transparently use contpte mappings when core-mm request pte
mappings that meet the requirements; and its all based around intercepting the
normal (non-hugetlb) helpers (e.g. set_ptes(), ptep_get() and friends). There is
no semantic change to the core-mm. See [1]. It relies on 1) the page cache using
large folios and 2) my "small-sized THP" series which starts using arbitrary
sized large folios for anonymous memory [2].

If I've understood this conversation correctly there is an object called hugepd,
which today is only supported by powerpc, but which could allow the core-mm to
control the mapping granularity? I can see some value in exposing that control
to core-mm in the (very) long term.

[1] https://lore.kernel.org/all/[email protected]/
[2] https://lore.kernel.org/linux-mm/[email protected]/

Thanks,
Ryan


>
> There can be a slight performance difference which I need to measure for
> arm's cont_pte already for hugetlb, but I didn't worry much on that;
> quotting my commit message in the last patch:
>
> There may be a slight difference of how the loops run when processing
> GUP over a large hugetlb range on either ARM64 (e.g. CONT_PMD) or RISCV
> (mostly its Svnapot extension on 64K huge pages): each loop of
> __get_user_pages() will resolve one pgtable entry with the patch
> applied, rather than relying on the size of hugetlb hstate, the latter
> may cover multiple entries in one loop.
>
> However, the performance difference should hopefully not be a major
> concern, considering that GUP just yet got 57edfcfd3419 ("mm/gup:
> accelerate thp gup even for "pages != NULL""), and that's not part of a
> performance analysis but a side dish. If the performance will be a
> concern, we can consider handle CONT_PTE in follow_page(), for example.
>
> So IMHO it can be slightly different comparing to e.g. page fault, because
> each fault is still pretty slow as a whole if one fault for each small pte
> (of a large folio / cont_pte), while the loop in GUP is still relatively
> tight and short, comparing to a fault. I'd boldly guess more low hanging
> fruits out there for large folio outside GUP areas.
>
> In all cases, it'll be interesting to know if Ryan has worked on cont_pte
> support for gup on large folios, and whether there's any performance number
> to share. It's definitely good news to me because it means Ryan's work can
> also then benefit hugetlb if this series will be merged, I just don't know
> how much difference there will be.
>
> Thanks,
>

2023-11-23 19:41:43

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH RFC 06/12] mm/gup: Drop folio_fast_pin_allowed() in hugepd processing

On Thu, Nov 23, 2023 at 06:22:33PM +0000, Christophe Leroy wrote:
> > For fast-gup I think the hugepd code is in use, however for walk_page_*
> > apis hugepd code shouldn't be reached iiuc as we have the hugetlb specific
> > handling (walk_hugetlb_range()), so anything within walk_pgd_range() to hit
> > a hugepd can be dead code to me (but note that this "dead code" is good
> > stuff to me, if one would like to merge hugetlb instead into generic mm).
>
> Not sure what you mean here. What do you mean by "dead code" ?
> A hugepage directory can be plugged at any page level, from PGD to PMD.
> So the following bit in walk_pgd_range() is valid and not dead:
>
> if (is_hugepd(__hugepd(pgd_val(*pgd))))
> err = walk_hugepd_range((hugepd_t *)pgd, addr, next, walk, PGDIR_SHIFT);

IMHO it boils down to the question on whether hugepd is only used in
hugetlbfs. I think I already mentioned that above, but I can be more
explicit; what I see is that from higher stack in __walk_page_range():

if (is_vm_hugetlb_page(vma)) {
if (ops->hugetlb_entry)
err = walk_hugetlb_range(start, end, walk);
} else
err = walk_pgd_range(start, end, walk);

It means to me as long as the vma is hugetlb, it'll not trigger any code in
walk_pgd_range(), but only walk_hugetlb_range(). Do you perhaps mean
hugepd is used outside hugetlbfs?

Thanks,

--
Peter Xu

2023-11-23 19:47:36

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH RFC 06/12] mm/gup: Drop folio_fast_pin_allowed() in hugepd processing

On Thu, Nov 23, 2023 at 07:11:19PM +0000, Ryan Roberts wrote:
> Hi,
>
> I'm not sure I've 100% understood the crossover between this series and my work
> to support arm64's contpte mappings generally for anonymous and file-backed memory.

No worry, there's no confliction. If you worked on that it's only be
something nice on top. Also, I'm curious if you have performance numbers,
because I'm going to do some test for hugetlb cont_ptes (which is only the
current plan), and if you got those it'll be a great baseline for me,
because it should be similar in you case even though the goal is slightly
different.

>
> My approach is to transparently use contpte mappings when core-mm request pte
> mappings that meet the requirements; and its all based around intercepting the
> normal (non-hugetlb) helpers (e.g. set_ptes(), ptep_get() and friends). There is
> no semantic change to the core-mm. See [1]. It relies on 1) the page cache using
> large folios and 2) my "small-sized THP" series which starts using arbitrary
> sized large folios for anonymous memory [2].
>
> If I've understood this conversation correctly there is an object called hugepd,
> which today is only supported by powerpc, but which could allow the core-mm to
> control the mapping granularity? I can see some value in exposing that control
> to core-mm in the (very) long term.

For me it's needed immediately, because hugetlb_follow_page_mask() will be
gone after the last patch.

>
> [1] https://lore.kernel.org/all/[email protected]/
> [2] https://lore.kernel.org/linux-mm/[email protected]/

AFAICT you haven't yet worked on gup then, after I glimpsed the above
series.

It's a matter of whether one follow_page_mask() call can fetch more than
one page* for a cont_pte entry on aarch64 for a large non-hugetlb folio
(and if this series lands, it'll be the same to hugetlb or non-hugetlb).
Now the current code can only fetch one page I think.

Thanks,

--
Peter Xu

2023-11-24 01:08:50

by Michael Ellerman

[permalink] [raw]
Subject: Re: [PATCH RFC 06/12] mm/gup: Drop folio_fast_pin_allowed() in hugepd processing

Peter Xu <[email protected]> writes:
> On Wed, Nov 22, 2023 at 12:00:24AM -0800, Christoph Hellwig wrote:
>> On Tue, Nov 21, 2023 at 10:59:35AM -0500, Peter Xu wrote:
...
>>
>> If dropping the check is the right thing for now (and I think the ppc
>> maintainers and willy as the large folio guy might have a more useful
>> opinions than I do), leaving a comment in would be very useful.
>
> Willy is in the loop, and I just notice I didn't really copy ppc list, even
> I planned to.. I am adding the list ([email protected]) into
> this reply. I'll remember to do so as long as there's a new version.

Thanks.

> The other reason I feel like hugepd may or may not be further developed for
> new features like large folio is that I saw Power9 started to shift to
> radix pgtables, and afaics hugepd is only supported in hash tables
> (hugepd_ok()).

Because it's powerpc it's not quite that simple :}

Power9 uses the Radix MMU by default, but the hash page table MMU is
still supported.

However although hugepd is used with the hash page table MMU, that's
only when PAGE_SIZE=4K. These days none of the major distros build with
4K pages.

But some of the non-server CPU platforms also use hugepd. 32-bit 8xx
does, which is actively maintained by Christophe.

And I believe Freescale e6500 can use it, but that is basically
orphaned, and although I boot test it I don't run any hugetlb tests.
(I guess I should do that).

cheers

2023-11-24 07:03:44

by Christophe Leroy

[permalink] [raw]
Subject: Re: [PATCH RFC 06/12] mm/gup: Drop folio_fast_pin_allowed() in hugepd processing



Le 23/11/2023 à 20:37, Peter Xu a écrit :
> On Thu, Nov 23, 2023 at 06:22:33PM +0000, Christophe Leroy wrote:
>>> For fast-gup I think the hugepd code is in use, however for walk_page_*
>>> apis hugepd code shouldn't be reached iiuc as we have the hugetlb specific
>>> handling (walk_hugetlb_range()), so anything within walk_pgd_range() to hit
>>> a hugepd can be dead code to me (but note that this "dead code" is good
>>> stuff to me, if one would like to merge hugetlb instead into generic mm).
>>
>> Not sure what you mean here. What do you mean by "dead code" ?
>> A hugepage directory can be plugged at any page level, from PGD to PMD.
>> So the following bit in walk_pgd_range() is valid and not dead:
>>
>> if (is_hugepd(__hugepd(pgd_val(*pgd))))
>> err = walk_hugepd_range((hugepd_t *)pgd, addr, next, walk, PGDIR_SHIFT);
>
> IMHO it boils down to the question on whether hugepd is only used in
> hugetlbfs. I think I already mentioned that above, but I can be more
> explicit; what I see is that from higher stack in __walk_page_range():
>
> if (is_vm_hugetlb_page(vma)) {
> if (ops->hugetlb_entry)
> err = walk_hugetlb_range(start, end, walk);
> } else
> err = walk_pgd_range(start, end, walk);
>
> It means to me as long as the vma is hugetlb, it'll not trigger any code in
> walk_pgd_range(), but only walk_hugetlb_range(). Do you perhaps mean
> hugepd is used outside hugetlbfs?

I added that code with commit e17eae2b8399 ("mm: pagewalk: fix walk for
hugepage tables") because I was getting crazy displays when dumping
/sys/kernel/debug/pagetables

Huge pages can be used for many thing.

On powerpc 8xx, there are 4 possible page size: 4k, 16k, 512k and 8M.
Each PGD entry addresses 4M areas, so hugepd is used for anything using
8M pages. Could have used regular page tables instead, but it is not
worth allocating a 4k table when the HW will only read first entry.

At the time being, linear memory mapping is performed with 8M pages, so
ptdump_walk_pgd() will walk into huge page directories.

Also, huge pages can be used in vmalloc() and in vmap(). At the time
being we support 512k pages there on the 8xx. 8M pages will be supported
once vmalloc() and vmap() support hugepd, as explained in commit
a6a8f7c4aa7e ("powerpc/8xx: add support for huge pages on VMAP and VMALLOC")

So yes as a conclusion hugepd is used outside hugetlbfs, hope it
clarifies things.

Christophe

2023-11-24 09:06:15

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH RFC 06/12] mm/gup: Drop folio_fast_pin_allowed() in hugepd processing

On 23/11/2023 19:46, Peter Xu wrote:
> On Thu, Nov 23, 2023 at 07:11:19PM +0000, Ryan Roberts wrote:
>> Hi,
>>
>> I'm not sure I've 100% understood the crossover between this series and my work
>> to support arm64's contpte mappings generally for anonymous and file-backed memory.
>
> No worry, there's no confliction. If you worked on that it's only be
> something nice on top. Also, I'm curious if you have performance numbers,

I have perf numbers for high level use cases (kernel compilation and Speedometer
Java Script benchmarks) at
https://lore.kernel.org/linux-arm-kernel/[email protected]/

I don't have any micro-benchmarks for GUP though, if that's your question. Is
there an easy-to-use test I can run to get some numbers? I'd be happy to try it out.

> because I'm going to do some test for hugetlb cont_ptes (which is only the
> current plan), and if you got those it'll be a great baseline for me,
> because it should be similar in you case even though the goal is slightly
> different.
>
>>
>> My approach is to transparently use contpte mappings when core-mm request pte
>> mappings that meet the requirements; and its all based around intercepting the
>> normal (non-hugetlb) helpers (e.g. set_ptes(), ptep_get() and friends). There is
>> no semantic change to the core-mm. See [1]. It relies on 1) the page cache using
>> large folios and 2) my "small-sized THP" series which starts using arbitrary
>> sized large folios for anonymous memory [2].
>>
>> If I've understood this conversation correctly there is an object called hugepd,
>> which today is only supported by powerpc, but which could allow the core-mm to
>> control the mapping granularity? I can see some value in exposing that control
>> to core-mm in the (very) long term.
>
> For me it's needed immediately, because hugetlb_follow_page_mask() will be
> gone after the last patch.
>
>>
>> [1] https://lore.kernel.org/all/[email protected]/
>> [2] https://lore.kernel.org/linux-mm/[email protected]/
>
> AFAICT you haven't yet worked on gup then, after I glimpsed the above
> series.

No, I haven't touched GUP at all. The approach is fully inside the arm64 arch
code (except 1 patch to core-mm which enables an optimization). So as far as GUP
and the rest of the core-mm is concerned, there are still only page-sized ptes
and they can all be iterated over and accessed as normal.

>
> It's a matter of whether one follow_page_mask() call can fetch more than
> one page* for a cont_pte entry on aarch64 for a large non-hugetlb folio
> (and if this series lands, it'll be the same to hugetlb or non-hugetlb).
> Now the current code can only fetch one page I think.
>
> Thanks,
>

2023-11-24 16:08:11

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH RFC 06/12] mm/gup: Drop folio_fast_pin_allowed() in hugepd processing

On Fri, Nov 24, 2023 at 09:06:01AM +0000, Ryan Roberts wrote:
> I don't have any micro-benchmarks for GUP though, if that's your question. Is
> there an easy-to-use test I can run to get some numbers? I'd be happy to try it out.

Thanks Ryan. Then nothing is needed to be tested if gup is not yet touched
from your side, afaict. I'll see whether I can provide some rough numbers
instead in the next post (I'll probably only be able to test it in a VM,
though, but hopefully that should still reflect mostly the truth).

--
Peter Xu

2023-11-24 18:17:05

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH RFC 06/12] mm/gup: Drop folio_fast_pin_allowed() in hugepd processing

Hi, Christophe, Michael, Aneesh,

[I'll reply altogether here]

On Fri, Nov 24, 2023 at 07:03:11AM +0000, Christophe Leroy wrote:
> I added that code with commit e17eae2b8399 ("mm: pagewalk: fix walk for
> hugepage tables") because I was getting crazy displays when dumping
> /sys/kernel/debug/pagetables
>
> Huge pages can be used for many thing.
>
> On powerpc 8xx, there are 4 possible page size: 4k, 16k, 512k and 8M.
> Each PGD entry addresses 4M areas, so hugepd is used for anything using
> 8M pages. Could have used regular page tables instead, but it is not
> worth allocating a 4k table when the HW will only read first entry.
>
> At the time being, linear memory mapping is performed with 8M pages, so
> ptdump_walk_pgd() will walk into huge page directories.
>
> Also, huge pages can be used in vmalloc() and in vmap(). At the time
> being we support 512k pages there on the 8xx. 8M pages will be supported
> once vmalloc() and vmap() support hugepd, as explained in commit
> a6a8f7c4aa7e ("powerpc/8xx: add support for huge pages on VMAP and VMALLOC")
>
> So yes as a conclusion hugepd is used outside hugetlbfs, hope it
> clarifies things.

Yes it does, thanks a lot for all of your replies.

So I think this is what I missed: on Freescale ppc 8xx there's a special
hugepd_populate_kernel() defined to install kernel pgtables for hugepd.
Obviously I didn't check further than hugepd_populate() when I first
looked, and stopped at the first instance of hugepd_populate() definition
on the 64 bits ppc.

For this specific patch: I suppose the change is still all fine to reuse
the fast-gup function, because it is still true when there's a VMA present
(GUP applies only to user mappings, nothing like KASAN should ever pop up).
So AFAIU it's still true that hugepd is only used in hugetlb pages in this
case even for Freescale 8xx, and nothing should yet explode. So maybe I
can still keep the code changes.

However the comment at least definitely needs fixing (that I'm going to add
some, which hch requested and I agree), that is not yet in the patch I
posted here but I'll refine them locally.

For the whole work: the purpose of it is to start merging hugetlb pgtable
processing with generic mm. That is my take of previous lsfmm discussions
in the community on how we should move forward with hugetlb in the future,
to avoid code duplications against generic mm. Hugetlb is kind of blocked
on adding new (especially, large) features in general because of such
complexity. This is all about that, but a small step towards it.

I see that it seems a trend to make hugepd more general. Christophe's fix
on dump pgtable is exactly what I would also look for if keep going. I
hope that's the right way to go.

I'll also need to think more on how this will affect my plan, currently it
seems all fine: I won't ever try to change any kernel mapping specific
code. I suppose any hugetlbfs based test should still cover all codes I
will touch on hugepd. Then it should just work for kernel mappings on
Freescales; it'll be great if e.g. Christophe can help me double check that
if the series can stablize in a few versions. If any of you have any hint
on testing it'll be more than welcomed, either specific test case or hints;
currently I'm still at a phase looking for a valid ppc systems - QEMU tcg
ppc64 emulation on x86 is slow enough to let me give up already.

Considering hugepd's specialty in ppc and the possibility that I'll break
it, there's yet another option which is I only apply the new logic into
archs with !ARCH_HAS_HUGEPD. It'll make my life easier, but that also
means even if my attempt would work out anything new will by default rule
ppc out. And we'll have a bunch of "#ifdef ARCH_HAS_HUGEPD" in generic
code, which is not preferred either. For gup, it might be relatively easy
when comparing to the rest. I'm still hesitating for the long term plan.

Please let me know if you have any thoughts on any of above.

Thanks!

--
Peter Xu

2023-11-30 21:41:28

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH RFC 06/12] mm/gup: Drop folio_fast_pin_allowed() in hugepd processing

On Fri, Nov 24, 2023 at 11:07:51AM -0500, Peter Xu wrote:
> On Fri, Nov 24, 2023 at 09:06:01AM +0000, Ryan Roberts wrote:
> > I don't have any micro-benchmarks for GUP though, if that's your question. Is
> > there an easy-to-use test I can run to get some numbers? I'd be happy to try it out.
>
> Thanks Ryan. Then nothing is needed to be tested if gup is not yet touched
> from your side, afaict. I'll see whether I can provide some rough numbers
> instead in the next post (I'll probably only be able to test it in a VM,
> though, but hopefully that should still reflect mostly the truth).

An update: I finished a round of 64K cont_pte test, in the slow gup micro
benchmark I see ~15% perf degrade with this patchset applied on a VM on top
of Apple M1.

Frankly that's even less than I expected, considering not only how slow gup
THP used to be, but also on the fact that that's a tight loop over slow
gup, which in normal cases shouldn't happen: "present" ptes normally goes
to fast-gup, while !present goes into a fault following it. I assume
that's why nobody cared slow gup for THP before. I think adding cont_pte
support shouldn't be very hard, but that will include making cont_pte idea
global just for arm64 and riscv Svnapot.

The current plan is I'll add that performance number into my commit message
only, as I don't ever expect any real workload will regress with it. Maybe
a global cont_pte api will be needed at some point, but perhaps not yet
feel strongly for this use case.

Please feel free to raise any concerns otherwise.

Thanks,

--
Peter Xu

2023-12-03 13:33:44

by Christophe Leroy

[permalink] [raw]
Subject: Re: [PATCH RFC 06/12] mm/gup: Drop folio_fast_pin_allowed() in hugepd processing



Le 30/11/2023 à 22:30, Peter Xu a écrit :
> On Fri, Nov 24, 2023 at 11:07:51AM -0500, Peter Xu wrote:
>> On Fri, Nov 24, 2023 at 09:06:01AM +0000, Ryan Roberts wrote:
>>> I don't have any micro-benchmarks for GUP though, if that's your question. Is
>>> there an easy-to-use test I can run to get some numbers? I'd be happy to try it out.
>>
>> Thanks Ryan. Then nothing is needed to be tested if gup is not yet touched
>> from your side, afaict. I'll see whether I can provide some rough numbers
>> instead in the next post (I'll probably only be able to test it in a VM,
>> though, but hopefully that should still reflect mostly the truth).
>
> An update: I finished a round of 64K cont_pte test, in the slow gup micro
> benchmark I see ~15% perf degrade with this patchset applied on a VM on top
> of Apple M1.
>
> Frankly that's even less than I expected, considering not only how slow gup
> THP used to be, but also on the fact that that's a tight loop over slow
> gup, which in normal cases shouldn't happen: "present" ptes normally goes
> to fast-gup, while !present goes into a fault following it. I assume
> that's why nobody cared slow gup for THP before. I think adding cont_pte
> support shouldn't be very hard, but that will include making cont_pte idea
> global just for arm64 and riscv Svnapot.

Is there any documentation on what cont_pte is ? I have always wondered
if it could also fit powerpc 8xx need ?

On powerpc, for 16k pages, we have to define 4 consecutive PTEs. All 4
PTE are flagged with the SPS bit telling it's a 16k pages, but for TLB
misses the HW needs one entrie for each 4k fragment.

There is also a similar approach for 512k pages, we have 128 contiguous
identical PTEs for them.

And whatever PAGE_SIZE is (either 4k or 16k), the HW needs one 'unsigned
long' pte for each 4k fragment. So at the time being when we define
PAGE_SIZE as 16k, we need a special pte_t which is a table of 4x
unsigned long.

Wondering if the cont_pte concept is similar and whether it could help.

Thanks
Christophe

2023-12-04 11:11:48

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH RFC 06/12] mm/gup: Drop folio_fast_pin_allowed() in hugepd processing

On 03/12/2023 13:33, Christophe Leroy wrote:
>
>
> Le 30/11/2023 à 22:30, Peter Xu a écrit :
>> On Fri, Nov 24, 2023 at 11:07:51AM -0500, Peter Xu wrote:
>>> On Fri, Nov 24, 2023 at 09:06:01AM +0000, Ryan Roberts wrote:
>>>> I don't have any micro-benchmarks for GUP though, if that's your question. Is
>>>> there an easy-to-use test I can run to get some numbers? I'd be happy to try it out.
>>>
>>> Thanks Ryan. Then nothing is needed to be tested if gup is not yet touched
>>> from your side, afaict. I'll see whether I can provide some rough numbers
>>> instead in the next post (I'll probably only be able to test it in a VM,
>>> though, but hopefully that should still reflect mostly the truth).
>>
>> An update: I finished a round of 64K cont_pte test, in the slow gup micro
>> benchmark I see ~15% perf degrade with this patchset applied on a VM on top
>> of Apple M1.
>>
>> Frankly that's even less than I expected, considering not only how slow gup
>> THP used to be, but also on the fact that that's a tight loop over slow
>> gup, which in normal cases shouldn't happen: "present" ptes normally goes
>> to fast-gup, while !present goes into a fault following it. I assume
>> that's why nobody cared slow gup for THP before. I think adding cont_pte
>> support shouldn't be very hard, but that will include making cont_pte idea
>> global just for arm64 and riscv Svnapot.
>
> Is there any documentation on what cont_pte is ? I have always wondered
> if it could also fit powerpc 8xx need ?

pte_cont() (and pte_mkcont() and pte_mknoncont()) test and manipulte the
"contiguous bit" in the arm64 PTE entries. Those helpers are arm64-specific
(AFAIK). The contiguous bit is a hint to the HW to tell it that a block of PTEs
are mapping a physically contiguous and naturally aligned piece of memory. The
HW can use this to coalesce entries in the TLB. When using 4K base pages, the
contpte size is 64K (16 PTEs). For 16K base pages, its 2M (128 PTEs) and for 64K
base pages, its 2M (32 PTEs).

>
> On powerpc, for 16k pages, we have to define 4 consecutive PTEs. All 4
> PTE are flagged with the SPS bit telling it's a 16k pages, but for TLB
> misses the HW needs one entrie for each 4k fragment.

From that description, it sounds like the SPS bit might be similar to arm64
contiguous bit? Although sounds like you are currently using it in a slightly
different way - telling kernel that the base page is 16K but mapping each 16K
page with 4x 4K entries (plus the SPS bit set)?

>
> There is also a similar approach for 512k pages, we have 128 contiguous
> identical PTEs for them.
>
> And whatever PAGE_SIZE is (either 4k or 16k), the HW needs one 'unsigned
> long' pte for each 4k fragment. So at the time being when we define
> PAGE_SIZE as 16k, we need a special pte_t which is a table of 4x
> unsigned long.
>
> Wondering if the cont_pte concept is similar and whether it could help.

To be honest, while I understand pte_cont() and friends, I don't understand
their relevance (or at least potential future relevance) to GUP?

>
> Thanks
> Christophe

2023-12-04 11:47:04

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH RFC 06/12] mm/gup: Drop folio_fast_pin_allowed() in hugepd processing

On 04/12/2023 11:25, Christophe Leroy wrote:
>
>
> Le 04/12/2023 à 12:11, Ryan Roberts a écrit :
>> On 03/12/2023 13:33, Christophe Leroy wrote:
>>>
>>>
>>> Le 30/11/2023 à 22:30, Peter Xu a écrit :
>>>> On Fri, Nov 24, 2023 at 11:07:51AM -0500, Peter Xu wrote:
>>>>> On Fri, Nov 24, 2023 at 09:06:01AM +0000, Ryan Roberts wrote:
>>>>>> I don't have any micro-benchmarks for GUP though, if that's your question. Is
>>>>>> there an easy-to-use test I can run to get some numbers? I'd be happy to try it out.
>>>>>
>>>>> Thanks Ryan. Then nothing is needed to be tested if gup is not yet touched
>>>>> from your side, afaict. I'll see whether I can provide some rough numbers
>>>>> instead in the next post (I'll probably only be able to test it in a VM,
>>>>> though, but hopefully that should still reflect mostly the truth).
>>>>
>>>> An update: I finished a round of 64K cont_pte test, in the slow gup micro
>>>> benchmark I see ~15% perf degrade with this patchset applied on a VM on top
>>>> of Apple M1.
>>>>
>>>> Frankly that's even less than I expected, considering not only how slow gup
>>>> THP used to be, but also on the fact that that's a tight loop over slow
>>>> gup, which in normal cases shouldn't happen: "present" ptes normally goes
>>>> to fast-gup, while !present goes into a fault following it. I assume
>>>> that's why nobody cared slow gup for THP before. I think adding cont_pte
>>>> support shouldn't be very hard, but that will include making cont_pte idea
>>>> global just for arm64 and riscv Svnapot.
>>>
>>> Is there any documentation on what cont_pte is ? I have always wondered
>>> if it could also fit powerpc 8xx need ?
>>
>> pte_cont() (and pte_mkcont() and pte_mknoncont()) test and manipulte the
>> "contiguous bit" in the arm64 PTE entries. Those helpers are arm64-specific
>> (AFAIK). The contiguous bit is a hint to the HW to tell it that a block of PTEs
>> are mapping a physically contiguous and naturally aligned piece of memory. The
>> HW can use this to coalesce entries in the TLB. When using 4K base pages, the
>> contpte size is 64K (16 PTEs). For 16K base pages, its 2M (128 PTEs) and for 64K
>> base pages, its 2M (32 PTEs).
>>
>>>
>>> On powerpc, for 16k pages, we have to define 4 consecutive PTEs. All 4
>>> PTE are flagged with the SPS bit telling it's a 16k pages, but for TLB
>>> misses the HW needs one entrie for each 4k fragment.
>>
>> From that description, it sounds like the SPS bit might be similar to arm64
>> contiguous bit? Although sounds like you are currently using it in a slightly
>> different way - telling kernel that the base page is 16K but mapping each 16K
>> page with 4x 4K entries (plus the SPS bit set)?
>
> Yes it's both.
>
> When the base page is 16k, there are 4x 4k entries (with SPS bit set) in
> the page table, and pte_t is a table of 4 'unsigned long'
>
> When the base page is 4k, there is a 16k hugepage size, which is the
> same 4x 4k entries with SPS bit set.
>
> So it looks similar to the contiguous bit.
>
>
> And by extension, the same principle is used for 512k hugepages, the bit
> _PAGE_HUGE is copied by the TLB miss handler into the lower bit of PS,
> PS being as follows:
> - 00 Small (4 Kbyte or 16 Kbyte)
> - 01 512 Kbyte
> - 10 Reserved
> - 11 8 Mbyte
>
> So as PMD size is 4M, 512k pages are 128 identical consecutive entries
> in the page table.
>
> I which I could have THP with 16k or 512k pages.

Then you have come to the right place! :)

https://lore.kernel.org/linux-mm/[email protected]/


>
> Christophe

2023-12-04 11:55:07

by Christophe Leroy

[permalink] [raw]
Subject: Re: [PATCH RFC 06/12] mm/gup: Drop folio_fast_pin_allowed() in hugepd processing



Le 04/12/2023 à 12:11, Ryan Roberts a écrit :
> On 03/12/2023 13:33, Christophe Leroy wrote:
>>
>>
>> Le 30/11/2023 à 22:30, Peter Xu a écrit :
>>> On Fri, Nov 24, 2023 at 11:07:51AM -0500, Peter Xu wrote:
>>>> On Fri, Nov 24, 2023 at 09:06:01AM +0000, Ryan Roberts wrote:
>>>>> I don't have any micro-benchmarks for GUP though, if that's your question. Is
>>>>> there an easy-to-use test I can run to get some numbers? I'd be happy to try it out.
>>>>
>>>> Thanks Ryan. Then nothing is needed to be tested if gup is not yet touched
>>>> from your side, afaict. I'll see whether I can provide some rough numbers
>>>> instead in the next post (I'll probably only be able to test it in a VM,
>>>> though, but hopefully that should still reflect mostly the truth).
>>>
>>> An update: I finished a round of 64K cont_pte test, in the slow gup micro
>>> benchmark I see ~15% perf degrade with this patchset applied on a VM on top
>>> of Apple M1.
>>>
>>> Frankly that's even less than I expected, considering not only how slow gup
>>> THP used to be, but also on the fact that that's a tight loop over slow
>>> gup, which in normal cases shouldn't happen: "present" ptes normally goes
>>> to fast-gup, while !present goes into a fault following it. I assume
>>> that's why nobody cared slow gup for THP before. I think adding cont_pte
>>> support shouldn't be very hard, but that will include making cont_pte idea
>>> global just for arm64 and riscv Svnapot.
>>
>> Is there any documentation on what cont_pte is ? I have always wondered
>> if it could also fit powerpc 8xx need ?
>
> pte_cont() (and pte_mkcont() and pte_mknoncont()) test and manipulte the
> "contiguous bit" in the arm64 PTE entries. Those helpers are arm64-specific
> (AFAIK). The contiguous bit is a hint to the HW to tell it that a block of PTEs
> are mapping a physically contiguous and naturally aligned piece of memory. The
> HW can use this to coalesce entries in the TLB. When using 4K base pages, the
> contpte size is 64K (16 PTEs). For 16K base pages, its 2M (128 PTEs) and for 64K
> base pages, its 2M (32 PTEs).
>
>>
>> On powerpc, for 16k pages, we have to define 4 consecutive PTEs. All 4
>> PTE are flagged with the SPS bit telling it's a 16k pages, but for TLB
>> misses the HW needs one entrie for each 4k fragment.
>
> From that description, it sounds like the SPS bit might be similar to arm64
> contiguous bit? Although sounds like you are currently using it in a slightly
> different way - telling kernel that the base page is 16K but mapping each 16K
> page with 4x 4K entries (plus the SPS bit set)?

Yes it's both.

When the base page is 16k, there are 4x 4k entries (with SPS bit set) in
the page table, and pte_t is a table of 4 'unsigned long'

When the base page is 4k, there is a 16k hugepage size, which is the
same 4x 4k entries with SPS bit set.

So it looks similar to the contiguous bit.


And by extension, the same principle is used for 512k hugepages, the bit
_PAGE_HUGE is copied by the TLB miss handler into the lower bit of PS,
PS being as follows:
- 00 Small (4 Kbyte or 16 Kbyte)
- 01 512 Kbyte
- 10 Reserved
- 11 8 Mbyte

So as PMD size is 4M, 512k pages are 128 identical consecutive entries
in the page table.

I which I could have THP with 16k or 512k pages.

Christophe

2023-12-04 12:00:05

by Christophe Leroy

[permalink] [raw]
Subject: Re: [PATCH RFC 06/12] mm/gup: Drop folio_fast_pin_allowed() in hugepd processing



Le 04/12/2023 à 12:46, Ryan Roberts a écrit :
> On 04/12/2023 11:25, Christophe Leroy wrote:
>>
>>
>> Le 04/12/2023 à 12:11, Ryan Roberts a écrit :
>>> On 03/12/2023 13:33, Christophe Leroy wrote:
>>>>
>>>>
>>>> Le 30/11/2023 à 22:30, Peter Xu a écrit :
>>>>> On Fri, Nov 24, 2023 at 11:07:51AM -0500, Peter Xu wrote:
>>>>>> On Fri, Nov 24, 2023 at 09:06:01AM +0000, Ryan Roberts wrote:
>>>>>>> I don't have any micro-benchmarks for GUP though, if that's your question. Is
>>>>>>> there an easy-to-use test I can run to get some numbers? I'd be happy to try it out.
>>>>>>
>>>>>> Thanks Ryan. Then nothing is needed to be tested if gup is not yet touched
>>>>>> from your side, afaict. I'll see whether I can provide some rough numbers
>>>>>> instead in the next post (I'll probably only be able to test it in a VM,
>>>>>> though, but hopefully that should still reflect mostly the truth).
>>>>>
>>>>> An update: I finished a round of 64K cont_pte test, in the slow gup micro
>>>>> benchmark I see ~15% perf degrade with this patchset applied on a VM on top
>>>>> of Apple M1.
>>>>>
>>>>> Frankly that's even less than I expected, considering not only how slow gup
>>>>> THP used to be, but also on the fact that that's a tight loop over slow
>>>>> gup, which in normal cases shouldn't happen: "present" ptes normally goes
>>>>> to fast-gup, while !present goes into a fault following it. I assume
>>>>> that's why nobody cared slow gup for THP before. I think adding cont_pte
>>>>> support shouldn't be very hard, but that will include making cont_pte idea
>>>>> global just for arm64 and riscv Svnapot.
>>>>
>>>> Is there any documentation on what cont_pte is ? I have always wondered
>>>> if it could also fit powerpc 8xx need ?
>>>
>>> pte_cont() (and pte_mkcont() and pte_mknoncont()) test and manipulte the
>>> "contiguous bit" in the arm64 PTE entries. Those helpers are arm64-specific
>>> (AFAIK). The contiguous bit is a hint to the HW to tell it that a block of PTEs
>>> are mapping a physically contiguous and naturally aligned piece of memory. The
>>> HW can use this to coalesce entries in the TLB. When using 4K base pages, the
>>> contpte size is 64K (16 PTEs). For 16K base pages, its 2M (128 PTEs) and for 64K
>>> base pages, its 2M (32 PTEs).
>>>
>>>>
>>>> On powerpc, for 16k pages, we have to define 4 consecutive PTEs. All 4
>>>> PTE are flagged with the SPS bit telling it's a 16k pages, but for TLB
>>>> misses the HW needs one entrie for each 4k fragment.
>>>
>>> From that description, it sounds like the SPS bit might be similar to arm64
>>> contiguous bit? Although sounds like you are currently using it in a slightly
>>> different way - telling kernel that the base page is 16K but mapping each 16K
>>> page with 4x 4K entries (plus the SPS bit set)?
>>
>> Yes it's both.
>>
>> When the base page is 16k, there are 4x 4k entries (with SPS bit set) in
>> the page table, and pte_t is a table of 4 'unsigned long'
>>
>> When the base page is 4k, there is a 16k hugepage size, which is the
>> same 4x 4k entries with SPS bit set.
>>
>> So it looks similar to the contiguous bit.
>>
>>
>> And by extension, the same principle is used for 512k hugepages, the bit
>> _PAGE_HUGE is copied by the TLB miss handler into the lower bit of PS,
>> PS being as follows:
>> - 00 Small (4 Kbyte or 16 Kbyte)
>> - 01 512 Kbyte
>> - 10 Reserved
>> - 11 8 Mbyte
>>
>> So as PMD size is 4M, 512k pages are 128 identical consecutive entries
>> in the page table.
>>
>> I which I could have THP with 16k or 512k pages.
>
> Then you have come to the right place! :)
>
> https://lore.kernel.org/linux-mm/[email protected]/
>

That looks great. That series only modifies core mm/ .
No changes needed in arch ? Will it work on powerpc without any
change/additions to arch code ?

Well, I'll try it soon.

Christophe

2023-12-04 12:04:02

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH RFC 06/12] mm/gup: Drop folio_fast_pin_allowed() in hugepd processing

On 04/12/2023 11:57, Christophe Leroy wrote:
>
>
> Le 04/12/2023 à 12:46, Ryan Roberts a écrit :
>> On 04/12/2023 11:25, Christophe Leroy wrote:
>>>
>>>
>>> Le 04/12/2023 à 12:11, Ryan Roberts a écrit :
>>>> On 03/12/2023 13:33, Christophe Leroy wrote:
>>>>>
>>>>>
>>>>> Le 30/11/2023 à 22:30, Peter Xu a écrit :
>>>>>> On Fri, Nov 24, 2023 at 11:07:51AM -0500, Peter Xu wrote:
>>>>>>> On Fri, Nov 24, 2023 at 09:06:01AM +0000, Ryan Roberts wrote:
>>>>>>>> I don't have any micro-benchmarks for GUP though, if that's your question. Is
>>>>>>>> there an easy-to-use test I can run to get some numbers? I'd be happy to try it out.
>>>>>>>
>>>>>>> Thanks Ryan. Then nothing is needed to be tested if gup is not yet touched
>>>>>>> from your side, afaict. I'll see whether I can provide some rough numbers
>>>>>>> instead in the next post (I'll probably only be able to test it in a VM,
>>>>>>> though, but hopefully that should still reflect mostly the truth).
>>>>>>
>>>>>> An update: I finished a round of 64K cont_pte test, in the slow gup micro
>>>>>> benchmark I see ~15% perf degrade with this patchset applied on a VM on top
>>>>>> of Apple M1.
>>>>>>
>>>>>> Frankly that's even less than I expected, considering not only how slow gup
>>>>>> THP used to be, but also on the fact that that's a tight loop over slow
>>>>>> gup, which in normal cases shouldn't happen: "present" ptes normally goes
>>>>>> to fast-gup, while !present goes into a fault following it. I assume
>>>>>> that's why nobody cared slow gup for THP before. I think adding cont_pte
>>>>>> support shouldn't be very hard, but that will include making cont_pte idea
>>>>>> global just for arm64 and riscv Svnapot.
>>>>>
>>>>> Is there any documentation on what cont_pte is ? I have always wondered
>>>>> if it could also fit powerpc 8xx need ?
>>>>
>>>> pte_cont() (and pte_mkcont() and pte_mknoncont()) test and manipulte the
>>>> "contiguous bit" in the arm64 PTE entries. Those helpers are arm64-specific
>>>> (AFAIK). The contiguous bit is a hint to the HW to tell it that a block of PTEs
>>>> are mapping a physically contiguous and naturally aligned piece of memory. The
>>>> HW can use this to coalesce entries in the TLB. When using 4K base pages, the
>>>> contpte size is 64K (16 PTEs). For 16K base pages, its 2M (128 PTEs) and for 64K
>>>> base pages, its 2M (32 PTEs).
>>>>
>>>>>
>>>>> On powerpc, for 16k pages, we have to define 4 consecutive PTEs. All 4
>>>>> PTE are flagged with the SPS bit telling it's a 16k pages, but for TLB
>>>>> misses the HW needs one entrie for each 4k fragment.
>>>>
>>>> From that description, it sounds like the SPS bit might be similar to arm64
>>>> contiguous bit? Although sounds like you are currently using it in a slightly
>>>> different way - telling kernel that the base page is 16K but mapping each 16K
>>>> page with 4x 4K entries (plus the SPS bit set)?
>>>
>>> Yes it's both.
>>>
>>> When the base page is 16k, there are 4x 4k entries (with SPS bit set) in
>>> the page table, and pte_t is a table of 4 'unsigned long'
>>>
>>> When the base page is 4k, there is a 16k hugepage size, which is the
>>> same 4x 4k entries with SPS bit set.
>>>
>>> So it looks similar to the contiguous bit.
>>>
>>>
>>> And by extension, the same principle is used for 512k hugepages, the bit
>>> _PAGE_HUGE is copied by the TLB miss handler into the lower bit of PS,
>>> PS being as follows:
>>> - 00 Small (4 Kbyte or 16 Kbyte)
>>> - 01 512 Kbyte
>>> - 10 Reserved
>>> - 11 8 Mbyte
>>>
>>> So as PMD size is 4M, 512k pages are 128 identical consecutive entries
>>> in the page table.
>>>
>>> I which I could have THP with 16k or 512k pages.
>>
>> Then you have come to the right place! :)
>>
>> https://lore.kernel.org/linux-mm/[email protected]/
>>
>
> That looks great. That series only modifies core mm/ .
> No changes needed in arch ? Will it work on powerpc without any
> change/additions to arch code ?

Yes there are also changes needed in arch; I have a separate series for arm64,
which transparently manages the contiguous bit when it sees appropriate PTEs:

https://lore.kernel.org/linux-arm-kernel/[email protected]/

>
> Well, I'll try it soon.
>
> Christophe

2023-12-04 16:49:10

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH RFC 06/12] mm/gup: Drop folio_fast_pin_allowed() in hugepd processing

On Mon, Dec 04, 2023 at 11:11:26AM +0000, Ryan Roberts wrote:
> To be honest, while I understand pte_cont() and friends, I don't understand
> their relevance (or at least potential future relevance) to GUP?

GUP in general can be smarter to recognize if a pte/pmd is a cont_pte and
fetch the whole pte/pmd range if the caller specified. Now it loops over
each pte/pmd.

Fast-gup is better as it at least doesn't take pgtable lock, for cont_pte
it looks inside gup_pte_range() which is good enough, but it'll still do
folio checks for each sub-pte, even though the 2nd+ folio checks should be
mostly the same (if to ignore races when the folio changed within the time
of processing the cont_pte chunk).

Slow-gup (as of what this series is about so far) doesn't do that either,
for each cont_pte whole entry it'll loop N times, frequently taking and
releasing the pgtable lock. A smarter slow-gup can fundamentallly setup
follow_page_context.page_mask if it sees a cont_pte. There might be a
challenge on whether holding the head page's refcount would stablize the
whole folio, but that may be another question to ask.

I think I also overlooked that PPC_8XX also has cont_pte support, so we
actually have three users indeed, if not counting potential future archs
adding support to also get that same tlb benefit.

Thanks,

--
Peter Xu