2024-03-21 22:08:19

by Peter Xu

[permalink] [raw]
Subject: [PATCH v3 00/12] mm/gup: Unify hugetlb, part 2

From: Peter Xu <[email protected]>

v3:
- Rebased to latest mm-unstalbe (a824831a082f, of March 21th)
- Dropped patch to introduce pmd_thp_or_huge(), replace such uses (and also
pXd_huge() users) with pXd_leaf() [Jason]
- Add a comment for CONFIG_PGTABLE_HAS_HUGE_LEAVES [Jason]
- Use IS_ENABLED() in follow_huge_pud() [Jason]
- Remove redundant none pud check in follow_pud_mask() [Jason]

rfc: https://lore.kernel.org/r/[email protected]
v1: https://lore.kernel.org/r/[email protected]
v2: https://lore.kernel.org/r/[email protected]

The series removes the hugetlb slow gup path after a previous refactor work
[1], so that slow gup now uses the exact same path to process all kinds of
memory including hugetlb.

For the long term, we may want to remove most, if not all, call sites of
huge_pte_offset(). It'll be ideal if that API can be completely dropped
from arch hugetlb API. This series is one small step towards merging
hugetlb specific codes into generic mm paths. From that POV, this series
removes one reference to huge_pte_offset() out of many others.

One goal of such a route is that we can reconsider merging hugetlb features
like High Granularity Mapping (HGM). It was not accepted in the past
because it may add lots of hugetlb specific codes and make the mm code even
harder to maintain. With a merged codeset, features like HGM can hopefully
share some code with THP, legacy (PMD+) or modern (continuous PTEs).

To make it work, the generic slow gup code will need to at least understand
hugepd, which is already done like so in fast-gup. Due to the specialty of
hugepd to be software-only solution (no hardware recognizes the hugepd
format, so it's purely artificial structures), there's chance we can merge
some or all hugepd formats with cont_pte in the future. That question is
yet unsettled from Power side to have an acknowledgement. As of now for
this series, I kept the hugepd handling because we may still need to do so
before getting a clearer picture of the future of hugepd. The other reason
is simply that we did it already for fast-gup and most codes are still
around to be reused. It'll make more sense to keep slow/fast gup behave
the same before a decision is made to remove hugepd.

There's one major difference for slow-gup on cont_pte / cont_pmd handling,
currently supported on three architectures (aarch64, riscv, ppc). Before
the series, slow gup will be able to recognize e.g. cont_pte entries with
the help of huge_pte_offset() when hstate is around. Now it's gone but
still working, by looking up pgtable entries one by one.

It's not ideal, but hopefully this change should not affect yet on major
workloads. There's some more information in the commit message of the last
patch. If this would be a concern, we can consider teaching slow gup to
recognize cont pte/pmd entries, and that should recover the lost
performance. But I doubt its necessity for now, so I kept it as simple as
it can be.

Test Done
=========

For x86_64, tested full gup_test matrix over 2MB huge pages. For aarch64,
tested the same over 64KB cont_pte huge pages.

One note is that this v3 didn't go through any ppc test anymore, as finding
such system can always take time. It's based on the fact that it was
tested in previous versions, and this version should have zero change
regarding to hugepd sections.

If anyone (Christophe?) wants to give it a shot on PowerPC, please do and I
would appreciate it: "./run_vmtests.sh -a -t gup_test" should do well
enough (please consider [2] applied if hugepd is <1MB), as long as we're
sure the hugepd pages are touched as expected.

Patch layout
=============

Patch 1-8: Preparation works, or cleanups in relevant code paths
Patch 9-11: Teach slow gup with all kinds of huge entries (pXd, hugepd)
Patch 12: Drop hugetlb_follow_page_mask()

More information can be found in the commit messages of each patch. Any
comment will be welcomed. Thanks.

[1] https://lore.kernel.org/all/[email protected]
[2] https://lore.kernel.org/r/[email protected]

Peter Xu (12):
mm/Kconfig: CONFIG_PGTABLE_HAS_HUGE_LEAVES
mm/hugetlb: Declare hugetlbfs_pagecache_present() non-static
mm: Make HPAGE_PXD_* macros even if !THP
mm: Introduce vma_pgtable_walk_{begin|end}()
mm/gup: Drop folio_fast_pin_allowed() in hugepd processing
mm/gup: Refactor record_subpages() to find 1st small page
mm/gup: Handle hugetlb for no_page_table()
mm/gup: Cache *pudp in follow_pud_mask()
mm/gup: Handle huge pud for follow_pud_mask()
mm/gup: Handle huge pmd for follow_pmd_mask()
mm/gup: Handle hugepd for follow_page()
mm/gup: Handle hugetlb in the generic follow_page_mask code

include/linux/huge_mm.h | 25 +--
include/linux/hugetlb.h | 16 +-
include/linux/mm.h | 3 +
mm/Kconfig | 6 +
mm/gup.c | 354 +++++++++++++++++++++++++++++++++-------
mm/huge_memory.c | 133 +--------------
mm/hugetlb.c | 75 +--------
mm/internal.h | 7 +-
mm/memory.c | 12 ++
9 files changed, 337 insertions(+), 294 deletions(-)

--
2.44.0



2024-03-21 22:08:26

by Peter Xu

[permalink] [raw]
Subject: [PATCH v3 01/12] mm/Kconfig: CONFIG_PGTABLE_HAS_HUGE_LEAVES

From: Peter Xu <[email protected]>

Introduce a config option that will be selected as long as huge leaves are
involved in pgtable (thp or hugetlbfs). It would be useful to mark any
code with this new config that can process either hugetlb or thp pages in
any level that is higher than pte level.

Reviewed-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
mm/Kconfig | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/mm/Kconfig b/mm/Kconfig
index b924f4a5a3ef..497cdf4d8ebf 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -850,6 +850,12 @@ config READ_ONLY_THP_FOR_FS

endif # TRANSPARENT_HUGEPAGE

+#
+# The architecture supports pgtable leaves that is larger than PAGE_SIZE
+#
+config PGTABLE_HAS_HUGE_LEAVES
+ def_bool TRANSPARENT_HUGEPAGE || HUGETLB_PAGE
+
#
# UP and nommu archs use km based percpu allocator
#
--
2.44.0


2024-03-21 22:08:46

by Peter Xu

[permalink] [raw]
Subject: [PATCH v3 02/12] mm/hugetlb: Declare hugetlbfs_pagecache_present() non-static

From: Peter Xu <[email protected]>

It will be used outside hugetlb.c soon.

Signed-off-by: Peter Xu <[email protected]>
---
include/linux/hugetlb.h | 9 +++++++++
mm/hugetlb.c | 4 ++--
2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 300de33c6fde..52d9efcf1edf 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -174,6 +174,9 @@ u32 hugetlb_fault_mutex_hash(struct address_space *mapping, pgoff_t idx);

pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long addr, pud_t *pud);
+bool hugetlbfs_pagecache_present(struct hstate *h,
+ struct vm_area_struct *vma,
+ unsigned long address);

struct address_space *hugetlb_page_mapping_lock_write(struct page *hpage);

@@ -1197,6 +1200,12 @@ static inline void hugetlb_register_node(struct node *node)
static inline void hugetlb_unregister_node(struct node *node)
{
}
+
+static inline bool hugetlbfs_pagecache_present(
+ struct hstate *h, struct vm_area_struct *vma, unsigned long address)
+{
+ return false;
+}
#endif /* CONFIG_HUGETLB_PAGE */

static inline spinlock_t *huge_pte_lock(struct hstate *h,
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 23ef240ba48a..abec04575c89 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6129,8 +6129,8 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct vm_area_struct *vma,
/*
* Return whether there is a pagecache page to back given address within VMA.
*/
-static bool hugetlbfs_pagecache_present(struct hstate *h,
- struct vm_area_struct *vma, unsigned long address)
+bool hugetlbfs_pagecache_present(struct hstate *h,
+ struct vm_area_struct *vma, unsigned long address)
{
struct address_space *mapping = vma->vm_file->f_mapping;
pgoff_t idx = linear_page_index(vma, address);
--
2.44.0


2024-03-21 22:08:54

by Peter Xu

[permalink] [raw]
Subject: [PATCH v3 03/12] mm: Make HPAGE_PXD_* macros even if !THP

From: Peter Xu <[email protected]>

These macros can be helpful when we plan to merge hugetlb code into generic
code. Move them out and define them even if !THP.

We actually already defined HPAGE_PMD_NR for other reasons even if !THP.
Reorganize these macros.

Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
include/linux/huge_mm.h | 17 ++++++-----------
1 file changed, 6 insertions(+), 11 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index de0c89105076..3bcdfc7e5d57 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -64,9 +64,6 @@ ssize_t single_hugepage_flag_show(struct kobject *kobj,
enum transparent_hugepage_flag flag);
extern struct kobj_attribute shmem_enabled_attr;

-#define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
-#define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
-
/*
* Mask of all large folio orders supported for anonymous THP; all orders up to
* and including PMD_ORDER, except order-0 (which is not "huge") and order-1
@@ -87,14 +84,19 @@ extern struct kobj_attribute shmem_enabled_attr;
#define thp_vma_allowable_order(vma, vm_flags, smaps, in_pf, enforce_sysfs, order) \
(!!thp_vma_allowable_orders(vma, vm_flags, smaps, in_pf, enforce_sysfs, BIT(order)))

-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
#define HPAGE_PMD_SHIFT PMD_SHIFT
#define HPAGE_PMD_SIZE ((1UL) << HPAGE_PMD_SHIFT)
#define HPAGE_PMD_MASK (~(HPAGE_PMD_SIZE - 1))
+#define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
+#define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)

#define HPAGE_PUD_SHIFT PUD_SHIFT
#define HPAGE_PUD_SIZE ((1UL) << HPAGE_PUD_SHIFT)
#define HPAGE_PUD_MASK (~(HPAGE_PUD_SIZE - 1))
+#define HPAGE_PUD_ORDER (HPAGE_PUD_SHIFT-PAGE_SHIFT)
+#define HPAGE_PUD_NR (1<<HPAGE_PUD_ORDER)
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE

extern unsigned long transparent_hugepage_flags;
extern unsigned long huge_anon_orders_always;
@@ -378,13 +380,6 @@ static inline bool thp_migration_supported(void)
}

#else /* CONFIG_TRANSPARENT_HUGEPAGE */
-#define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
-#define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
-#define HPAGE_PMD_SIZE ({ BUILD_BUG(); 0; })
-
-#define HPAGE_PUD_SHIFT ({ BUILD_BUG(); 0; })
-#define HPAGE_PUD_MASK ({ BUILD_BUG(); 0; })
-#define HPAGE_PUD_SIZE ({ BUILD_BUG(); 0; })

static inline bool folio_test_pmd_mappable(struct folio *folio)
{
--
2.44.0


2024-03-21 22:09:07

by Peter Xu

[permalink] [raw]
Subject: [PATCH v3 04/12] mm: Introduce vma_pgtable_walk_{begin|end}()

From: Peter Xu <[email protected]>

Introduce per-vma begin()/end() helpers for pgtable walks. This is a
preparation work to merge hugetlb pgtable walkers with generic mm.

The helpers need to be called before and after a pgtable walk, will start
to be needed if the pgtable walker code supports hugetlb pages. It's a
hook point for any type of VMA, but for now only hugetlb uses it to
stablize the pgtable pages from getting away (due to possible pmd
unsharing).

Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Muchun Song <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
include/linux/mm.h | 3 +++
mm/memory.c | 12 ++++++++++++
2 files changed, 15 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8147b1302413..d10eb89f4096 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4198,4 +4198,7 @@ static inline bool pfn_is_unaccepted_memory(unsigned long pfn)
return range_contains_unaccepted_memory(paddr, paddr + PAGE_SIZE);
}

+void vma_pgtable_walk_begin(struct vm_area_struct *vma);
+void vma_pgtable_walk_end(struct vm_area_struct *vma);
+
#endif /* _LINUX_MM_H */
diff --git a/mm/memory.c b/mm/memory.c
index 9bce1fa76dd7..4f2caf1c3c4d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6438,3 +6438,15 @@ void ptlock_free(struct ptdesc *ptdesc)
kmem_cache_free(page_ptl_cachep, ptdesc->ptl);
}
#endif
+
+void vma_pgtable_walk_begin(struct vm_area_struct *vma)
+{
+ if (is_vm_hugetlb_page(vma))
+ hugetlb_vma_lock_read(vma);
+}
+
+void vma_pgtable_walk_end(struct vm_area_struct *vma)
+{
+ if (is_vm_hugetlb_page(vma))
+ hugetlb_vma_unlock_read(vma);
+}
--
2.44.0


2024-03-21 22:09:30

by Peter Xu

[permalink] [raw]
Subject: [PATCH v3 05/12] mm/gup: Drop folio_fast_pin_allowed() in hugepd processing

From: Peter Xu <[email protected]>

Hugepd format for GUP is only used in PowerPC with hugetlbfs. There are
some kernel usage of hugepd (can refer to hugepd_populate_kernel() for
PPC_8XX), however those pages are not candidates for GUP.

Commit a6e79df92e4a ("mm/gup: disallow FOLL_LONGTERM GUP-fast writing to
file-backed mappings") added a check to fail gup-fast if there's potential
risk of violating GUP over writeback file systems. That should never apply
to hugepd. Considering that hugepd is an old format (and even
software-only), there's no plan to extend hugepd into other file typed
memories that is prone to the same issue.

Drop that check, not only because it'll never be true for hugepd per any
known plan, but also it paves way for reusing the function outside
fast-gup.

To make sure we'll still remember this issue just in case hugepd will be
extended to support non-hugetlbfs memories, add a rich comment above
gup_huge_pd(), explaining the issue with proper references.

Cc: Christoph Hellwig <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: [email protected]
Signed-off-by: Peter Xu <[email protected]>
---
mm/gup.c | 13 ++++++++-----
1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 484a7c70d121..9127ec5515ac 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2831,11 +2831,6 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
return 0;
}

- if (!folio_fast_pin_allowed(folio, flags)) {
- gup_put_folio(folio, refs, flags);
- return 0;
- }
-
if (!pte_write(pte) && gup_must_unshare(NULL, flags, &folio->page)) {
gup_put_folio(folio, refs, flags);
return 0;
@@ -2846,6 +2841,14 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
return 1;
}

+/*
+ * NOTE: currently GUP for a hugepd is only possible on hugetlbfs file
+ * systems on Power, which does not have issue with folio writeback against
+ * GUP updates. When hugepd will be extended to support non-hugetlbfs or
+ * even anonymous memory, we need to do extra check as what we do with most
+ * of the other folios. See writable_file_mapping_allowed() and
+ * folio_fast_pin_allowed() for more information.
+ */
static int gup_huge_pd(hugepd_t hugepd, unsigned long addr,
unsigned int pdshift, unsigned long end, unsigned int flags,
struct page **pages, int *nr)
--
2.44.0


2024-03-21 22:09:40

by Peter Xu

[permalink] [raw]
Subject: [PATCH v3 06/12] mm/gup: Refactor record_subpages() to find 1st small page

From: Peter Xu <[email protected]>

All the fast-gup functions take a tail page to operate, always need to do
page mask calculations before feeding that into record_subpages().

Merge that logic into record_subpages(), so that it will do the nth_page()
calculation.

Reviewed-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
mm/gup.c | 25 ++++++++++++++-----------
1 file changed, 14 insertions(+), 11 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 9127ec5515ac..f3ae8f6ce8a4 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2778,13 +2778,16 @@ static int __gup_device_huge_pud(pud_t pud, pud_t *pudp, unsigned long addr,
}
#endif

-static int record_subpages(struct page *page, unsigned long addr,
- unsigned long end, struct page **pages)
+static int record_subpages(struct page *page, unsigned long sz,
+ unsigned long addr, unsigned long end,
+ struct page **pages)
{
+ struct page *start_page;
int nr;

+ start_page = nth_page(page, (addr & (sz - 1)) >> PAGE_SHIFT);
for (nr = 0; addr != end; nr++, addr += PAGE_SIZE)
- pages[nr] = nth_page(page, nr);
+ pages[nr] = nth_page(start_page, nr);

return nr;
}
@@ -2819,8 +2822,8 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
/* hugepages are never "special" */
VM_BUG_ON(!pfn_valid(pte_pfn(pte)));

- page = nth_page(pte_page(pte), (addr & (sz - 1)) >> PAGE_SHIFT);
- refs = record_subpages(page, addr, end, pages + *nr);
+ page = pte_page(pte);
+ refs = record_subpages(page, sz, addr, end, pages + *nr);

folio = try_grab_folio(page, refs, flags);
if (!folio)
@@ -2893,8 +2896,8 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
pages, nr);
}

- page = nth_page(pmd_page(orig), (addr & ~PMD_MASK) >> PAGE_SHIFT);
- refs = record_subpages(page, addr, end, pages + *nr);
+ page = pmd_page(orig);
+ refs = record_subpages(page, PMD_SIZE, addr, end, pages + *nr);

folio = try_grab_folio(page, refs, flags);
if (!folio)
@@ -2937,8 +2940,8 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
pages, nr);
}

- page = nth_page(pud_page(orig), (addr & ~PUD_MASK) >> PAGE_SHIFT);
- refs = record_subpages(page, addr, end, pages + *nr);
+ page = pud_page(orig);
+ refs = record_subpages(page, PUD_SIZE, addr, end, pages + *nr);

folio = try_grab_folio(page, refs, flags);
if (!folio)
@@ -2977,8 +2980,8 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,

BUILD_BUG_ON(pgd_devmap(orig));

- page = nth_page(pgd_page(orig), (addr & ~PGDIR_MASK) >> PAGE_SHIFT);
- refs = record_subpages(page, addr, end, pages + *nr);
+ page = pgd_page(orig);
+ refs = record_subpages(page, PGDIR_SIZE, addr, end, pages + *nr);

folio = try_grab_folio(page, refs, flags);
if (!folio)
--
2.44.0


2024-03-21 22:09:50

by Peter Xu

[permalink] [raw]
Subject: [PATCH v3 07/12] mm/gup: Handle hugetlb for no_page_table()

From: Peter Xu <[email protected]>

no_page_table() is not yet used for hugetlb code paths. Make it prepared.

The major difference here is hugetlb will return -EFAULT as long as page
cache does not exist, even if VM_SHARED. See hugetlb_follow_page_mask().

Pass "address" into no_page_table() too, as hugetlb will need it.

Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
mm/gup.c | 44 ++++++++++++++++++++++++++------------------
1 file changed, 26 insertions(+), 18 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index f3ae8f6ce8a4..943671736080 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -501,19 +501,27 @@ static inline void mm_set_has_pinned_flag(unsigned long *mm_flags)

#ifdef CONFIG_MMU
static struct page *no_page_table(struct vm_area_struct *vma,
- unsigned int flags)
+ unsigned int flags, unsigned long address)
{
+ if (!(flags & FOLL_DUMP))
+ return NULL;
+
/*
- * When core dumping an enormous anonymous area that nobody
- * has touched so far, we don't want to allocate unnecessary pages or
+ * When core dumping, we don't want to allocate unnecessary pages or
* page tables. Return error instead of NULL to skip handle_mm_fault,
* then get_dump_page() will return NULL to leave a hole in the dump.
* But we can only make this optimization where a hole would surely
* be zero-filled if handle_mm_fault() actually did handle it.
*/
- if ((flags & FOLL_DUMP) &&
- (vma_is_anonymous(vma) || !vma->vm_ops->fault))
+ if (is_vm_hugetlb_page(vma)) {
+ struct hstate *h = hstate_vma(vma);
+
+ if (!hugetlbfs_pagecache_present(h, vma, address))
+ return ERR_PTR(-EFAULT);
+ } else if ((vma_is_anonymous(vma) || !vma->vm_ops->fault)) {
return ERR_PTR(-EFAULT);
+ }
+
return NULL;
}

@@ -593,7 +601,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,

ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
if (!ptep)
- return no_page_table(vma, flags);
+ return no_page_table(vma, flags, address);
pte = ptep_get(ptep);
if (!pte_present(pte))
goto no_page;
@@ -685,7 +693,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
pte_unmap_unlock(ptep, ptl);
if (!pte_none(pte))
return NULL;
- return no_page_table(vma, flags);
+ return no_page_table(vma, flags, address);
}

static struct page *follow_pmd_mask(struct vm_area_struct *vma,
@@ -701,27 +709,27 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
pmd = pmd_offset(pudp, address);
pmdval = pmdp_get_lockless(pmd);
if (pmd_none(pmdval))
- return no_page_table(vma, flags);
+ return no_page_table(vma, flags, address);
if (!pmd_present(pmdval))
- return no_page_table(vma, flags);
+ return no_page_table(vma, flags, address);
if (pmd_devmap(pmdval)) {
ptl = pmd_lock(mm, pmd);
page = follow_devmap_pmd(vma, address, pmd, flags, &ctx->pgmap);
spin_unlock(ptl);
if (page)
return page;
- return no_page_table(vma, flags);
+ return no_page_table(vma, flags, address);
}
if (likely(!pmd_trans_huge(pmdval)))
return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);

if (pmd_protnone(pmdval) && !gup_can_follow_protnone(vma, flags))
- return no_page_table(vma, flags);
+ return no_page_table(vma, flags, address);

ptl = pmd_lock(mm, pmd);
if (unlikely(!pmd_present(*pmd))) {
spin_unlock(ptl);
- return no_page_table(vma, flags);
+ return no_page_table(vma, flags, address);
}
if (unlikely(!pmd_trans_huge(*pmd))) {
spin_unlock(ptl);
@@ -752,17 +760,17 @@ static struct page *follow_pud_mask(struct vm_area_struct *vma,

pud = pud_offset(p4dp, address);
if (pud_none(*pud))
- return no_page_table(vma, flags);
+ return no_page_table(vma, flags, address);
if (pud_devmap(*pud)) {
ptl = pud_lock(mm, pud);
page = follow_devmap_pud(vma, address, pud, flags, &ctx->pgmap);
spin_unlock(ptl);
if (page)
return page;
- return no_page_table(vma, flags);
+ return no_page_table(vma, flags, address);
}
if (unlikely(pud_bad(*pud)))
- return no_page_table(vma, flags);
+ return no_page_table(vma, flags, address);

return follow_pmd_mask(vma, address, pud, flags, ctx);
}
@@ -777,10 +785,10 @@ static struct page *follow_p4d_mask(struct vm_area_struct *vma,
p4dp = p4d_offset(pgdp, address);
p4d = READ_ONCE(*p4dp);
if (!p4d_present(p4d))
- return no_page_table(vma, flags);
+ return no_page_table(vma, flags, address);
BUILD_BUG_ON(p4d_leaf(p4d));
if (unlikely(p4d_bad(p4d)))
- return no_page_table(vma, flags);
+ return no_page_table(vma, flags, address);

return follow_pud_mask(vma, address, p4dp, flags, ctx);
}
@@ -830,7 +838,7 @@ static struct page *follow_page_mask(struct vm_area_struct *vma,
pgd = pgd_offset(mm, address);

if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
- return no_page_table(vma, flags);
+ return no_page_table(vma, flags, address);

return follow_p4d_mask(vma, address, pgd, flags, ctx);
}
--
2.44.0


2024-03-21 22:10:10

by Peter Xu

[permalink] [raw]
Subject: [PATCH v3 08/12] mm/gup: Cache *pudp in follow_pud_mask()

From: Peter Xu <[email protected]>

Introduce "pud_t pud" in the function, so the code won't dereference *pudp
multiple time. Not only because that looks less straightforward, but also
because if the dereference really happened, it's not clear whether there
can be race to see different *pudp values if it's being modified at the
same time.

Acked-by: James Houghton <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
mm/gup.c | 17 +++++++++--------
1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 943671736080..a338944e4425 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -753,26 +753,27 @@ static struct page *follow_pud_mask(struct vm_area_struct *vma,
unsigned int flags,
struct follow_page_context *ctx)
{
- pud_t *pud;
+ pud_t *pudp, pud;
spinlock_t *ptl;
struct page *page;
struct mm_struct *mm = vma->vm_mm;

- pud = pud_offset(p4dp, address);
- if (pud_none(*pud))
+ pudp = pud_offset(p4dp, address);
+ pud = READ_ONCE(*pudp);
+ if (pud_none(pud))
return no_page_table(vma, flags, address);
- if (pud_devmap(*pud)) {
- ptl = pud_lock(mm, pud);
- page = follow_devmap_pud(vma, address, pud, flags, &ctx->pgmap);
+ if (pud_devmap(pud)) {
+ ptl = pud_lock(mm, pudp);
+ page = follow_devmap_pud(vma, address, pudp, flags, &ctx->pgmap);
spin_unlock(ptl);
if (page)
return page;
return no_page_table(vma, flags, address);
}
- if (unlikely(pud_bad(*pud)))
+ if (unlikely(pud_bad(pud)))
return no_page_table(vma, flags, address);

- return follow_pmd_mask(vma, address, pud, flags, ctx);
+ return follow_pmd_mask(vma, address, pudp, flags, ctx);
}

static struct page *follow_p4d_mask(struct vm_area_struct *vma,
--
2.44.0


2024-03-21 22:10:31

by Peter Xu

[permalink] [raw]
Subject: [PATCH v3 09/12] mm/gup: Handle huge pud for follow_pud_mask()

From: Peter Xu <[email protected]>

Teach follow_pud_mask() to be able to handle normal PUD pages like hugetlb.

Rename follow_devmap_pud() to follow_huge_pud() so that it can process
either huge devmap or hugetlb. Move it out of TRANSPARENT_HUGEPAGE_PUD and
and huge_memory.c (which relies on CONFIG_THP). Switch to pud_leaf() to
detect both cases in the slow gup.

In the new follow_huge_pud(), taking care of possible CoR for hugetlb if
necessary. touch_pud() needs to be moved out of huge_memory.c to be
accessable from gup.c even if !THP.

Since at it, optimize the non-present check by adding a pud_present() early
check before taking the pgtable lock, failing the follow_page() early if
PUD is not present: that is required by both devmap or hugetlb. Use
pud_huge() to also cover the pud_devmap() case.

One more trivial thing to mention is, introduce "pud_t pud" in the code
paths along the way, so the code doesn't dereference *pudp multiple time.
Not only because that looks less straightforward, but also because if the
dereference really happened, it's not clear whether there can be race to
see different *pudp values when it's being modified at the same time.

Setting ctx->page_mask properly for a PUD entry. As a side effect, this
patch should also be able to optimize devmap GUP on PUD to be able to jump
over the whole PUD range, but not yet verified. Hugetlb already can do so
prior to this patch.

Reviewed-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
include/linux/huge_mm.h | 8 -----
mm/gup.c | 70 +++++++++++++++++++++++++++++++++++++++--
mm/huge_memory.c | 47 ++-------------------------
mm/internal.h | 2 ++
4 files changed, 71 insertions(+), 56 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 3bcdfc7e5d57..39195f7c5269 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -346,8 +346,6 @@ static inline bool folio_test_pmd_mappable(struct folio *folio)

struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
pmd_t *pmd, int flags, struct dev_pagemap **pgmap);
-struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
- pud_t *pud, int flags, struct dev_pagemap **pgmap);

vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf);

@@ -504,12 +502,6 @@ static inline struct page *follow_devmap_pmd(struct vm_area_struct *vma,
return NULL;
}

-static inline struct page *follow_devmap_pud(struct vm_area_struct *vma,
- unsigned long addr, pud_t *pud, int flags, struct dev_pagemap **pgmap)
-{
- return NULL;
-}
-
static inline bool thp_migration_supported(void)
{
return false;
diff --git a/mm/gup.c b/mm/gup.c
index a338944e4425..ae21afb9434e 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -525,6 +525,70 @@ static struct page *no_page_table(struct vm_area_struct *vma,
return NULL;
}

+#ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES
+static struct page *follow_huge_pud(struct vm_area_struct *vma,
+ unsigned long addr, pud_t *pudp,
+ int flags, struct follow_page_context *ctx)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ struct page *page;
+ pud_t pud = *pudp;
+ unsigned long pfn = pud_pfn(pud);
+ int ret;
+
+ assert_spin_locked(pud_lockptr(mm, pudp));
+
+ if ((flags & FOLL_WRITE) && !pud_write(pud))
+ return NULL;
+
+ if (!pud_present(pud))
+ return NULL;
+
+ pfn += (addr & ~PUD_MASK) >> PAGE_SHIFT;
+
+ if (IS_ENABLED(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD) &&
+ pud_devmap(pud)) {
+ /*
+ * device mapped pages can only be returned if the caller
+ * will manage the page reference count.
+ *
+ * At least one of FOLL_GET | FOLL_PIN must be set, so
+ * assert that here:
+ */
+ if (!(flags & (FOLL_GET | FOLL_PIN)))
+ return ERR_PTR(-EEXIST);
+
+ if (flags & FOLL_TOUCH)
+ touch_pud(vma, addr, pudp, flags & FOLL_WRITE);
+
+ ctx->pgmap = get_dev_pagemap(pfn, ctx->pgmap);
+ if (!ctx->pgmap)
+ return ERR_PTR(-EFAULT);
+ }
+
+ page = pfn_to_page(pfn);
+
+ if (!pud_devmap(pud) && !pud_write(pud) &&
+ gup_must_unshare(vma, flags, page))
+ return ERR_PTR(-EMLINK);
+
+ ret = try_grab_page(page, flags);
+ if (ret)
+ page = ERR_PTR(ret);
+ else
+ ctx->page_mask = HPAGE_PUD_NR - 1;
+
+ return page;
+}
+#else /* CONFIG_PGTABLE_HAS_HUGE_LEAVES */
+static struct page *follow_huge_pud(struct vm_area_struct *vma,
+ unsigned long addr, pud_t *pudp,
+ int flags, struct follow_page_context *ctx)
+{
+ return NULL;
+}
+#endif /* CONFIG_PGTABLE_HAS_HUGE_LEAVES */
+
static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
pte_t *pte, unsigned int flags)
{
@@ -760,11 +824,11 @@ static struct page *follow_pud_mask(struct vm_area_struct *vma,

pudp = pud_offset(p4dp, address);
pud = READ_ONCE(*pudp);
- if (pud_none(pud))
+ if (!pud_present(pud))
return no_page_table(vma, flags, address);
- if (pud_devmap(pud)) {
+ if (pud_leaf(pud)) {
ptl = pud_lock(mm, pudp);
- page = follow_devmap_pud(vma, address, pudp, flags, &ctx->pgmap);
+ page = follow_huge_pud(vma, address, pudp, flags, ctx);
spin_unlock(ptl);
if (page)
return page;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c77cedf45f3a..f8bd2012bc27 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1363,8 +1363,8 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
}

#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
-static void touch_pud(struct vm_area_struct *vma, unsigned long addr,
- pud_t *pud, bool write)
+void touch_pud(struct vm_area_struct *vma, unsigned long addr,
+ pud_t *pud, bool write)
{
pud_t _pud;

@@ -1376,49 +1376,6 @@ static void touch_pud(struct vm_area_struct *vma, unsigned long addr,
update_mmu_cache_pud(vma, addr, pud);
}

-struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
- pud_t *pud, int flags, struct dev_pagemap **pgmap)
-{
- unsigned long pfn = pud_pfn(*pud);
- struct mm_struct *mm = vma->vm_mm;
- struct page *page;
- int ret;
-
- assert_spin_locked(pud_lockptr(mm, pud));
-
- if (flags & FOLL_WRITE && !pud_write(*pud))
- return NULL;
-
- if (pud_present(*pud) && pud_devmap(*pud))
- /* pass */;
- else
- return NULL;
-
- if (flags & FOLL_TOUCH)
- touch_pud(vma, addr, pud, flags & FOLL_WRITE);
-
- /*
- * device mapped pages can only be returned if the
- * caller will manage the page reference count.
- *
- * At least one of FOLL_GET | FOLL_PIN must be set, so assert that here:
- */
- if (!(flags & (FOLL_GET | FOLL_PIN)))
- return ERR_PTR(-EEXIST);
-
- pfn += (addr & ~PUD_MASK) >> PAGE_SHIFT;
- *pgmap = get_dev_pagemap(pfn, *pgmap);
- if (!*pgmap)
- return ERR_PTR(-EFAULT);
- page = pfn_to_page(pfn);
-
- ret = try_grab_page(page, flags);
- if (ret)
- page = ERR_PTR(ret);
-
- return page;
-}
-
int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
pud_t *dst_pud, pud_t *src_pud, unsigned long addr,
struct vm_area_struct *vma)
diff --git a/mm/internal.h b/mm/internal.h
index f8b31234c130..63e4f6e001be 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1102,6 +1102,8 @@ int __must_check try_grab_page(struct page *page, unsigned int flags);
/*
* mm/huge_memory.c
*/
+void touch_pud(struct vm_area_struct *vma, unsigned long addr,
+ pud_t *pud, bool write);
struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
unsigned long addr, pmd_t *pmd,
unsigned int flags);
--
2.44.0


2024-03-21 22:10:36

by Peter Xu

[permalink] [raw]
Subject: [PATCH v3 10/12] mm/gup: Handle huge pmd for follow_pmd_mask()

From: Peter Xu <[email protected]>

Replace pmd_trans_huge() with pmd_leaf() to also cover pmd_huge() as long
as enabled.

FOLL_TOUCH and FOLL_SPLIT_PMD only apply to THP, not yet huge.

Since now follow_trans_huge_pmd() can process hugetlb pages, renaming it
into follow_huge_pmd() to match what it does. Move it into gup.c so not
depend on CONFIG_THP.

When at it, move the ctx->page_mask setup into follow_huge_pmd(), only set
it when the page is valid. It was not a bug to set it before even if GUP
failed (page==NULL), because follow_page_mask() callers always ignores
page_mask if so. But doing so makes the code cleaner.

Reviewed-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
mm/gup.c | 107 ++++++++++++++++++++++++++++++++++++++++++++---
mm/huge_memory.c | 86 +------------------------------------
mm/internal.h | 5 +--
3 files changed, 105 insertions(+), 93 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index ae21afb9434e..00cdf4cb0cd4 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -580,6 +580,93 @@ static struct page *follow_huge_pud(struct vm_area_struct *vma,

return page;
}
+
+/* FOLL_FORCE can write to even unwritable PMDs in COW mappings. */
+static inline bool can_follow_write_pmd(pmd_t pmd, struct page *page,
+ struct vm_area_struct *vma,
+ unsigned int flags)
+{
+ /* If the pmd is writable, we can write to the page. */
+ if (pmd_write(pmd))
+ return true;
+
+ /* Maybe FOLL_FORCE is set to override it? */
+ if (!(flags & FOLL_FORCE))
+ return false;
+
+ /* But FOLL_FORCE has no effect on shared mappings */
+ if (vma->vm_flags & (VM_MAYSHARE | VM_SHARED))
+ return false;
+
+ /* ... or read-only private ones */
+ if (!(vma->vm_flags & VM_MAYWRITE))
+ return false;
+
+ /* ... or already writable ones that just need to take a write fault */
+ if (vma->vm_flags & VM_WRITE)
+ return false;
+
+ /*
+ * See can_change_pte_writable(): we broke COW and could map the page
+ * writable if we have an exclusive anonymous page ...
+ */
+ if (!page || !PageAnon(page) || !PageAnonExclusive(page))
+ return false;
+
+ /* ... and a write-fault isn't required for other reasons. */
+ if (vma_soft_dirty_enabled(vma) && !pmd_soft_dirty(pmd))
+ return false;
+ return !userfaultfd_huge_pmd_wp(vma, pmd);
+}
+
+static struct page *follow_huge_pmd(struct vm_area_struct *vma,
+ unsigned long addr, pmd_t *pmd,
+ unsigned int flags,
+ struct follow_page_context *ctx)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ pmd_t pmdval = *pmd;
+ struct page *page;
+ int ret;
+
+ assert_spin_locked(pmd_lockptr(mm, pmd));
+
+ page = pmd_page(pmdval);
+ VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
+
+ if ((flags & FOLL_WRITE) &&
+ !can_follow_write_pmd(pmdval, page, vma, flags))
+ return NULL;
+
+ /* Avoid dumping huge zero page */
+ if ((flags & FOLL_DUMP) && is_huge_zero_pmd(pmdval))
+ return ERR_PTR(-EFAULT);
+
+ if (pmd_protnone(*pmd) && !gup_can_follow_protnone(vma, flags))
+ return NULL;
+
+ if (!pmd_write(pmdval) && gup_must_unshare(vma, flags, page))
+ return ERR_PTR(-EMLINK);
+
+ VM_BUG_ON_PAGE((flags & FOLL_PIN) && PageAnon(page) &&
+ !PageAnonExclusive(page), page);
+
+ ret = try_grab_page(page, flags);
+ if (ret)
+ return ERR_PTR(ret);
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ if (pmd_trans_huge(pmdval) && (flags & FOLL_TOUCH))
+ touch_pmd(vma, addr, pmd, flags & FOLL_WRITE);
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
+ page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
+ ctx->page_mask = HPAGE_PMD_NR - 1;
+ VM_BUG_ON_PAGE(!PageCompound(page) && !is_zone_device_page(page), page);
+
+ return page;
+}
+
#else /* CONFIG_PGTABLE_HAS_HUGE_LEAVES */
static struct page *follow_huge_pud(struct vm_area_struct *vma,
unsigned long addr, pud_t *pudp,
@@ -587,6 +674,14 @@ static struct page *follow_huge_pud(struct vm_area_struct *vma,
{
return NULL;
}
+
+static struct page *follow_huge_pmd(struct vm_area_struct *vma,
+ unsigned long addr, pmd_t *pmd,
+ unsigned int flags,
+ struct follow_page_context *ctx)
+{
+ return NULL;
+}
#endif /* CONFIG_PGTABLE_HAS_HUGE_LEAVES */

static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
@@ -784,31 +879,31 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
return page;
return no_page_table(vma, flags, address);
}
- if (likely(!pmd_trans_huge(pmdval)))
+ if (likely(!pmd_leaf(pmdval)))
return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);

if (pmd_protnone(pmdval) && !gup_can_follow_protnone(vma, flags))
return no_page_table(vma, flags, address);

ptl = pmd_lock(mm, pmd);
- if (unlikely(!pmd_present(*pmd))) {
+ pmdval = *pmd;
+ if (unlikely(!pmd_present(pmdval))) {
spin_unlock(ptl);
return no_page_table(vma, flags, address);
}
- if (unlikely(!pmd_trans_huge(*pmd))) {
+ if (unlikely(!pmd_leaf(pmdval))) {
spin_unlock(ptl);
return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
}
- if (flags & FOLL_SPLIT_PMD) {
+ if (pmd_trans_huge(pmdval) && (flags & FOLL_SPLIT_PMD)) {
spin_unlock(ptl);
split_huge_pmd(vma, pmd, address);
/* If pmd was left empty, stuff a page table in there quickly */
return pte_alloc(mm, pmd) ? ERR_PTR(-ENOMEM) :
follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
}
- page = follow_trans_huge_pmd(vma, address, pmd, flags);
+ page = follow_huge_pmd(vma, address, pmd, flags, ctx);
spin_unlock(ptl);
- ctx->page_mask = HPAGE_PMD_NR - 1;
return page;
}

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f8bd2012bc27..e747dacb5051 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1206,8 +1206,8 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write)
EXPORT_SYMBOL_GPL(vmf_insert_pfn_pud);
#endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */

-static void touch_pmd(struct vm_area_struct *vma, unsigned long addr,
- pmd_t *pmd, bool write)
+void touch_pmd(struct vm_area_struct *vma, unsigned long addr,
+ pmd_t *pmd, bool write)
{
pmd_t _pmd;

@@ -1562,88 +1562,6 @@ static inline bool can_change_pmd_writable(struct vm_area_struct *vma,
return pmd_dirty(pmd);
}

-/* FOLL_FORCE can write to even unwritable PMDs in COW mappings. */
-static inline bool can_follow_write_pmd(pmd_t pmd, struct page *page,
- struct vm_area_struct *vma,
- unsigned int flags)
-{
- /* If the pmd is writable, we can write to the page. */
- if (pmd_write(pmd))
- return true;
-
- /* Maybe FOLL_FORCE is set to override it? */
- if (!(flags & FOLL_FORCE))
- return false;
-
- /* But FOLL_FORCE has no effect on shared mappings */
- if (vma->vm_flags & (VM_MAYSHARE | VM_SHARED))
- return false;
-
- /* ... or read-only private ones */
- if (!(vma->vm_flags & VM_MAYWRITE))
- return false;
-
- /* ... or already writable ones that just need to take a write fault */
- if (vma->vm_flags & VM_WRITE)
- return false;
-
- /*
- * See can_change_pte_writable(): we broke COW and could map the page
- * writable if we have an exclusive anonymous page ...
- */
- if (!page || !PageAnon(page) || !PageAnonExclusive(page))
- return false;
-
- /* ... and a write-fault isn't required for other reasons. */
- if (vma_soft_dirty_enabled(vma) && !pmd_soft_dirty(pmd))
- return false;
- return !userfaultfd_huge_pmd_wp(vma, pmd);
-}
-
-struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
- unsigned long addr,
- pmd_t *pmd,
- unsigned int flags)
-{
- struct mm_struct *mm = vma->vm_mm;
- struct page *page;
- int ret;
-
- assert_spin_locked(pmd_lockptr(mm, pmd));
-
- page = pmd_page(*pmd);
- VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
-
- if ((flags & FOLL_WRITE) &&
- !can_follow_write_pmd(*pmd, page, vma, flags))
- return NULL;
-
- /* Avoid dumping huge zero page */
- if ((flags & FOLL_DUMP) && is_huge_zero_pmd(*pmd))
- return ERR_PTR(-EFAULT);
-
- if (pmd_protnone(*pmd) && !gup_can_follow_protnone(vma, flags))
- return NULL;
-
- if (!pmd_write(*pmd) && gup_must_unshare(vma, flags, page))
- return ERR_PTR(-EMLINK);
-
- VM_BUG_ON_PAGE((flags & FOLL_PIN) && PageAnon(page) &&
- !PageAnonExclusive(page), page);
-
- ret = try_grab_page(page, flags);
- if (ret)
- return ERR_PTR(ret);
-
- if (flags & FOLL_TOUCH)
- touch_pmd(vma, addr, pmd, flags & FOLL_WRITE);
-
- page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
- VM_BUG_ON_PAGE(!PageCompound(page) && !is_zone_device_page(page), page);
-
- return page;
-}
-
/* NUMA hinting page fault entry point for trans huge pmds */
vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
{
diff --git a/mm/internal.h b/mm/internal.h
index 63e4f6e001be..d47862e6d968 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1104,9 +1104,8 @@ int __must_check try_grab_page(struct page *page, unsigned int flags);
*/
void touch_pud(struct vm_area_struct *vma, unsigned long addr,
pud_t *pud, bool write);
-struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
- unsigned long addr, pmd_t *pmd,
- unsigned int flags);
+void touch_pmd(struct vm_area_struct *vma, unsigned long addr,
+ pmd_t *pmd, bool write);

#ifdef CONFIG_MEMCG
static inline
--
2.44.0


2024-03-21 22:10:46

by Peter Xu

[permalink] [raw]
Subject: [PATCH v3 11/12] mm/gup: Handle hugepd for follow_page()

From: Peter Xu <[email protected]>

Hugepd is only used in PowerPC so far on 4K page size kernels where hash
mmu is used. follow_page_mask() used to leverage hugetlb APIs to access
hugepd entries. Teach follow_page_mask() itself on hugepd.

With previous refactors on fast-gup gup_huge_pd(), most of the code can be
easily leveraged. There's something not needed for follow page, for
example, gup_hugepte() tries to detect pgtable entry change which will
never happen with slow gup (which has the pgtable lock held), but that's
not a problem to check.

Since follow_page() always only fetch one page, set the end to "address +
PAGE_SIZE" should suffice. We will still do the pgtable walk once for each
hugetlb page by setting ctx->page_mask properly.

One thing worth mentioning is that some level of pgtable's _bad() helper
will report is_hugepd() entries as TRUE on Power8 hash MMUs. I think it at
least applies to PUD on Power8 with 4K pgsize. It means feeding a hugepd
entry to pud_bad() will report a false positive. Let's leave that for now
because it can be arch-specific where I am a bit declined to touch. In
this patch it's not a problem as long as hugepd is detected before any bad
pgtable entries.

Signed-off-by: Peter Xu <[email protected]>
---
mm/gup.c | 73 ++++++++++++++++++++++++++++++++++++++++++++++++++------
1 file changed, 66 insertions(+), 7 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 00cdf4cb0cd4..43a2e0a203cd 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -30,6 +30,11 @@ struct follow_page_context {
unsigned int page_mask;
};

+static struct page *follow_hugepd(struct vm_area_struct *vma, hugepd_t hugepd,
+ unsigned long addr, unsigned int pdshift,
+ unsigned int flags,
+ struct follow_page_context *ctx);
+
static inline void sanity_check_pinned_pages(struct page **pages,
unsigned long npages)
{
@@ -871,6 +876,9 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
return no_page_table(vma, flags, address);
if (!pmd_present(pmdval))
return no_page_table(vma, flags, address);
+ if (unlikely(is_hugepd(__hugepd(pmd_val(pmdval)))))
+ return follow_hugepd(vma, __hugepd(pmd_val(pmdval)),
+ address, PMD_SHIFT, flags, ctx);
if (pmd_devmap(pmdval)) {
ptl = pmd_lock(mm, pmd);
page = follow_devmap_pmd(vma, address, pmd, flags, &ctx->pgmap);
@@ -921,6 +929,9 @@ static struct page *follow_pud_mask(struct vm_area_struct *vma,
pud = READ_ONCE(*pudp);
if (!pud_present(pud))
return no_page_table(vma, flags, address);
+ if (unlikely(is_hugepd(__hugepd(pud_val(pud)))))
+ return follow_hugepd(vma, __hugepd(pud_val(pud)),
+ address, PUD_SHIFT, flags, ctx);
if (pud_leaf(pud)) {
ptl = pud_lock(mm, pudp);
page = follow_huge_pud(vma, address, pudp, flags, ctx);
@@ -944,10 +955,13 @@ static struct page *follow_p4d_mask(struct vm_area_struct *vma,

p4dp = p4d_offset(pgdp, address);
p4d = READ_ONCE(*p4dp);
- if (!p4d_present(p4d))
- return no_page_table(vma, flags, address);
BUILD_BUG_ON(p4d_leaf(p4d));
- if (unlikely(p4d_bad(p4d)))
+
+ if (unlikely(is_hugepd(__hugepd(p4d_val(p4d)))))
+ return follow_hugepd(vma, __hugepd(p4d_val(p4d)),
+ address, P4D_SHIFT, flags, ctx);
+
+ if (!p4d_present(p4d) || p4d_bad(p4d))
return no_page_table(vma, flags, address);

return follow_pud_mask(vma, address, p4dp, flags, ctx);
@@ -981,7 +995,7 @@ static struct page *follow_page_mask(struct vm_area_struct *vma,
unsigned long address, unsigned int flags,
struct follow_page_context *ctx)
{
- pgd_t *pgd;
+ pgd_t *pgd, pgdval;
struct mm_struct *mm = vma->vm_mm;

ctx->page_mask = 0;
@@ -996,11 +1010,17 @@ static struct page *follow_page_mask(struct vm_area_struct *vma,
&ctx->page_mask);

pgd = pgd_offset(mm, address);
+ pgdval = *pgd;

- if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
- return no_page_table(vma, flags, address);
+ if (unlikely(is_hugepd(__hugepd(pgd_val(pgdval)))))
+ page = follow_hugepd(vma, __hugepd(pgd_val(pgdval)),
+ address, PGDIR_SHIFT, flags, ctx);
+ else if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
+ page = no_page_table(vma, flags, address);
+ else
+ page = follow_p4d_mask(vma, address, pgd, flags, ctx);

- return follow_p4d_mask(vma, address, pgd, flags, ctx);
+ return page;
}

struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
@@ -3037,6 +3057,37 @@ static int gup_huge_pd(hugepd_t hugepd, unsigned long addr,

return 1;
}
+
+static struct page *follow_hugepd(struct vm_area_struct *vma, hugepd_t hugepd,
+ unsigned long addr, unsigned int pdshift,
+ unsigned int flags,
+ struct follow_page_context *ctx)
+{
+ struct page *page;
+ struct hstate *h;
+ spinlock_t *ptl;
+ int nr = 0, ret;
+ pte_t *ptep;
+
+ /* Only hugetlb supports hugepd */
+ if (WARN_ON_ONCE(!is_vm_hugetlb_page(vma)))
+ return ERR_PTR(-EFAULT);
+
+ h = hstate_vma(vma);
+ ptep = hugepte_offset(hugepd, addr, pdshift);
+ ptl = huge_pte_lock(h, vma->vm_mm, ptep);
+ ret = gup_huge_pd(hugepd, addr, pdshift, addr + PAGE_SIZE,
+ flags, &page, &nr);
+ spin_unlock(ptl);
+
+ if (ret) {
+ WARN_ON_ONCE(nr != 1);
+ ctx->page_mask = (1U << huge_page_order(h)) - 1;
+ return page;
+ }
+
+ return NULL;
+}
#else
static inline int gup_huge_pd(hugepd_t hugepd, unsigned long addr,
unsigned int pdshift, unsigned long end, unsigned int flags,
@@ -3044,6 +3095,14 @@ static inline int gup_huge_pd(hugepd_t hugepd, unsigned long addr,
{
return 0;
}
+
+static struct page *follow_hugepd(struct vm_area_struct *vma, hugepd_t hugepd,
+ unsigned long addr, unsigned int pdshift,
+ unsigned int flags,
+ struct follow_page_context *ctx)
+{
+ return NULL;
+}
#endif /* CONFIG_ARCH_HAS_HUGEPD */

static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
--
2.44.0


2024-03-21 22:11:14

by Peter Xu

[permalink] [raw]
Subject: [PATCH v3 12/12] mm/gup: Handle hugetlb in the generic follow_page_mask code

From: Peter Xu <[email protected]>

Now follow_page() is ready to handle hugetlb pages in whatever form, and
over all architectures. Switch to the generic code path.

Time to retire hugetlb_follow_page_mask(), following the previous
retirement of follow_hugetlb_page() in 4849807114b8.

There may be a slight difference of how the loops run when processing slow
GUP over a large hugetlb range on cont_pte/cont_pmd supported archs: each
loop of __get_user_pages() will resolve one pgtable entry with the patch
applied, rather than relying on the size of hugetlb hstate, the latter may
cover multiple entries in one loop.

A quick performance test on an aarch64 VM on M1 chip shows 15% degrade over
a tight loop of slow gup after the path switched. That shouldn't be a
problem because slow-gup should not be a hot path for GUP in general: when
page is commonly present, fast-gup will already succeed, while when the
page is indeed missing and require a follow up page fault, the slow gup
degrade will probably buried in the fault paths anyway. It also explains
why slow gup for THP used to be very slow before 57edfcfd3419 ("mm/gup:
accelerate thp gup even for "pages != NULL"") lands, the latter not part of
a performance analysis but a side benefit. If the performance will be a
concern, we can consider handle CONT_PTE in follow_page().

Before that is justified to be necessary, keep everything clean and simple.

Signed-off-by: Peter Xu <[email protected]>
---
include/linux/hugetlb.h | 7 ----
mm/gup.c | 15 +++------
mm/hugetlb.c | 71 -----------------------------------------
3 files changed, 5 insertions(+), 88 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 52d9efcf1edf..85e1c9931ae5 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -328,13 +328,6 @@ static inline void hugetlb_zap_end(
{
}

-static inline struct page *hugetlb_follow_page_mask(
- struct vm_area_struct *vma, unsigned long address, unsigned int flags,
- unsigned int *page_mask)
-{
- BUILD_BUG(); /* should never be compiled in if !CONFIG_HUGETLB_PAGE*/
-}
-
static inline int copy_hugetlb_page_range(struct mm_struct *dst,
struct mm_struct *src,
struct vm_area_struct *dst_vma,
diff --git a/mm/gup.c b/mm/gup.c
index 43a2e0a203cd..2eb5911ba849 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -997,18 +997,11 @@ static struct page *follow_page_mask(struct vm_area_struct *vma,
{
pgd_t *pgd, pgdval;
struct mm_struct *mm = vma->vm_mm;
+ struct page *page;

- ctx->page_mask = 0;
-
- /*
- * Call hugetlb_follow_page_mask for hugetlb vmas as it will use
- * special hugetlb page table walking code. This eliminates the
- * need to check for hugetlb entries in the general walking code.
- */
- if (is_vm_hugetlb_page(vma))
- return hugetlb_follow_page_mask(vma, address, flags,
- &ctx->page_mask);
+ vma_pgtable_walk_begin(vma);

+ ctx->page_mask = 0;
pgd = pgd_offset(mm, address);
pgdval = *pgd;

@@ -1020,6 +1013,8 @@ static struct page *follow_page_mask(struct vm_area_struct *vma,
else
page = follow_p4d_mask(vma, address, pgd, flags, ctx);

+ vma_pgtable_walk_end(vma);
+
return page;
}

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index abec04575c89..2e320757501b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6883,77 +6883,6 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
}
#endif /* CONFIG_USERFAULTFD */

-struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
- unsigned long address, unsigned int flags,
- unsigned int *page_mask)
-{
- struct hstate *h = hstate_vma(vma);
- struct mm_struct *mm = vma->vm_mm;
- unsigned long haddr = address & huge_page_mask(h);
- struct page *page = NULL;
- spinlock_t *ptl;
- pte_t *pte, entry;
- int ret;
-
- hugetlb_vma_lock_read(vma);
- pte = hugetlb_walk(vma, haddr, huge_page_size(h));
- if (!pte)
- goto out_unlock;
-
- ptl = huge_pte_lock(h, mm, pte);
- entry = huge_ptep_get(pte);
- if (pte_present(entry)) {
- page = pte_page(entry);
-
- if (!huge_pte_write(entry)) {
- if (flags & FOLL_WRITE) {
- page = NULL;
- goto out;
- }
-
- if (gup_must_unshare(vma, flags, page)) {
- /* Tell the caller to do unsharing */
- page = ERR_PTR(-EMLINK);
- goto out;
- }
- }
-
- page = nth_page(page, ((address & ~huge_page_mask(h)) >> PAGE_SHIFT));
-
- /*
- * Note that page may be a sub-page, and with vmemmap
- * optimizations the page struct may be read only.
- * try_grab_page() will increase the ref count on the
- * head page, so this will be OK.
- *
- * try_grab_page() should always be able to get the page here,
- * because we hold the ptl lock and have verified pte_present().
- */
- ret = try_grab_page(page, flags);
-
- if (WARN_ON_ONCE(ret)) {
- page = ERR_PTR(ret);
- goto out;
- }
-
- *page_mask = (1U << huge_page_order(h)) - 1;
- }
-out:
- spin_unlock(ptl);
-out_unlock:
- hugetlb_vma_unlock_read(vma);
-
- /*
- * Fixup retval for dump requests: if pagecache doesn't exist,
- * don't try to allocate a new page but just skip it.
- */
- if (!page && (flags & FOLL_DUMP) &&
- !hugetlbfs_pagecache_present(h, vma, address))
- page = ERR_PTR(-EFAULT);
-
- return page;
-}
-
long hugetlb_change_protection(struct vm_area_struct *vma,
unsigned long address, unsigned long end,
pgprot_t newprot, unsigned long cp_flags)
--
2.44.0


2024-03-22 12:27:54

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v3 04/12] mm: Introduce vma_pgtable_walk_{begin|end}()

On Thu, Mar 21, 2024 at 06:07:54PM -0400, [email protected] wrote:
> From: Peter Xu <[email protected]>
>
> Introduce per-vma begin()/end() helpers for pgtable walks. This is a
> preparation work to merge hugetlb pgtable walkers with generic mm.
>
> The helpers need to be called before and after a pgtable walk, will start
> to be needed if the pgtable walker code supports hugetlb pages. It's a
> hook point for any type of VMA, but for now only hugetlb uses it to
> stablize the pgtable pages from getting away (due to possible pmd
> unsharing).
>
> Reviewed-by: Christoph Hellwig <[email protected]>
> Reviewed-by: Muchun Song <[email protected]>
> Signed-off-by: Peter Xu <[email protected]>
> ---
> include/linux/mm.h | 3 +++
> mm/memory.c | 12 ++++++++++++
> 2 files changed, 15 insertions(+)

is_vm_hugetlb_page(vma) seems weirdly named. Regardless

Reviewed-by: Jason Gunthorpe <[email protected]>

Jason

2024-03-22 12:29:11

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v3 05/12] mm/gup: Drop folio_fast_pin_allowed() in hugepd processing

On Thu, Mar 21, 2024 at 06:07:55PM -0400, [email protected] wrote:
> From: Peter Xu <[email protected]>
>
> Hugepd format for GUP is only used in PowerPC with hugetlbfs. There are
> some kernel usage of hugepd (can refer to hugepd_populate_kernel() for
> PPC_8XX), however those pages are not candidates for GUP.
>
> Commit a6e79df92e4a ("mm/gup: disallow FOLL_LONGTERM GUP-fast writing to
> file-backed mappings") added a check to fail gup-fast if there's potential
> risk of violating GUP over writeback file systems. That should never apply
> to hugepd. Considering that hugepd is an old format (and even
> software-only), there's no plan to extend hugepd into other file typed
> memories that is prone to the same issue.
>
> Drop that check, not only because it'll never be true for hugepd per any
> known plan, but also it paves way for reusing the function outside
> fast-gup.
>
> To make sure we'll still remember this issue just in case hugepd will be
> extended to support non-hugetlbfs memories, add a rich comment above
> gup_huge_pd(), explaining the issue with proper references.
>
> Cc: Christoph Hellwig <[email protected]>
> Cc: Lorenzo Stoakes <[email protected]>
> Cc: Michael Ellerman <[email protected]>
> Cc: [email protected]
> Signed-off-by: Peter Xu <[email protected]>
> ---
> mm/gup.c | 13 ++++++++-----
> 1 file changed, 8 insertions(+), 5 deletions(-)

Reviewed-by: Jason Gunthorpe <[email protected]>

Jason

2024-03-22 13:43:40

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v3 12/12] mm/gup: Handle hugetlb in the generic follow_page_mask code

On Thu, Mar 21, 2024 at 06:08:02PM -0400, [email protected] wrote:

> A quick performance test on an aarch64 VM on M1 chip shows 15% degrade over
> a tight loop of slow gup after the path switched. That shouldn't be a
> problem because slow-gup should not be a hot path for GUP in general: when
> page is commonly present, fast-gup will already succeed, while when the
> page is indeed missing and require a follow up page fault, the slow gup
> degrade will probably buried in the fault paths anyway. It also explains
> why slow gup for THP used to be very slow before 57edfcfd3419 ("mm/gup:
> accelerate thp gup even for "pages != NULL"") lands, the latter not part of
> a performance analysis but a side benefit. If the performance will be a
> concern, we can consider handle CONT_PTE in follow_page().

I think this is probably fine for the moment, at least for this
series, as CONT_PTE is still very new.

But it will need to be optimized. "slow" GUP is the only GUP that is
used by FOLL_LONGTERM and it still needs to be optimized because you
can't assume a FOLL_LONGTERM user will be hitting the really slow
fault path. There are enough important cases where it is just reading
already populted page tables, and these days, often with large folios.

Reviewed-by: Jason Gunthorpe <[email protected]>

Jason

2024-03-22 15:55:28

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v3 12/12] mm/gup: Handle hugetlb in the generic follow_page_mask code

Jason,

On Fri, Mar 22, 2024 at 10:30:12AM -0300, Jason Gunthorpe wrote:
> On Thu, Mar 21, 2024 at 06:08:02PM -0400, [email protected] wrote:
>
> > A quick performance test on an aarch64 VM on M1 chip shows 15% degrade over
> > a tight loop of slow gup after the path switched. That shouldn't be a
> > problem because slow-gup should not be a hot path for GUP in general: when
> > page is commonly present, fast-gup will already succeed, while when the
> > page is indeed missing and require a follow up page fault, the slow gup
> > degrade will probably buried in the fault paths anyway. It also explains
> > why slow gup for THP used to be very slow before 57edfcfd3419 ("mm/gup:
> > accelerate thp gup even for "pages != NULL"") lands, the latter not part of
> > a performance analysis but a side benefit. If the performance will be a
> > concern, we can consider handle CONT_PTE in follow_page().
>
> I think this is probably fine for the moment, at least for this
> series, as CONT_PTE is still very new.
>
> But it will need to be optimized. "slow" GUP is the only GUP that is
> used by FOLL_LONGTERM and it still needs to be optimized because you
> can't assume a FOLL_LONGTERM user will be hitting the really slow
> fault path. There are enough important cases where it is just reading
> already populted page tables, and these days, often with large folios.

Ah, I thought FOLL_LONGTERM should work in most cases for fast-gup,
especially for hugetlb, but maybe I missed something? I do see that devmap
skips fast-gup for LONGTERM, we also have that writeback issue but none of
those that I can find applies to hugetlb. This might be a problem indeed
if we have hugetlb cont_pte pages that will constantly fallback to slow
gup.

OTOH, I also agree with you that such batching would be nice to have for
slow-gup, likely devmap or many fs (exclude shmem/hugetlb) file mappings
can at least benefit from it due to above. But then that'll be a more
generic issue to solve, IOW, we still don't do that for !hugetlb cont_pte
large folios, before or after this series.

>
> Reviewed-by: Jason Gunthorpe <[email protected]>

Thanks!

--
Peter Xu


2024-03-22 16:09:07

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v3 12/12] mm/gup: Handle hugetlb in the generic follow_page_mask code

On Fri, Mar 22, 2024 at 11:55:11AM -0400, Peter Xu wrote:
> Jason,
>
> On Fri, Mar 22, 2024 at 10:30:12AM -0300, Jason Gunthorpe wrote:
> > On Thu, Mar 21, 2024 at 06:08:02PM -0400, [email protected] wrote:
> >
> > > A quick performance test on an aarch64 VM on M1 chip shows 15% degrade over
> > > a tight loop of slow gup after the path switched. That shouldn't be a
> > > problem because slow-gup should not be a hot path for GUP in general: when
> > > page is commonly present, fast-gup will already succeed, while when the
> > > page is indeed missing and require a follow up page fault, the slow gup
> > > degrade will probably buried in the fault paths anyway. It also explains
> > > why slow gup for THP used to be very slow before 57edfcfd3419 ("mm/gup:
> > > accelerate thp gup even for "pages != NULL"") lands, the latter not part of
> > > a performance analysis but a side benefit. If the performance will be a
> > > concern, we can consider handle CONT_PTE in follow_page().
> >
> > I think this is probably fine for the moment, at least for this
> > series, as CONT_PTE is still very new.
> >
> > But it will need to be optimized. "slow" GUP is the only GUP that is
> > used by FOLL_LONGTERM and it still needs to be optimized because you
> > can't assume a FOLL_LONGTERM user will be hitting the really slow
> > fault path. There are enough important cases where it is just reading
> > already populted page tables, and these days, often with large folios.
>
> Ah, I thought FOLL_LONGTERM should work in most cases for fast-gup,
> especially for hugetlb, but maybe I missed something?

Ah, no this is my bad memory, there was a time where that was true,
but it is not the case now. Oh, it is a really bad memory because it
seems I removed parts of it :)

> I do see that devmap skips fast-gup for LONGTERM, we also have that
> writeback issue but none of those that I can find applies to
> hugetlb. This might be a problem indeed if we have hugetlb cont_pte
> pages that will constantly fallback to slow gup.

Right, DAX would be the main use case I can think of. Today the
intersection of DAX and contig PTE is non-existant so lets not worry.

> OTOH, I also agree with you that such batching would be nice to have for
> slow-gup, likely devmap or many fs (exclude shmem/hugetlb) file mappings
> can at least benefit from it due to above. But then that'll be a more
> generic issue to solve, IOW, we still don't do that for !hugetlb cont_pte
> large folios, before or after this series.

Right, improving contig pte is going to be a process and eventually it
will make sense to optimize this regardless of hugetlbfs

Jason

2024-03-22 16:10:27

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v3 00/12] mm/gup: Unify hugetlb, part 2

On Thu, Mar 21, 2024 at 06:07:50PM -0400, [email protected] wrote:
> From: Peter Xu <[email protected]>
>
> v3:
> - Rebased to latest mm-unstalbe (a824831a082f, of March 21th)
> - Dropped patch to introduce pmd_thp_or_huge(), replace such uses (and also
> pXd_huge() users) with pXd_leaf() [Jason]
> - Add a comment for CONFIG_PGTABLE_HAS_HUGE_LEAVES [Jason]
> - Use IS_ENABLED() in follow_huge_pud() [Jason]
> - Remove redundant none pud check in follow_pud_mask() [Jason]
>
> rfc: https://lore.kernel.org/r/[email protected]
> v1: https://lore.kernel.org/r/[email protected]
> v2: https://lore.kernel.org/r/[email protected]
>
> The series removes the hugetlb slow gup path after a previous refactor work
> [1], so that slow gup now uses the exact same path to process all kinds of
> memory including hugetlb.
>
> For the long term, we may want to remove most, if not all, call sites of
> huge_pte_offset(). It'll be ideal if that API can be completely dropped
> from arch hugetlb API. This series is one small step towards merging
> hugetlb specific codes into generic mm paths. From that POV, this series
> removes one reference to huge_pte_offset() out of many others.

This remark would be a little easier to understand if you refer to
hugetlb_walk() not huge_pte_offset() - the latter is only used to
implement hugetlb_walk() and isn't removed by this series, while a
single hugetlb_walk() was removed.

Regardless, I think the point is spot on, the direction should be to
make the page table reading generic with minimal/no interaction with
hugetlbfs in the generic code.

After this series I would suggest doing the pagewalk.c stuff as it is
very parallel to GUP slow (indeed it would be amazing to figure out a
way to make GUP slow and pagewalk.c use the same code without a
performance cost)

Some of the other core mm callers don't look too bad either, getting
to the point where hugetlb_walk() is internal to hugetlb.c would be a
nice step that looks reasonable size.

> One goal of such a route is that we can reconsider merging hugetlb features
> like High Granularity Mapping (HGM). It was not accepted in the past
> because it may add lots of hugetlb specific codes and make the mm code even
> harder to maintain. With a merged codeset, features like HGM can hopefully
> share some code with THP, legacy (PMD+) or modern (continuous PTEs).

Yeah, if all the special hugetlb stuff is using generic arch code and
generic page walkers (maybe with that special vma locking hook) it is
much easier to swallow than adding yet another special class of code
to all the page walkers.

> To make it work, the generic slow gup code will need to at least understand
> hugepd, which is already done like so in fast-gup. Due to the specialty of
> hugepd to be software-only solution (no hardware recognizes the hugepd
> format, so it's purely artificial structures), there's chance we can merge
> some or all hugepd formats with cont_pte in the future. That question is
> yet unsettled from Power side to have an acknowledgement.

At a minimum, I think we have a concurrence that the reading of the
hugepd entries should be fine through the standard contig pte APIs and
so all the page walkers doing read side operations could stop having
special hugepd code. It is just an implementation of contig pte with
the new property that the size of a PTE can be larger than a PMD
entry.

If, or how much, the hugepd write side remains special is the open
question, I think.

> this series, I kept the hugepd handling because we may still need to do so
> before getting a clearer picture of the future of hugepd. The other reason
> is simply that we did it already for fast-gup and most codes are still
> around to be reused. It'll make more sense to keep slow/fast gup behave
> the same before a decision is made to remove hugepd.

Yeah, I think that is right for this series. Lets get this done and
then try to remove hugepd read side.

Thanks,
Jason

2024-03-22 17:15:08

by SeongJae Park

[permalink] [raw]
Subject: Re: [PATCH v3 03/12] mm: Make HPAGE_PXD_* macros even if !THP

Hi Peter,

On Thu, 21 Mar 2024 18:07:53 -0400 [email protected] wrote:

> From: Peter Xu <[email protected]>
>
> These macros can be helpful when we plan to merge hugetlb code into generic
> code. Move them out and define them even if !THP.
>
> We actually already defined HPAGE_PMD_NR for other reasons even if !THP.
> Reorganize these macros.
>
> Reviewed-by: Christoph Hellwig <[email protected]>
> Reviewed-by: Jason Gunthorpe <[email protected]>
> Signed-off-by: Peter Xu <[email protected]>
> ---
> include/linux/huge_mm.h | 17 ++++++-----------
> 1 file changed, 6 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index de0c89105076..3bcdfc7e5d57 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -64,9 +64,6 @@ ssize_t single_hugepage_flag_show(struct kobject *kobj,
> enum transparent_hugepage_flag flag);
> extern struct kobj_attribute shmem_enabled_attr;
>
> -#define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
> -#define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
> -
> /*
> * Mask of all large folio orders supported for anonymous THP; all orders up to
> * and including PMD_ORDER, except order-0 (which is not "huge") and order-1
> @@ -87,14 +84,19 @@ extern struct kobj_attribute shmem_enabled_attr;
> #define thp_vma_allowable_order(vma, vm_flags, smaps, in_pf, enforce_sysfs, order) \
> (!!thp_vma_allowable_orders(vma, vm_flags, smaps, in_pf, enforce_sysfs, BIT(order)))
>
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> #define HPAGE_PMD_SHIFT PMD_SHIFT
> #define HPAGE_PMD_SIZE ((1UL) << HPAGE_PMD_SHIFT)
> #define HPAGE_PMD_MASK (~(HPAGE_PMD_SIZE - 1))
> +#define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
> +#define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
>
> #define HPAGE_PUD_SHIFT PUD_SHIFT
> #define HPAGE_PUD_SIZE ((1UL) << HPAGE_PUD_SHIFT)
> #define HPAGE_PUD_MASK (~(HPAGE_PUD_SIZE - 1))
> +#define HPAGE_PUD_ORDER (HPAGE_PUD_SHIFT-PAGE_SHIFT)
> +#define HPAGE_PUD_NR (1<<HPAGE_PUD_ORDER)

I just found latest mm-unstable fails one of my build configurations[1] with
below error. 'git bisect' says this is the first patch set started the
failure. I haven't looked in deep, but just reporting first.

In file included from .../include/linux/mm.h:1115,
from .../mm/vmstat.c:14:
.../mm/vmstat.c: In function 'zoneinfo_show_print':
.../include/linux/huge_mm.h:87:25: error: 'PMD_SHIFT' undeclared (first use in this function); did you mean 'PUD_SHIFT'?
87 | #define HPAGE_PMD_SHIFT PMD_SHIFT
| ^~~~~~~~~

[1] https://github.com/awslabs/damon-tests/blob/next/corr/tests/build_m68k.sh


Thanks,
SJ

[...]

2024-03-22 20:48:26

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v3 12/12] mm/gup: Handle hugetlb in the generic follow_page_mask code

On Thu, 21 Mar 2024 18:08:02 -0400 [email protected] wrote:

> From: Peter Xu <[email protected]>
>
> Now follow_page() is ready to handle hugetlb pages in whatever form, and
> over all architectures. Switch to the generic code path.
>
> Time to retire hugetlb_follow_page_mask(), following the previous
> retirement of follow_hugetlb_page() in 4849807114b8.
>
> There may be a slight difference of how the loops run when processing slow
> GUP over a large hugetlb range on cont_pte/cont_pmd supported archs: each
> loop of __get_user_pages() will resolve one pgtable entry with the patch
> applied, rather than relying on the size of hugetlb hstate, the latter may
> cover multiple entries in one loop.
>
> A quick performance test on an aarch64 VM on M1 chip shows 15% degrade over
> a tight loop of slow gup after the path switched. That shouldn't be a
> problem because slow-gup should not be a hot path for GUP in general: when
> page is commonly present, fast-gup will already succeed, while when the
> page is indeed missing and require a follow up page fault, the slow gup
> degrade will probably buried in the fault paths anyway. It also explains
> why slow gup for THP used to be very slow before 57edfcfd3419 ("mm/gup:
> accelerate thp gup even for "pages != NULL"") lands, the latter not part of
> a performance analysis but a side benefit. If the performance will be a
> concern, we can consider handle CONT_PTE in follow_page().
>
> Before that is justified to be necessary, keep everything clean and simple.
>

mm/gup.c:33:21: warning: 'follow_hugepd' declared 'static' but never defined [-Wunused-function]
33 | static struct page *follow_hugepd(struct vm_area_struct *vma, hugepd_t hugepd,
| ^~~~~~~~~~~~~

--- a/mm/gup.c~mm-gup-handle-hugepd-for-follow_page-fix
+++ a/mm/gup.c
@@ -30,10 +30,12 @@ struct follow_page_context {
unsigned int page_mask;
};

+#ifdef CONFIG_HAVE_FAST_GUP
static struct page *follow_hugepd(struct vm_area_struct *vma, hugepd_t hugepd,
unsigned long addr, unsigned int pdshift,
unsigned int flags,
struct follow_page_context *ctx);
+#endif

static inline void sanity_check_pinned_pages(struct page **pages,
unsigned long npages)
_


This looks inelegant.

That's two build issues so far. Please be more expansive in the
Kconfig variations when testing. Especially when mucking with pgtable
macros.



2024-03-23 00:30:40

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v3 03/12] mm: Make HPAGE_PXD_* macros even if !THP

On Fri, Mar 22, 2024 at 10:14:56AM -0700, SeongJae Park wrote:
> Hi Peter,

Hi, SeongJae,

>
> On Thu, 21 Mar 2024 18:07:53 -0400 [email protected] wrote:
>
> > From: Peter Xu <[email protected]>
> >
> > These macros can be helpful when we plan to merge hugetlb code into generic
> > code. Move them out and define them even if !THP.
> >
> > We actually already defined HPAGE_PMD_NR for other reasons even if !THP.
> > Reorganize these macros.
> >
> > Reviewed-by: Christoph Hellwig <[email protected]>
> > Reviewed-by: Jason Gunthorpe <[email protected]>
> > Signed-off-by: Peter Xu <[email protected]>
> > ---
> > include/linux/huge_mm.h | 17 ++++++-----------
> > 1 file changed, 6 insertions(+), 11 deletions(-)
> >
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index de0c89105076..3bcdfc7e5d57 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -64,9 +64,6 @@ ssize_t single_hugepage_flag_show(struct kobject *kobj,
> > enum transparent_hugepage_flag flag);
> > extern struct kobj_attribute shmem_enabled_attr;
> >
> > -#define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
> > -#define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
> > -
> > /*
> > * Mask of all large folio orders supported for anonymous THP; all orders up to
> > * and including PMD_ORDER, except order-0 (which is not "huge") and order-1
> > @@ -87,14 +84,19 @@ extern struct kobj_attribute shmem_enabled_attr;
> > #define thp_vma_allowable_order(vma, vm_flags, smaps, in_pf, enforce_sysfs, order) \
> > (!!thp_vma_allowable_orders(vma, vm_flags, smaps, in_pf, enforce_sysfs, BIT(order)))
> >
> > -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > #define HPAGE_PMD_SHIFT PMD_SHIFT
> > #define HPAGE_PMD_SIZE ((1UL) << HPAGE_PMD_SHIFT)
> > #define HPAGE_PMD_MASK (~(HPAGE_PMD_SIZE - 1))
> > +#define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
> > +#define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
> >
> > #define HPAGE_PUD_SHIFT PUD_SHIFT
> > #define HPAGE_PUD_SIZE ((1UL) << HPAGE_PUD_SHIFT)
> > #define HPAGE_PUD_MASK (~(HPAGE_PUD_SIZE - 1))
> > +#define HPAGE_PUD_ORDER (HPAGE_PUD_SHIFT-PAGE_SHIFT)
> > +#define HPAGE_PUD_NR (1<<HPAGE_PUD_ORDER)
>
> I just found latest mm-unstable fails one of my build configurations[1] with
> below error. 'git bisect' says this is the first patch set started the
> failure. I haven't looked in deep, but just reporting first.
>
> In file included from .../include/linux/mm.h:1115,
> from .../mm/vmstat.c:14:
> .../mm/vmstat.c: In function 'zoneinfo_show_print':
> .../include/linux/huge_mm.h:87:25: error: 'PMD_SHIFT' undeclared (first use in this function); did you mean 'PUD_SHIFT'?
> 87 | #define HPAGE_PMD_SHIFT PMD_SHIFT
> | ^~~~~~~~~
>
> [1] https://github.com/awslabs/damon-tests/blob/next/corr/tests/build_m68k.sh

Apologies for the issue. This is caused by !CONFIG_MMU, I think.

I thought this would be fairly easy to fix by putting these macros under
CONFIG_PGTABLE_HAS_HUGE_LEAVES, however when doing this I could have found
some other issue that violates this rule.. mm/vmstat.c has referenced
HPAGE_PMD_NR even if vmstat_item_print_in_thp() wanted to guard it only
with CONFIG_THP.

/home/peterx/git/linux/mm/vmstat.c: In function 'zoneinfo_show_print':
/home/peterx/git/linux/mm/vmstat.c:1689:42: error: 'HPAGE_PMD_NR' undeclared (first use in this function)
1689 | pages /= HPAGE_PMD_NR;
| ^~~~~~~~~~~~
/home/peterx/git/linux/mm/vmstat.c:1689:42: note: each undeclared identifier is reported only once for each function it appears in
CC drivers/tty/tty_port.o
/home/peterx/git/linux/mm/vmstat.c: In function 'vmstat_start':
/home/peterx/git/linux/mm/vmstat.c:1822:33: error: 'HPAGE_PMD_NR' undeclared (first use in this function)
1822 | v[i] /= HPAGE_PMD_NR;
| ^~~~~~~~~~~~
make[4]: *** [/home/peterx/git/linux/scripts/Makefile.build:243: mm/vmstat.o] Error 1

static __always_inline bool vmstat_item_print_in_thp(enum node_stat_item item)
{
if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
return false;
...
}

I think the problem is vmstat_item_print_in_thp() uses IS_ENABLED() however
that won't stop compiler from looking into the "if".. so it'll still try to
find the HPAGE_PMD_NR macro.

It means, I may need to further change vmstat_item_print_in_thp() to make
this work in the clean way.. by properly switching to a #ifdef.

For that I'll need to post a formal patch and add people to review. I'll
keep you posted.

Side note: thank you for your script, just to mention make.cross has been
moved to kbuild/, and it'll also need kbuild.sh now to work. So you may
want to fix your test script (and it worked for you because you kept the
old make.cross around), like:

wget https://raw.githubusercontent.com/intel/lkp-tests/master/kbuild/make.cross -O ./bin/make.cross
wget https://raw.githubusercontent.com/intel/lkp-tests/master/kbuild/kbuild.sh -O ./bin/kbuild.sh

Thanks,

--
Peter Xu


2024-03-23 00:46:21

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v3 12/12] mm/gup: Handle hugetlb in the generic follow_page_mask code

On Fri, Mar 22, 2024 at 01:48:18PM -0700, Andrew Morton wrote:
> On Thu, 21 Mar 2024 18:08:02 -0400 [email protected] wrote:
>
> > From: Peter Xu <[email protected]>
> >
> > Now follow_page() is ready to handle hugetlb pages in whatever form, and
> > over all architectures. Switch to the generic code path.
> >
> > Time to retire hugetlb_follow_page_mask(), following the previous
> > retirement of follow_hugetlb_page() in 4849807114b8.
> >
> > There may be a slight difference of how the loops run when processing slow
> > GUP over a large hugetlb range on cont_pte/cont_pmd supported archs: each
> > loop of __get_user_pages() will resolve one pgtable entry with the patch
> > applied, rather than relying on the size of hugetlb hstate, the latter may
> > cover multiple entries in one loop.
> >
> > A quick performance test on an aarch64 VM on M1 chip shows 15% degrade over
> > a tight loop of slow gup after the path switched. That shouldn't be a
> > problem because slow-gup should not be a hot path for GUP in general: when
> > page is commonly present, fast-gup will already succeed, while when the
> > page is indeed missing and require a follow up page fault, the slow gup
> > degrade will probably buried in the fault paths anyway. It also explains
> > why slow gup for THP used to be very slow before 57edfcfd3419 ("mm/gup:
> > accelerate thp gup even for "pages != NULL"") lands, the latter not part of
> > a performance analysis but a side benefit. If the performance will be a
> > concern, we can consider handle CONT_PTE in follow_page().
> >
> > Before that is justified to be necessary, keep everything clean and simple.
> >
>
> mm/gup.c:33:21: warning: 'follow_hugepd' declared 'static' but never defined [-Wunused-function]
> 33 | static struct page *follow_hugepd(struct vm_area_struct *vma, hugepd_t hugepd,
> | ^~~~~~~~~~~~~
>
> --- a/mm/gup.c~mm-gup-handle-hugepd-for-follow_page-fix
> +++ a/mm/gup.c
> @@ -30,10 +30,12 @@ struct follow_page_context {
> unsigned int page_mask;
> };
>
> +#ifdef CONFIG_HAVE_FAST_GUP
> static struct page *follow_hugepd(struct vm_area_struct *vma, hugepd_t hugepd,
> unsigned long addr, unsigned int pdshift,
> unsigned int flags,
> struct follow_page_context *ctx);
> +#endif
>
> static inline void sanity_check_pinned_pages(struct page **pages,
> unsigned long npages)
> _
>
>
> This looks inelegant.
>
> That's two build issues so far. Please be more expansive in the
> Kconfig variations when testing. Especially when mucking with pgtable
> macros.

Andrew,

Apologies for that, and also for a slightly late response. Yeah it's time
I'll need my set of things to do serious build tests, and I'll at least
start to cover a few error prone configs/archs in with that.

I was trying to rely on the build bot in many of previous such cases, as
that was quite useful to me to cover many build issues without investing my
own test setups, but I think for some reason it retired and stopped working
for a while. Maybe I shouldn't have relied on it at all.

For this specific issue, I'm not sure if CONFIG_HAVE_FAST_GUP is proper?
As follow_hugepd() is used in slow gup not fast. So maybe we can put that
under CONFIG_MMU below that code (and I think we can drop "static" too as I
don't think it's anything useful). My version of fixup attached at the end
of email, and I verified it on m68k build.

I do plan to post a small fixup series to fix these issues (so far it may
contain 1 formal patch to touch up vmstat_item_print_in_thp, and 2 fixups
where I'll mark the subject with "fixup!" properly). Either you can pick
up below or you can wait for my small patchset, should be there either
today or tomorrow.

Thanks,

===8<===
diff --git a/mm/gup.c b/mm/gup.c
index 4cd349390477..a2ed8203495a 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -30,11 +30,6 @@ struct follow_page_context {
unsigned int page_mask;
};

-static struct page *follow_hugepd(struct vm_area_struct *vma, hugepd_t hugepd,
- unsigned long addr, unsigned int pdshift,
- unsigned int flags,
- struct follow_page_context *ctx);
-
static inline void sanity_check_pinned_pages(struct page **pages,
unsigned long npages)
{
@@ -505,6 +500,12 @@ static inline void mm_set_has_pinned_flag(unsigned long *mm_flags)
}

#ifdef CONFIG_MMU
+
+struct page *follow_hugepd(struct vm_area_struct *vma, hugepd_t hugepd,
+ unsigned long addr, unsigned int pdshift,
+ unsigned int flags,
+ struct follow_page_context *ctx);
+
static struct page *no_page_table(struct vm_area_struct *vma,
unsigned int flags, unsigned long address)
{
===8<===

--
Peter Xu


2024-03-23 01:05:53

by SeongJae Park

[permalink] [raw]
Subject: Re: [PATCH v3 03/12] mm: Make HPAGE_PXD_* macros even if !THP

Hi Peter,

On Fri, 22 Mar 2024 20:30:24 -0400 Peter Xu <[email protected]> wrote:

> On Fri, Mar 22, 2024 at 10:14:56AM -0700, SeongJae Park wrote:
> > Hi Peter,
>
> Hi, SeongJae,
>
> >
> > On Thu, 21 Mar 2024 18:07:53 -0400 [email protected] wrote:
> >
> > > From: Peter Xu <[email protected]>
> > >
> > > These macros can be helpful when we plan to merge hugetlb code into generic
> > > code. Move them out and define them even if !THP.
> > >
> > > We actually already defined HPAGE_PMD_NR for other reasons even if !THP.
> > > Reorganize these macros.
> > >
> > > Reviewed-by: Christoph Hellwig <[email protected]>
> > > Reviewed-by: Jason Gunthorpe <[email protected]>
> > > Signed-off-by: Peter Xu <[email protected]>
> > > ---
> > > include/linux/huge_mm.h | 17 ++++++-----------
> > > 1 file changed, 6 insertions(+), 11 deletions(-)
> > >
> > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > > index de0c89105076..3bcdfc7e5d57 100644
> > > --- a/include/linux/huge_mm.h
> > > +++ b/include/linux/huge_mm.h
> > > @@ -64,9 +64,6 @@ ssize_t single_hugepage_flag_show(struct kobject *kobj,
> > > enum transparent_hugepage_flag flag);
> > > extern struct kobj_attribute shmem_enabled_attr;
> > >
> > > -#define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
> > > -#define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
> > > -
> > > /*
> > > * Mask of all large folio orders supported for anonymous THP; all orders up to
> > > * and including PMD_ORDER, except order-0 (which is not "huge") and order-1
> > > @@ -87,14 +84,19 @@ extern struct kobj_attribute shmem_enabled_attr;
> > > #define thp_vma_allowable_order(vma, vm_flags, smaps, in_pf, enforce_sysfs, order) \
> > > (!!thp_vma_allowable_orders(vma, vm_flags, smaps, in_pf, enforce_sysfs, BIT(order)))
> > >
> > > -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > > #define HPAGE_PMD_SHIFT PMD_SHIFT
> > > #define HPAGE_PMD_SIZE ((1UL) << HPAGE_PMD_SHIFT)
> > > #define HPAGE_PMD_MASK (~(HPAGE_PMD_SIZE - 1))
> > > +#define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
> > > +#define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
> > >
> > > #define HPAGE_PUD_SHIFT PUD_SHIFT
> > > #define HPAGE_PUD_SIZE ((1UL) << HPAGE_PUD_SHIFT)
> > > #define HPAGE_PUD_MASK (~(HPAGE_PUD_SIZE - 1))
> > > +#define HPAGE_PUD_ORDER (HPAGE_PUD_SHIFT-PAGE_SHIFT)
> > > +#define HPAGE_PUD_NR (1<<HPAGE_PUD_ORDER)
> >
> > I just found latest mm-unstable fails one of my build configurations[1] with
> > below error. 'git bisect' says this is the first patch set started the
> > failure. I haven't looked in deep, but just reporting first.
> >
> > In file included from .../include/linux/mm.h:1115,
> > from .../mm/vmstat.c:14:
> > .../mm/vmstat.c: In function 'zoneinfo_show_print':
> > .../include/linux/huge_mm.h:87:25: error: 'PMD_SHIFT' undeclared (first use in this function); did you mean 'PUD_SHIFT'?
> > 87 | #define HPAGE_PMD_SHIFT PMD_SHIFT
> > | ^~~~~~~~~
> >
> > [1] https://github.com/awslabs/damon-tests/blob/next/corr/tests/build_m68k.sh
>
> Apologies for the issue.

No problem at all, this blocks nothing in real :)

> This is caused by !CONFIG_MMU, I think.
>
> I thought this would be fairly easy to fix by putting these macros under
> CONFIG_PGTABLE_HAS_HUGE_LEAVES, however when doing this I could have found
> some other issue that violates this rule.. mm/vmstat.c has referenced
> HPAGE_PMD_NR even if vmstat_item_print_in_thp() wanted to guard it only
> with CONFIG_THP.
>
> /home/peterx/git/linux/mm/vmstat.c: In function 'zoneinfo_show_print':
> /home/peterx/git/linux/mm/vmstat.c:1689:42: error: 'HPAGE_PMD_NR' undeclared (first use in this function)
> 1689 | pages /= HPAGE_PMD_NR;
> | ^~~~~~~~~~~~
> /home/peterx/git/linux/mm/vmstat.c:1689:42: note: each undeclared identifier is reported only once for each function it appears in
> CC drivers/tty/tty_port.o
> /home/peterx/git/linux/mm/vmstat.c: In function 'vmstat_start':
> /home/peterx/git/linux/mm/vmstat.c:1822:33: error: 'HPAGE_PMD_NR' undeclared (first use in this function)
> 1822 | v[i] /= HPAGE_PMD_NR;
> | ^~~~~~~~~~~~
> make[4]: *** [/home/peterx/git/linux/scripts/Makefile.build:243: mm/vmstat.o] Error 1
>
> static __always_inline bool vmstat_item_print_in_thp(enum node_stat_item item)
> {
> if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
> return false;
> ...
> }
>
> I think the problem is vmstat_item_print_in_thp() uses IS_ENABLED() however
> that won't stop compiler from looking into the "if".. so it'll still try to
> find the HPAGE_PMD_NR macro.
>
> It means, I may need to further change vmstat_item_print_in_thp() to make
> this work in the clean way.. by properly switching to a #ifdef.
>
> For that I'll need to post a formal patch and add people to review. I'll
> keep you posted.

Thank you for this kind explanation, all makes sense to me. Looking forward to
the patch.

>
> Side note: thank you for your script, just to mention make.cross has been
> moved to kbuild/, and it'll also need kbuild.sh now to work. So you may
> want to fix your test script (and it worked for you because you kept the
> old make.cross around), like:
>
> wget https://raw.githubusercontent.com/intel/lkp-tests/master/kbuild/make.cross -O ./bin/make.cross
> wget https://raw.githubusercontent.com/intel/lkp-tests/master/kbuild/kbuild.sh -O ./bin/kbuild.sh

And thank you so much for this nice suggestion. I'll revisit the script soon.


Thanks,
SJ

>
> Thanks,
>
> --
> Peter Xu

2024-03-23 02:15:38

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v3 12/12] mm/gup: Handle hugetlb in the generic follow_page_mask code

On Fri, Mar 22, 2024 at 08:45:59PM -0400, Peter Xu wrote:
> On Fri, Mar 22, 2024 at 01:48:18PM -0700, Andrew Morton wrote:
> > On Thu, 21 Mar 2024 18:08:02 -0400 [email protected] wrote:
> >
> > > From: Peter Xu <[email protected]>
> > >
> > > Now follow_page() is ready to handle hugetlb pages in whatever form, and
> > > over all architectures. Switch to the generic code path.
> > >
> > > Time to retire hugetlb_follow_page_mask(), following the previous
> > > retirement of follow_hugetlb_page() in 4849807114b8.
> > >
> > > There may be a slight difference of how the loops run when processing slow
> > > GUP over a large hugetlb range on cont_pte/cont_pmd supported archs: each
> > > loop of __get_user_pages() will resolve one pgtable entry with the patch
> > > applied, rather than relying on the size of hugetlb hstate, the latter may
> > > cover multiple entries in one loop.
> > >
> > > A quick performance test on an aarch64 VM on M1 chip shows 15% degrade over
> > > a tight loop of slow gup after the path switched. That shouldn't be a
> > > problem because slow-gup should not be a hot path for GUP in general: when
> > > page is commonly present, fast-gup will already succeed, while when the
> > > page is indeed missing and require a follow up page fault, the slow gup
> > > degrade will probably buried in the fault paths anyway. It also explains
> > > why slow gup for THP used to be very slow before 57edfcfd3419 ("mm/gup:
> > > accelerate thp gup even for "pages != NULL"") lands, the latter not part of
> > > a performance analysis but a side benefit. If the performance will be a
> > > concern, we can consider handle CONT_PTE in follow_page().
> > >
> > > Before that is justified to be necessary, keep everything clean and simple.
> > >
> >
> > mm/gup.c:33:21: warning: 'follow_hugepd' declared 'static' but never defined [-Wunused-function]
> > 33 | static struct page *follow_hugepd(struct vm_area_struct *vma, hugepd_t hugepd,
> > | ^~~~~~~~~~~~~
> >
> > --- a/mm/gup.c~mm-gup-handle-hugepd-for-follow_page-fix
> > +++ a/mm/gup.c
> > @@ -30,10 +30,12 @@ struct follow_page_context {
> > unsigned int page_mask;
> > };
> >
> > +#ifdef CONFIG_HAVE_FAST_GUP
> > static struct page *follow_hugepd(struct vm_area_struct *vma, hugepd_t hugepd,
> > unsigned long addr, unsigned int pdshift,
> > unsigned int flags,
> > struct follow_page_context *ctx);
> > +#endif
> >
> > static inline void sanity_check_pinned_pages(struct page **pages,
> > unsigned long npages)
> > _
> >
> >
> > This looks inelegant.
> >
> > That's two build issues so far. Please be more expansive in the
> > Kconfig variations when testing. Especially when mucking with pgtable
> > macros.
>
> Andrew,
>
> Apologies for that, and also for a slightly late response. Yeah it's time
> I'll need my set of things to do serious build tests, and I'll at least
> start to cover a few error prone configs/archs in with that.
>
> I was trying to rely on the build bot in many of previous such cases, as
> that was quite useful to me to cover many build issues without investing my
> own test setups, but I think for some reason it retired and stopped working
> for a while. Maybe I shouldn't have relied on it at all.
>
> For this specific issue, I'm not sure if CONFIG_HAVE_FAST_GUP is proper?
> As follow_hugepd() is used in slow gup not fast. So maybe we can put that
> under CONFIG_MMU below that code (and I think we can drop "static" too as I
> don't think it's anything useful). My version of fixup attached at the end

the static is useful; below patch did pass on m68k but won't on
x86.. ignore that please.

> of email, and I verified it on m68k build.
>
> I do plan to post a small fixup series to fix these issues (so far it may
> contain 1 formal patch to touch up vmstat_item_print_in_thp, and 2 fixups
> where I'll mark the subject with "fixup!" properly). Either you can pick
> up below or you can wait for my small patchset, should be there either
> today or tomorrow.

I changed plan here too; I found more users of HPAGE_PMD_NR assuming it's
defined even if !CONFIG_MMU. That's weird, as CONFIG_MMU doesn't even
define PMD_SHIFT... To fix this I decided to use the old trick on using
BUILD_BUG() like it used to work before; frankly I don't know how that
didn't throw warnings, but i'll make sure it passes all known builds (ps: I
still haven't got my build harness ready, so that will still be limited but
should solve known issues).

In short: please wait for my fixup series. Thanks.

>
> Thanks,
>
> ===8<===
> diff --git a/mm/gup.c b/mm/gup.c
> index 4cd349390477..a2ed8203495a 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -30,11 +30,6 @@ struct follow_page_context {
> unsigned int page_mask;
> };
>
> -static struct page *follow_hugepd(struct vm_area_struct *vma, hugepd_t hugepd,
> - unsigned long addr, unsigned int pdshift,
> - unsigned int flags,
> - struct follow_page_context *ctx);
> -
> static inline void sanity_check_pinned_pages(struct page **pages,
> unsigned long npages)
> {
> @@ -505,6 +500,12 @@ static inline void mm_set_has_pinned_flag(unsigned long *mm_flags)
> }
>
> #ifdef CONFIG_MMU
> +
> +struct page *follow_hugepd(struct vm_area_struct *vma, hugepd_t hugepd,
> + unsigned long addr, unsigned int pdshift,
> + unsigned int flags,
> + struct follow_page_context *ctx);
> +
> static struct page *no_page_table(struct vm_area_struct *vma,
> unsigned int flags, unsigned long address)
> {
> ===8<===
>
> --
> Peter Xu

--
Peter Xu


2024-03-25 16:44:35

by Christophe Leroy

[permalink] [raw]
Subject: Re: [PATCH v3 00/12] mm/gup: Unify hugetlb, part 2



Le 21/03/2024 à 23:07, [email protected] a écrit :
> From: Peter Xu <[email protected]>
>
> v3:
> - Rebased to latest mm-unstalbe (a824831a082f, of March 21th)
> - Dropped patch to introduce pmd_thp_or_huge(), replace such uses (and also
> pXd_huge() users) with pXd_leaf() [Jason]
> - Add a comment for CONFIG_PGTABLE_HAS_HUGE_LEAVES [Jason]
> - Use IS_ENABLED() in follow_huge_pud() [Jason]
> - Remove redundant none pud check in follow_pud_mask() [Jason]
>
> rfc: https://lore.kernel.org/r/[email protected]
> v1: https://lore.kernel.org/r/[email protected]
> v2: https://lore.kernel.org/r/[email protected]
>
> The series removes the hugetlb slow gup path after a previous refactor work
> [1], so that slow gup now uses the exact same path to process all kinds of
> memory including hugetlb.
>
> For the long term, we may want to remove most, if not all, call sites of
> huge_pte_offset(). It'll be ideal if that API can be completely dropped
> from arch hugetlb API. This series is one small step towards merging
> hugetlb specific codes into generic mm paths. From that POV, this series
> removes one reference to huge_pte_offset() out of many others.
>
> One goal of such a route is that we can reconsider merging hugetlb features
> like High Granularity Mapping (HGM). It was not accepted in the past
> because it may add lots of hugetlb specific codes and make the mm code even
> harder to maintain. With a merged codeset, features like HGM can hopefully
> share some code with THP, legacy (PMD+) or modern (continuous PTEs).
>
> To make it work, the generic slow gup code will need to at least understand
> hugepd, which is already done like so in fast-gup. Due to the specialty of
> hugepd to be software-only solution (no hardware recognizes the hugepd
> format, so it's purely artificial structures), there's chance we can merge
> some or all hugepd formats with cont_pte in the future. That question is
> yet unsettled from Power side to have an acknowledgement. As of now for
> this series, I kept the hugepd handling because we may still need to do so
> before getting a clearer picture of the future of hugepd. The other reason
> is simply that we did it already for fast-gup and most codes are still
> around to be reused. It'll make more sense to keep slow/fast gup behave
> the same before a decision is made to remove hugepd.
>

It is not true that hugepd is a software-only solution. Powerpc 8xx HW
matches the hugepd topology for 8M pages.

Christophe



2024-03-25 19:09:12

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v3 00/12] mm/gup: Unify hugetlb, part 2

On Fri, Mar 22, 2024 at 01:10:00PM -0300, Jason Gunthorpe wrote:
> On Thu, Mar 21, 2024 at 06:07:50PM -0400, [email protected] wrote:
> > From: Peter Xu <[email protected]>
> >
> > v3:
> > - Rebased to latest mm-unstalbe (a824831a082f, of March 21th)
> > - Dropped patch to introduce pmd_thp_or_huge(), replace such uses (and also
> > pXd_huge() users) with pXd_leaf() [Jason]
> > - Add a comment for CONFIG_PGTABLE_HAS_HUGE_LEAVES [Jason]
> > - Use IS_ENABLED() in follow_huge_pud() [Jason]
> > - Remove redundant none pud check in follow_pud_mask() [Jason]
> >
> > rfc: https://lore.kernel.org/r/[email protected]
> > v1: https://lore.kernel.org/r/[email protected]
> > v2: https://lore.kernel.org/r/[email protected]
> >
> > The series removes the hugetlb slow gup path after a previous refactor work
> > [1], so that slow gup now uses the exact same path to process all kinds of
> > memory including hugetlb.
> >
> > For the long term, we may want to remove most, if not all, call sites of
> > huge_pte_offset(). It'll be ideal if that API can be completely dropped
> > from arch hugetlb API. This series is one small step towards merging
> > hugetlb specific codes into generic mm paths. From that POV, this series
> > removes one reference to huge_pte_offset() out of many others.
>
> This remark would be a little easier to understand if you refer to
> hugetlb_walk() not huge_pte_offset() - the latter is only used to
> implement hugetlb_walk() and isn't removed by this series, while a
> single hugetlb_walk() was removed.

Right. Here huge_pte_offset() is the arch API that I hope we can remove,
the hugetlb_walk() is simply the wrapper.

>
> Regardless, I think the point is spot on, the direction should be to
> make the page table reading generic with minimal/no interaction with
> hugetlbfs in the generic code.

Yes, and I also like your terms on calling them "pgtable readers". It's a
better way to describe the difference in that regard between
huge_pte_offset() v.s. huge_pte_alloc(). Exactly that's my goal, that we
should start with the "readers".

The writters might change semantics when merge, and can be more
challenging, I'll need to look into details of each one, like page fault
handlers. Such work may need to be analyzed case by case, and this GUP
part is definitely the low hanging fruit comparing to the rest.

>
> After this series I would suggest doing the pagewalk.c stuff as it is
> very parallel to GUP slow (indeed it would be amazing to figure out a
> way to make GUP slow and pagewalk.c use the same code without a
> performance cost)

Yes. I hope there shouldn't be much perf degrade, I can do some more
measurements too when getting there. And btw, IIUC the major challenge of
pagewalk.c is not the removal of walk_hugetlb_range() alone - that may not
be that hard if that's the solo purpose. The better way to go is to remove
mm_walk_ops.hugetlb_entry() altogether, which will cause a change in all
callers; that's "the challenge".. pretty much labor works, not a major
technical challenge it seems. Not sure if it's a good news or bad..

One thing I'll soon look into is to allow huge mappings for PFNMAP; there's
one request from VFIO side for MMIO. Dropping mm_walk_ops.hugetlb_entry()
seems to be good for all such purposes; well, I may need to bite the bullet
here.. for either of the purposes to move on.

>
> Some of the other core mm callers don't look too bad either, getting
> to the point where hugetlb_walk() is internal to hugetlb.c would be a
> nice step that looks reasonable size.

Agree.

>
> > One goal of such a route is that we can reconsider merging hugetlb features
> > like High Granularity Mapping (HGM). It was not accepted in the past
> > because it may add lots of hugetlb specific codes and make the mm code even
> > harder to maintain. With a merged codeset, features like HGM can hopefully
> > share some code with THP, legacy (PMD+) or modern (continuous PTEs).
>
> Yeah, if all the special hugetlb stuff is using generic arch code and
> generic page walkers (maybe with that special vma locking hook) it is
> much easier to swallow than adding yet another special class of code
> to all the page walkers.
>
> > To make it work, the generic slow gup code will need to at least understand
> > hugepd, which is already done like so in fast-gup. Due to the specialty of
> > hugepd to be software-only solution (no hardware recognizes the hugepd
> > format, so it's purely artificial structures), there's chance we can merge
> > some or all hugepd formats with cont_pte in the future. That question is
> > yet unsettled from Power side to have an acknowledgement.
>
> At a minimum, I think we have a concurrence that the reading of the
> hugepd entries should be fine through the standard contig pte APIs and
> so all the page walkers doing read side operations could stop having
> special hugepd code. It is just an implementation of contig pte with
> the new property that the size of a PTE can be larger than a PMD
> entry.
>
> If, or how much, the hugepd write side remains special is the open
> question, I think.

It seems balls are rolling in that aspect, I haven't looked in depth there
in Christophe's series but it's great to have started!

>
> > this series, I kept the hugepd handling because we may still need to do so
> > before getting a clearer picture of the future of hugepd. The other reason
> > is simply that we did it already for fast-gup and most codes are still
> > around to be reused. It'll make more sense to keep slow/fast gup behave
> > the same before a decision is made to remove hugepd.
>
> Yeah, I think that is right for this series. Lets get this done and
> then try to remove hugepd read side.

Thanks a bunch for the reviews.

--
Peter Xu


2024-03-26 14:14:11

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v3 00/12] mm/gup: Unify hugetlb, part 2

On Mon, Mar 25, 2024 at 02:58:48PM -0400, Peter Xu wrote:

> > This remark would be a little easier to understand if you refer to
> > hugetlb_walk() not huge_pte_offset() - the latter is only used to
> > implement hugetlb_walk() and isn't removed by this series, while a
> > single hugetlb_walk() was removed.
>
> Right. Here huge_pte_offset() is the arch API that I hope we can remove,
> the hugetlb_walk() is simply the wrapper.

But arguably hugetlb_walk is the thing that should go..

In the generic code we should really try to get away from the weird
hugetlb abuse of treating every level as a pte_t. That is not how the
generic page table API works and that weirdness is part of what
motivates the arch API to supply special functions for reading. Not
every arch can just cast every level to a pte_t and still work.

But that weirdness is also saving alot of code so something else needs
to be though up..

> > Regardless, I think the point is spot on, the direction should be to
> > make the page table reading generic with minimal/no interaction with
> > hugetlbfs in the generic code.
>
> Yes, and I also like your terms on calling them "pgtable readers". It's a
> better way to describe the difference in that regard between
> huge_pte_offset() v.s. huge_pte_alloc(). Exactly that's my goal, that we
> should start with the "readers".

Yeah, it makes alot of sense to tackle the readers first - we are
pretty close now to having enough done to have generic readers. I
would imagine tackling everything outside mm/huge*.c to use the normal
page table API for reading.

Then consider what to do with the reading in mm/huge*.c

> The writters might change semantics when merge, and can be more
> challenging, I'll need to look into details of each one, like page fault
> handlers. Such work may need to be analyzed case by case, and this GUP
> part is definitely the low hanging fruit comparing to the rest.

The write side is tricky, I think if the read side is sorted out then
it will be easer to reason about the write side. Today the write side
is paired with the special read side and it is extra hard to
understand if there is something weird hidden in the arch.

> measurements too when getting there. And btw, IIUC the major challenge of
> pagewalk.c is not the removal of walk_hugetlb_range() alone - that may not
> be that hard if that's the solo purpose. The better way to go is to remove
> mm_walk_ops.hugetlb_entry() altogether, which will cause a change in all
> callers; that's "the challenge".. pretty much labor works, not a major
> technical challenge it seems. Not sure if it's a good news or bad..

Ugh, that is a big pain. It is relying on that hugetlbfs trick of
passing in a pte that is not a pte to make the API generic..

The more I look at this the more I think we need to get to Matthew's
idea of having some kind of generic page table API that is not tightly
tied to level. Replacing the hugetlb trick of 'everything is a PTE'
with 5 special cases in every place seems just horrible.

struct mm_walk_ops {
int (*leaf_entry)(struct mm_walk_state *state, struct mm_walk *walk);
}

And many cases really want something like:
struct mm_walk_state state;

if (!mm_walk_seek_leaf(state, mm, address))
goto no_present
if (mm_walk_is_write(state)) ..

And detailed walking:
for_each_pt_leaf(state, mm, address) {
if (mm_walk_is_write(state)) ..
}

Replacing it with a mm_walk_state that retains the level or otherwise
to allow decoding any entry composes a lot better. Forced Loop
unrolling can get back to the current code gen in alot of places.

It also makes the power stuff a bit nicer as the mm_walk_state could
automatically retain back pointers to the higher levels in the state
struct too...

The puzzle is how to do it and still get reasonable efficient codegen,
many operations are going to end up switching on some state->level to
know how to decode the entry.

> One thing I'll soon look into is to allow huge mappings for PFNMAP; there's
> one request from VFIO side for MMIO. Dropping mm_walk_ops.hugetlb_entry()
> seems to be good for all such purposes; well, I may need to bite the bullet
> here.. for either of the purposes to move on.

That would be a nice feature!

> > If, or how much, the hugepd write side remains special is the open
> > question, I think.
>
> It seems balls are rolling in that aspect, I haven't looked in depth there
> in Christophe's series but it's great to have started!

Yes!

Jason

2024-04-04 21:48:21

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v3 00/12] mm/gup: Unify hugetlb, part 2

On Tue, Mar 26, 2024 at 11:02:52AM -0300, Jason Gunthorpe wrote:
> The more I look at this the more I think we need to get to Matthew's
> idea of having some kind of generic page table API that is not tightly
> tied to level. Replacing the hugetlb trick of 'everything is a PTE'
> with 5 special cases in every place seems just horrible.
>
> struct mm_walk_ops {
> int (*leaf_entry)(struct mm_walk_state *state, struct mm_walk *walk);
> }
>
> And many cases really want something like:
> struct mm_walk_state state;
>
> if (!mm_walk_seek_leaf(state, mm, address))
> goto no_present
> if (mm_walk_is_write(state)) ..
>
> And detailed walking:
> for_each_pt_leaf(state, mm, address) {
> if (mm_walk_is_write(state)) ..
> }
>
> Replacing it with a mm_walk_state that retains the level or otherwise
> to allow decoding any entry composes a lot better. Forced Loop
> unrolling can get back to the current code gen in alot of places.
>
> It also makes the power stuff a bit nicer as the mm_walk_state could
> automatically retain back pointers to the higher levels in the state
> struct too...
>
> The puzzle is how to do it and still get reasonable efficient codegen,
> many operations are going to end up switching on some state->level to
> know how to decode the entry.

These discussions are definitely constructive, thanks Jason. Very helpful.

I thought about this last week but got interrupted. It does make sense to
me; it looks pretty generic and it is flexible enough as a top design. At
least that's what I thought.

However now when I rethink about it, and look more into the code when I got
the chance, it turns out this will be a major rewrite of mostly every
walkers.. it doesn't mean that this is a bad idea, but then I'll need to
compare the other approach, because there can be a huge difference on when
we can get that code ready, I think. :)

Consider that what we (or.. I) want to teach the pXd layers are two things
right now: (1) hugetlb mappings (2) MMIO (PFN) mappings. That mostly
shares the generic concept when working on the mm walkers no matter which
way to go, just different treatment on different type of mem. (2) is on
top of current code and new stuff, while (1) is a refactoring to drop
hugetlb_entry() hook point as the goal.

Taking a simplest mm walker (smaps) as example, I think most codes are
ready thanks to THP's existance, and also like vm_normal_page[_pmd]() which
should even already work for pfnmaps; pud layer is missing but that should
be trivial. It means we may have chance to drop hugetlb_entry() without an
huge overhaul yet.

Now the important question I'm asking myself is: do we really need huge p4d
or even bigger? It's 512GB on x86, and we said "No 512 GiB pages yet"
(commit fe1e8c3e963) since 2017 - that is 7 years without chaning this
fact. While on non-x86 p4d_leaf() never defined. Then it's also
interesting to see how many codes are "ready" to handle p4d entries (by
looking at p4d_leaf() calls; much easier to see with the removal of the
rest huge apis..) even if none existed.

So, can we over-engineer too much if we go the generic route now?
Considering that we already have most of pmd/pud entries around in the mm
walker ops. So far it sounds better we leave it for later, until further
justifed to be useful. And that won't block it if it ever justified to be
needed, I'd say it can also be seen as a step forward if I can make it to
remove hugetlb_entry() first.

Comments welcomed (before I start to work on anything..).

Thanks,

--
Peter Xu


2024-04-05 18:17:33

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v3 00/12] mm/gup: Unify hugetlb, part 2

On Thu, Apr 04, 2024 at 05:48:03PM -0400, Peter Xu wrote:
> On Tue, Mar 26, 2024 at 11:02:52AM -0300, Jason Gunthorpe wrote:
> > The more I look at this the more I think we need to get to Matthew's
> > idea of having some kind of generic page table API that is not tightly
> > tied to level. Replacing the hugetlb trick of 'everything is a PTE'
> > with 5 special cases in every place seems just horrible.
> >
> > struct mm_walk_ops {
> > int (*leaf_entry)(struct mm_walk_state *state, struct mm_walk *walk);
> > }
> >
> > And many cases really want something like:
> > struct mm_walk_state state;
> >
> > if (!mm_walk_seek_leaf(state, mm, address))
> > goto no_present
> > if (mm_walk_is_write(state)) ..
> >
> > And detailed walking:
> > for_each_pt_leaf(state, mm, address) {
> > if (mm_walk_is_write(state)) ..
> > }
> >
> > Replacing it with a mm_walk_state that retains the level or otherwise
> > to allow decoding any entry composes a lot better. Forced Loop
> > unrolling can get back to the current code gen in alot of places.
> >
> > It also makes the power stuff a bit nicer as the mm_walk_state could
> > automatically retain back pointers to the higher levels in the state
> > struct too...
> >
> > The puzzle is how to do it and still get reasonable efficient codegen,
> > many operations are going to end up switching on some state->level to
> > know how to decode the entry.
>
> These discussions are definitely constructive, thanks Jason. Very helpful.
>
> I thought about this last week but got interrupted. It does make sense to
> me; it looks pretty generic and it is flexible enough as a top design. At
> least that's what I thought.

Yeah, exactly..

> However now when I rethink about it, and look more into the code when I got
> the chance, it turns out this will be a major rewrite of mostly every
> walkers..

Indeed, it is why it may not be reasonable.

> Consider that what we (or.. I) want to teach the pXd layers are two things
> right now: (1) hugetlb mappings (2) MMIO (PFN) mappings. That mostly
> shares the generic concept when working on the mm walkers no matter which
> way to go, just different treatment on different type of mem. (2) is on
> top of current code and new stuff, while (1) is a refactoring to drop
> hugetlb_entry() hook point as the goal.

Right, I view this as a two pronged attack

One one front you teach the generic pXX_* macros to process huge pages
and push that around to the performance walkers like GUP

On another front you want to replace use of the hugepte with the new
walkers.

The challenge with the hugepte code is that it is all structured to
assume one API that works at all levels and that may be a hard fit to
replace with pXX_* functions.

The places that are easy to switch from hugetlb to pXX_* may as well
do so.

Other places maybe need a hugetlb replacement that has a similar
abstraction level of pointing to any page table level.

I think if you do the easy places for pXX conversion you will have a
good idea about what is needed for the hard places.

> Now the important question I'm asking myself is: do we really need huge p4d
> or even bigger?

Do we need huge p4d support with folios? Probably not..

huge p4d support for pfnmap, eg in VFIO. Yes I think that is possibly
interesting - but I wouldn't ask anyone to do the work :)

But then again we come back to power and its big list of page sizes
and variety :( Looks like some there have huge sizes at the pgd level
at least.

> So, can we over-engineer too much if we go the generic route now?

Yes we can, and it will probably be slow as well. The pXX macros are
the most efficient if code can be adapted to use them.

> Considering that we already have most of pmd/pud entries around in the mm
> walker ops.

Yeah, so you add pgd and maybe p4d and then we can don't need any
generic thing. If it is easy it would be nice.

Jason

2024-04-05 21:43:08

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v3 00/12] mm/gup: Unify hugetlb, part 2

On Fri, Apr 05, 2024 at 03:16:33PM -0300, Jason Gunthorpe wrote:
> On Thu, Apr 04, 2024 at 05:48:03PM -0400, Peter Xu wrote:
> > On Tue, Mar 26, 2024 at 11:02:52AM -0300, Jason Gunthorpe wrote:
> > > The more I look at this the more I think we need to get to Matthew's
> > > idea of having some kind of generic page table API that is not tightly
> > > tied to level. Replacing the hugetlb trick of 'everything is a PTE'
> > > with 5 special cases in every place seems just horrible.
> > >
> > > struct mm_walk_ops {
> > > int (*leaf_entry)(struct mm_walk_state *state, struct mm_walk *walk);
> > > }
> > >
> > > And many cases really want something like:
> > > struct mm_walk_state state;
> > >
> > > if (!mm_walk_seek_leaf(state, mm, address))
> > > goto no_present
> > > if (mm_walk_is_write(state)) ..
> > >
> > > And detailed walking:
> > > for_each_pt_leaf(state, mm, address) {
> > > if (mm_walk_is_write(state)) ..
> > > }
> > >
> > > Replacing it with a mm_walk_state that retains the level or otherwise
> > > to allow decoding any entry composes a lot better. Forced Loop
> > > unrolling can get back to the current code gen in alot of places.
> > >
> > > It also makes the power stuff a bit nicer as the mm_walk_state could
> > > automatically retain back pointers to the higher levels in the state
> > > struct too...
> > >
> > > The puzzle is how to do it and still get reasonable efficient codegen,
> > > many operations are going to end up switching on some state->level to
> > > know how to decode the entry.
> >
> > These discussions are definitely constructive, thanks Jason. Very helpful.
> >
> > I thought about this last week but got interrupted. It does make sense to
> > me; it looks pretty generic and it is flexible enough as a top design. At
> > least that's what I thought.
>
> Yeah, exactly..
>
> > However now when I rethink about it, and look more into the code when I got
> > the chance, it turns out this will be a major rewrite of mostly every
> > walkers..
>
> Indeed, it is why it may not be reasonable.
>
> > Consider that what we (or.. I) want to teach the pXd layers are two things
> > right now: (1) hugetlb mappings (2) MMIO (PFN) mappings. That mostly
> > shares the generic concept when working on the mm walkers no matter which
> > way to go, just different treatment on different type of mem. (2) is on
> > top of current code and new stuff, while (1) is a refactoring to drop
> > hugetlb_entry() hook point as the goal.
>
> Right, I view this as a two pronged attack
>
> One one front you teach the generic pXX_* macros to process huge pages
> and push that around to the performance walkers like GUP
>
> On another front you want to replace use of the hugepte with the new
> walkers.
>
> The challenge with the hugepte code is that it is all structured to
> assume one API that works at all levels and that may be a hard fit to
> replace with pXX_* functions.

That's the core of problem, or otherwise I feel like I might be doing
something else already. I had a feeling even if it's currently efficient
for hugetlb, we'll drop that sooner or later.

The issue is at least hugetlb pgtable format is exactly the same as the
rest, as large folio grows it will reach the point that we complain more
than before on having hugetlb does its smart things on its own.

>
> The places that are easy to switch from hugetlb to pXX_* may as well
> do so.
>
> Other places maybe need a hugetlb replacement that has a similar
> abstraction level of pointing to any page table level.

IMHO this depends.

Per my current knowledge, hugetlb is only special in three forms:

- huge mapping (well, this isn't that special..)
- cont pte/pmd/...
- hugepd

The most fancy one is actually hugepd.. but I plan to put that temporarily
aside - I haven't look at Christophe's series yet, however I think we're
going to solve orthogonal issues but his work will definitely help me on
reaching mine, and I want to think through first on my end of things to
know a plan. If hugepd has its chance to be generalized, the worst case is
I'll leverage CONFIG_ARCH_HAS_HUGEPD and only keep hugetlb_entry() for them
until hugepd became some cont-pxx variance. Then when I put HAS_HUGEPD
aside, I don't think it's very complicated, perhaps?

In short, hugetlb mappings shouldn't be special comparing to other huge pXd
and large folio (cont-pXd) mappings for most of the walkers in my mind, if
not all. I need to look at all the walkers and there can be some tricky
ones, but I believe that applies in general. It's actually similar to what
I did with slow gup here.

Like this series, for cont-pXd we'll need multiple walks comparing to
before (when with hugetlb_entry()), but for that part I'll provide some
performance tests too, and we also have a fallback plan, which is to detect
cont-pXd existance, which will also work for large folios.

>
> I think if you do the easy places for pXX conversion you will have a
> good idea about what is needed for the hard places.

Here IMHO we don't need to understand "what is the size of this hugetlb
vma", or "which level of pgtable does this hugetlb vma pages locate",
because we may not need that, e.g., when we only want to collect some smaps
statistics. "whether it's hugetlb" may matter, though. E.g. in the mm
walker we see a huge pmd, it can be a thp, it can be a hugetlb (when
hugetlb_entry removed), we may need extra check later to put things into
the right bucket, but for the walker itself it doesn't necessarily need
hugetlb_entry().

>
> > Now the important question I'm asking myself is: do we really need huge p4d
> > or even bigger?
>
> Do we need huge p4d support with folios? Probably not..
>
> huge p4d support for pfnmap, eg in VFIO. Yes I think that is possibly
> interesting - but I wouldn't ask anyone to do the work :)

Considering it does not yet exist, and we don't have plan to work it out
(so far), maybe I can see this as a very implicit ack that I can put that
aside at least of now. :)

>
> But then again we come back to power and its big list of page sizes
> and variety :( Looks like some there have huge sizes at the pgd level
> at least.

Yeah this is something I want to be super clear, because I may miss
something: we don't have real pgd pages, right? Powerpc doesn't even
define p4d_leaf(), AFAICT.

$ git grep p4d_leaf
arch/powerpc/mm/book3s64/radix_pgtable.c: if (p4d_leaf(*p4d)) {
arch/powerpc/mm/pgtable.c: if (p4d_leaf(p4d)) {
arch/powerpc/mm/pgtable_64.c: if (p4d_leaf(p4d)) {
arch/powerpc/mm/pgtable_64.c: VM_WARN_ON(!p4d_leaf(p4d));
arch/powerpc/xmon/xmon.c: if (p4d_leaf(*p4dp)) {

They all constantly return false (which is the fallback)?

>
> > So, can we over-engineer too much if we go the generic route now?
>
> Yes we can, and it will probably be slow as well. The pXX macros are
> the most efficient if code can be adapted to use them.
>
> > Considering that we already have most of pmd/pud entries around in the mm
> > walker ops.
>
> Yeah, so you add pgd and maybe p4d and then we can don't need any
> generic thing. If it is easy it would be nice.

I want to double check on above on PowerPC first, otherwise I'd consider
leaving p4d+ alone, if that looks ok. It could be that I missed something
important, Christophe may want to chim in if he has any thoughts or just to
clarify my mistakes.

Thanks,

--
Peter Xu


2024-04-09 23:44:53

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v3 00/12] mm/gup: Unify hugetlb, part 2

On Fri, Apr 05, 2024 at 05:42:44PM -0400, Peter Xu wrote:
> In short, hugetlb mappings shouldn't be special comparing to other huge pXd
> and large folio (cont-pXd) mappings for most of the walkers in my mind, if
> not all. I need to look at all the walkers and there can be some tricky
> ones, but I believe that applies in general. It's actually similar to what
> I did with slow gup here.

I think that is the big question, I also haven't done the research to
know the answer.

At this point focusing on moving what is reasonable to the pXX_* API
makes sense to me. Then reviewing what remains and making some
decision.

> Like this series, for cont-pXd we'll need multiple walks comparing to
> before (when with hugetlb_entry()), but for that part I'll provide some
> performance tests too, and we also have a fallback plan, which is to detect
> cont-pXd existance, which will also work for large folios.

I think we can optimize this pretty easy.

> > I think if you do the easy places for pXX conversion you will have a
> > good idea about what is needed for the hard places.
>
> Here IMHO we don't need to understand "what is the size of this hugetlb
> vma"

Yeh, I never really understood why hugetlb was linked to the VMA.. The
page table is self describing, obviously.

> or "which level of pgtable does this hugetlb vma pages locate",

Ditto

> because we may not need that, e.g., when we only want to collect some smaps
> statistics. "whether it's hugetlb" may matter, though. E.g. in the mm
> walker we see a huge pmd, it can be a thp, it can be a hugetlb (when
> hugetlb_entry removed), we may need extra check later to put things into
> the right bucket, but for the walker itself it doesn't necessarily need
> hugetlb_entry().

Right, places may still need to know it is part of a huge VMA because we
have special stuff linked to that.

> > But then again we come back to power and its big list of page sizes
> > and variety :( Looks like some there have huge sizes at the pgd level
> > at least.
>
> Yeah this is something I want to be super clear, because I may miss
> something: we don't have real pgd pages, right? Powerpc doesn't even
> define p4d_leaf(), AFAICT.

AFAICT it is because it hides it all in hugepd.

If the goal is to purge hugepd then some of the options might turn out
to convert hugepd into huge p4d/pgd, as I understand it. It would be
nice to have certainty on this at least.

We have effectively three APIs to parse a single page table and
currently none of the APIs can return 100% of the data for power.

Jason

2024-04-10 15:29:01

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v3 00/12] mm/gup: Unify hugetlb, part 2

On Tue, Apr 09, 2024 at 08:43:55PM -0300, Jason Gunthorpe wrote:
> On Fri, Apr 05, 2024 at 05:42:44PM -0400, Peter Xu wrote:
> > In short, hugetlb mappings shouldn't be special comparing to other huge pXd
> > and large folio (cont-pXd) mappings for most of the walkers in my mind, if
> > not all. I need to look at all the walkers and there can be some tricky
> > ones, but I believe that applies in general. It's actually similar to what
> > I did with slow gup here.
>
> I think that is the big question, I also haven't done the research to
> know the answer.
>
> At this point focusing on moving what is reasonable to the pXX_* API
> makes sense to me. Then reviewing what remains and making some
> decision.
>
> > Like this series, for cont-pXd we'll need multiple walks comparing to
> > before (when with hugetlb_entry()), but for that part I'll provide some
> > performance tests too, and we also have a fallback plan, which is to detect
> > cont-pXd existance, which will also work for large folios.
>
> I think we can optimize this pretty easy.
>
> > > I think if you do the easy places for pXX conversion you will have a
> > > good idea about what is needed for the hard places.
> >
> > Here IMHO we don't need to understand "what is the size of this hugetlb
> > vma"
>
> Yeh, I never really understood why hugetlb was linked to the VMA.. The
> page table is self describing, obviously.

Attaching to vma still makes sense to me, where we should definitely avoid
a mixture of hugetlb and !hugetlb pages in a single vma - hugetlb pages are
allocated, managed, ... totally differently.

And since hugetlb is designed as file-based (which also makes sense to me,
at least for now), it's also natural that it's vma-attached.

>
> > or "which level of pgtable does this hugetlb vma pages locate",
>
> Ditto
>
> > because we may not need that, e.g., when we only want to collect some smaps
> > statistics. "whether it's hugetlb" may matter, though. E.g. in the mm
> > walker we see a huge pmd, it can be a thp, it can be a hugetlb (when
> > hugetlb_entry removed), we may need extra check later to put things into
> > the right bucket, but for the walker itself it doesn't necessarily need
> > hugetlb_entry().
>
> Right, places may still need to know it is part of a huge VMA because we
> have special stuff linked to that.
>
> > > But then again we come back to power and its big list of page sizes
> > > and variety :( Looks like some there have huge sizes at the pgd level
> > > at least.
> >
> > Yeah this is something I want to be super clear, because I may miss
> > something: we don't have real pgd pages, right? Powerpc doesn't even
> > define p4d_leaf(), AFAICT.
>
> AFAICT it is because it hides it all in hugepd.

IMHO one thing we can benefit from such hugepd rework is, if we can squash
all the hugepds like what Christophe does, then we push it one more layer
down, and we have a good chance all things should just work.

So again my Power brain is close to zero, but now I'm referring to what
Christophe shared in the other thread:

https://github.com/linuxppc/wiki/wiki/Huge-pages

Together with:

https://lore.kernel.org/r/288f26f487648d21fd9590e40b390934eaa5d24a.1711377230.git.christophe.leroy@csgroup.eu

Where it has:

--- a/arch/powerpc/platforms/Kconfig.cputype
+++ b/arch/powerpc/platforms/Kconfig.cputype
@@ -98,6 +98,7 @@ config PPC_BOOK3S_64
select ARCH_ENABLE_HUGEPAGE_MIGRATION if HUGETLB_PAGE && MIGRATION
select ARCH_ENABLE_SPLIT_PMD_PTLOCK
select ARCH_ENABLE_THP_MIGRATION if TRANSPARENT_HUGEPAGE
+ select ARCH_HAS_HUGEPD if HUGETLB_PAGE
select ARCH_SUPPORTS_HUGETLBFS
select ARCH_SUPPORTS_NUMA_BALANCING
select HAVE_MOVE_PMD
@@ -290,6 +291,7 @@ config PPC_BOOK3S
config PPC_E500
select FSL_EMB_PERFMON
bool
+ select ARCH_HAS_HUGEPD if HUGETLB_PAGE
select ARCH_SUPPORTS_HUGETLBFS if PHYS_64BIT || PPC64
select PPC_SMP_MUXED_IPI
select PPC_DOORBELL

So I think it means we have three PowerPC systems that supports hugepd
right now (besides the 8xx which Christophe is trying to drop support
there), besides 8xx we still have book3s_64 and E500.

Let's check one by one:

- book3s_64

- hash

- 64K: p4d is not used, largest pgsize pgd 16G @pud level. It
means after squashing it'll be a bunch of cont-pmd, all good.

- 4K: p4d also not used, largest pgsize pgd 128G, after squashed
it'll be cont-pud. all good.

- radix

- 64K: largest 1G @pud, then cont-pmd after squashed. all good.

- 4K: largest 1G @pud, then cont-pmd, all good.

- e500 & 8xx

- both of them use 2-level pgtables (pgd + pte), after squashed hugepd
@pgd level they become cont-pte. all good.

I think the trick here is there'll be no pgd leaves after hugepd squashing
to lower levels, then since PowerPC seems to never have p4d, then all
things fall into pud or lower. We seem to be all good there?

>
> If the goal is to purge hugepd then some of the options might turn out
> to convert hugepd into huge p4d/pgd, as I understand it. It would be
> nice to have certainty on this at least.

Right. I hope the pmd/pud plan I proposed above can already work too with
such ambicious goal too. But review very welcomed from either you or
Christophe.

PS: I think I'll also have a closer look at Christophe's series this week
or next.

>
> We have effectively three APIs to parse a single page table and
> currently none of the APIs can return 100% of the data for power.

Thanks,

--
Peter Xu


2024-04-10 16:31:00

by Christophe Leroy

[permalink] [raw]
Subject: Re: [PATCH v3 00/12] mm/gup: Unify hugetlb, part 2



Le 10/04/2024 à 17:28, Peter Xu a écrit :
> On Tue, Apr 09, 2024 at 08:43:55PM -0300, Jason Gunthorpe wrote:
>> On Fri, Apr 05, 2024 at 05:42:44PM -0400, Peter Xu wrote:
>>> In short, hugetlb mappings shouldn't be special comparing to other huge pXd
>>> and large folio (cont-pXd) mappings for most of the walkers in my mind, if
>>> not all. I need to look at all the walkers and there can be some tricky
>>> ones, but I believe that applies in general. It's actually similar to what
>>> I did with slow gup here.
>>
>> I think that is the big question, I also haven't done the research to
>> know the answer.
>>
>> At this point focusing on moving what is reasonable to the pXX_* API
>> makes sense to me. Then reviewing what remains and making some
>> decision.
>>
>>> Like this series, for cont-pXd we'll need multiple walks comparing to
>>> before (when with hugetlb_entry()), but for that part I'll provide some
>>> performance tests too, and we also have a fallback plan, which is to detect
>>> cont-pXd existance, which will also work for large folios.
>>
>> I think we can optimize this pretty easy.
>>
>>>> I think if you do the easy places for pXX conversion you will have a
>>>> good idea about what is needed for the hard places.
>>>
>>> Here IMHO we don't need to understand "what is the size of this hugetlb
>>> vma"
>>
>> Yeh, I never really understood why hugetlb was linked to the VMA.. The
>> page table is self describing, obviously.
>
> Attaching to vma still makes sense to me, where we should definitely avoid
> a mixture of hugetlb and !hugetlb pages in a single vma - hugetlb pages are
> allocated, managed, ... totally differently.
>
> And since hugetlb is designed as file-based (which also makes sense to me,
> at least for now), it's also natural that it's vma-attached.
>
>>
>>> or "which level of pgtable does this hugetlb vma pages locate",
>>
>> Ditto
>>
>>> because we may not need that, e.g., when we only want to collect some smaps
>>> statistics. "whether it's hugetlb" may matter, though. E.g. in the mm
>>> walker we see a huge pmd, it can be a thp, it can be a hugetlb (when
>>> hugetlb_entry removed), we may need extra check later to put things into
>>> the right bucket, but for the walker itself it doesn't necessarily need
>>> hugetlb_entry().
>>
>> Right, places may still need to know it is part of a huge VMA because we
>> have special stuff linked to that.
>>
>>>> But then again we come back to power and its big list of page sizes
>>>> and variety :( Looks like some there have huge sizes at the pgd level
>>>> at least.
>>>
>>> Yeah this is something I want to be super clear, because I may miss
>>> something: we don't have real pgd pages, right? Powerpc doesn't even
>>> define p4d_leaf(), AFAICT.
>>
>> AFAICT it is because it hides it all in hugepd.
>
> IMHO one thing we can benefit from such hugepd rework is, if we can squash
> all the hugepds like what Christophe does, then we push it one more layer
> down, and we have a good chance all things should just work.
>
> So again my Power brain is close to zero, but now I'm referring to what
> Christophe shared in the other thread:
>
> https://github.com/linuxppc/wiki/wiki/Huge-pages
>
> Together with:
>
> https://lore.kernel.org/r/288f26f487648d21fd9590e40b390934eaa5d24a.1711377230.git.christophe.leroy@csgroup.eu
>
> Where it has:
>
> --- a/arch/powerpc/platforms/Kconfig.cputype
> +++ b/arch/powerpc/platforms/Kconfig.cputype
> @@ -98,6 +98,7 @@ config PPC_BOOK3S_64
> select ARCH_ENABLE_HUGEPAGE_MIGRATION if HUGETLB_PAGE && MIGRATION
> select ARCH_ENABLE_SPLIT_PMD_PTLOCK
> select ARCH_ENABLE_THP_MIGRATION if TRANSPARENT_HUGEPAGE
> + select ARCH_HAS_HUGEPD if HUGETLB_PAGE
> select ARCH_SUPPORTS_HUGETLBFS
> select ARCH_SUPPORTS_NUMA_BALANCING
> select HAVE_MOVE_PMD
> @@ -290,6 +291,7 @@ config PPC_BOOK3S
> config PPC_E500
> select FSL_EMB_PERFMON
> bool
> + select ARCH_HAS_HUGEPD if HUGETLB_PAGE
> select ARCH_SUPPORTS_HUGETLBFS if PHYS_64BIT || PPC64
> select PPC_SMP_MUXED_IPI
> select PPC_DOORBELL
>
> So I think it means we have three PowerPC systems that supports hugepd
> right now (besides the 8xx which Christophe is trying to drop support
> there), besides 8xx we still have book3s_64 and E500.
>
> Let's check one by one:
>
> - book3s_64
>
> - hash
>
> - 64K: p4d is not used, largest pgsize pgd 16G @pud level. It
> means after squashing it'll be a bunch of cont-pmd, all good.
>
> - 4K: p4d also not used, largest pgsize pgd 128G, after squashed
> it'll be cont-pud. all good.
>
> - radix
>
> - 64K: largest 1G @pud, then cont-pmd after squashed. all good.
>
> - 4K: largest 1G @pud, then cont-pmd, all good.
>
> - e500 & 8xx
>
> - both of them use 2-level pgtables (pgd + pte), after squashed hugepd
> @pgd level they become cont-pte. all good.

e500 has two modes: 32 bits and 64 bits.

For 32 bits:

8xx is the only one handling it through HW-assisted pagetable walk hence
requiring a 2-level whatever the pagesize is.

On e500 it is all software so pages 2M and larger should be cont-PGD (by
the way I'm a bit puzzled that on arches that have only 2 levels, ie PGD
and PTE, the PGD entries are populated by a function called PMD_populate()).

Current situation for 8xx is illustrated here:
https://github.com/linuxppc/wiki/wiki/Huge-pages#8xx

I also tried to better illustrate e500/32 here:
https://github.com/linuxppc/wiki/wiki/Huge-pages#e500

For 64 bits:
We have PTE/PMD/PUD/PGD, no P4D

See arch/powerpc/include/asm/nohash/64/pgtable-4k.h


>
> I think the trick here is there'll be no pgd leaves after hugepd squashing
> to lower levels, then since PowerPC seems to never have p4d, then all
> things fall into pud or lower. We seem to be all good there?
>
>>
>> If the goal is to purge hugepd then some of the options might turn out
>> to convert hugepd into huge p4d/pgd, as I understand it. It would be
>> nice to have certainty on this at least.
>
> Right. I hope the pmd/pud plan I proposed above can already work too with
> such ambicious goal too. But review very welcomed from either you or
> Christophe.
>
> PS: I think I'll also have a closer look at Christophe's series this week
> or next.
>
>>
>> We have effectively three APIs to parse a single page table and
>> currently none of the APIs can return 100% of the data for power.
>
> Thanks,
>

2024-04-10 20:10:19

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v3 00/12] mm/gup: Unify hugetlb, part 2

On Wed, Apr 10, 2024 at 04:30:41PM +0000, Christophe Leroy wrote:
>
>
> Le 10/04/2024 à 17:28, Peter Xu a écrit :
> > On Tue, Apr 09, 2024 at 08:43:55PM -0300, Jason Gunthorpe wrote:
> >> On Fri, Apr 05, 2024 at 05:42:44PM -0400, Peter Xu wrote:
> >>> In short, hugetlb mappings shouldn't be special comparing to other huge pXd
> >>> and large folio (cont-pXd) mappings for most of the walkers in my mind, if
> >>> not all. I need to look at all the walkers and there can be some tricky
> >>> ones, but I believe that applies in general. It's actually similar to what
> >>> I did with slow gup here.
> >>
> >> I think that is the big question, I also haven't done the research to
> >> know the answer.
> >>
> >> At this point focusing on moving what is reasonable to the pXX_* API
> >> makes sense to me. Then reviewing what remains and making some
> >> decision.
> >>
> >>> Like this series, for cont-pXd we'll need multiple walks comparing to
> >>> before (when with hugetlb_entry()), but for that part I'll provide some
> >>> performance tests too, and we also have a fallback plan, which is to detect
> >>> cont-pXd existance, which will also work for large folios.
> >>
> >> I think we can optimize this pretty easy.
> >>
> >>>> I think if you do the easy places for pXX conversion you will have a
> >>>> good idea about what is needed for the hard places.
> >>>
> >>> Here IMHO we don't need to understand "what is the size of this hugetlb
> >>> vma"
> >>
> >> Yeh, I never really understood why hugetlb was linked to the VMA.. The
> >> page table is self describing, obviously.
> >
> > Attaching to vma still makes sense to me, where we should definitely avoid
> > a mixture of hugetlb and !hugetlb pages in a single vma - hugetlb pages are
> > allocated, managed, ... totally differently.
> >
> > And since hugetlb is designed as file-based (which also makes sense to me,
> > at least for now), it's also natural that it's vma-attached.
> >
> >>
> >>> or "which level of pgtable does this hugetlb vma pages locate",
> >>
> >> Ditto
> >>
> >>> because we may not need that, e.g., when we only want to collect some smaps
> >>> statistics. "whether it's hugetlb" may matter, though. E.g. in the mm
> >>> walker we see a huge pmd, it can be a thp, it can be a hugetlb (when
> >>> hugetlb_entry removed), we may need extra check later to put things into
> >>> the right bucket, but for the walker itself it doesn't necessarily need
> >>> hugetlb_entry().
> >>
> >> Right, places may still need to know it is part of a huge VMA because we
> >> have special stuff linked to that.
> >>
> >>>> But then again we come back to power and its big list of page sizes
> >>>> and variety :( Looks like some there have huge sizes at the pgd level
> >>>> at least.
> >>>
> >>> Yeah this is something I want to be super clear, because I may miss
> >>> something: we don't have real pgd pages, right? Powerpc doesn't even
> >>> define p4d_leaf(), AFAICT.
> >>
> >> AFAICT it is because it hides it all in hugepd.
> >
> > IMHO one thing we can benefit from such hugepd rework is, if we can squash
> > all the hugepds like what Christophe does, then we push it one more layer
> > down, and we have a good chance all things should just work.
> >
> > So again my Power brain is close to zero, but now I'm referring to what
> > Christophe shared in the other thread:
> >
> > https://github.com/linuxppc/wiki/wiki/Huge-pages
> >
> > Together with:
> >
> > https://lore.kernel.org/r/288f26f487648d21fd9590e40b390934eaa5d24a.1711377230.git.christophe.leroy@csgroup.eu
> >
> > Where it has:
> >
> > --- a/arch/powerpc/platforms/Kconfig.cputype
> > +++ b/arch/powerpc/platforms/Kconfig.cputype
> > @@ -98,6 +98,7 @@ config PPC_BOOK3S_64
> > select ARCH_ENABLE_HUGEPAGE_MIGRATION if HUGETLB_PAGE && MIGRATION
> > select ARCH_ENABLE_SPLIT_PMD_PTLOCK
> > select ARCH_ENABLE_THP_MIGRATION if TRANSPARENT_HUGEPAGE
> > + select ARCH_HAS_HUGEPD if HUGETLB_PAGE
> > select ARCH_SUPPORTS_HUGETLBFS
> > select ARCH_SUPPORTS_NUMA_BALANCING
> > select HAVE_MOVE_PMD
> > @@ -290,6 +291,7 @@ config PPC_BOOK3S
> > config PPC_E500
> > select FSL_EMB_PERFMON
> > bool
> > + select ARCH_HAS_HUGEPD if HUGETLB_PAGE
> > select ARCH_SUPPORTS_HUGETLBFS if PHYS_64BIT || PPC64
> > select PPC_SMP_MUXED_IPI
> > select PPC_DOORBELL
> >
> > So I think it means we have three PowerPC systems that supports hugepd
> > right now (besides the 8xx which Christophe is trying to drop support
> > there), besides 8xx we still have book3s_64 and E500.
> >
> > Let's check one by one:
> >
> > - book3s_64
> >
> > - hash
> >
> > - 64K: p4d is not used, largest pgsize pgd 16G @pud level. It
> > means after squashing it'll be a bunch of cont-pmd, all good.
> >
> > - 4K: p4d also not used, largest pgsize pgd 128G, after squashed
> > it'll be cont-pud. all good.
> >
> > - radix
> >
> > - 64K: largest 1G @pud, then cont-pmd after squashed. all good.
> >
> > - 4K: largest 1G @pud, then cont-pmd, all good.
> >
> > - e500 & 8xx
> >
> > - both of them use 2-level pgtables (pgd + pte), after squashed hugepd
> > @pgd level they become cont-pte. all good.
>
> e500 has two modes: 32 bits and 64 bits.
>
> For 32 bits:
>
> 8xx is the only one handling it through HW-assisted pagetable walk hence
> requiring a 2-level whatever the pagesize is.

Hmm I think maybe finally I get it..

I think the confusion came from when I saw there's always such level-2
table described in Figure 8-5 of the manual:

https://www.nxp.com/docs/en/reference-manual/MPC860UM.pdf

So I suppose you meant for 8M, the PowerPC 8xx system hardware will be
aware of such 8M pgtable (from level-1's entry, where it has bit 28-29 set
011b), then it won't ever read anything starting from "Level-2 Descriptor
1" (but only read the only entry "Level-2 Descriptor 0"), so fundamentally
hugepd format must look like such for 8xx?

But then perhaps it's still compatible with cont-pte because the rest
entries (pte index 1+) will simply be ignored by the hardware?

>
> On e500 it is all software so pages 2M and larger should be cont-PGD (by
> the way I'm a bit puzzled that on arches that have only 2 levels, ie PGD
> and PTE, the PGD entries are populated by a function called PMD_populate()).

Yeah.. I am also wondering whether pgd_populate() could also work there
(perhaps with some trivial changes, or maybe not even needed..), as when
p4d/pud/pmd levels are missing, linux should just do something like an
enforced cast from pgd_t* -> pmd_t* in this case.

I think currently they're already not pgd, as __find_linux_pte() already
skipped pgd unconditionally:

pgdp = pgdir + pgd_index(ea);
p4dp = p4d_offset(pgdp, ea);

>
> Current situation for 8xx is illustrated here:
> https://github.com/linuxppc/wiki/wiki/Huge-pages#8xx
>
> I also tried to better illustrate e500/32 here:
> https://github.com/linuxppc/wiki/wiki/Huge-pages#e500
>
> For 64 bits:
> We have PTE/PMD/PUD/PGD, no P4D
>
> See arch/powerpc/include/asm/nohash/64/pgtable-4k.h

We don't have anything that is above pud in this category, right? That's
what I read from your wiki (and thanks for providing that in the first
place; helps a lot for me to understand how it works on PowerPC).

I want to make sure if I can move on without caring on p4d/pgd leafs like
what we do right now, even after if we can remove hugepd for good, in this
case since p4d always missing, then it's about whether "pud|pmd|pte_leaf()"
can also cover the pgd ones when that day comes, iiuc.

Thanks,

>
>
> >
> > I think the trick here is there'll be no pgd leaves after hugepd squashing
> > to lower levels, then since PowerPC seems to never have p4d, then all
> > things fall into pud or lower. We seem to be all good there?
> >
> >>
> >> If the goal is to purge hugepd then some of the options might turn out
> >> to convert hugepd into huge p4d/pgd, as I understand it. It would be
> >> nice to have certainty on this at least.
> >
> > Right. I hope the pmd/pud plan I proposed above can already work too with
> > such ambicious goal too. But review very welcomed from either you or
> > Christophe.
> >
> > PS: I think I'll also have a closer look at Christophe's series this week
> > or next.
> >
> >>
> >> We have effectively three APIs to parse a single page table and
> >> currently none of the APIs can return 100% of the data for power.
> >
> > Thanks,
> >

--
Peter Xu


2024-04-12 14:28:46

by Christophe Leroy

[permalink] [raw]
Subject: Re: [PATCH v3 00/12] mm/gup: Unify hugetlb, part 2



Le 10/04/2024 à 21:58, Peter Xu a écrit :
>>
>> e500 has two modes: 32 bits and 64 bits.
>>
>> For 32 bits:
>>
>> 8xx is the only one handling it through HW-assisted pagetable walk hence
>> requiring a 2-level whatever the pagesize is.
>
> Hmm I think maybe finally I get it..
>
> I think the confusion came from when I saw there's always such level-2
> table described in Figure 8-5 of the manual:
>
> https://www.nxp.com/docs/en/reference-manual/MPC860UM.pdf

Yes indeed that figure is confusing.

Table 8-1 gives a pretty good idea of what is required. We only use
MD_CTR[TWAM] = 1

>
> So I suppose you meant for 8M, the PowerPC 8xx system hardware will be
> aware of such 8M pgtable (from level-1's entry, where it has bit 28-29 set
> 011b), then it won't ever read anything starting from "Level-2 Descriptor
> 1" (but only read the only entry "Level-2 Descriptor 0"), so fundamentally
> hugepd format must look like such for 8xx?
>
> But then perhaps it's still compatible with cont-pte because the rest
> entries (pte index 1+) will simply be ignored by the hardware?

Yes, still compatible with CONT-PTE allthough things become tricky
because you need two page tables to get the full 8M so that's a kind of
cont-PMD down to PTE level, as you can see in my RFC series.

>
>>
>> On e500 it is all software so pages 2M and larger should be cont-PGD (by
>> the way I'm a bit puzzled that on arches that have only 2 levels, ie PGD
>> and PTE, the PGD entries are populated by a function called PMD_populate()).
>
> Yeah.. I am also wondering whether pgd_populate() could also work there
> (perhaps with some trivial changes, or maybe not even needed..), as when
> p4d/pud/pmd levels are missing, linux should just do something like an
> enforced cast from pgd_t* -> pmd_t* in this case.
>
> I think currently they're already not pgd, as __find_linux_pte() already
> skipped pgd unconditionally:
>
> pgdp = pgdir + pgd_index(ea);
> p4dp = p4d_offset(pgdp, ea);
>

Yes that's what is confusing, some parts of code considers we have only
a PGD and a PT while other parts consider we have only a PMD and a PT

>>
>> Current situation for 8xx is illustrated here:
>> https://github.com/linuxppc/wiki/wiki/Huge-pages#8xx
>>
>> I also tried to better illustrate e500/32 here:
>> https://github.com/linuxppc/wiki/wiki/Huge-pages#e500
>>
>> For 64 bits:
>> We have PTE/PMD/PUD/PGD, no P4D
>>
>> See arch/powerpc/include/asm/nohash/64/pgtable-4k.h
>
> We don't have anything that is above pud in this category, right? That's
> what I read from your wiki (and thanks for providing that in the first
> place; helps a lot for me to understand how it works on PowerPC).

Yes thanks to Michael and Aneesh who initiated that Wiki page.

>
> I want to make sure if I can move on without caring on p4d/pgd leafs like
> what we do right now, even after if we can remove hugepd for good, in this
> case since p4d always missing, then it's about whether "pud|pmd|pte_leaf()"
> can also cover the pgd ones when that day comes, iiuc.

I guess so but I'd like Aneesh and/or Michael to confirm as I'm not an
expert on PPC64.

Christophe