2024-03-13 21:47:43

by Peter Xu

[permalink] [raw]
Subject: [PATCH 00/13] mm/treewide: Remove pXd_huge() API

From: Peter Xu <[email protected]>

[based on akpm/mm-unstable latest commit 9af2e4c429b5]

v1:
- Rebase, remove RFC tag
- Fixed powerpc patch build issue, enhancing commit message [Michael]
- Optimize patch 1 & 3 on "none || !present" check [Jason]

In previous work [1], we removed the pXd_large() API, which is arch
specific. This patchset further removes the hugetlb pXd_huge() API.

Hugetlb was never special on creating huge mappings when compared with
other huge mappings. Having a standalone API just to detect such pgtable
entries is more or less redundant, especially after the pXd_leaf() API set
is introduced with/without CONFIG_HUGETLB_PAGE.

When looking at this problem, a few issues are also exposed that we don't
have a clear definition of the *_huge() variance API. This patchset
started by cleaning these issues first, then replace all *_huge() users to
use *_leaf(), then drop all *_huge() code.

On x86/sparc, swap entries will be reported "true" in pXd_huge(), while for
all the rest archs they're reported "false" instead. This part is done in
patch 1-5, in which I suspect patch 1 can be seen as a bug fix, but I'll
leave that to hmm experts to decide.

Besides, there are three archs (arm, arm64, powerpc) that have slightly
different definitions between the *_huge() v.s. *_leaf() variances. I
tackled them separately so that it'll be easier for arch experts to chim in
when necessary. This part is done in patch 6-9.

The final patches 10-13 do the rest on the final removal, since *_leaf()
will be the ultimate API in the future, and we seem to have quite some
confusions on how *_huge() APIs can be defined, provide a rich comment for
*_leaf() API set to define them properly to avoid future misuse, and
hopefully that'll also help new archs to start support huge mappings and
avoid traps (like either swap entries, or PROT_NONE entry checks).

The whole series is only lightly tested on x86, while as usual I don't have
the capability to test all archs that it touches.

[1] https://lore.kernel.org/r/[email protected]

Peter Xu (13):
mm/hmm: Process pud swap entry without pud_huge()
mm/gup: Cache p4d in follow_p4d_mask()
mm/gup: Check p4d presence before going on
mm/x86: Change pXd_huge() behavior to exclude swap entries
mm/sparc: Change pXd_huge() behavior to exclude swap entries
mm/arm: Use macros to define pmd/pud helpers
mm/arm: Redefine pmd_huge() with pmd_leaf()
mm/arm64: Merge pXd_huge() and pXd_leaf() definitions
mm/powerpc: Redefine pXd_huge() with pXd_leaf()
mm/gup: Merge pXd huge mapping checks
mm/treewide: Replace pXd_huge() with pXd_leaf()
mm/treewide: Remove pXd_huge()
mm: Document pXd_leaf() API

arch/arm/include/asm/pgtable-2level.h | 4 +--
arch/arm/include/asm/pgtable-3level-hwdef.h | 1 +
arch/arm/include/asm/pgtable-3level.h | 6 ++--
arch/arm/mm/Makefile | 1 -
arch/arm/mm/hugetlbpage.c | 34 -------------------
arch/arm64/include/asm/pgtable.h | 6 +++-
arch/arm64/mm/hugetlbpage.c | 18 ++--------
arch/loongarch/mm/hugetlbpage.c | 12 +------
arch/mips/include/asm/pgtable-32.h | 2 +-
arch/mips/include/asm/pgtable-64.h | 2 +-
arch/mips/mm/hugetlbpage.c | 10 ------
arch/mips/mm/tlb-r4k.c | 2 +-
arch/parisc/mm/hugetlbpage.c | 11 ------
.../include/asm/book3s/64/pgtable-4k.h | 20 -----------
.../include/asm/book3s/64/pgtable-64k.h | 25 --------------
arch/powerpc/include/asm/book3s/64/pgtable.h | 27 +++++++--------
arch/powerpc/include/asm/nohash/pgtable.h | 10 ------
arch/powerpc/mm/pgtable_64.c | 6 ++--
arch/riscv/mm/hugetlbpage.c | 10 ------
arch/s390/mm/hugetlbpage.c | 10 ------
arch/sh/mm/hugetlbpage.c | 10 ------
arch/sparc/mm/hugetlbpage.c | 12 -------
arch/x86/mm/hugetlbpage.c | 26 --------------
arch/x86/mm/pgtable.c | 4 +--
include/linux/hugetlb.h | 24 -------------
include/linux/pgtable.h | 24 ++++++++++---
mm/gup.c | 24 ++++++-------
mm/hmm.c | 9 ++---
mm/memory.c | 2 +-
29 files changed, 68 insertions(+), 284 deletions(-)
delete mode 100644 arch/arm/mm/hugetlbpage.c

--
2.44.0



2024-03-13 21:48:09

by Peter Xu

[permalink] [raw]
Subject: [PATCH 01/13] mm/hmm: Process pud swap entry without pud_huge()

From: Peter Xu <[email protected]>

Swap pud entries do not always return true for pud_huge() for all archs.
x86 and sparc (so far) allow it, but all the rest do not accept a swap
entry to be reported as pud_huge(). So it's not safe to check swap entries
within pud_huge(). Check swap entries before pud_huge(), so it should be
always safe.

This is the only place in the kernel that (IMHO, wrongly) relies on
pud_huge() to return true on pud swap entries. The plan is to cleanup
pXd_huge() to only report non-swap mappings for all archs.

Cc: Alistair Popple <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
mm/hmm.c | 7 +------
1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index 277ddcab4947..c95b9ec5d95f 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -424,7 +424,7 @@ static int hmm_vma_walk_pud(pud_t *pudp, unsigned long start, unsigned long end,
walk->action = ACTION_CONTINUE;

pud = READ_ONCE(*pudp);
- if (pud_none(pud)) {
+ if (!pud_present(pud)) {
spin_unlock(ptl);
return hmm_vma_walk_hole(start, end, -1, walk);
}
@@ -435,11 +435,6 @@ static int hmm_vma_walk_pud(pud_t *pudp, unsigned long start, unsigned long end,
unsigned long *hmm_pfns;
unsigned long cpu_flags;

- if (!pud_present(pud)) {
- spin_unlock(ptl);
- return hmm_vma_walk_hole(start, end, -1, walk);
- }
-
i = (addr - range->start) >> PAGE_SHIFT;
npages = (end - addr) >> PAGE_SHIFT;
hmm_pfns = &range->hmm_pfns[i];
--
2.44.0


2024-03-13 21:48:33

by Peter Xu

[permalink] [raw]
Subject: [PATCH 03/13] mm/gup: Check p4d presence before going on

From: Peter Xu <[email protected]>

Currently there should have no p4d swap entries so it may not matter much,
however this may help us to rule out swap entries in pXd_huge() API, which
will include p4d_huge(). The p4d_present() checks make it 100% clear that
we won't rely on p4d_huge() for swap entries.

Signed-off-by: Peter Xu <[email protected]>
---
mm/gup.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 69a777f4fc5c..802987281b2f 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -776,7 +776,7 @@ static struct page *follow_p4d_mask(struct vm_area_struct *vma,

p4dp = p4d_offset(pgdp, address);
p4d = READ_ONCE(*p4dp);
- if (p4d_none(p4d))
+ if (!p4d_present(p4d))
return no_page_table(vma, flags);
BUILD_BUG_ON(p4d_huge(p4d));
if (unlikely(p4d_bad(p4d)))
@@ -3069,7 +3069,7 @@ static int gup_p4d_range(pgd_t *pgdp, pgd_t pgd, unsigned long addr, unsigned lo
p4d_t p4d = READ_ONCE(*p4dp);

next = p4d_addr_end(addr, end);
- if (p4d_none(p4d))
+ if (!p4d_present(p4d))
return 0;
BUILD_BUG_ON(p4d_huge(p4d));
if (unlikely(is_hugepd(__hugepd(p4d_val(p4d))))) {
--
2.44.0


2024-03-13 21:48:39

by Peter Xu

[permalink] [raw]
Subject: [PATCH 04/13] mm/x86: Change pXd_huge() behavior to exclude swap entries

From: Peter Xu <[email protected]>

This patch partly reverts below commits:

3a194f3f8ad0 ("mm/hugetlb: make pud_huge() and follow_huge_pud() aware of non-present pud entry")
cbef8478bee5 ("mm/hugetlb: pmd_huge() returns true for non-present hugepage")

Right now, pXd_huge() definition across kernel is unclear. We have two
groups that think differently on swap entries:

- x86/sparc: Allow pXd_huge() to accept swap entries
- all the rest: Doesn't allow pXd_huge() to accept swap entries

This is so confusing. Since the sparc helpers seem to be added in 2016,
which is after x86's (2015), so sparc could have followed a trend. x86
proposed such swap handling in 2015 to resolve hugetlb swap entries hit in
GUP, but now GUP guards swap entries with !pXd_present() in all layers so
we should be safe.

We should define this API properly, one way or another, rather than keep
them defined differently across archs.

Gut feeling tells me that pXd_huge() shouldn't include swap entries, and it
turns out that I am not the only one thinking so, the question was raised
when the current pmd_huge() for x86 was proposed by Ville Syrjälä:

https://lore.kernel.org/all/[email protected]/

I might also be missing something obvious, but why is it even necessary
to treat PRESENT==0+PSE==0 as a huge entry?

It is also questioned when Jason Gunthorpe reviewed the other patchset on
swap entry handlings:

https://lore.kernel.org/all/[email protected]/

Revert its meaning back to original. It shouldn't have any functional
change as we should be ready with guards on !pXd_present() explicitly
everywhere.

Note that I also dropped the "#if CONFIG_PGTABLE_LEVELS > 2", it was there
probably because it was breaking things when 3a194f3f8ad0 was proposed,
according to the report here:

https://lore.kernel.org/all/[email protected]/

Now we shouldn't need that.

Instead of reverting to _PAGE_PSE raw check, leverage pXd_leaf().

Cc: Naoya Horiguchi <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: [email protected]
Signed-off-by: Peter Xu <[email protected]>
---
arch/x86/mm/hugetlbpage.c | 18 ++++--------------
1 file changed, 4 insertions(+), 14 deletions(-)

diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index 5804bbae4f01..8362953a24ce 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -20,29 +20,19 @@
#include <asm/elf.h>

/*
- * pmd_huge() returns 1 if @pmd is hugetlb related entry, that is normal
- * hugetlb entry or non-present (migration or hwpoisoned) hugetlb entry.
- * Otherwise, returns 0.
+ * pmd_huge() returns 1 if @pmd is hugetlb related entry.
*/
int pmd_huge(pmd_t pmd)
{
- return !pmd_none(pmd) &&
- (pmd_val(pmd) & (_PAGE_PRESENT|_PAGE_PSE)) != _PAGE_PRESENT;
+ return pmd_leaf(pmd);
}

/*
- * pud_huge() returns 1 if @pud is hugetlb related entry, that is normal
- * hugetlb entry or non-present (migration or hwpoisoned) hugetlb entry.
- * Otherwise, returns 0.
+ * pud_huge() returns 1 if @pud is hugetlb related entry.
*/
int pud_huge(pud_t pud)
{
-#if CONFIG_PGTABLE_LEVELS > 2
- return !pud_none(pud) &&
- (pud_val(pud) & (_PAGE_PRESENT|_PAGE_PSE)) != _PAGE_PRESENT;
-#else
- return 0;
-#endif
+ return pud_leaf(pud);
}

#ifdef CONFIG_HUGETLB_PAGE
--
2.44.0


2024-03-13 21:48:45

by Peter Xu

[permalink] [raw]
Subject: [PATCH 02/13] mm/gup: Cache p4d in follow_p4d_mask()

From: Peter Xu <[email protected]>

Add a variable to cache p4d in follow_p4d_mask(). It's a good practise to
make sure all the following checks will have a consistent view of the entry.

Signed-off-by: Peter Xu <[email protected]>
---
mm/gup.c | 13 +++++++------
1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index df83182ec72d..69a777f4fc5c 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -772,16 +772,17 @@ static struct page *follow_p4d_mask(struct vm_area_struct *vma,
unsigned int flags,
struct follow_page_context *ctx)
{
- p4d_t *p4d;
+ p4d_t *p4dp, p4d;

- p4d = p4d_offset(pgdp, address);
- if (p4d_none(*p4d))
+ p4dp = p4d_offset(pgdp, address);
+ p4d = READ_ONCE(*p4dp);
+ if (p4d_none(p4d))
return no_page_table(vma, flags);
- BUILD_BUG_ON(p4d_huge(*p4d));
- if (unlikely(p4d_bad(*p4d)))
+ BUILD_BUG_ON(p4d_huge(p4d));
+ if (unlikely(p4d_bad(p4d)))
return no_page_table(vma, flags);

- return follow_pud_mask(vma, address, p4d, flags, ctx);
+ return follow_pud_mask(vma, address, p4dp, flags, ctx);
}

/**
--
2.44.0


2024-03-13 21:49:09

by Peter Xu

[permalink] [raw]
Subject: [PATCH 06/13] mm/arm: Use macros to define pmd/pud helpers

From: Peter Xu <[email protected]>

It's already confusing that ARM 2-level v.s. 3-level defines SECT bit
differently on pmd/puds. Always use a macro which is much clearer.

Cc: Russell King <[email protected]>
Cc: Shawn Guo <[email protected]>
Cc: Krzysztof Kozlowski <[email protected]>
Cc: Bjorn Andersson <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Konrad Dybcio <[email protected]>
Cc: Fabio Estevam <[email protected]>
Cc: [email protected]
Signed-off-by: Peter Xu <[email protected]>
---
arch/arm/include/asm/pgtable-2level.h | 4 ++--
arch/arm/include/asm/pgtable-3level-hwdef.h | 1 +
arch/arm/include/asm/pgtable-3level.h | 4 ++--
3 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/arm/include/asm/pgtable-2level.h b/arch/arm/include/asm/pgtable-2level.h
index b0a262566eb9..4245c2e74720 100644
--- a/arch/arm/include/asm/pgtable-2level.h
+++ b/arch/arm/include/asm/pgtable-2level.h
@@ -213,8 +213,8 @@ static inline pmd_t *pmd_offset(pud_t *pud, unsigned long addr)

#define pmd_pfn(pmd) (__phys_to_pfn(pmd_val(pmd) & PHYS_MASK))

-#define pmd_leaf(pmd) (pmd_val(pmd) & 2)
-#define pmd_bad(pmd) (pmd_val(pmd) & 2)
+#define pmd_leaf(pmd) (pmd_val(pmd) & PMD_TYPE_SECT)
+#define pmd_bad(pmd) pmd_leaf(pmd)
#define pmd_present(pmd) (pmd_val(pmd))

#define copy_pmd(pmdpd,pmdps) \
diff --git a/arch/arm/include/asm/pgtable-3level-hwdef.h b/arch/arm/include/asm/pgtable-3level-hwdef.h
index 2f35b4eddaa8..e7b666cf0060 100644
--- a/arch/arm/include/asm/pgtable-3level-hwdef.h
+++ b/arch/arm/include/asm/pgtable-3level-hwdef.h
@@ -14,6 +14,7 @@
* + Level 1/2 descriptor
* - common
*/
+#define PUD_TABLE_BIT (_AT(pmdval_t, 1) << 1)
#define PMD_TYPE_MASK (_AT(pmdval_t, 3) << 0)
#define PMD_TYPE_FAULT (_AT(pmdval_t, 0) << 0)
#define PMD_TYPE_TABLE (_AT(pmdval_t, 3) << 0)
diff --git a/arch/arm/include/asm/pgtable-3level.h b/arch/arm/include/asm/pgtable-3level.h
index 4b1d9eb3908a..e7aecbef75c9 100644
--- a/arch/arm/include/asm/pgtable-3level.h
+++ b/arch/arm/include/asm/pgtable-3level.h
@@ -112,7 +112,7 @@
#ifndef __ASSEMBLY__

#define pud_none(pud) (!pud_val(pud))
-#define pud_bad(pud) (!(pud_val(pud) & 2))
+#define pud_bad(pud) (!(pud_val(pud) & PUD_TABLE_BIT))
#define pud_present(pud) (pud_val(pud))
#define pmd_table(pmd) ((pmd_val(pmd) & PMD_TYPE_MASK) == \
PMD_TYPE_TABLE)
@@ -137,7 +137,7 @@ static inline pmd_t *pud_pgtable(pud_t pud)
return __va(pud_val(pud) & PHYS_MASK & (s32)PAGE_MASK);
}

-#define pmd_bad(pmd) (!(pmd_val(pmd) & 2))
+#define pmd_bad(pmd) (!(pmd_val(pmd) & PMD_TABLE_BIT))

#define copy_pmd(pmdpd,pmdps) \
do { \
--
2.44.0


2024-03-13 21:49:45

by Peter Xu

[permalink] [raw]
Subject: [PATCH 08/13] mm/arm64: Merge pXd_huge() and pXd_leaf() definitions

From: Peter Xu <[email protected]>

Unlike most archs, aarch64 defines pXd_huge() and pXd_leaf() slightly
differently. Redefine the pXd_huge() with pXd_leaf().

There used to be two traps for old aarch64 definitions over these APIs that
I found when reading the code around, they're:

(1) 4797ec2dc83a ("arm64: fix pud_huge() for 2-level pagetables")
(2) 23bc8f69f0ec ("arm64: mm: fix p?d_leaf()")

Define pXd_huge() with the current pXd_leaf() will make sure (2) isn't a
problem (on PROT_NONE checks). To make sure it also works for (1), we move
over the __PAGETABLE_PMD_FOLDED check to pud_leaf(), allowing it to
constantly returning "false" for 2-level pgtables, which looks even safer
to cover both now.

Cc: Muchun Song <[email protected]>
Cc: Mark Salter <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: [email protected]
Signed-off-by: Peter Xu <[email protected]>
---
arch/arm64/include/asm/pgtable.h | 4 ++++
arch/arm64/mm/hugetlbpage.c | 8 ++------
2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 401087e8a43d..14d24c357c7a 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -704,7 +704,11 @@ static inline unsigned long pmd_page_vaddr(pmd_t pmd)
#define pud_none(pud) (!pud_val(pud))
#define pud_bad(pud) (!pud_table(pud))
#define pud_present(pud) pte_present(pud_pte(pud))
+#ifndef __PAGETABLE_PMD_FOLDED
#define pud_leaf(pud) (pud_present(pud) && !pud_table(pud))
+#else
+#define pud_leaf(pud) false
+#endif
#define pud_valid(pud) pte_valid(pud_pte(pud))
#define pud_user(pud) pte_user(pud_pte(pud))
#define pud_user_exec(pud) pte_user_exec(pud_pte(pud))
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 0f0e10bb0a95..1234bbaef5bf 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -81,16 +81,12 @@ bool arch_hugetlb_migration_supported(struct hstate *h)

int pmd_huge(pmd_t pmd)
{
- return pmd_val(pmd) && !(pmd_val(pmd) & PMD_TABLE_BIT);
+ return pmd_leaf(pmd);
}

int pud_huge(pud_t pud)
{
-#ifndef __PAGETABLE_PMD_FOLDED
- return pud_val(pud) && !(pud_val(pud) & PUD_TABLE_BIT);
-#else
- return 0;
-#endif
+ return pud_leaf(pud);
}

static int find_num_contig(struct mm_struct *mm, unsigned long addr,
--
2.44.0


2024-03-13 21:49:48

by Peter Xu

[permalink] [raw]
Subject: [PATCH 09/13] mm/powerpc: Redefine pXd_huge() with pXd_leaf()

From: Peter Xu <[email protected]>

PowerPC book3s 4K mostly has the same definition on both, except pXd_huge()
constantly returns 0 for hash MMUs. As Michael Ellerman pointed out [1],
it is safe to check _PAGE_PTE on hash MMUs, as the bit will never be set so
it will keep returning false.

As a reference, __p[mu]d_mkhuge() will trigger a BUG_ON trying to create
such huge mappings for 4K hash MMUs. Meanwhile, the major powerpc hugetlb
pgtable walker __find_linux_pte() already used pXd_leaf() to check hugetlb
mappings.

The goal should be that we will have one API pXd_leaf() to detect all kinds
of huge mappings. AFAICT we need to use the pXd_leaf() impl (rather than
pXd_huge() ones) to make sure ie. THPs on hash MMU will also return true.

This helps to simplify a follow up patch to drop pXd_huge() treewide.

NOTE: *_leaf() definition need to be moved before the inclusion of
asm/book3s/64/pgtable-4k.h, which defines pXd_huge() with it.

[1] https://lore.kernel.org/r/[email protected]

Cc: Michael Ellerman <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: "Aneesh Kumar K.V" <[email protected]>
Cc: "Naveen N. Rao" <[email protected]>
Cc: [email protected]
Signed-off-by: Peter Xu <[email protected]>
---
.../include/asm/book3s/64/pgtable-4k.h | 14 ++--------
arch/powerpc/include/asm/book3s/64/pgtable.h | 27 +++++++++----------
2 files changed, 14 insertions(+), 27 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable-4k.h b/arch/powerpc/include/asm/book3s/64/pgtable-4k.h
index 48f21820afe2..92545981bb49 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable-4k.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable-4k.h
@@ -8,22 +8,12 @@
#ifdef CONFIG_HUGETLB_PAGE
static inline int pmd_huge(pmd_t pmd)
{
- /*
- * leaf pte for huge page
- */
- if (radix_enabled())
- return !!(pmd_raw(pmd) & cpu_to_be64(_PAGE_PTE));
- return 0;
+ return pmd_leaf(pmd);
}

static inline int pud_huge(pud_t pud)
{
- /*
- * leaf pte for huge page
- */
- if (radix_enabled())
- return !!(pud_raw(pud) & cpu_to_be64(_PAGE_PTE));
- return 0;
+ return pud_leaf(pud);
}

/*
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index df66dce8306f..fd7180fded75 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -262,6 +262,18 @@ extern unsigned long __kernel_io_end;

extern struct page *vmemmap;
extern unsigned long pci_io_base;
+
+#define pmd_leaf pmd_leaf
+static inline bool pmd_leaf(pmd_t pmd)
+{
+ return !!(pmd_raw(pmd) & cpu_to_be64(_PAGE_PTE));
+}
+
+#define pud_leaf pud_leaf
+static inline bool pud_leaf(pud_t pud)
+{
+ return !!(pud_raw(pud) & cpu_to_be64(_PAGE_PTE));
+}
#endif /* __ASSEMBLY__ */

#include <asm/book3s/64/hash.h>
@@ -1436,20 +1448,5 @@ static inline bool is_pte_rw_upgrade(unsigned long old_val, unsigned long new_va
return false;
}

-/*
- * Like pmd_huge(), but works regardless of config options
- */
-#define pmd_leaf pmd_leaf
-static inline bool pmd_leaf(pmd_t pmd)
-{
- return !!(pmd_raw(pmd) & cpu_to_be64(_PAGE_PTE));
-}
-
-#define pud_leaf pud_leaf
-static inline bool pud_leaf(pud_t pud)
-{
- return !!(pud_raw(pud) & cpu_to_be64(_PAGE_PTE));
-}
-
#endif /* __ASSEMBLY__ */
#endif /* _ASM_POWERPC_BOOK3S_64_PGTABLE_H_ */
--
2.44.0


2024-03-13 21:50:04

by Peter Xu

[permalink] [raw]
Subject: [PATCH 10/13] mm/gup: Merge pXd huge mapping checks

From: Peter Xu <[email protected]>

Huge mapping checks in GUP are slightly redundant and can be simplified.

pXd_huge() now is the same as pXd_leaf(). pmd_trans_huge() and
pXd_devmap() should both imply pXd_leaf(). Time to merge them into one.

Reviewed-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
mm/gup.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 802987281b2f..e2415e9789bc 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -3005,8 +3005,7 @@ static int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr, unsigned lo
if (!pmd_present(pmd))
return 0;

- if (unlikely(pmd_trans_huge(pmd) || pmd_huge(pmd) ||
- pmd_devmap(pmd))) {
+ if (unlikely(pmd_leaf(pmd))) {
/* See gup_pte_range() */
if (pmd_protnone(pmd))
return 0;
@@ -3043,7 +3042,7 @@ static int gup_pud_range(p4d_t *p4dp, p4d_t p4d, unsigned long addr, unsigned lo
next = pud_addr_end(addr, end);
if (unlikely(!pud_present(pud)))
return 0;
- if (unlikely(pud_huge(pud) || pud_devmap(pud))) {
+ if (unlikely(pud_leaf(pud))) {
if (!gup_huge_pud(pud, pudp, addr, next, flags,
pages, nr))
return 0;
@@ -3096,7 +3095,7 @@ static void gup_pgd_range(unsigned long addr, unsigned long end,
next = pgd_addr_end(addr, end);
if (pgd_none(pgd))
return;
- if (unlikely(pgd_huge(pgd))) {
+ if (unlikely(pgd_leaf(pgd))) {
if (!gup_huge_pgd(pgd, pgdp, addr, next, flags,
pages, nr))
return;
--
2.44.0


2024-03-13 21:50:04

by Peter Xu

[permalink] [raw]
Subject: [PATCH 05/13] mm/sparc: Change pXd_huge() behavior to exclude swap entries

From: Peter Xu <[email protected]>

Please refer to the previous patch on the reasoning for x86. Now sparc is
the only architecture that will allow swap entries to be reported as
pXd_huge(). After this patch, all architectures should forbid swap entries
in pXd_huge().

Cc: David S. Miller <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: [email protected]
Signed-off-by: Peter Xu <[email protected]>
---
arch/sparc/mm/hugetlbpage.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c
index b432500c13a5..d31c2cec35c9 100644
--- a/arch/sparc/mm/hugetlbpage.c
+++ b/arch/sparc/mm/hugetlbpage.c
@@ -409,14 +409,12 @@ pte_t huge_ptep_get_and_clear(struct mm_struct *mm, unsigned long addr,

int pmd_huge(pmd_t pmd)
{
- return !pmd_none(pmd) &&
- (pmd_val(pmd) & (_PAGE_VALID|_PAGE_PMD_HUGE)) != _PAGE_VALID;
+ return pmd_leaf(pmd);;
}

int pud_huge(pud_t pud)
{
- return !pud_none(pud) &&
- (pud_val(pud) & (_PAGE_VALID|_PAGE_PUD_HUGE)) != _PAGE_VALID;
+ return pud_leaf(pud);
}

static void hugetlb_free_pte_range(struct mmu_gather *tlb, pmd_t *pmd,
--
2.44.0


2024-03-13 21:50:16

by Peter Xu

[permalink] [raw]
Subject: [PATCH 11/13] mm/treewide: Replace pXd_huge() with pXd_leaf()

From: Peter Xu <[email protected]>

Now after we're sure all pXd_huge() definitions are the same as pXd_leaf(),
reuse it. Luckily, pXd_huge() isn't widely used.

Signed-off-by: Peter Xu <[email protected]>
---
arch/arm/include/asm/pgtable-3level.h | 2 +-
arch/arm64/include/asm/pgtable.h | 2 +-
arch/arm64/mm/hugetlbpage.c | 4 ++--
arch/loongarch/mm/hugetlbpage.c | 2 +-
arch/mips/mm/tlb-r4k.c | 2 +-
arch/powerpc/mm/pgtable_64.c | 6 +++---
arch/x86/mm/pgtable.c | 4 ++--
mm/gup.c | 4 ++--
mm/hmm.c | 2 +-
mm/memory.c | 2 +-
10 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/arch/arm/include/asm/pgtable-3level.h b/arch/arm/include/asm/pgtable-3level.h
index e7aecbef75c9..9e3c44f0aea2 100644
--- a/arch/arm/include/asm/pgtable-3level.h
+++ b/arch/arm/include/asm/pgtable-3level.h
@@ -190,7 +190,7 @@ static inline pte_t pte_mkspecial(pte_t pte)
#define pmd_dirty(pmd) (pmd_isset((pmd), L_PMD_SECT_DIRTY))

#define pmd_hugewillfault(pmd) (!pmd_young(pmd) || !pmd_write(pmd))
-#define pmd_thp_or_huge(pmd) (pmd_huge(pmd) || pmd_trans_huge(pmd))
+#define pmd_thp_or_huge(pmd) (pmd_leaf(pmd) || pmd_trans_huge(pmd))

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
#define pmd_trans_huge(pmd) (pmd_val(pmd) && !pmd_table(pmd))
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 14d24c357c7a..c4efa47fed5f 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -512,7 +512,7 @@ static inline pmd_t pmd_mkinvalid(pmd_t pmd)
return pmd;
}

-#define pmd_thp_or_huge(pmd) (pmd_huge(pmd) || pmd_trans_huge(pmd))
+#define pmd_thp_or_huge(pmd) (pmd_leaf(pmd) || pmd_trans_huge(pmd))

#define pmd_write(pmd) pte_write(pmd_pte(pmd))

diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 1234bbaef5bf..f494fc31201f 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -321,7 +321,7 @@ pte_t *huge_pte_offset(struct mm_struct *mm,
if (sz != PUD_SIZE && pud_none(pud))
return NULL;
/* hugepage or swap? */
- if (pud_huge(pud) || !pud_present(pud))
+ if (pud_leaf(pud) || !pud_present(pud))
return (pte_t *)pudp;
/* table; check the next level */

@@ -333,7 +333,7 @@ pte_t *huge_pte_offset(struct mm_struct *mm,
if (!(sz == PMD_SIZE || sz == CONT_PMD_SIZE) &&
pmd_none(pmd))
return NULL;
- if (pmd_huge(pmd) || !pmd_present(pmd))
+ if (pmd_leaf(pmd) || !pmd_present(pmd))
return (pte_t *)pmdp;

if (sz == CONT_PTE_SIZE)
diff --git a/arch/loongarch/mm/hugetlbpage.c b/arch/loongarch/mm/hugetlbpage.c
index 1e76fcb83093..a4e78e74aa21 100644
--- a/arch/loongarch/mm/hugetlbpage.c
+++ b/arch/loongarch/mm/hugetlbpage.c
@@ -64,7 +64,7 @@ uint64_t pmd_to_entrylo(unsigned long pmd_val)
{
uint64_t val;
/* PMD as PTE. Must be huge page */
- if (!pmd_huge(__pmd(pmd_val)))
+ if (!pmd_leaf(__pmd(pmd_val)))
panic("%s", __func__);

val = pmd_val ^ _PAGE_HUGE;
diff --git a/arch/mips/mm/tlb-r4k.c b/arch/mips/mm/tlb-r4k.c
index 4106084e57d7..76f3b9c0a9f0 100644
--- a/arch/mips/mm/tlb-r4k.c
+++ b/arch/mips/mm/tlb-r4k.c
@@ -326,7 +326,7 @@ void __update_tlb(struct vm_area_struct * vma, unsigned long address, pte_t pte)
idx = read_c0_index();
#ifdef CONFIG_MIPS_HUGE_TLB_SUPPORT
/* this could be a huge page */
- if (pmd_huge(*pmdp)) {
+ if (pmd_leaf(*pmdp)) {
unsigned long lo;
write_c0_pagemask(PM_HUGE_MASK);
ptep = (pte_t *)pmdp;
diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
index 9b99113cb51a..6621cfc3baf8 100644
--- a/arch/powerpc/mm/pgtable_64.c
+++ b/arch/powerpc/mm/pgtable_64.c
@@ -102,7 +102,7 @@ struct page *p4d_page(p4d_t p4d)
{
if (p4d_leaf(p4d)) {
if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMAP))
- VM_WARN_ON(!p4d_huge(p4d));
+ VM_WARN_ON(!p4d_leaf(p4d));
return pte_page(p4d_pte(p4d));
}
return virt_to_page(p4d_pgtable(p4d));
@@ -113,7 +113,7 @@ struct page *pud_page(pud_t pud)
{
if (pud_leaf(pud)) {
if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMAP))
- VM_WARN_ON(!pud_huge(pud));
+ VM_WARN_ON(!pud_leaf(pud));
return pte_page(pud_pte(pud));
}
return virt_to_page(pud_pgtable(pud));
@@ -132,7 +132,7 @@ struct page *pmd_page(pmd_t pmd)
* enabled so these checks can't be used.
*/
if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMAP))
- VM_WARN_ON(!(pmd_leaf(pmd) || pmd_huge(pmd)));
+ VM_WARN_ON(!pmd_leaf(pmd));
return pte_page(pmd_pte(pmd));
}
return virt_to_page(pmd_page_vaddr(pmd));
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index ff690ddc2334..d74f0814e086 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -731,7 +731,7 @@ int pud_set_huge(pud_t *pud, phys_addr_t addr, pgprot_t prot)
return 0;

/* Bail out if we are we on a populated non-leaf entry: */
- if (pud_present(*pud) && !pud_huge(*pud))
+ if (pud_present(*pud) && !pud_leaf(*pud))
return 0;

set_pte((pte_t *)pud, pfn_pte(
@@ -760,7 +760,7 @@ int pmd_set_huge(pmd_t *pmd, phys_addr_t addr, pgprot_t prot)
}

/* Bail out if we are we on a populated non-leaf entry: */
- if (pmd_present(*pmd) && !pmd_huge(*pmd))
+ if (pmd_present(*pmd) && !pmd_leaf(*pmd))
return 0;

set_pte((pte_t *)pmd, pfn_pte(
diff --git a/mm/gup.c b/mm/gup.c
index e2415e9789bc..8e04a04ef138 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -778,7 +778,7 @@ static struct page *follow_p4d_mask(struct vm_area_struct *vma,
p4d = READ_ONCE(*p4dp);
if (!p4d_present(p4d))
return no_page_table(vma, flags);
- BUILD_BUG_ON(p4d_huge(p4d));
+ BUILD_BUG_ON(p4d_leaf(p4d));
if (unlikely(p4d_bad(p4d)))
return no_page_table(vma, flags);

@@ -3070,7 +3070,7 @@ static int gup_p4d_range(pgd_t *pgdp, pgd_t pgd, unsigned long addr, unsigned lo
next = p4d_addr_end(addr, end);
if (!p4d_present(p4d))
return 0;
- BUILD_BUG_ON(p4d_huge(p4d));
+ BUILD_BUG_ON(p4d_leaf(p4d));
if (unlikely(is_hugepd(__hugepd(p4d_val(p4d))))) {
if (!gup_huge_pd(__hugepd(p4d_val(p4d)), addr,
P4D_SHIFT, next, flags, pages, nr))
diff --git a/mm/hmm.c b/mm/hmm.c
index c95b9ec5d95f..93aebd9cc130 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -429,7 +429,7 @@ static int hmm_vma_walk_pud(pud_t *pudp, unsigned long start, unsigned long end,
return hmm_vma_walk_hole(start, end, -1, walk);
}

- if (pud_huge(pud) && pud_devmap(pud)) {
+ if (pud_leaf(pud) && pud_devmap(pud)) {
unsigned long i, npages, pfn;
unsigned int required_fault;
unsigned long *hmm_pfns;
diff --git a/mm/memory.c b/mm/memory.c
index 904f70b99498..baee777dcd2d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2765,7 +2765,7 @@ static int apply_to_pmd_range(struct mm_struct *mm, pud_t *pud,
unsigned long next;
int err = 0;

- BUG_ON(pud_huge(*pud));
+ BUG_ON(pud_leaf(*pud));

if (create) {
pmd = pmd_alloc_track(mm, pud, addr, mask);
--
2.44.0


2024-03-13 21:51:31

by Peter Xu

[permalink] [raw]
Subject: [PATCH 07/13] mm/arm: Redefine pmd_huge() with pmd_leaf()

From: Peter Xu <[email protected]>

Most of the archs already define these two APIs the same way. ARM is more
complicated in two aspects:

- For pXd_huge() it's always checking against !PXD_TABLE_BIT, while for
pXd_leaf() it's always checking against PXD_TYPE_SECT.

- SECT/TABLE bits are defined differently on 2-level v.s. 3-level ARM
pgtables, which makes the whole thing even harder to follow.

Luckily, the second complexity should be hidden by the pmd_leaf()
implementation against 2-level v.s. 3-level headers. Invoke pmd_leaf()
directly for pmd_huge(), to remove the first part of complexity. This
prepares to drop pXd_huge() API globally.

When at it, drop the obsolete comments - it's outdated.

Cc: Russell King <[email protected]>
Cc: Shawn Guo <[email protected]>
Cc: Krzysztof Kozlowski <[email protected]>
Cc: Bjorn Andersson <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Konrad Dybcio <[email protected]>
Cc: Fabio Estevam <[email protected]>
Cc: [email protected]
Signed-off-by: Peter Xu <[email protected]>
---
arch/arm/mm/hugetlbpage.c | 7 +------
1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/arch/arm/mm/hugetlbpage.c b/arch/arm/mm/hugetlbpage.c
index dd7a0277c5c0..c2fa643f6bb5 100644
--- a/arch/arm/mm/hugetlbpage.c
+++ b/arch/arm/mm/hugetlbpage.c
@@ -18,11 +18,6 @@
#include <asm/tlb.h>
#include <asm/tlbflush.h>

-/*
- * On ARM, huge pages are backed by pmd's rather than pte's, so we do a lot
- * of type casting from pmd_t * to pte_t *.
- */
-
int pud_huge(pud_t pud)
{
return 0;
@@ -30,5 +25,5 @@ int pud_huge(pud_t pud)

int pmd_huge(pmd_t pmd)
{
- return pmd_val(pmd) && !(pmd_val(pmd) & PMD_TABLE_BIT);
+ return pmd_leaf(pmd);
}
--
2.44.0


2024-03-13 21:51:48

by Peter Xu

[permalink] [raw]
Subject: [PATCH 12/13] mm/treewide: Remove pXd_huge()

From: Peter Xu <[email protected]>

This API is not used anymore, drop it for the whole tree.

Signed-off-by: Peter Xu <[email protected]>
---
arch/arm/mm/Makefile | 1 -
arch/arm/mm/hugetlbpage.c | 29 -------------------
arch/arm64/mm/hugetlbpage.c | 10 -------
arch/loongarch/mm/hugetlbpage.c | 10 -------
arch/mips/include/asm/pgtable-32.h | 2 +-
arch/mips/include/asm/pgtable-64.h | 2 +-
arch/mips/mm/hugetlbpage.c | 10 -------
arch/parisc/mm/hugetlbpage.c | 11 -------
.../include/asm/book3s/64/pgtable-4k.h | 10 -------
.../include/asm/book3s/64/pgtable-64k.h | 25 ----------------
arch/powerpc/include/asm/nohash/pgtable.h | 10 -------
arch/riscv/mm/hugetlbpage.c | 10 -------
arch/s390/mm/hugetlbpage.c | 10 -------
arch/sh/mm/hugetlbpage.c | 10 -------
arch/sparc/mm/hugetlbpage.c | 10 -------
arch/x86/mm/hugetlbpage.c | 16 ----------
include/linux/hugetlb.h | 24 ---------------
17 files changed, 2 insertions(+), 198 deletions(-)
delete mode 100644 arch/arm/mm/hugetlbpage.c

diff --git a/arch/arm/mm/Makefile b/arch/arm/mm/Makefile
index 71b858c9b10c..1779e12db085 100644
--- a/arch/arm/mm/Makefile
+++ b/arch/arm/mm/Makefile
@@ -21,7 +21,6 @@ KASAN_SANITIZE_physaddr.o := n
obj-$(CONFIG_DEBUG_VIRTUAL) += physaddr.o

obj-$(CONFIG_ALIGNMENT_TRAP) += alignment.o
-obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpage.o
obj-$(CONFIG_ARM_PV_FIXUP) += pv-fixup-asm.o

obj-$(CONFIG_CPU_ABRT_NOMMU) += abort-nommu.o
diff --git a/arch/arm/mm/hugetlbpage.c b/arch/arm/mm/hugetlbpage.c
deleted file mode 100644
index c2fa643f6bb5..000000000000
--- a/arch/arm/mm/hugetlbpage.c
+++ /dev/null
@@ -1,29 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0-only
-/*
- * arch/arm/mm/hugetlbpage.c
- *
- * Copyright (C) 2012 ARM Ltd.
- *
- * Based on arch/x86/include/asm/hugetlb.h and Bill Carson's patches
- */
-
-#include <linux/init.h>
-#include <linux/fs.h>
-#include <linux/mm.h>
-#include <linux/hugetlb.h>
-#include <linux/pagemap.h>
-#include <linux/err.h>
-#include <linux/sysctl.h>
-#include <asm/mman.h>
-#include <asm/tlb.h>
-#include <asm/tlbflush.h>
-
-int pud_huge(pud_t pud)
-{
- return 0;
-}
-
-int pmd_huge(pmd_t pmd)
-{
- return pmd_leaf(pmd);
-}
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index f494fc31201f..ca58210d6c07 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -79,16 +79,6 @@ bool arch_hugetlb_migration_supported(struct hstate *h)
}
#endif

-int pmd_huge(pmd_t pmd)
-{
- return pmd_leaf(pmd);
-}
-
-int pud_huge(pud_t pud)
-{
- return pud_leaf(pud);
-}
-
static int find_num_contig(struct mm_struct *mm, unsigned long addr,
pte_t *ptep, size_t *pgsize)
{
diff --git a/arch/loongarch/mm/hugetlbpage.c b/arch/loongarch/mm/hugetlbpage.c
index a4e78e74aa21..12222c56cb59 100644
--- a/arch/loongarch/mm/hugetlbpage.c
+++ b/arch/loongarch/mm/hugetlbpage.c
@@ -50,16 +50,6 @@ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr,
return (pte_t *) pmd;
}

-int pmd_huge(pmd_t pmd)
-{
- return (pmd_val(pmd) & _PAGE_HUGE) != 0;
-}
-
-int pud_huge(pud_t pud)
-{
- return (pud_val(pud) & _PAGE_HUGE) != 0;
-}
-
uint64_t pmd_to_entrylo(unsigned long pmd_val)
{
uint64_t val;
diff --git a/arch/mips/include/asm/pgtable-32.h b/arch/mips/include/asm/pgtable-32.h
index 0e196650f4f4..92b7591aac2a 100644
--- a/arch/mips/include/asm/pgtable-32.h
+++ b/arch/mips/include/asm/pgtable-32.h
@@ -129,7 +129,7 @@ static inline int pmd_none(pmd_t pmd)
static inline int pmd_bad(pmd_t pmd)
{
#ifdef CONFIG_MIPS_HUGE_TLB_SUPPORT
- /* pmd_huge(pmd) but inline */
+ /* pmd_leaf(pmd) but inline */
if (unlikely(pmd_val(pmd) & _PAGE_HUGE))
return 0;
#endif
diff --git a/arch/mips/include/asm/pgtable-64.h b/arch/mips/include/asm/pgtable-64.h
index 20ca48c1b606..7c28510b3768 100644
--- a/arch/mips/include/asm/pgtable-64.h
+++ b/arch/mips/include/asm/pgtable-64.h
@@ -245,7 +245,7 @@ static inline int pmd_none(pmd_t pmd)
static inline int pmd_bad(pmd_t pmd)
{
#ifdef CONFIG_MIPS_HUGE_TLB_SUPPORT
- /* pmd_huge(pmd) but inline */
+ /* pmd_leaf(pmd) but inline */
if (unlikely(pmd_val(pmd) & _PAGE_HUGE))
return 0;
#endif
diff --git a/arch/mips/mm/hugetlbpage.c b/arch/mips/mm/hugetlbpage.c
index 7eaff5b07873..0b9e15555b59 100644
--- a/arch/mips/mm/hugetlbpage.c
+++ b/arch/mips/mm/hugetlbpage.c
@@ -57,13 +57,3 @@ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr,
}
return (pte_t *) pmd;
}
-
-int pmd_huge(pmd_t pmd)
-{
- return (pmd_val(pmd) & _PAGE_HUGE) != 0;
-}
-
-int pud_huge(pud_t pud)
-{
- return (pud_val(pud) & _PAGE_HUGE) != 0;
-}
diff --git a/arch/parisc/mm/hugetlbpage.c b/arch/parisc/mm/hugetlbpage.c
index a9f7e21f6656..0356199bd9e7 100644
--- a/arch/parisc/mm/hugetlbpage.c
+++ b/arch/parisc/mm/hugetlbpage.c
@@ -180,14 +180,3 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
}
return changed;
}
-
-
-int pmd_huge(pmd_t pmd)
-{
- return 0;
-}
-
-int pud_huge(pud_t pud)
-{
- return 0;
-}
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable-4k.h b/arch/powerpc/include/asm/book3s/64/pgtable-4k.h
index 92545981bb49..baf934578c3a 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable-4k.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable-4k.h
@@ -6,16 +6,6 @@
*/
#ifndef __ASSEMBLY__
#ifdef CONFIG_HUGETLB_PAGE
-static inline int pmd_huge(pmd_t pmd)
-{
- return pmd_leaf(pmd);
-}
-
-static inline int pud_huge(pud_t pud)
-{
- return pud_leaf(pud);
-}
-
/*
* With radix , we have hugepage ptes in the pud and pmd entries. We don't
* need to setup hugepage directory for them. Our pte and page directory format
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable-64k.h b/arch/powerpc/include/asm/book3s/64/pgtable-64k.h
index 2fce3498b000..579a7153857f 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable-64k.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable-64k.h
@@ -4,31 +4,6 @@

#ifndef __ASSEMBLY__
#ifdef CONFIG_HUGETLB_PAGE
-/*
- * We have PGD_INDEX_SIZ = 12 and PTE_INDEX_SIZE = 8, so that we can have
- * 16GB hugepage pte in PGD and 16MB hugepage pte at PMD;
- *
- * Defined in such a way that we can optimize away code block at build time
- * if CONFIG_HUGETLB_PAGE=n.
- *
- * returns true for pmd migration entries, THP, devmap, hugetlb
- * But compile time dependent on CONFIG_HUGETLB_PAGE
- */
-static inline int pmd_huge(pmd_t pmd)
-{
- /*
- * leaf pte for huge page
- */
- return !!(pmd_raw(pmd) & cpu_to_be64(_PAGE_PTE));
-}
-
-static inline int pud_huge(pud_t pud)
-{
- /*
- * leaf pte for huge page
- */
- return !!(pud_raw(pud) & cpu_to_be64(_PAGE_PTE));
-}

/*
* With 64k page size, we have hugepage ptes in the pgd and pmd entries. We don't
diff --git a/arch/powerpc/include/asm/nohash/pgtable.h b/arch/powerpc/include/asm/nohash/pgtable.h
index 427db14292c9..f5f39d4f03c8 100644
--- a/arch/powerpc/include/asm/nohash/pgtable.h
+++ b/arch/powerpc/include/asm/nohash/pgtable.h
@@ -351,16 +351,6 @@ static inline int hugepd_ok(hugepd_t hpd)
#endif
}

-static inline int pmd_huge(pmd_t pmd)
-{
- return 0;
-}
-
-static inline int pud_huge(pud_t pud)
-{
- return 0;
-}
-
#define is_hugepd(hpd) (hugepd_ok(hpd))
#endif

diff --git a/arch/riscv/mm/hugetlbpage.c b/arch/riscv/mm/hugetlbpage.c
index 29c7606414d2..9a4bc4bd2a01 100644
--- a/arch/riscv/mm/hugetlbpage.c
+++ b/arch/riscv/mm/hugetlbpage.c
@@ -399,16 +399,6 @@ static bool is_napot_size(unsigned long size)

#endif /*CONFIG_RISCV_ISA_SVNAPOT*/

-int pud_huge(pud_t pud)
-{
- return pud_leaf(pud);
-}
-
-int pmd_huge(pmd_t pmd)
-{
- return pmd_leaf(pmd);
-}
-
static bool __hugetlb_valid_size(unsigned long size)
{
if (size == HPAGE_SIZE)
diff --git a/arch/s390/mm/hugetlbpage.c b/arch/s390/mm/hugetlbpage.c
index c2e8242bd15d..ca43b6fce71c 100644
--- a/arch/s390/mm/hugetlbpage.c
+++ b/arch/s390/mm/hugetlbpage.c
@@ -233,16 +233,6 @@ pte_t *huge_pte_offset(struct mm_struct *mm,
return (pte_t *) pmdp;
}

-int pmd_huge(pmd_t pmd)
-{
- return pmd_leaf(pmd);
-}
-
-int pud_huge(pud_t pud)
-{
- return pud_leaf(pud);
-}
-
bool __init arch_hugetlb_valid_size(unsigned long size)
{
if (MACHINE_HAS_EDAT1 && size == PMD_SIZE)
diff --git a/arch/sh/mm/hugetlbpage.c b/arch/sh/mm/hugetlbpage.c
index 6cb0ad73dbb9..ff209b55285a 100644
--- a/arch/sh/mm/hugetlbpage.c
+++ b/arch/sh/mm/hugetlbpage.c
@@ -70,13 +70,3 @@ pte_t *huge_pte_offset(struct mm_struct *mm,

return pte;
}
-
-int pmd_huge(pmd_t pmd)
-{
- return 0;
-}
-
-int pud_huge(pud_t pud)
-{
- return 0;
-}
diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c
index d31c2cec35c9..8ed5bdf95d25 100644
--- a/arch/sparc/mm/hugetlbpage.c
+++ b/arch/sparc/mm/hugetlbpage.c
@@ -407,16 +407,6 @@ pte_t huge_ptep_get_and_clear(struct mm_struct *mm, unsigned long addr,
return entry;
}

-int pmd_huge(pmd_t pmd)
-{
- return pmd_leaf(pmd);;
-}
-
-int pud_huge(pud_t pud)
-{
- return pud_leaf(pud);
-}
-
static void hugetlb_free_pte_range(struct mmu_gather *tlb, pmd_t *pmd,
unsigned long addr)
{
diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index 8362953a24ce..dab6db288e5b 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -19,22 +19,6 @@
#include <asm/tlbflush.h>
#include <asm/elf.h>

-/*
- * pmd_huge() returns 1 if @pmd is hugetlb related entry.
- */
-int pmd_huge(pmd_t pmd)
-{
- return pmd_leaf(pmd);
-}
-
-/*
- * pud_huge() returns 1 if @pud is hugetlb related entry.
- */
-int pud_huge(pud_t pud)
-{
- return pud_leaf(pud);
-}
-
#ifdef CONFIG_HUGETLB_PAGE
static unsigned long hugetlb_get_unmapped_area_bottomup(struct file *file,
unsigned long addr, unsigned long len,
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 77b30a8c6076..300de33c6fde 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -272,13 +272,9 @@ void hugetlb_vma_unlock_write(struct vm_area_struct *vma);
int hugetlb_vma_trylock_write(struct vm_area_struct *vma);
void hugetlb_vma_assert_locked(struct vm_area_struct *vma);
void hugetlb_vma_lock_release(struct kref *kref);
-
-int pmd_huge(pmd_t pmd);
-int pud_huge(pud_t pud);
long hugetlb_change_protection(struct vm_area_struct *vma,
unsigned long address, unsigned long end, pgprot_t newprot,
unsigned long cp_flags);
-
bool is_hugetlb_entry_migration(pte_t pte);
bool is_hugetlb_entry_hwpoisoned(pte_t pte);
void hugetlb_unshare_all_pmds(struct vm_area_struct *vma);
@@ -399,16 +395,6 @@ static inline void hugetlb_vma_assert_locked(struct vm_area_struct *vma)
{
}

-static inline int pmd_huge(pmd_t pmd)
-{
- return 0;
-}
-
-static inline int pud_huge(pud_t pud)
-{
- return 0;
-}
-
static inline int is_hugepage_only_range(struct mm_struct *mm,
unsigned long addr, unsigned long len)
{
@@ -493,16 +479,6 @@ static inline vm_fault_t hugetlb_fault(struct mm_struct *mm,
static inline void hugetlb_unshare_all_pmds(struct vm_area_struct *vma) { }

#endif /* !CONFIG_HUGETLB_PAGE */
-/*
- * hugepages at page global directory. If arch support
- * hugepages at pgd level, they need to define this.
- */
-#ifndef pgd_huge
-#define pgd_huge(x) 0
-#endif
-#ifndef p4d_huge
-#define p4d_huge(x) 0
-#endif

#ifndef pgd_write
static inline int pgd_write(pgd_t pgd)
--
2.44.0


2024-03-13 21:52:14

by Peter Xu

[permalink] [raw]
Subject: [PATCH 13/13] mm: Document pXd_leaf() API

From: Peter Xu <[email protected]>

There's one small section already, but since we're going to remove
pXd_huge(), that comment may start to obsolete.

Rewrite that section with more information, hopefully with that the API is
crystal clear on what it implies.

Reviewed-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
include/linux/pgtable.h | 24 +++++++++++++++++++-----
1 file changed, 19 insertions(+), 5 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 85fc7554cd52..6b0d222a7fad 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1770,11 +1770,25 @@ typedef unsigned int pgtbl_mod_mask;
#endif

/*
- * p?d_leaf() - true if this entry is a final mapping to a physical address.
- * This differs from p?d_huge() by the fact that they are always available (if
- * the architecture supports large pages at the appropriate level) even
- * if CONFIG_HUGETLB_PAGE is not defined.
- * Only meaningful when called on a valid entry.
+ * pXd_leaf() is the API to check whether a pgtable entry is a huge page
+ * mapping. It should work globally across all archs, without any
+ * dependency on CONFIG_* options. For architectures that do not support
+ * huge mappings on specific levels, below fallbacks will be used.
+ *
+ * A leaf pgtable entry should always imply the following:
+ *
+ * - It is a "present" entry. IOW, before using this API, please check it
+ * with pXd_present() first. NOTE: it may not always mean the "present
+ * bit" is set. For example, PROT_NONE entries are always "present".
+ *
+ * - It should _never_ be a swap entry of any type. Above "present" check
+ * should have guarded this, but let's be crystal clear on this.
+ *
+ * - It should contain a huge PFN, which points to a huge page larger than
+ * PAGE_SIZE of the platform. The PFN format isn't important here.
+ *
+ * - It should cover all kinds of huge mappings (e.g., pXd_trans_huge(),
+ * pXd_devmap(), or hugetlb mappings).
*/
#ifndef pgd_leaf
#define pgd_leaf(x) false
--
2.44.0


2024-03-14 08:48:59

by Christophe Leroy

[permalink] [raw]
Subject: Re: [PATCH 09/13] mm/powerpc: Redefine pXd_huge() with pXd_leaf()



Le 13/03/2024 à 22:47, [email protected] a écrit :
> From: Peter Xu <[email protected]>
>
> PowerPC book3s 4K mostly has the same definition on both, except pXd_huge()
> constantly returns 0 for hash MMUs. As Michael Ellerman pointed out [1],
> it is safe to check _PAGE_PTE on hash MMUs, as the bit will never be set so
> it will keep returning false.
>
> As a reference, __p[mu]d_mkhuge() will trigger a BUG_ON trying to create
> such huge mappings for 4K hash MMUs. Meanwhile, the major powerpc hugetlb
> pgtable walker __find_linux_pte() already used pXd_leaf() to check hugetlb
> mappings.
>
> The goal should be that we will have one API pXd_leaf() to detect all kinds
> of huge mappings. AFAICT we need to use the pXd_leaf() impl (rather than
> pXd_huge() ones) to make sure ie. THPs on hash MMU will also return true.

All kinds of huge mappings ?

pXd_leaf() will detect only leaf mappings (like pXd_huge() ). There are
also huge mappings through hugepd. On powerpc 8xx we have 8M huge pages
and 512k huge pages. A PGD entry covers 4M so pgd_leaf() won't report
those huge pages.

>
> This helps to simplify a follow up patch to drop pXd_huge() treewide.
>
> NOTE: *_leaf() definition need to be moved before the inclusion of
> asm/book3s/64/pgtable-4k.h, which defines pXd_huge() with it.
>
> [1] https://lore.kernel.org/r/[email protected]
>
> Cc: Michael Ellerman <[email protected]>
> Cc: Nicholas Piggin <[email protected]>
> Cc: Christophe Leroy <[email protected]>
> Cc: "Aneesh Kumar K.V" <[email protected]>
> Cc: "Naveen N. Rao" <[email protected]>
> Cc: [email protected]
> Signed-off-by: Peter Xu <[email protected]>
> ---
> .../include/asm/book3s/64/pgtable-4k.h | 14 ++--------
> arch/powerpc/include/asm/book3s/64/pgtable.h | 27 +++++++++----------
> 2 files changed, 14 insertions(+), 27 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/book3s/64/pgtable-4k.h b/arch/powerpc/include/asm/book3s/64/pgtable-4k.h
> index 48f21820afe2..92545981bb49 100644
> --- a/arch/powerpc/include/asm/book3s/64/pgtable-4k.h
> +++ b/arch/powerpc/include/asm/book3s/64/pgtable-4k.h
> @@ -8,22 +8,12 @@
> #ifdef CONFIG_HUGETLB_PAGE
> static inline int pmd_huge(pmd_t pmd)
> {
> - /*
> - * leaf pte for huge page
> - */
> - if (radix_enabled())
> - return !!(pmd_raw(pmd) & cpu_to_be64(_PAGE_PTE));
> - return 0;
> + return pmd_leaf(pmd);
> }
>
> static inline int pud_huge(pud_t pud)
> {
> - /*
> - * leaf pte for huge page
> - */
> - if (radix_enabled())
> - return !!(pud_raw(pud) & cpu_to_be64(_PAGE_PTE));
> - return 0;
> + return pud_leaf(pud);
> }
>
> /*
> diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
> index df66dce8306f..fd7180fded75 100644
> --- a/arch/powerpc/include/asm/book3s/64/pgtable.h
> +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
> @@ -262,6 +262,18 @@ extern unsigned long __kernel_io_end;
>
> extern struct page *vmemmap;
> extern unsigned long pci_io_base;
> +
> +#define pmd_leaf pmd_leaf
> +static inline bool pmd_leaf(pmd_t pmd)
> +{
> + return !!(pmd_raw(pmd) & cpu_to_be64(_PAGE_PTE));
> +}
> +
> +#define pud_leaf pud_leaf
> +static inline bool pud_leaf(pud_t pud)
> +{
> + return !!(pud_raw(pud) & cpu_to_be64(_PAGE_PTE));
> +}
> #endif /* __ASSEMBLY__ */
>
> #include <asm/book3s/64/hash.h>
> @@ -1436,20 +1448,5 @@ static inline bool is_pte_rw_upgrade(unsigned long old_val, unsigned long new_va
> return false;
> }
>
> -/*
> - * Like pmd_huge(), but works regardless of config options
> - */
> -#define pmd_leaf pmd_leaf
> -static inline bool pmd_leaf(pmd_t pmd)
> -{
> - return !!(pmd_raw(pmd) & cpu_to_be64(_PAGE_PTE));
> -}
> -
> -#define pud_leaf pud_leaf
> -static inline bool pud_leaf(pud_t pud)
> -{
> - return !!(pud_raw(pud) & cpu_to_be64(_PAGE_PTE));
> -}
> -
> #endif /* __ASSEMBLY__ */
> #endif /* _ASM_POWERPC_BOOK3S_64_PGTABLE_H_ */

2024-03-14 08:52:03

by Christophe Leroy

[permalink] [raw]
Subject: Re: [PATCH 11/13] mm/treewide: Replace pXd_huge() with pXd_leaf()



Le 13/03/2024 à 22:47, [email protected] a écrit :
> From: Peter Xu <[email protected]>
>
> Now after we're sure all pXd_huge() definitions are the same as pXd_leaf(),
> reuse it. Luckily, pXd_huge() isn't widely used.
>
> Signed-off-by: Peter Xu <[email protected]>
> ---
> arch/arm/include/asm/pgtable-3level.h | 2 +-
> arch/arm64/include/asm/pgtable.h | 2 +-
> arch/arm64/mm/hugetlbpage.c | 4 ++--
> arch/loongarch/mm/hugetlbpage.c | 2 +-
> arch/mips/mm/tlb-r4k.c | 2 +-
> arch/powerpc/mm/pgtable_64.c | 6 +++---
> arch/x86/mm/pgtable.c | 4 ++--
> mm/gup.c | 4 ++--
> mm/hmm.c | 2 +-
> mm/memory.c | 2 +-
> 10 files changed, 15 insertions(+), 15 deletions(-)
>
> diff --git a/arch/arm/include/asm/pgtable-3level.h b/arch/arm/include/asm/pgtable-3level.h
> index e7aecbef75c9..9e3c44f0aea2 100644
> --- a/arch/arm/include/asm/pgtable-3level.h
> +++ b/arch/arm/include/asm/pgtable-3level.h
> @@ -190,7 +190,7 @@ static inline pte_t pte_mkspecial(pte_t pte)
> #define pmd_dirty(pmd) (pmd_isset((pmd), L_PMD_SECT_DIRTY))
>
> #define pmd_hugewillfault(pmd) (!pmd_young(pmd) || !pmd_write(pmd))
> -#define pmd_thp_or_huge(pmd) (pmd_huge(pmd) || pmd_trans_huge(pmd))
> +#define pmd_thp_or_huge(pmd) (pmd_leaf(pmd) || pmd_trans_huge(pmd))

Previous patch said pmd_trans_huge() implies pmd_leaf().

Or is that only for GUP ?

>
> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> #define pmd_trans_huge(pmd) (pmd_val(pmd) && !pmd_table(pmd))


> diff --git a/mm/hmm.c b/mm/hmm.c
> index c95b9ec5d95f..93aebd9cc130 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -429,7 +429,7 @@ static int hmm_vma_walk_pud(pud_t *pudp, unsigned long start, unsigned long end,
> return hmm_vma_walk_hole(start, end, -1, walk);
> }
>
> - if (pud_huge(pud) && pud_devmap(pud)) {
> + if (pud_leaf(pud) && pud_devmap(pud)) {

Didn't previous patch say devmap implies leaf ? Or is it only for GUP ?

> unsigned long i, npages, pfn;
> unsigned int required_fault;
> unsigned long *hmm_pfns;


2024-03-14 08:57:13

by Christophe Leroy

[permalink] [raw]
Subject: Re: [PATCH 12/13] mm/treewide: Remove pXd_huge()



Le 13/03/2024 à 22:47, [email protected] a écrit :
> From: Peter Xu <[email protected]>
>
> This API is not used anymore, drop it for the whole tree.
>
> Signed-off-by: Peter Xu <[email protected]>
> ---
> arch/arm/mm/Makefile | 1 -
> arch/arm/mm/hugetlbpage.c | 29 -------------------
> arch/arm64/mm/hugetlbpage.c | 10 -------
> arch/loongarch/mm/hugetlbpage.c | 10 -------
> arch/mips/include/asm/pgtable-32.h | 2 +-
> arch/mips/include/asm/pgtable-64.h | 2 +-
> arch/mips/mm/hugetlbpage.c | 10 -------
> arch/parisc/mm/hugetlbpage.c | 11 -------
> .../include/asm/book3s/64/pgtable-4k.h | 10 -------
> .../include/asm/book3s/64/pgtable-64k.h | 25 ----------------
> arch/powerpc/include/asm/nohash/pgtable.h | 10 -------
> arch/riscv/mm/hugetlbpage.c | 10 -------
> arch/s390/mm/hugetlbpage.c | 10 -------
> arch/sh/mm/hugetlbpage.c | 10 -------
> arch/sparc/mm/hugetlbpage.c | 10 -------
> arch/x86/mm/hugetlbpage.c | 16 ----------
> include/linux/hugetlb.h | 24 ---------------
> 17 files changed, 2 insertions(+), 198 deletions(-)
> delete mode 100644 arch/arm/mm/hugetlbpage.c
>

> diff --git a/arch/mips/include/asm/pgtable-32.h b/arch/mips/include/asm/pgtable-32.h
> index 0e196650f4f4..92b7591aac2a 100644
> --- a/arch/mips/include/asm/pgtable-32.h
> +++ b/arch/mips/include/asm/pgtable-32.h
> @@ -129,7 +129,7 @@ static inline int pmd_none(pmd_t pmd)
> static inline int pmd_bad(pmd_t pmd)
> {
> #ifdef CONFIG_MIPS_HUGE_TLB_SUPPORT
> - /* pmd_huge(pmd) but inline */
> + /* pmd_leaf(pmd) but inline */

Shouldn't this comment have been changed in patch 11 ?

> if (unlikely(pmd_val(pmd) & _PAGE_HUGE))

Unlike pmd_huge() which is an outline function, pmd_leaf() is a macro so
it could be used here instead of open coping.

> return 0;
> #endif
> diff --git a/arch/mips/include/asm/pgtable-64.h b/arch/mips/include/asm/pgtable-64.h
> index 20ca48c1b606..7c28510b3768 100644
> --- a/arch/mips/include/asm/pgtable-64.h
> +++ b/arch/mips/include/asm/pgtable-64.h
> @@ -245,7 +245,7 @@ static inline int pmd_none(pmd_t pmd)
> static inline int pmd_bad(pmd_t pmd)
> {
> #ifdef CONFIG_MIPS_HUGE_TLB_SUPPORT
> - /* pmd_huge(pmd) but inline */
> + /* pmd_leaf(pmd) but inline */

Same

> if (unlikely(pmd_val(pmd) & _PAGE_HUGE))

Same

> return 0;
> #endif

> diff --git a/arch/powerpc/include/asm/book3s/64/pgtable-64k.h b/arch/powerpc/include/asm/book3s/64/pgtable-64k.h
> index 2fce3498b000..579a7153857f 100644
> --- a/arch/powerpc/include/asm/book3s/64/pgtable-64k.h
> +++ b/arch/powerpc/include/asm/book3s/64/pgtable-64k.h
> @@ -4,31 +4,6 @@
>
> #ifndef __ASSEMBLY__
> #ifdef CONFIG_HUGETLB_PAGE
> -/*
> - * We have PGD_INDEX_SIZ = 12 and PTE_INDEX_SIZE = 8, so that we can have
> - * 16GB hugepage pte in PGD and 16MB hugepage pte at PMD;
> - *
> - * Defined in such a way that we can optimize away code block at build time
> - * if CONFIG_HUGETLB_PAGE=n.
> - *
> - * returns true for pmd migration entries, THP, devmap, hugetlb
> - * But compile time dependent on CONFIG_HUGETLB_PAGE
> - */

Should we keep this comment somewhere for documentation ?

> -static inline int pmd_huge(pmd_t pmd)
> -{
> - /*
> - * leaf pte for huge page
> - */
> - return !!(pmd_raw(pmd) & cpu_to_be64(_PAGE_PTE));
> -}
> -
> -static inline int pud_huge(pud_t pud)
> -{
> - /*
> - * leaf pte for huge page
> - */
> - return !!(pud_raw(pud) & cpu_to_be64(_PAGE_PTE));
> -}
>
> /*
> * With 64k page size, we have hugepage ptes in the pgd and pmd entries. We don't

2024-03-14 12:54:14

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 09/13] mm/powerpc: Redefine pXd_huge() with pXd_leaf()

On Thu, Mar 14, 2024 at 08:45:34AM +0000, Christophe Leroy wrote:
>
>
> Le 13/03/2024 à 22:47, [email protected] a écrit :
> > From: Peter Xu <[email protected]>
> >
> > PowerPC book3s 4K mostly has the same definition on both, except pXd_huge()
> > constantly returns 0 for hash MMUs. As Michael Ellerman pointed out [1],
> > it is safe to check _PAGE_PTE on hash MMUs, as the bit will never be set so
> > it will keep returning false.
> >
> > As a reference, __p[mu]d_mkhuge() will trigger a BUG_ON trying to create
> > such huge mappings for 4K hash MMUs. Meanwhile, the major powerpc hugetlb
> > pgtable walker __find_linux_pte() already used pXd_leaf() to check hugetlb
> > mappings.
> >
> > The goal should be that we will have one API pXd_leaf() to detect all kinds
> > of huge mappings. AFAICT we need to use the pXd_leaf() impl (rather than
> > pXd_huge() ones) to make sure ie. THPs on hash MMU will also return true.
>
> All kinds of huge mappings ?
>
> pXd_leaf() will detect only leaf mappings (like pXd_huge() ). There are
> also huge mappings through hugepd. On powerpc 8xx we have 8M huge pages
> and 512k huge pages. A PGD entry covers 4M so pgd_leaf() won't report
> those huge pages.

Ah yes, I should always mention this is in the context of leaf huge pages
only. Are the examples you provided all fall into hugepd category? If so
I can reword the commit message, as:

As a reference, __p[mu]d_mkhuge() will trigger a BUG_ON trying to
create such huge mappings for 4K hash MMUs. Meanwhile, the major
powerpc hugetlb pgtable walker __find_linux_pte() already used
pXd_leaf() to check leaf hugetlb mappings.

The goal should be that we will have one API pXd_leaf() to detect
all kinds of huge mappings except hugepd. AFAICT we need to use
the pXd_leaf() impl (rather than pXd_huge() ones) to make sure
ie. THPs on hash MMU will also return true.

Does this look good to you?

Thanks,

--
Peter Xu


2024-03-14 12:59:48

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 11/13] mm/treewide: Replace pXd_huge() with pXd_leaf()

On Thu, Mar 14, 2024 at 08:50:20AM +0000, Christophe Leroy wrote:
>
>
> Le 13/03/2024 à 22:47, [email protected] a écrit :
> > From: Peter Xu <[email protected]>
> >
> > Now after we're sure all pXd_huge() definitions are the same as pXd_leaf(),
> > reuse it. Luckily, pXd_huge() isn't widely used.
> >
> > Signed-off-by: Peter Xu <[email protected]>
> > ---
> > arch/arm/include/asm/pgtable-3level.h | 2 +-
> > arch/arm64/include/asm/pgtable.h | 2 +-
> > arch/arm64/mm/hugetlbpage.c | 4 ++--
> > arch/loongarch/mm/hugetlbpage.c | 2 +-
> > arch/mips/mm/tlb-r4k.c | 2 +-
> > arch/powerpc/mm/pgtable_64.c | 6 +++---
> > arch/x86/mm/pgtable.c | 4 ++--
> > mm/gup.c | 4 ++--
> > mm/hmm.c | 2 +-
> > mm/memory.c | 2 +-
> > 10 files changed, 15 insertions(+), 15 deletions(-)
> >
> > diff --git a/arch/arm/include/asm/pgtable-3level.h b/arch/arm/include/asm/pgtable-3level.h
> > index e7aecbef75c9..9e3c44f0aea2 100644
> > --- a/arch/arm/include/asm/pgtable-3level.h
> > +++ b/arch/arm/include/asm/pgtable-3level.h
> > @@ -190,7 +190,7 @@ static inline pte_t pte_mkspecial(pte_t pte)
> > #define pmd_dirty(pmd) (pmd_isset((pmd), L_PMD_SECT_DIRTY))
> >
> > #define pmd_hugewillfault(pmd) (!pmd_young(pmd) || !pmd_write(pmd))
> > -#define pmd_thp_or_huge(pmd) (pmd_huge(pmd) || pmd_trans_huge(pmd))
> > +#define pmd_thp_or_huge(pmd) (pmd_leaf(pmd) || pmd_trans_huge(pmd))
>
> Previous patch said pmd_trans_huge() implies pmd_leaf().

Ah here I remember I kept this arm definition there because I think we
should add a patch to drop pmd_thp_or_huge() completely. If you won't mind
I can add one more patch instead of doing it here. Then I keep this patch
purely as a replacement patch without further changes on arch-cleanups.

>
> Or is that only for GUP ?

I think it should apply to all.

>
> >
> > #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > #define pmd_trans_huge(pmd) (pmd_val(pmd) && !pmd_table(pmd))
>
>
> > diff --git a/mm/hmm.c b/mm/hmm.c
> > index c95b9ec5d95f..93aebd9cc130 100644
> > --- a/mm/hmm.c
> > +++ b/mm/hmm.c
> > @@ -429,7 +429,7 @@ static int hmm_vma_walk_pud(pud_t *pudp, unsigned long start, unsigned long end,
> > return hmm_vma_walk_hole(start, end, -1, walk);
> > }
> >
> > - if (pud_huge(pud) && pud_devmap(pud)) {
> > + if (pud_leaf(pud) && pud_devmap(pud)) {
>
> Didn't previous patch say devmap implies leaf ? Or is it only for GUP ?

This is an extra safety check that I didn't remove. Devmap used separate
bits even though I'm not clear on why. It should still imply a leaf though.

Thanks,

>
> > unsigned long i, npages, pfn;
> > unsigned int required_fault;
> > unsigned long *hmm_pfns;
>
>

--
Peter Xu


2024-03-14 13:12:13

by Christophe Leroy

[permalink] [raw]
Subject: Re: [PATCH 09/13] mm/powerpc: Redefine pXd_huge() with pXd_leaf()



Le 14/03/2024 à 13:53, Peter Xu a écrit :
> On Thu, Mar 14, 2024 at 08:45:34AM +0000, Christophe Leroy wrote:
>>
>>
>> Le 13/03/2024 à 22:47, [email protected] a écrit :
>>> From: Peter Xu <[email protected]>
>>>
>>> PowerPC book3s 4K mostly has the same definition on both, except pXd_huge()
>>> constantly returns 0 for hash MMUs. As Michael Ellerman pointed out [1],
>>> it is safe to check _PAGE_PTE on hash MMUs, as the bit will never be set so
>>> it will keep returning false.
>>>
>>> As a reference, __p[mu]d_mkhuge() will trigger a BUG_ON trying to create
>>> such huge mappings for 4K hash MMUs. Meanwhile, the major powerpc hugetlb
>>> pgtable walker __find_linux_pte() already used pXd_leaf() to check hugetlb
>>> mappings.
>>>
>>> The goal should be that we will have one API pXd_leaf() to detect all kinds
>>> of huge mappings. AFAICT we need to use the pXd_leaf() impl (rather than
>>> pXd_huge() ones) to make sure ie. THPs on hash MMU will also return true.
>>
>> All kinds of huge mappings ?
>>
>> pXd_leaf() will detect only leaf mappings (like pXd_huge() ). There are
>> also huge mappings through hugepd. On powerpc 8xx we have 8M huge pages
>> and 512k huge pages. A PGD entry covers 4M so pgd_leaf() won't report
>> those huge pages.
>
> Ah yes, I should always mention this is in the context of leaf huge pages
> only. Are the examples you provided all fall into hugepd category? If so
> I can reword the commit message, as:

On powerpc 8xx, only the 8M huge pages fall into the hugepd case.

The 512k hugepages are at PTE level, they are handled more or less like
CONT_PTE on ARM. see function set_huge_pte_at() for more context.

You can also look at pte_leaf_size() and pgd_leaf_size().

By the way pgd_leaf_size() looks odd because it is called only when
pgd_leaf_size() returns true, which never happens for 8M pages.

>
> As a reference, __p[mu]d_mkhuge() will trigger a BUG_ON trying to
> create such huge mappings for 4K hash MMUs. Meanwhile, the major
> powerpc hugetlb pgtable walker __find_linux_pte() already used
> pXd_leaf() to check leaf hugetlb mappings.
>
> The goal should be that we will have one API pXd_leaf() to detect
> all kinds of huge mappings except hugepd. AFAICT we need to use
> the pXd_leaf() impl (rather than pXd_huge() ones) to make sure
> ie. THPs on hash MMU will also return true.
>
> Does this look good to you?
>
> Thanks,
>

2024-03-14 14:19:29

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 12/13] mm/treewide: Remove pXd_huge()

On Thu, Mar 14, 2024 at 08:56:59AM +0000, Christophe Leroy wrote:
>
>
> Le 13/03/2024 à 22:47, [email protected] a écrit :
> > From: Peter Xu <[email protected]>
> >
> > This API is not used anymore, drop it for the whole tree.
> >
> > Signed-off-by: Peter Xu <[email protected]>
> > ---
> > arch/arm/mm/Makefile | 1 -
> > arch/arm/mm/hugetlbpage.c | 29 -------------------
> > arch/arm64/mm/hugetlbpage.c | 10 -------
> > arch/loongarch/mm/hugetlbpage.c | 10 -------
> > arch/mips/include/asm/pgtable-32.h | 2 +-
> > arch/mips/include/asm/pgtable-64.h | 2 +-
> > arch/mips/mm/hugetlbpage.c | 10 -------
> > arch/parisc/mm/hugetlbpage.c | 11 -------
> > .../include/asm/book3s/64/pgtable-4k.h | 10 -------
> > .../include/asm/book3s/64/pgtable-64k.h | 25 ----------------
> > arch/powerpc/include/asm/nohash/pgtable.h | 10 -------
> > arch/riscv/mm/hugetlbpage.c | 10 -------
> > arch/s390/mm/hugetlbpage.c | 10 -------
> > arch/sh/mm/hugetlbpage.c | 10 -------
> > arch/sparc/mm/hugetlbpage.c | 10 -------
> > arch/x86/mm/hugetlbpage.c | 16 ----------
> > include/linux/hugetlb.h | 24 ---------------
> > 17 files changed, 2 insertions(+), 198 deletions(-)
> > delete mode 100644 arch/arm/mm/hugetlbpage.c
> >
>
> > diff --git a/arch/mips/include/asm/pgtable-32.h b/arch/mips/include/asm/pgtable-32.h
> > index 0e196650f4f4..92b7591aac2a 100644
> > --- a/arch/mips/include/asm/pgtable-32.h
> > +++ b/arch/mips/include/asm/pgtable-32.h
> > @@ -129,7 +129,7 @@ static inline int pmd_none(pmd_t pmd)
> > static inline int pmd_bad(pmd_t pmd)
> > {
> > #ifdef CONFIG_MIPS_HUGE_TLB_SUPPORT
> > - /* pmd_huge(pmd) but inline */
> > + /* pmd_leaf(pmd) but inline */
>
> Shouldn't this comment have been changed in patch 11 ?

IMHO it's fine to be here, as this is the patch to finally drop _huge().
Patch 11 only converts the callers to use _leaf()s. So this comment is
still valid until this patch, because this patch removes that definition.

>
> > if (unlikely(pmd_val(pmd) & _PAGE_HUGE))
>
> Unlike pmd_huge() which is an outline function, pmd_leaf() is a macro so
> it could be used here instead of open coping.

I worry it will break things as pmd_leaf() can sometimes be defined after
arch *pgtable.h headers. So I avoided touching it except what I think I'm
confident. I had a feeling it's inlined just because of a similar reason
for the old _huge().

>
> > return 0;
> > #endif
> > diff --git a/arch/mips/include/asm/pgtable-64.h b/arch/mips/include/asm/pgtable-64.h
> > index 20ca48c1b606..7c28510b3768 100644
> > --- a/arch/mips/include/asm/pgtable-64.h
> > +++ b/arch/mips/include/asm/pgtable-64.h
> > @@ -245,7 +245,7 @@ static inline int pmd_none(pmd_t pmd)
> > static inline int pmd_bad(pmd_t pmd)
> > {
> > #ifdef CONFIG_MIPS_HUGE_TLB_SUPPORT
> > - /* pmd_huge(pmd) but inline */
> > + /* pmd_leaf(pmd) but inline */
>
> Same
>
> > if (unlikely(pmd_val(pmd) & _PAGE_HUGE))
>
> Same
>
> > return 0;
> > #endif
>
> > diff --git a/arch/powerpc/include/asm/book3s/64/pgtable-64k.h b/arch/powerpc/include/asm/book3s/64/pgtable-64k.h
> > index 2fce3498b000..579a7153857f 100644
> > --- a/arch/powerpc/include/asm/book3s/64/pgtable-64k.h
> > +++ b/arch/powerpc/include/asm/book3s/64/pgtable-64k.h
> > @@ -4,31 +4,6 @@
> >
> > #ifndef __ASSEMBLY__
> > #ifdef CONFIG_HUGETLB_PAGE
> > -/*
> > - * We have PGD_INDEX_SIZ = 12 and PTE_INDEX_SIZE = 8, so that we can have
> > - * 16GB hugepage pte in PGD and 16MB hugepage pte at PMD;
> > - *
> > - * Defined in such a way that we can optimize away code block at build time
> > - * if CONFIG_HUGETLB_PAGE=n.
> > - *
> > - * returns true for pmd migration entries, THP, devmap, hugetlb
> > - * But compile time dependent on CONFIG_HUGETLB_PAGE
> > - */
>
> Should we keep this comment somewhere for documentation ?

The 2nd/3rd paragraphs are definitely obsolete, so should be dropped.

OTOH, I'm not sure how much that will help if e.g. I move that over to
pmd_leaf(): a check over cpu_to_be64(_PAGE_PTE) is an implementation as
simple as it could be to explain itself with even no comment to me..

I also don't fully digest why that 1st paragraph discusses PGD entries: for
example, there's no pgd_huge() defined. It may not mean that the comment
is wrong, perhaps it means that I may lack some knowledge around this area
on Power..

Would you suggest how I should move paragraph 1 (and help to explain what
it is describing)? Or maybe we can provide a separate patch for Power's
huge page sizes but posted separately (and very possibly I'm not the best
candidate then..).

>
> > -static inline int pmd_huge(pmd_t pmd)
> > -{
> > - /*
> > - * leaf pte for huge page
> > - */
> > - return !!(pmd_raw(pmd) & cpu_to_be64(_PAGE_PTE));
> > -}
> > -
> > -static inline int pud_huge(pud_t pud)
> > -{
> > - /*
> > - * leaf pte for huge page
> > - */
> > - return !!(pud_raw(pud) & cpu_to_be64(_PAGE_PTE));
> > -}
> >
> > /*
> > * With 64k page size, we have hugepage ptes in the pgd and pmd entries. We don't

--
Peter Xu


2024-03-18 16:15:33

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 09/13] mm/powerpc: Redefine pXd_huge() with pXd_leaf()

On Thu, Mar 14, 2024 at 01:11:59PM +0000, Christophe Leroy wrote:
>
>
> Le 14/03/2024 à 13:53, Peter Xu a écrit :
> > On Thu, Mar 14, 2024 at 08:45:34AM +0000, Christophe Leroy wrote:
> >>
> >>
> >> Le 13/03/2024 à 22:47, [email protected] a écrit :
> >>> From: Peter Xu <[email protected]>
> >>>
> >>> PowerPC book3s 4K mostly has the same definition on both, except pXd_huge()
> >>> constantly returns 0 for hash MMUs. As Michael Ellerman pointed out [1],
> >>> it is safe to check _PAGE_PTE on hash MMUs, as the bit will never be set so
> >>> it will keep returning false.
> >>>
> >>> As a reference, __p[mu]d_mkhuge() will trigger a BUG_ON trying to create
> >>> such huge mappings for 4K hash MMUs. Meanwhile, the major powerpc hugetlb
> >>> pgtable walker __find_linux_pte() already used pXd_leaf() to check hugetlb
> >>> mappings.
> >>>
> >>> The goal should be that we will have one API pXd_leaf() to detect all kinds
> >>> of huge mappings. AFAICT we need to use the pXd_leaf() impl (rather than
> >>> pXd_huge() ones) to make sure ie. THPs on hash MMU will also return true.
> >>
> >> All kinds of huge mappings ?
> >>
> >> pXd_leaf() will detect only leaf mappings (like pXd_huge() ). There are
> >> also huge mappings through hugepd. On powerpc 8xx we have 8M huge pages
> >> and 512k huge pages. A PGD entry covers 4M so pgd_leaf() won't report
> >> those huge pages.
> >
> > Ah yes, I should always mention this is in the context of leaf huge pages
> > only. Are the examples you provided all fall into hugepd category? If so
> > I can reword the commit message, as:
>
> On powerpc 8xx, only the 8M huge pages fall into the hugepd case.
>
> The 512k hugepages are at PTE level, they are handled more or less like
> CONT_PTE on ARM. see function set_huge_pte_at() for more context.
>
> You can also look at pte_leaf_size() and pgd_leaf_size().

IMHO leaf should return false if the thing is pointing to a next level
page table, even if that next level is fully populated with contiguous
pages.

This seems more aligned with the contig page direction that hugepd
should be moved over to..

> By the way pgd_leaf_size() looks odd because it is called only when
> pgd_leaf_size() returns true, which never happens for 8M pages.

Like this, you should reach the actual final leaf that the HW will
load and leaf_size() should say it is greater size than the current
table level. Other levels should return 0.

If necessary the core MM code should deal with this by iterating over
adjacent tables.

Jason

2024-03-18 16:16:39

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 11/13] mm/treewide: Replace pXd_huge() with pXd_leaf()

On Thu, Mar 14, 2024 at 08:59:22AM -0400, Peter Xu wrote:

> > > --- a/mm/hmm.c
> > > +++ b/mm/hmm.c
> > > @@ -429,7 +429,7 @@ static int hmm_vma_walk_pud(pud_t *pudp, unsigned long start, unsigned long end,
> > > return hmm_vma_walk_hole(start, end, -1, walk);
> > > }
> > >
> > > - if (pud_huge(pud) && pud_devmap(pud)) {
> > > + if (pud_leaf(pud) && pud_devmap(pud)) {
> >
> > Didn't previous patch say devmap implies leaf ? Or is it only for GUP ?
>
> This is an extra safety check that I didn't remove. Devmap used separate
> bits even though I'm not clear on why. It should still imply a leaf though.

Yes, something is very wrong if devmap is true on non-leaf..

Jason

2024-03-19 23:13:03

by Christophe Leroy

[permalink] [raw]
Subject: Re: [PATCH 09/13] mm/powerpc: Redefine pXd_huge() with pXd_leaf()



Le 18/03/2024 à 17:15, Jason Gunthorpe a écrit :
> On Thu, Mar 14, 2024 at 01:11:59PM +0000, Christophe Leroy wrote:
>>
>>
>> Le 14/03/2024 à 13:53, Peter Xu a écrit :
>>> On Thu, Mar 14, 2024 at 08:45:34AM +0000, Christophe Leroy wrote:
>>>>
>>>>
>>>> Le 13/03/2024 à 22:47, [email protected] a écrit :
>>>>> From: Peter Xu <[email protected]>
>>>>>
>>>>> PowerPC book3s 4K mostly has the same definition on both, except pXd_huge()
>>>>> constantly returns 0 for hash MMUs. As Michael Ellerman pointed out [1],
>>>>> it is safe to check _PAGE_PTE on hash MMUs, as the bit will never be set so
>>>>> it will keep returning false.
>>>>>
>>>>> As a reference, __p[mu]d_mkhuge() will trigger a BUG_ON trying to create
>>>>> such huge mappings for 4K hash MMUs. Meanwhile, the major powerpc hugetlb
>>>>> pgtable walker __find_linux_pte() already used pXd_leaf() to check hugetlb
>>>>> mappings.
>>>>>
>>>>> The goal should be that we will have one API pXd_leaf() to detect all kinds
>>>>> of huge mappings. AFAICT we need to use the pXd_leaf() impl (rather than
>>>>> pXd_huge() ones) to make sure ie. THPs on hash MMU will also return true.
>>>>
>>>> All kinds of huge mappings ?
>>>>
>>>> pXd_leaf() will detect only leaf mappings (like pXd_huge() ). There are
>>>> also huge mappings through hugepd. On powerpc 8xx we have 8M huge pages
>>>> and 512k huge pages. A PGD entry covers 4M so pgd_leaf() won't report
>>>> those huge pages.
>>>
>>> Ah yes, I should always mention this is in the context of leaf huge pages
>>> only. Are the examples you provided all fall into hugepd category? If so
>>> I can reword the commit message, as:
>>
>> On powerpc 8xx, only the 8M huge pages fall into the hugepd case.
>>
>> The 512k hugepages are at PTE level, they are handled more or less like
>> CONT_PTE on ARM. see function set_huge_pte_at() for more context.
>>
>> You can also look at pte_leaf_size() and pgd_leaf_size().
>
> IMHO leaf should return false if the thing is pointing to a next level
> page table, even if that next level is fully populated with contiguous
> pages.
>
> This seems more aligned with the contig page direction that hugepd
> should be moved over to..

Should hugepd be moved to the contig page direction, really ?

Would it be acceptable that a 8M hugepage requires 2048 contig entries
in 2 page tables, when the hugepd allows a single entry ? Would it be
acceptable performancewise ?

>
>> By the way pgd_leaf_size() looks odd because it is called only when
>> pgd_leaf_size() returns true, which never happens for 8M pages.
>
> Like this, you should reach the actual final leaf that the HW will
> load and leaf_size() should say it is greater size than the current
> table level. Other levels should return 0.
>
> If necessary the core MM code should deal with this by iterating over
> adjacent tables.
>
> Jason

2024-03-19 23:27:11

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 09/13] mm/powerpc: Redefine pXd_huge() with pXd_leaf()

On Tue, Mar 19, 2024 at 11:07:08PM +0000, Christophe Leroy wrote:
>
>
> Le 18/03/2024 à 17:15, Jason Gunthorpe a écrit :
> > On Thu, Mar 14, 2024 at 01:11:59PM +0000, Christophe Leroy wrote:
> >>
> >>
> >> Le 14/03/2024 à 13:53, Peter Xu a écrit :
> >>> On Thu, Mar 14, 2024 at 08:45:34AM +0000, Christophe Leroy wrote:
> >>>>
> >>>>
> >>>> Le 13/03/2024 à 22:47, [email protected] a écrit :
> >>>>> From: Peter Xu <[email protected]>
> >>>>>
> >>>>> PowerPC book3s 4K mostly has the same definition on both, except pXd_huge()
> >>>>> constantly returns 0 for hash MMUs. As Michael Ellerman pointed out [1],
> >>>>> it is safe to check _PAGE_PTE on hash MMUs, as the bit will never be set so
> >>>>> it will keep returning false.
> >>>>>
> >>>>> As a reference, __p[mu]d_mkhuge() will trigger a BUG_ON trying to create
> >>>>> such huge mappings for 4K hash MMUs. Meanwhile, the major powerpc hugetlb
> >>>>> pgtable walker __find_linux_pte() already used pXd_leaf() to check hugetlb
> >>>>> mappings.
> >>>>>
> >>>>> The goal should be that we will have one API pXd_leaf() to detect all kinds
> >>>>> of huge mappings. AFAICT we need to use the pXd_leaf() impl (rather than
> >>>>> pXd_huge() ones) to make sure ie. THPs on hash MMU will also return true.
> >>>>
> >>>> All kinds of huge mappings ?
> >>>>
> >>>> pXd_leaf() will detect only leaf mappings (like pXd_huge() ). There are
> >>>> also huge mappings through hugepd. On powerpc 8xx we have 8M huge pages
> >>>> and 512k huge pages. A PGD entry covers 4M so pgd_leaf() won't report
> >>>> those huge pages.
> >>>
> >>> Ah yes, I should always mention this is in the context of leaf huge pages
> >>> only. Are the examples you provided all fall into hugepd category? If so
> >>> I can reword the commit message, as:
> >>
> >> On powerpc 8xx, only the 8M huge pages fall into the hugepd case.
> >>
> >> The 512k hugepages are at PTE level, they are handled more or less like
> >> CONT_PTE on ARM. see function set_huge_pte_at() for more context.
> >>
> >> You can also look at pte_leaf_size() and pgd_leaf_size().
> >
> > IMHO leaf should return false if the thing is pointing to a next level
> > page table, even if that next level is fully populated with contiguous
> > pages.
> >
> > This seems more aligned with the contig page direction that hugepd
> > should be moved over to..
>
> Should hugepd be moved to the contig page direction, really ?

Sure? Is there any downside for the reading side to do so?

> Would it be acceptable that a 8M hugepage requires 2048 contig entries
> in 2 page tables, when the hugepd allows a single entry ?

? I thought we agreed the only difference would be that something new
is needed to merge the two identical sibling page tables into one, ie
you pay 2x the page table memory if that isn't fixed. That is write
side only change and I imagine it could be done with a single PPC
special API.

Honestly not totally sure that is a big deal, it is already really
memory inefficient compared to every other arch's huge page by needing
the child page table in the first place.

> Would it be acceptable performancewise ?

Isn't this particular PPC sub platform ancient? Are there current real
users that are going to have hugetlbfs special code and care about
this performance detail on a 6.20 era kernel?

In today's world wouldn't it be performance better if these platforms
could support THP by aligning to the contig API instead of being
special?

Am I wrong to question why we are polluting the core code for this
special optimization?

Jason

2024-03-20 06:16:58

by Christophe Leroy

[permalink] [raw]
Subject: Re: [PATCH 09/13] mm/powerpc: Redefine pXd_huge() with pXd_leaf()



Le 20/03/2024 à 00:26, Jason Gunthorpe a écrit :
> On Tue, Mar 19, 2024 at 11:07:08PM +0000, Christophe Leroy wrote:
>>
>>
>> Le 18/03/2024 à 17:15, Jason Gunthorpe a écrit :
>>> On Thu, Mar 14, 2024 at 01:11:59PM +0000, Christophe Leroy wrote:
>>>>
>>>>
>>>> Le 14/03/2024 à 13:53, Peter Xu a écrit :
>>>>> On Thu, Mar 14, 2024 at 08:45:34AM +0000, Christophe Leroy wrote:
>>>>>>
>>>>>>
>>>>>> Le 13/03/2024 à 22:47, [email protected] a écrit :
>>>>>>> From: Peter Xu <[email protected]>
>>>>>>>
>>>>>>> PowerPC book3s 4K mostly has the same definition on both, except pXd_huge()
>>>>>>> constantly returns 0 for hash MMUs. As Michael Ellerman pointed out [1],
>>>>>>> it is safe to check _PAGE_PTE on hash MMUs, as the bit will never be set so
>>>>>>> it will keep returning false.
>>>>>>>
>>>>>>> As a reference, __p[mu]d_mkhuge() will trigger a BUG_ON trying to create
>>>>>>> such huge mappings for 4K hash MMUs. Meanwhile, the major powerpc hugetlb
>>>>>>> pgtable walker __find_linux_pte() already used pXd_leaf() to check hugetlb
>>>>>>> mappings.
>>>>>>>
>>>>>>> The goal should be that we will have one API pXd_leaf() to detect all kinds
>>>>>>> of huge mappings. AFAICT we need to use the pXd_leaf() impl (rather than
>>>>>>> pXd_huge() ones) to make sure ie. THPs on hash MMU will also return true.
>>>>>>
>>>>>> All kinds of huge mappings ?
>>>>>>
>>>>>> pXd_leaf() will detect only leaf mappings (like pXd_huge() ). There are
>>>>>> also huge mappings through hugepd. On powerpc 8xx we have 8M huge pages
>>>>>> and 512k huge pages. A PGD entry covers 4M so pgd_leaf() won't report
>>>>>> those huge pages.
>>>>>
>>>>> Ah yes, I should always mention this is in the context of leaf huge pages
>>>>> only. Are the examples you provided all fall into hugepd category? If so
>>>>> I can reword the commit message, as:
>>>>
>>>> On powerpc 8xx, only the 8M huge pages fall into the hugepd case.
>>>>
>>>> The 512k hugepages are at PTE level, they are handled more or less like
>>>> CONT_PTE on ARM. see function set_huge_pte_at() for more context.
>>>>
>>>> You can also look at pte_leaf_size() and pgd_leaf_size().
>>>
>>> IMHO leaf should return false if the thing is pointing to a next level
>>> page table, even if that next level is fully populated with contiguous
>>> pages.
>>>
>>> This seems more aligned with the contig page direction that hugepd
>>> should be moved over to..
>>
>> Should hugepd be moved to the contig page direction, really ?
>
> Sure? Is there any downside for the reading side to do so?

Probably not.

>
>> Would it be acceptable that a 8M hugepage requires 2048 contig entries
>> in 2 page tables, when the hugepd allows a single entry ?
>
> ? I thought we agreed the only difference would be that something new
> is needed to merge the two identical sibling page tables into one, ie
> you pay 2x the page table memory if that isn't fixed. That is write
> side only change and I imagine it could be done with a single PPC
> special API.
>
> Honestly not totally sure that is a big deal, it is already really
> memory inefficient compared to every other arch's huge page by needing
> the child page table in the first place.
>
>> Would it be acceptable performancewise ?
>
> Isn't this particular PPC sub platform ancient? Are there current real
> users that are going to have hugetlbfs special code and care about
> this performance detail on a 6.20 era kernel?

Ancient yes but still widely in use and with the emergence of voice over
IP in Air Trafic Control, performance becomes more and more challenge
with those old boards that have another 10 years in front of them.

>
> In today's world wouldn't it be performance better if these platforms
> could support THP by aligning to the contig API instead of being
> special?

Indeed, if we can promote THP that'd be even better.

>
> Am I wrong to question why we are polluting the core code for this
> special optimization?

At the first place that was to get a close fit between hardware
pagetable topology and linux pagetable topology. But obviously we
already stepped back for 512k pages, so let's go one more step aside and
do similar with 8M pages.

I'll give it a try and see how it goes.

Christophe

2024-03-20 16:18:01

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 09/13] mm/powerpc: Redefine pXd_huge() with pXd_leaf()

On Wed, Mar 20, 2024 at 06:16:43AM +0000, Christophe Leroy wrote:
> At the first place that was to get a close fit between hardware
> pagetable topology and linux pagetable topology. But obviously we
> already stepped back for 512k pages, so let's go one more step aside and
> do similar with 8M pages.
>
> I'll give it a try and see how it goes.

So you're talking about 8M only for 8xx, am I right?

There seem to be other PowerPC systems use hugepd. Is it possible that we
convert all hugepd into cont_pte form?

Thanks,

--
Peter Xu


2024-03-20 17:46:38

by Christophe Leroy

[permalink] [raw]
Subject: Re: [PATCH 09/13] mm/powerpc: Redefine pXd_huge() with pXd_leaf()



Le 20/03/2024 à 17:09, Peter Xu a écrit :
> On Wed, Mar 20, 2024 at 06:16:43AM +0000, Christophe Leroy wrote:
>> At the first place that was to get a close fit between hardware
>> pagetable topology and linux pagetable topology. But obviously we
>> already stepped back for 512k pages, so let's go one more step aside and
>> do similar with 8M pages.
>>
>> I'll give it a try and see how it goes.
>
> So you're talking about 8M only for 8xx, am I right?

Yes I am.

>
> There seem to be other PowerPC systems use hugepd. Is it possible that we
> convert all hugepd into cont_pte form?

Indeed.

Seems like we have hugepd for book3s/64 and for nohash.

For book3s I don't know, may Aneesh can answer.

For nohash I think it should be possible because TLB misses are handled
by software. Even the e6500 which has a hardware tablewalk falls back on
software walk when it is a hugepage IIUC.

Christophe

2024-03-20 20:24:40

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 09/13] mm/powerpc: Redefine pXd_huge() with pXd_leaf()

On Wed, Mar 20, 2024 at 05:40:39PM +0000, Christophe Leroy wrote:
>
>
> Le 20/03/2024 à 17:09, Peter Xu a écrit :
> > On Wed, Mar 20, 2024 at 06:16:43AM +0000, Christophe Leroy wrote:
> >> At the first place that was to get a close fit between hardware
> >> pagetable topology and linux pagetable topology. But obviously we
> >> already stepped back for 512k pages, so let's go one more step aside and
> >> do similar with 8M pages.
> >>
> >> I'll give it a try and see how it goes.
> >
> > So you're talking about 8M only for 8xx, am I right?
>
> Yes I am.
>
> >
> > There seem to be other PowerPC systems use hugepd. Is it possible that we
> > convert all hugepd into cont_pte form?
>
> Indeed.
>
> Seems like we have hugepd for book3s/64 and for nohash.
>
> For book3s I don't know, may Aneesh can answer.
>
> For nohash I think it should be possible because TLB misses are handled
> by software. Even the e6500 which has a hardware tablewalk falls back on
> software walk when it is a hugepage IIUC.

It'll be great if I can get some answer here, and then I know the path for
hugepd in general. I don't want to add any new code into core mm to
something destined to fade away soon.

One option for me is I can check a macro of hugepd existance, so all new
code will only work when hugepd is not supported on such arch. However
that'll start to make some PowerPC systems special (which I still tried
hard to avoid, if that wasn't proved in the past..), meanwhile we'll also
need to keep some generic-mm paths (that I can already remove along with
the new code) only for these hugepd systems. But it's still okay to me,
it'll be just a matter of when to drop those codes, sooner or later.

Thanks,

--
Peter Xu