2024-03-25 16:55:58

by Christophe Leroy

[permalink] [raw]
Subject: [RFC PATCH 0/8] Reimplement huge pages without hugepd on powerpc 8xx

This series reimplements hugepages with hugepd on powerpc 8xx.

Unlike most architectures, powerpc 8xx HW requires a two-level
pagetable topology for all page sizes. So a leaf PMD-contig approach
is not feasible as such.

Possible sizes are 4k, 16k, 512k and 8M.

First level (PGD/PMD) covers 4M per entry. For 8M pages, two PMD entries
must point to a single entry level-2 page table. Until now that was
done using hugepd. This series changes it to use standard page tables
where the entry is replicated 1024 times on each of the two pagetables
refered by the two associated PMD entries for that 8M page.

At the moment it has to look into each helper to know if the
hugepage ptep is a PTE or a PMD in order to know it is a 8M page or
a lower size. I hope this can me handled by core-mm in the future.

There are probably several ways to implement stuff, so feedback is
very welcome.

Christophe Leroy (8):
mm: Provide pagesize to pmd_populate()
mm: Provide page size to pte_alloc_huge()
mm: Provide pmd to pte_leaf_size()
mm: Provide mm_struct and address to huge_ptep_get()
powerpc/mm: Allow hugepages without hugepd
powerpc/8xx: Fix size given to set_huge_pte_at()
powerpc/8xx: Remove support for 8M pages
powerpc/8xx: Add back support for 8M pages using contiguous PTE
entries

arch/arm64/include/asm/hugetlb.h | 2 +-
arch/arm64/include/asm/pgtable.h | 2 +-
arch/arm64/mm/hugetlbpage.c | 2 +-
arch/parisc/mm/hugetlbpage.c | 2 +-
arch/powerpc/Kconfig | 1 -
arch/powerpc/include/asm/hugetlb.h | 13 +++-
.../include/asm/nohash/32/hugetlb-8xx.h | 54 ++++++++---------
arch/powerpc/include/asm/nohash/32/pgalloc.h | 2 +
arch/powerpc/include/asm/nohash/32/pte-8xx.h | 59 +++++++++++++------
arch/powerpc/include/asm/nohash/pgtable.h | 12 ++--
arch/powerpc/include/asm/page.h | 5 --
arch/powerpc/include/asm/pgtable.h | 1 +
arch/powerpc/kernel/head_8xx.S | 10 +---
arch/powerpc/mm/hugetlbpage.c | 23 +++++++-
arch/powerpc/mm/nohash/8xx.c | 46 +++++++--------
arch/powerpc/mm/pgtable.c | 26 +++++---
arch/powerpc/mm/pgtable_32.c | 2 +-
arch/powerpc/platforms/Kconfig.cputype | 2 +
arch/riscv/include/asm/pgtable.h | 2 +-
arch/riscv/mm/hugetlbpage.c | 2 +-
arch/sh/mm/hugetlbpage.c | 2 +-
arch/sparc/include/asm/pgtable_64.h | 2 +-
arch/sparc/mm/hugetlbpage.c | 4 +-
fs/hugetlbfs/inode.c | 2 +-
fs/proc/task_mmu.c | 8 +--
fs/userfaultfd.c | 2 +-
include/asm-generic/hugetlb.h | 2 +-
include/linux/hugetlb.h | 4 +-
include/linux/mm.h | 12 ++--
include/linux/pgtable.h | 2 +-
include/linux/swapops.h | 2 +-
kernel/events/core.c | 2 +-
mm/damon/vaddr.c | 6 +-
mm/filemap.c | 2 +-
mm/gup.c | 2 +-
mm/hmm.c | 2 +-
mm/hugetlb.c | 46 +++++++--------
mm/internal.h | 2 +-
mm/memory-failure.c | 2 +-
mm/memory.c | 19 +++---
mm/mempolicy.c | 2 +-
mm/migrate.c | 4 +-
mm/mincore.c | 2 +-
mm/pgalloc-track.h | 2 +-
mm/userfaultfd.c | 6 +-
45 files changed, 229 insertions(+), 180 deletions(-)

--
2.43.0



2024-03-25 17:36:25

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC PATCH 0/8] Reimplement huge pages without hugepd on powerpc 8xx

On Mon, Mar 25, 2024 at 03:55:53PM +0100, Christophe Leroy wrote:
> This series reimplements hugepages with hugepd on powerpc 8xx.
>
> Unlike most architectures, powerpc 8xx HW requires a two-level
> pagetable topology for all page sizes. So a leaf PMD-contig approach
> is not feasible as such.
>
> Possible sizes are 4k, 16k, 512k and 8M.
>
> First level (PGD/PMD) covers 4M per entry. For 8M pages, two PMD entries
> must point to a single entry level-2 page table. Until now that was
> done using hugepd. This series changes it to use standard page tables
> where the entry is replicated 1024 times on each of the two pagetables
> refered by the two associated PMD entries for that 8M page.
>
> At the moment it has to look into each helper to know if the
> hugepage ptep is a PTE or a PMD in order to know it is a 8M page or
> a lower size. I hope this can me handled by core-mm in the future.
>
> There are probably several ways to implement stuff, so feedback is
> very welcome.

I thought it looks pretty good!

Thanks,
Jason

2024-03-25 20:43:48

by Christophe Leroy

[permalink] [raw]
Subject: [RFC PATCH 7/8] powerpc/8xx: Remove support for 8M pages

Remove support for 8M pages in order to stop using hugepd.

Support for 8M pages will be added back later using the same
approach as for 512k pages, in extenso using contiguous page
entries in the regular page table.

Signed-off-by: Christophe Leroy <[email protected]>
---
arch/powerpc/Kconfig | 1 -
.../include/asm/nohash/32/hugetlb-8xx.h | 30 -------------------
arch/powerpc/include/asm/nohash/32/pte-8xx.h | 14 +--------
arch/powerpc/include/asm/nohash/pgtable.h | 4 ---
arch/powerpc/include/asm/page.h | 5 ----
arch/powerpc/kernel/head_8xx.S | 9 +-----
arch/powerpc/mm/hugetlbpage.c | 3 --
arch/powerpc/mm/nohash/8xx.c | 28 ++---------------
arch/powerpc/mm/nohash/tlb.c | 3 --
arch/powerpc/platforms/Kconfig.cputype | 2 ++
10 files changed, 7 insertions(+), 92 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index a68b9e637eda..74c038cf770c 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -135,7 +135,6 @@ config PPC
select ARCH_HAS_DMA_MAP_DIRECT if PPC_PSERIES
select ARCH_HAS_FORTIFY_SOURCE
select ARCH_HAS_GCOV_PROFILE_ALL
- select ARCH_HAS_HUGEPD if HUGETLB_PAGE
select ARCH_HAS_KCOV
select ARCH_HAS_MEMBARRIER_CALLBACKS
select ARCH_HAS_MEMBARRIER_SYNC_CORE
diff --git a/arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h b/arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h
index 92df40c6cc6b..178ed9fdd353 100644
--- a/arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h
+++ b/arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h
@@ -4,42 +4,12 @@

#define PAGE_SHIFT_8M 23

-static inline pte_t *hugepd_page(hugepd_t hpd)
-{
- BUG_ON(!hugepd_ok(hpd));
-
- return (pte_t *)__va(hpd_val(hpd) & ~HUGEPD_SHIFT_MASK);
-}
-
-static inline unsigned int hugepd_shift(hugepd_t hpd)
-{
- return PAGE_SHIFT_8M;
-}
-
-static inline pte_t *hugepte_offset(hugepd_t hpd, unsigned long addr,
- unsigned int pdshift)
-{
- unsigned long idx = (addr & (SZ_4M - 1)) >> PAGE_SHIFT;
-
- return hugepd_page(hpd) + idx;
-}
-
static inline void flush_hugetlb_page(struct vm_area_struct *vma,
unsigned long vmaddr)
{
flush_tlb_page(vma, vmaddr);
}

-static inline void hugepd_populate(hugepd_t *hpdp, pte_t *new, unsigned int pshift)
-{
- *hpdp = __hugepd(__pa(new) | _PMD_USER | _PMD_PRESENT | _PMD_PAGE_8M);
-}
-
-static inline void hugepd_populate_kernel(hugepd_t *hpdp, pte_t *new, unsigned int pshift)
-{
- *hpdp = __hugepd(__pa(new) | _PMD_PRESENT | _PMD_PAGE_8M);
-}
-
static inline int check_and_get_huge_psize(int shift)
{
return shift_to_mmu_psize(shift);
diff --git a/arch/powerpc/include/asm/nohash/32/pte-8xx.h b/arch/powerpc/include/asm/nohash/32/pte-8xx.h
index 07df6b664861..004d7e825af2 100644
--- a/arch/powerpc/include/asm/nohash/32/pte-8xx.h
+++ b/arch/powerpc/include/asm/nohash/32/pte-8xx.h
@@ -142,15 +142,6 @@ static inline void __ptep_set_access_flags(struct vm_area_struct *vma, pte_t *pt
}
#define __ptep_set_access_flags __ptep_set_access_flags

-static inline unsigned long pgd_leaf_size(pgd_t pgd)
-{
- if (pgd_val(pgd) & _PMD_PAGE_8M)
- return SZ_8M;
- return SZ_4M;
-}
-
-#define pgd_leaf_size pgd_leaf_size
-
static inline unsigned long pte_leaf_size(pmd_t pmd, pte_t pte)
{
pte_basic_t val = pte_val(pte);
@@ -171,14 +162,11 @@ static inline unsigned long pte_leaf_size(pmd_t pmd, pte_t pte)
* For other page sizes, we have a single entry in the table.
*/
static pmd_t *pmd_off(struct mm_struct *mm, unsigned long addr);
-static int hugepd_ok(hugepd_t hpd);

static inline int number_of_cells_per_pte(pmd_t *pmd, pte_basic_t val, int huge)
{
if (!huge)
return PAGE_SIZE / SZ_4K;
- else if (hugepd_ok(*((hugepd_t *)pmd)))
- return 1;
else if (IS_ENABLED(CONFIG_PPC_4K_PAGES) && !(val & _PAGE_HUGE))
return SZ_16K / SZ_4K;
else
@@ -198,7 +186,7 @@ static inline pte_basic_t pte_update(struct mm_struct *mm, unsigned long addr, p

for (i = 0; i < num; i += PAGE_SIZE / SZ_4K, new += PAGE_SIZE) {
*entry++ = new;
- if (IS_ENABLED(CONFIG_PPC_16K_PAGES) && num != 1) {
+ if (IS_ENABLED(CONFIG_PPC_16K_PAGES)) {
*entry++ = new;
*entry++ = new;
*entry++ = new;
diff --git a/arch/powerpc/include/asm/nohash/pgtable.h b/arch/powerpc/include/asm/nohash/pgtable.h
index ac3353f7f2ac..c4be7754e96f 100644
--- a/arch/powerpc/include/asm/nohash/pgtable.h
+++ b/arch/powerpc/include/asm/nohash/pgtable.h
@@ -343,12 +343,8 @@ static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr,
#ifdef CONFIG_ARCH_HAS_HUGEPD
static inline int hugepd_ok(hugepd_t hpd)
{
-#ifdef CONFIG_PPC_8xx
- return ((hpd_val(hpd) & _PMD_PAGE_MASK) == _PMD_PAGE_8M);
-#else
/* We clear the top bit to indicate hugepd */
return (hpd_val(hpd) && (hpd_val(hpd) & PD_HUGE) == 0);
-#endif
}

#define is_hugepd(hpd) (hugepd_ok(hpd))
diff --git a/arch/powerpc/include/asm/page.h b/arch/powerpc/include/asm/page.h
index e411e5a70ea3..018c3d55232c 100644
--- a/arch/powerpc/include/asm/page.h
+++ b/arch/powerpc/include/asm/page.h
@@ -293,13 +293,8 @@ static inline const void *pfn_to_kaddr(unsigned long pfn)
/*
* Some number of bits at the level of the page table that points to
* a hugepte are used to encode the size. This masks those bits.
- * On 8xx, HW assistance requires 4k alignment for the hugepte.
*/
-#ifdef CONFIG_PPC_8xx
-#define HUGEPD_SHIFT_MASK 0xfff
-#else
#define HUGEPD_SHIFT_MASK 0x3f
-#endif

#ifndef __ASSEMBLY__

diff --git a/arch/powerpc/kernel/head_8xx.S b/arch/powerpc/kernel/head_8xx.S
index 647b0b445e89..b53af565b132 100644
--- a/arch/powerpc/kernel/head_8xx.S
+++ b/arch/powerpc/kernel/head_8xx.S
@@ -416,13 +416,11 @@ FixupDAR:/* Entry point for dcbx workaround. */
3:
lwz r11, (swapper_pg_dir-PAGE_OFFSET)@l(r11) /* Get the level 1 entry */
mtspr SPRN_MD_TWC, r11
- mtcrf 0x01, r11
mfspr r11, SPRN_MD_TWC
lwz r11, 0(r11) /* Get the pte */
- bt 28,200f /* bit 28 = Large page (8M) */
/* concat physical page address(r11) and page offset(r10) */
rlwimi r11, r10, 0, 32 - PAGE_SHIFT, 31
-201: lwz r11,0(r11)
+ lwz r11,0(r11)
/* Check if it really is a dcbx instruction. */
/* dcbt and dcbtst does not generate DTLB Misses/Errors,
* no need to include them here */
@@ -441,11 +439,6 @@ FixupDAR:/* Entry point for dcbx workaround. */
141: mfspr r10,SPRN_M_TW
b DARFixed /* Nope, go back to normal TLB processing */

-200:
- /* concat physical page address(r11) and page offset(r10) */
- rlwimi r11, r10, 0, 32 - PAGE_SHIFT_8M, 31
- b 201b
-
144: mfspr r10, SPRN_DSISR
rlwinm r10, r10,0,7,5 /* Clear store bit for buggy dcbst insn */
mtspr SPRN_DSISR, r10
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index db73ad845a2a..4e9fbd5b895d 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -183,9 +183,6 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
if (!hpdp)
return NULL;

- if (IS_ENABLED(CONFIG_PPC_8xx) && pshift < PMD_SHIFT)
- return pte_alloc_huge(mm, (pmd_t *)hpdp, addr, sz);
-
BUG_ON(!hugepd_none(*hpdp) && !hugepd_ok(*hpdp));

if (hugepd_none(*hpdp) && __hugepte_alloc(mm, hpdp, addr,
diff --git a/arch/powerpc/mm/nohash/8xx.c b/arch/powerpc/mm/nohash/8xx.c
index 70b4d807fda5..fc10e08bcb85 100644
--- a/arch/powerpc/mm/nohash/8xx.c
+++ b/arch/powerpc/mm/nohash/8xx.c
@@ -48,42 +48,22 @@ unsigned long p_block_mapped(phys_addr_t pa)
return 0;
}

-static pte_t __init *early_hugepd_alloc_kernel(hugepd_t *pmdp, unsigned long va)
-{
- if (hpd_val(*pmdp) == 0) {
- pte_t *ptep = memblock_alloc(sizeof(pte_basic_t), SZ_4K);
-
- if (!ptep)
- return NULL;
-
- hugepd_populate_kernel((hugepd_t *)pmdp, ptep, PAGE_SHIFT_8M);
- hugepd_populate_kernel((hugepd_t *)pmdp + 1, ptep, PAGE_SHIFT_8M);
- }
- return hugepte_offset(*(hugepd_t *)pmdp, va, PGDIR_SHIFT);
-}
-
static int __ref __early_map_kernel_hugepage(unsigned long va, phys_addr_t pa,
pgprot_t prot, int psize, bool new)
{
pmd_t *pmdp = pmd_off_k(va);
pte_t *ptep;

- if (WARN_ON(psize != MMU_PAGE_512K && psize != MMU_PAGE_8M))
+ if (WARN_ON(psize != MMU_PAGE_512K))
return -EINVAL;

if (new) {
if (WARN_ON(slab_is_available()))
return -EINVAL;

- if (psize == MMU_PAGE_512K)
- ptep = early_pte_alloc_kernel(pmdp, va);
- else
- ptep = early_hugepd_alloc_kernel((hugepd_t *)pmdp, va);
+ ptep = early_pte_alloc_kernel(pmdp, va);
} else {
- if (psize == MMU_PAGE_512K)
- ptep = pte_offset_kernel(pmdp, va);
- else
- ptep = hugepte_offset(*(hugepd_t *)pmdp, va, PGDIR_SHIFT);
+ ptep = pte_offset_kernel(pmdp, va);
}

if (WARN_ON(!ptep))
@@ -130,8 +110,6 @@ static void mmu_mapin_ram_chunk(unsigned long offset, unsigned long top,

for (; p < ALIGN(p, SZ_8M) && p < top; p += SZ_512K, v += SZ_512K)
__early_map_kernel_hugepage(v, p, prot, MMU_PAGE_512K, new);
- for (; p < ALIGN_DOWN(top, SZ_8M) && p < top; p += SZ_8M, v += SZ_8M)
- __early_map_kernel_hugepage(v, p, prot, MMU_PAGE_8M, new);
for (; p < ALIGN_DOWN(top, SZ_512K) && p < top; p += SZ_512K, v += SZ_512K)
__early_map_kernel_hugepage(v, p, prot, MMU_PAGE_512K, new);

diff --git a/arch/powerpc/mm/nohash/tlb.c b/arch/powerpc/mm/nohash/tlb.c
index 5ffa0af4328a..cb2afe39cee5 100644
--- a/arch/powerpc/mm/nohash/tlb.c
+++ b/arch/powerpc/mm/nohash/tlb.c
@@ -104,9 +104,6 @@ struct mmu_psize_def mmu_psize_defs[MMU_PAGE_COUNT] = {
[MMU_PAGE_512K] = {
.shift = 19,
},
- [MMU_PAGE_8M] = {
- .shift = 23,
- },
};
#endif

diff --git a/arch/powerpc/platforms/Kconfig.cputype b/arch/powerpc/platforms/Kconfig.cputype
index b2d8c0da2ad9..fa4bb096b3ae 100644
--- a/arch/powerpc/platforms/Kconfig.cputype
+++ b/arch/powerpc/platforms/Kconfig.cputype
@@ -98,6 +98,7 @@ config PPC_BOOK3S_64
select ARCH_ENABLE_HUGEPAGE_MIGRATION if HUGETLB_PAGE && MIGRATION
select ARCH_ENABLE_SPLIT_PMD_PTLOCK
select ARCH_ENABLE_THP_MIGRATION if TRANSPARENT_HUGEPAGE
+ select ARCH_HAS_HUGEPD if HUGETLB_PAGE
select ARCH_SUPPORTS_HUGETLBFS
select ARCH_SUPPORTS_NUMA_BALANCING
select HAVE_MOVE_PMD
@@ -290,6 +291,7 @@ config PPC_BOOK3S
config PPC_E500
select FSL_EMB_PERFMON
bool
+ select ARCH_HAS_HUGEPD if HUGETLB_PAGE
select ARCH_SUPPORTS_HUGETLBFS if PHYS_64BIT || PPC64
select PPC_SMP_MUXED_IPI
select PPC_DOORBELL
--
2.43.0


2024-04-11 17:44:50

by Peter Xu

[permalink] [raw]
Subject: Re: [RFC PATCH 0/8] Reimplement huge pages without hugepd on powerpc 8xx

On Mon, Mar 25, 2024 at 01:38:40PM -0300, Jason Gunthorpe wrote:
> On Mon, Mar 25, 2024 at 03:55:53PM +0100, Christophe Leroy wrote:
> > This series reimplements hugepages with hugepd on powerpc 8xx.
> >
> > Unlike most architectures, powerpc 8xx HW requires a two-level
> > pagetable topology for all page sizes. So a leaf PMD-contig approach
> > is not feasible as such.
> >
> > Possible sizes are 4k, 16k, 512k and 8M.
> >
> > First level (PGD/PMD) covers 4M per entry. For 8M pages, two PMD entries
> > must point to a single entry level-2 page table. Until now that was
> > done using hugepd. This series changes it to use standard page tables
> > where the entry is replicated 1024 times on each of the two pagetables
> > refered by the two associated PMD entries for that 8M page.
> >
> > At the moment it has to look into each helper to know if the
> > hugepage ptep is a PTE or a PMD in order to know it is a 8M page or
> > a lower size. I hope this can me handled by core-mm in the future.
> >
> > There are probably several ways to implement stuff, so feedback is
> > very welcome.
>
> I thought it looks pretty good!

I second it.

I saw the discussions in patch 1. Christophe, I suppose you're exploring
the big hammer over hugepd, and perhaps went already with the 32bit pmd
solution for nohash/32bit challenge you mentioned?

I'm trying to position my next step; it seems like at least I should not
adding any more hugepd code, then should I go with ARCH_HAS_HUGEPD checks,
or you're going to have an RFC soon then I can base on top?

Thanks,

--
Peter Xu


2024-04-12 14:08:15

by Christophe Leroy

[permalink] [raw]
Subject: Re: [RFC PATCH 0/8] Reimplement huge pages without hugepd on powerpc 8xx



Le 11/04/2024 à 18:15, Peter Xu a écrit :
> On Mon, Mar 25, 2024 at 01:38:40PM -0300, Jason Gunthorpe wrote:
>> On Mon, Mar 25, 2024 at 03:55:53PM +0100, Christophe Leroy wrote:
>>> This series reimplements hugepages with hugepd on powerpc 8xx.
>>>
>>> Unlike most architectures, powerpc 8xx HW requires a two-level
>>> pagetable topology for all page sizes. So a leaf PMD-contig approach
>>> is not feasible as such.
>>>
>>> Possible sizes are 4k, 16k, 512k and 8M.
>>>
>>> First level (PGD/PMD) covers 4M per entry. For 8M pages, two PMD entries
>>> must point to a single entry level-2 page table. Until now that was
>>> done using hugepd. This series changes it to use standard page tables
>>> where the entry is replicated 1024 times on each of the two pagetables
>>> refered by the two associated PMD entries for that 8M page.
>>>
>>> At the moment it has to look into each helper to know if the
>>> hugepage ptep is a PTE or a PMD in order to know it is a 8M page or
>>> a lower size. I hope this can me handled by core-mm in the future.
>>>
>>> There are probably several ways to implement stuff, so feedback is
>>> very welcome.
>>
>> I thought it looks pretty good!
>
> I second it.
>
> I saw the discussions in patch 1. Christophe, I suppose you're exploring
> the big hammer over hugepd, and perhaps went already with the 32bit pmd
> solution for nohash/32bit challenge you mentioned?
>
> I'm trying to position my next step; it seems like at least I should not
> adding any more hugepd code, then should I go with ARCH_HAS_HUGEPD checks,
> or you're going to have an RFC soon then I can base on top?

Depends on what you expect by "soon".

I sure won't be able to send any RFC before end of April.

Should be possible to have something during May.

Christophe

2024-04-12 14:31:10

by Peter Xu

[permalink] [raw]
Subject: Re: [RFC PATCH 0/8] Reimplement huge pages without hugepd on powerpc 8xx

On Fri, Apr 12, 2024 at 02:08:03PM +0000, Christophe Leroy wrote:
>
>
> Le 11/04/2024 à 18:15, Peter Xu a écrit :
> > On Mon, Mar 25, 2024 at 01:38:40PM -0300, Jason Gunthorpe wrote:
> >> On Mon, Mar 25, 2024 at 03:55:53PM +0100, Christophe Leroy wrote:
> >>> This series reimplements hugepages with hugepd on powerpc 8xx.
> >>>
> >>> Unlike most architectures, powerpc 8xx HW requires a two-level
> >>> pagetable topology for all page sizes. So a leaf PMD-contig approach
> >>> is not feasible as such.
> >>>
> >>> Possible sizes are 4k, 16k, 512k and 8M.
> >>>
> >>> First level (PGD/PMD) covers 4M per entry. For 8M pages, two PMD entries
> >>> must point to a single entry level-2 page table. Until now that was
> >>> done using hugepd. This series changes it to use standard page tables
> >>> where the entry is replicated 1024 times on each of the two pagetables
> >>> refered by the two associated PMD entries for that 8M page.
> >>>
> >>> At the moment it has to look into each helper to know if the
> >>> hugepage ptep is a PTE or a PMD in order to know it is a 8M page or
> >>> a lower size. I hope this can me handled by core-mm in the future.
> >>>
> >>> There are probably several ways to implement stuff, so feedback is
> >>> very welcome.
> >>
> >> I thought it looks pretty good!
> >
> > I second it.
> >
> > I saw the discussions in patch 1. Christophe, I suppose you're exploring
> > the big hammer over hugepd, and perhaps went already with the 32bit pmd
> > solution for nohash/32bit challenge you mentioned?
> >
> > I'm trying to position my next step; it seems like at least I should not
> > adding any more hugepd code, then should I go with ARCH_HAS_HUGEPD checks,
> > or you're going to have an RFC soon then I can base on top?
>
> Depends on what you expect by "soon".
>
> I sure won't be able to send any RFC before end of April.
>
> Should be possible to have something during May.

That's good enough, thanks. I'll see what is the best I can do.

Then do you think I can leave p4d/pgd leaves alone? Please check the other
email where I'm not sure whether pgd leaves ever existed for any of
PowerPC. That's so far what I plan to do, on teaching pgtable walkers
recognize pud and lower for all leaves. Then if Power can switch from
hugepd to this it should just work.

Even if pgd exists (then something I overlooked..), I'm wondering whether
we can push that downwards to be either pud/pmd (and looks like we all
agree p4d is never used on Power). That may involve some pgtable
operations moving from pgd level to lower, e.g. my pure imagination would
look like starting with:

#define PTE_INDEX_SIZE PTE_SHIFT
#define PMD_INDEX_SIZE 0
#define PUD_INDEX_SIZE 0
#define PGD_INDEX_SIZE (32 - PGDIR_SHIFT)

To:

#define PTE_INDEX_SIZE PTE_SHIFT
#define PMD_INDEX_SIZE (32 - PMD_SHIFT)
#define PUD_INDEX_SIZE 0
#define PGD_INDEX_SIZE 0

And the rest will need care too. I hope moving downward is easier
(e.g. the walker should always exist for lower levels but not always for
higher levels), but I actually have little idea on whether there's any
other implications, so please bare with me on stupid mistakes.

I just hope pgd leaves don't exist already, then I think it'll be simpler.

Thanks,

--
Peter Xu


2024-04-15 19:12:42

by Christophe Leroy

[permalink] [raw]
Subject: Re: [RFC PATCH 0/8] Reimplement huge pages without hugepd on powerpc 8xx



Le 12/04/2024 à 16:30, Peter Xu a écrit :
> On Fri, Apr 12, 2024 at 02:08:03PM +0000, Christophe Leroy wrote:
>>
>>
>> Le 11/04/2024 à 18:15, Peter Xu a écrit :
>>> On Mon, Mar 25, 2024 at 01:38:40PM -0300, Jason Gunthorpe wrote:
>>>> On Mon, Mar 25, 2024 at 03:55:53PM +0100, Christophe Leroy wrote:
>>>>> This series reimplements hugepages with hugepd on powerpc 8xx.
>>>>>
>>>>> Unlike most architectures, powerpc 8xx HW requires a two-level
>>>>> pagetable topology for all page sizes. So a leaf PMD-contig approach
>>>>> is not feasible as such.
>>>>>
>>>>> Possible sizes are 4k, 16k, 512k and 8M.
>>>>>
>>>>> First level (PGD/PMD) covers 4M per entry. For 8M pages, two PMD entries
>>>>> must point to a single entry level-2 page table. Until now that was
>>>>> done using hugepd. This series changes it to use standard page tables
>>>>> where the entry is replicated 1024 times on each of the two pagetables
>>>>> refered by the two associated PMD entries for that 8M page.
>>>>>
>>>>> At the moment it has to look into each helper to know if the
>>>>> hugepage ptep is a PTE or a PMD in order to know it is a 8M page or
>>>>> a lower size. I hope this can me handled by core-mm in the future.
>>>>>
>>>>> There are probably several ways to implement stuff, so feedback is
>>>>> very welcome.
>>>>
>>>> I thought it looks pretty good!
>>>
>>> I second it.
>>>
>>> I saw the discussions in patch 1. Christophe, I suppose you're exploring
>>> the big hammer over hugepd, and perhaps went already with the 32bit pmd
>>> solution for nohash/32bit challenge you mentioned?
>>>
>>> I'm trying to position my next step; it seems like at least I should not
>>> adding any more hugepd code, then should I go with ARCH_HAS_HUGEPD checks,
>>> or you're going to have an RFC soon then I can base on top?
>>
>> Depends on what you expect by "soon".
>>
>> I sure won't be able to send any RFC before end of April.
>>
>> Should be possible to have something during May.
>
> That's good enough, thanks. I'll see what is the best I can do.
>
> Then do you think I can leave p4d/pgd leaves alone? Please check the other
> email where I'm not sure whether pgd leaves ever existed for any of
> PowerPC. That's so far what I plan to do, on teaching pgtable walkers
> recognize pud and lower for all leaves. Then if Power can switch from
> hugepd to this it should just work.

Well, if I understand correctly, something with no PMD will include
<asm-generic/pgtable-nopmd.h> and will therefore only have .... pmd
entries (hence no pgd/p4d/pud entries). Looks odd but that's what it is.
pgd_populate(), p4d_populate(), pud_populate() are all "do { } while
(0)" and only pmd_populate exists. So only pmd_leaf() will exist in that
case.

And therefore including <asm-generic/pgtable-nop4d.h> means .... you
have p4d entries. Doesn't mean you have p4d_leaf() but that needs to be
checked.


>
> Even if pgd exists (then something I overlooked..), I'm wondering whether
> we can push that downwards to be either pud/pmd (and looks like we all
> agree p4d is never used on Power). That may involve some pgtable
> operations moving from pgd level to lower, e.g. my pure imagination would
> look like starting with:

Yes I think there is no doubt that p4d is never used:

arch/powerpc/include/asm/book3s/32/pgtable.h:#include
<asm-generic/pgtable-nopmd.h>
arch/powerpc/include/asm/book3s/64/pgtable.h:#include
<asm-generic/pgtable-nop4d.h>
arch/powerpc/include/asm/nohash/32/pgtable.h:#include
<asm-generic/pgtable-nopmd.h>
arch/powerpc/include/asm/nohash/64/pgtable-4k.h:#include
<asm-generic/pgtable-nop4d.h>

But that means that PPC32 have pmd entries and PPC64 have p4d entries ...

>
> #define PTE_INDEX_SIZE PTE_SHIFT
> #define PMD_INDEX_SIZE 0
> #define PUD_INDEX_SIZE 0
> #define PGD_INDEX_SIZE (32 - PGDIR_SHIFT)
>
> To:
>
> #define PTE_INDEX_SIZE PTE_SHIFT
> #define PMD_INDEX_SIZE (32 - PMD_SHIFT)
> #define PUD_INDEX_SIZE 0
> #define PGD_INDEX_SIZE 0

But then you can't anymore have #define PTRS_PER_PMD 1 from
<asm-generic/pgtable-nop4d.h>

>
> And the rest will need care too. I hope moving downward is easier
> (e.g. the walker should always exist for lower levels but not always for
> higher levels), but I actually have little idea on whether there's any
> other implications, so please bare with me on stupid mistakes.
>
> I just hope pgd leaves don't exist already, then I think it'll be simpler.
>
> Thanks,
>

2024-04-16 10:58:46

by Christophe Leroy

[permalink] [raw]
Subject: Re: [RFC PATCH 0/8] Reimplement huge pages without hugepd on powerpc 8xx



Le 15/04/2024 à 21:12, Christophe Leroy a écrit :
>
>
> Le 12/04/2024 à 16:30, Peter Xu a écrit :
>> On Fri, Apr 12, 2024 at 02:08:03PM +0000, Christophe Leroy wrote:
>>>
>>>
>>> Le 11/04/2024 à 18:15, Peter Xu a écrit :
>>>> On Mon, Mar 25, 2024 at 01:38:40PM -0300, Jason Gunthorpe wrote:
>>>>> On Mon, Mar 25, 2024 at 03:55:53PM +0100, Christophe Leroy wrote:
>>>>>> This series reimplements hugepages with hugepd on powerpc 8xx.
>>>>>>
>>>>>> Unlike most architectures, powerpc 8xx HW requires a two-level
>>>>>> pagetable topology for all page sizes. So a leaf PMD-contig approach
>>>>>> is not feasible as such.
>>>>>>
>>>>>> Possible sizes are 4k, 16k, 512k and 8M.
>>>>>>
>>>>>> First level (PGD/PMD) covers 4M per entry. For 8M pages, two PMD
>>>>>> entries
>>>>>> must point to a single entry level-2 page table. Until now that was
>>>>>> done using hugepd. This series changes it to use standard page tables
>>>>>> where the entry is replicated 1024 times on each of the two
>>>>>> pagetables
>>>>>> refered by the two associated PMD entries for that 8M page.
>>>>>>
>>>>>> At the moment it has to look into each helper to know if the
>>>>>> hugepage ptep is a PTE or a PMD in order to know it is a 8M page or
>>>>>> a lower size. I hope this can me handled by core-mm in the future.
>>>>>>
>>>>>> There are probably several ways to implement stuff, so feedback is
>>>>>> very welcome.
>>>>>
>>>>> I thought it looks pretty good!
>>>>
>>>> I second it.
>>>>
>>>> I saw the discussions in patch 1.  Christophe, I suppose you're
>>>> exploring
>>>> the big hammer over hugepd, and perhaps went already with the 32bit pmd
>>>> solution for nohash/32bit challenge you mentioned?
>>>>
>>>> I'm trying to position my next step; it seems like at least I should
>>>> not
>>>> adding any more hugepd code, then should I go with ARCH_HAS_HUGEPD
>>>> checks,
>>>> or you're going to have an RFC soon then I can base on top?
>>>
>>> Depends on what you expect by "soon".
>>>
>>> I sure won't be able to send any RFC before end of April.
>>>
>>> Should be possible to have something during May.
>>
>> That's good enough, thanks.  I'll see what is the best I can do.
>>
>> Then do you think I can leave p4d/pgd leaves alone?  Please check the
>> other
>> email where I'm not sure whether pgd leaves ever existed for any of
>> PowerPC.  That's so far what I plan to do, on teaching pgtable walkers
>> recognize pud and lower for all leaves.  Then if Power can switch from
>> hugepd to this it should just work.
>
> Well, if I understand correctly, something with no PMD will include
> <asm-generic/pgtable-nopmd.h> and will therefore only have .... pmd
> entries (hence no pgd/p4d/pud entries). Looks odd but that's what it is.
> pgd_populate(), p4d_populate(), pud_populate() are all "do { } while
> (0)" and only pmd_populate exists. So only pmd_leaf() will exist in that
> case.
>
> And therefore including <asm-generic/pgtable-nop4d.h> means .... you
> have p4d entries. Doesn't mean you have p4d_leaf() but that needs to be
> checked.
>
>
>>
>> Even if pgd exists (then something I overlooked..), I'm wondering whether
>> we can push that downwards to be either pud/pmd (and looks like we all
>> agree p4d is never used on Power).  That may involve some pgtable
>> operations moving from pgd level to lower, e.g. my pure imagination would
>> look like starting with:
>
> Yes I think there is no doubt that p4d is never used:
>
> arch/powerpc/include/asm/book3s/32/pgtable.h:#include
> <asm-generic/pgtable-nopmd.h>
> arch/powerpc/include/asm/book3s/64/pgtable.h:#include
> <asm-generic/pgtable-nop4d.h>
> arch/powerpc/include/asm/nohash/32/pgtable.h:#include
> <asm-generic/pgtable-nopmd.h>
> arch/powerpc/include/asm/nohash/64/pgtable-4k.h:#include
> <asm-generic/pgtable-nop4d.h>
>
> But that means that PPC32 have pmd entries and PPC64 have p4d entries ...
>
>>
>> #define PTE_INDEX_SIZE    PTE_SHIFT
>> #define PMD_INDEX_SIZE    0
>> #define PUD_INDEX_SIZE    0
>> #define PGD_INDEX_SIZE    (32 - PGDIR_SHIFT)
>>
>> To:
>>
>> #define PTE_INDEX_SIZE    PTE_SHIFT
>> #define PMD_INDEX_SIZE    (32 - PMD_SHIFT)
>> #define PUD_INDEX_SIZE    0
>> #define PGD_INDEX_SIZE    0
>
> But then you can't anymore have #define PTRS_PER_PMD 1 from
> <asm-generic/pgtable-nop4d.h>
>
>>
>> And the rest will need care too.  I hope moving downward is easier
>> (e.g. the walker should always exist for lower levels but not always for
>> higher levels), but I actually have little idea on whether there's any
>> other implications, so please bare with me on stupid mistakes.
>>
>> I just hope pgd leaves don't exist already, then I think it'll be
>> simpler.
>>
>> Thanks,
>>

Digging into asm-generic/pgtable-nopmd.h, I see a definition of
pud_leaf() always returning 0, introduced by commit 2c8a81dc0cc5
("riscv/mm: fix two page table check related issues")

So should asm-generic/pgtable-nopud.h contain the same for p4d_leaf()
and asm-generic/pgtable-nop4d.h contain the same for pgd_leaf() ?

Christophe

2024-04-16 19:45:23

by Peter Xu

[permalink] [raw]
Subject: Re: [RFC PATCH 0/8] Reimplement huge pages without hugepd on powerpc 8xx

On Tue, Apr 16, 2024 at 10:58:33AM +0000, Christophe Leroy wrote:
>
>
> Le 15/04/2024 à 21:12, Christophe Leroy a écrit :
> >
> >
> > Le 12/04/2024 à 16:30, Peter Xu a écrit :
> >> On Fri, Apr 12, 2024 at 02:08:03PM +0000, Christophe Leroy wrote:
> >>>
> >>>
> >>> Le 11/04/2024 à 18:15, Peter Xu a écrit :
> >>>> On Mon, Mar 25, 2024 at 01:38:40PM -0300, Jason Gunthorpe wrote:
> >>>>> On Mon, Mar 25, 2024 at 03:55:53PM +0100, Christophe Leroy wrote:
> >>>>>> This series reimplements hugepages with hugepd on powerpc 8xx.
> >>>>>>
> >>>>>> Unlike most architectures, powerpc 8xx HW requires a two-level
> >>>>>> pagetable topology for all page sizes. So a leaf PMD-contig approach
> >>>>>> is not feasible as such.
> >>>>>>
> >>>>>> Possible sizes are 4k, 16k, 512k and 8M.
> >>>>>>
> >>>>>> First level (PGD/PMD) covers 4M per entry. For 8M pages, two PMD
> >>>>>> entries
> >>>>>> must point to a single entry level-2 page table. Until now that was
> >>>>>> done using hugepd. This series changes it to use standard page tables
> >>>>>> where the entry is replicated 1024 times on each of the two
> >>>>>> pagetables
> >>>>>> refered by the two associated PMD entries for that 8M page.
> >>>>>>
> >>>>>> At the moment it has to look into each helper to know if the
> >>>>>> hugepage ptep is a PTE or a PMD in order to know it is a 8M page or
> >>>>>> a lower size. I hope this can me handled by core-mm in the future.
> >>>>>>
> >>>>>> There are probably several ways to implement stuff, so feedback is
> >>>>>> very welcome.
> >>>>>
> >>>>> I thought it looks pretty good!
> >>>>
> >>>> I second it.
> >>>>
> >>>> I saw the discussions in patch 1.  Christophe, I suppose you're
> >>>> exploring
> >>>> the big hammer over hugepd, and perhaps went already with the 32bit pmd
> >>>> solution for nohash/32bit challenge you mentioned?
> >>>>
> >>>> I'm trying to position my next step; it seems like at least I should
> >>>> not
> >>>> adding any more hugepd code, then should I go with ARCH_HAS_HUGEPD
> >>>> checks,
> >>>> or you're going to have an RFC soon then I can base on top?
> >>>
> >>> Depends on what you expect by "soon".
> >>>
> >>> I sure won't be able to send any RFC before end of April.
> >>>
> >>> Should be possible to have something during May.
> >>
> >> That's good enough, thanks.  I'll see what is the best I can do.
> >>
> >> Then do you think I can leave p4d/pgd leaves alone?  Please check the
> >> other
> >> email where I'm not sure whether pgd leaves ever existed for any of
> >> PowerPC.  That's so far what I plan to do, on teaching pgtable walkers
> >> recognize pud and lower for all leaves.  Then if Power can switch from
> >> hugepd to this it should just work.
> >
> > Well, if I understand correctly, something with no PMD will include
> > <asm-generic/pgtable-nopmd.h> and will therefore only have .... pmd
> > entries (hence no pgd/p4d/pud entries). Looks odd but that's what it is.
> > pgd_populate(), p4d_populate(), pud_populate() are all "do { } while
> > (0)" and only pmd_populate exists. So only pmd_leaf() will exist in that
> > case.
> >
> > And therefore including <asm-generic/pgtable-nop4d.h> means .... you
> > have p4d entries. Doesn't mean you have p4d_leaf() but that needs to be
> > checked.
> >
> >
> >>
> >> Even if pgd exists (then something I overlooked..), I'm wondering whether
> >> we can push that downwards to be either pud/pmd (and looks like we all
> >> agree p4d is never used on Power).  That may involve some pgtable
> >> operations moving from pgd level to lower, e.g. my pure imagination would
> >> look like starting with:
> >
> > Yes I think there is no doubt that p4d is never used:
> >
> > arch/powerpc/include/asm/book3s/32/pgtable.h:#include
> > <asm-generic/pgtable-nopmd.h>
> > arch/powerpc/include/asm/book3s/64/pgtable.h:#include
> > <asm-generic/pgtable-nop4d.h>
> > arch/powerpc/include/asm/nohash/32/pgtable.h:#include
> > <asm-generic/pgtable-nopmd.h>
> > arch/powerpc/include/asm/nohash/64/pgtable-4k.h:#include
> > <asm-generic/pgtable-nop4d.h>
> >
> > But that means that PPC32 have pmd entries and PPC64 have p4d entries ...
> >
> >>
> >> #define PTE_INDEX_SIZE    PTE_SHIFT
> >> #define PMD_INDEX_SIZE    0
> >> #define PUD_INDEX_SIZE    0
> >> #define PGD_INDEX_SIZE    (32 - PGDIR_SHIFT)
> >>
> >> To:
> >>
> >> #define PTE_INDEX_SIZE    PTE_SHIFT
> >> #define PMD_INDEX_SIZE    (32 - PMD_SHIFT)
> >> #define PUD_INDEX_SIZE    0
> >> #define PGD_INDEX_SIZE    0
> >
> > But then you can't anymore have #define PTRS_PER_PMD 1 from
> > <asm-generic/pgtable-nop4d.h>
> >
> >>
> >> And the rest will need care too.  I hope moving downward is easier
> >> (e.g. the walker should always exist for lower levels but not always for
> >> higher levels), but I actually have little idea on whether there's any
> >> other implications, so please bare with me on stupid mistakes.
> >>
> >> I just hope pgd leaves don't exist already, then I think it'll be
> >> simpler.
> >>
> >> Thanks,
> >>
>
> Digging into asm-generic/pgtable-nopmd.h, I see a definition of
> pud_leaf() always returning 0, introduced by commit 2c8a81dc0cc5
> ("riscv/mm: fix two page table check related issues")
>
> So should asm-generic/pgtable-nopud.h contain the same for p4d_leaf()

I think should be yes for this, and..

> and asm-generic/pgtable-nop4d.h contain the same for pgd_leaf() ?

. probably no for this? It seems pgd is just slightly special.

Firstly, I notice that the -nopmd.h actually includes -nopud.h already, and
further that includes -nop4d.h. It means we can only have below options:

a) nopmd + nopud + nop4d
b) nopud + nop4d
c) nop4d

It should also mean we can't randomly choose which layer to skip with the
current header arrangements.. at the meantime, all 32bit PowerPC should
perhaps fall into a), while 64 bits fall into c). That looks all fine for
now.

Above p4d_leaf()==false [1] should be fine when -nopud.h included, because
that already included nop4d.h, it means "p4d level is skipped" in the
pgtable. Then it doesn't make sense if p4d_leaf() can ever return true.
That's the same when pud_leaf()==false looks sane when an arch included
-nopmd.h as that in turn implies -nopud too.

pgd seems different, because -nop4d.h didn't include anything else like
"-nopgd.h".. so I don't see further implication on pgd sololy from the
headers. I guess that's also why 32bit Power uses pgd+pte for the two
levels; it looks like pgd is special among the others.

However I think it still didn't mean that we can't push pgd to pmd, making
pgd+pte into pmd+pte. Though here we may want to move from:

#include <asm-generic/pgtable-nopmd.h>
(pmd/pud/p4d not used)

Into:

#include <asm-generic/pgtable-nopud.h>
(pud/p4d not used)

Then we may need to provide something similar to what's in -nopXd.h for
pgds.

But let's loop back to the very beginning: I don't think we have either pgd
leaves or p4d leaves for PowerPC. Note that hugepd looks possible to exist
in pgd entries (per wiki page [1]), however I don't even think it's true in
reality, as I mentioned elsewhere on reading __find_linux_pte() and it
always go directly into p4d. I highly doubt in reality the "pgd hugepd
entries" are actually processed by the p4d layer here:

if (is_hugepd(__hugepd(p4d_val(p4d)))) {
hpdp = (hugepd_t *)&p4d;
goto out_huge;
}

Because when with "nop4d" it doesn't mean "there is no p4d" but what it
really meant is "we have only one p4d entry (which is actually exactly the
pgd entry..)". After all, is_hugepd() works for all levels.

Taking E500 nohash 32 as example, it has these:

"pgd entry covering 2M" -> "hugepd with one hugepte covering one 4M hugepage"
^
"pgd entry covering 2M" ---------+

I suspect this "pgd covering 2M" is processed by the p4d code above, rather
than pgd level. Then in the future world where there's no hugepd, it'll
naturally become a pgtable page at pte level.

Thanks,

[1] https://github.com/linuxppc/wiki/wiki/Huge-pages

--
Peter Xu