LinuxLists.cc - Re: [RFC PATCH v2 18/20] powerpc/64s: Use contiguous PMD/PUD instead of HUGEPD

2024-05-20 13:03:55

Subject: Re: [RFC PATCH v2 18/20] powerpc/64s: Use contiguous PMD/PUD instead of HUGEPD

On Sat May 18, 2024 at 5:00 AM AEST, Christophe Leroy wrote:
> On book3s/64, the only user of hugepd is hash in 4k mode.
>
> All other setups (hash-64, radix-4, radix-64) use leaf PMD/PUD.
>
> Rework hash-4k to use contiguous PMD and PUD instead.
>
> In that setup there are only two huge page sizes: 16M and 16G.
>
> 16M sits at PMD level and 16G at PUD level.
>
> pte_update doesn't know page size, lets use the same trick as
> hpte_need_flush() to get page size from segment properties. That's
> not the most efficient way but let's do that until callers of
> pte_update() provide page size instead of just a huge flag.
>
> Signed-off-by: Christophe Leroy <[email protected]>
> ---
> arch/powerpc/include/asm/book3s/64/hash-4k.h | 15 --------
> arch/powerpc/include/asm/book3s/64/hash.h | 38 +++++++++++++++----
> arch/powerpc/include/asm/book3s/64/hugetlb.h | 38 -------------------
> .../include/asm/book3s/64/pgtable-4k.h | 34 -----------------
> .../include/asm/book3s/64/pgtable-64k.h | 20 ----------
> arch/powerpc/include/asm/hugetlb.h | 4 ++
> .../include/asm/nohash/32/hugetlb-8xx.h | 4 --
> .../powerpc/include/asm/nohash/hugetlb-e500.h | 4 --
> arch/powerpc/include/asm/page.h | 8 ----
> arch/powerpc/mm/book3s64/hash_utils.c | 11 ++++--
> arch/powerpc/mm/book3s64/pgtable.c | 12 ------
> arch/powerpc/mm/hugetlbpage.c | 19 ----------
> arch/powerpc/mm/pgtable.c | 2 +-
> arch/powerpc/platforms/Kconfig.cputype | 1 -
> 14 files changed, 43 insertions(+), 167 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h b/arch/powerpc/include/asm/book3s/64/hash-4k.h
> index 6472b08fa1b0..c654c376ef8b 100644
> --- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
> +++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
> @@ -74,21 +74,6 @@
> #define remap_4k_pfn(vma, addr, pfn, prot) \
> remap_pfn_range((vma), (addr), (pfn), PAGE_SIZE, (prot))
>
> -#ifdef CONFIG_HUGETLB_PAGE
> -static inline int hash__hugepd_ok(hugepd_t hpd)
> -{
> - unsigned long hpdval = hpd_val(hpd);
> - /*
> - * if it is not a pte and have hugepd shift mask
> - * set, then it is a hugepd directory pointer
> - */
> - if (!(hpdval & _PAGE_PTE) && (hpdval & _PAGE_PRESENT) &&
> - ((hpdval & HUGEPD_SHIFT_MASK) != 0))
> - return true;
> - return false;
> -}
> -#endif
> -
> /*
> * 4K PTE format is different from 64K PTE format. Saving the hash_slot is just
> * a matter of returning the PTE bits that need to be modified. On 64K PTE,
> diff --git a/arch/powerpc/include/asm/book3s/64/hash.h b/arch/powerpc/include/asm/book3s/64/hash.h
> index faf3e3b4e4b2..509811ca7695 100644
> --- a/arch/powerpc/include/asm/book3s/64/hash.h
> +++ b/arch/powerpc/include/asm/book3s/64/hash.h
> @@ -4,6 +4,7 @@
> #ifdef __KERNEL__
>
> #include <asm/asm-const.h>
> +#include <asm/book3s/64/slice.h>
>
> /*
> * Common bits between 4K and 64K pages in a linux-style PTE.
> @@ -161,14 +162,10 @@ extern void hpte_need_flush(struct mm_struct *mm, unsigned long addr,
> pte_t *ptep, unsigned long pte, int huge);
> unsigned long htab_convert_pte_flags(unsigned long pteflags, unsigned long flags);
> /* Atomic PTE updates */
> -static inline unsigned long hash__pte_update(struct mm_struct *mm,
> - unsigned long addr,
> - pte_t *ptep, unsigned long clr,
> - unsigned long set,
> - int huge)
> +static inline unsigned long hash__pte_update_one(pte_t *ptep, unsigned long clr,
> + unsigned long set)
> {
> __be64 old_be, tmp_be;
> - unsigned long old;
>
> __asm__ __volatile__(
> "1: ldarx %0,0,%3 # pte_update\n\
> @@ -182,11 +179,38 @@ static inline unsigned long hash__pte_update(struct mm_struct *mm,
> : "r" (ptep), "r" (cpu_to_be64(clr)), "m" (*ptep),
> "r" (cpu_to_be64(H_PAGE_BUSY)), "r" (cpu_to_be64(set))
> : "cc" );
> +
> + return be64_to_cpu(old_be);
> +}
> +
> +static inline unsigned long hash__pte_update(struct mm_struct *mm,
> + unsigned long addr,
> + pte_t *ptep, unsigned long clr,
> + unsigned long set,
> + int huge)
> +{
> + unsigned long old;
> +
> + old = hash__pte_update_one(ptep, clr, set);
> +
> + if (huge && IS_ENABLED(CONFIG_PPC_4K_PAGES)) {
> + unsigned int psize = get_slice_psize(mm, addr);
> + int nb, i;
> +
> + if (psize == MMU_PAGE_16M)
> + nb = SZ_16M / PMD_SIZE;
> + else if (psize == MMU_PAGE_16G)
> + nb = SZ_16G / PUD_SIZE;
> + else
> + nb = 1;
> +
> + for (i = 1; i < nb; i++)
> + hash__pte_update_one(ptep + i, clr, set);
> + }
> /* huge pages use the old page table lock */
> if (!huge)
> assert_pte_locked(mm, addr);
>
> - old = be64_to_cpu(old_be);
> if (old & H_PAGE_HASHPTE)
> hpte_need_flush(mm, addr, ptep, old, huge);
>

Nice series, I don't know this hugepd code very well but I'll try.
Why do you have to replicate the PTE entry here? The hash table refill
should always be working on the first PTE of the page otherwise we have
bigger problems.

What paths look at the N > 0 PTEs of a contiguous page entry?

Thanks,
Nick

2024-05-20 16:43:30

by Christophe Leroy

[permalink] [raw]

Subject: Re: [RFC PATCH v2 18/20] powerpc/64s: Use contiguous PMD/PUD instead of HUGEPD

Le 20/05/2024 à 14:54, Nicholas Piggin a écrit :
> On Sat May 18, 2024 at 5:00 AM AEST, Christophe Leroy wrote:
>> On book3s/64, the only user of hugepd is hash in 4k mode.
>>
>> All other setups (hash-64, radix-4, radix-64) use leaf PMD/PUD.
>>
>> Rework hash-4k to use contiguous PMD and PUD instead.
>>
>> In that setup there are only two huge page sizes: 16M and 16G.
>>
>> 16M sits at PMD level and 16G at PUD level.
>>
>> pte_update doesn't know page size, lets use the same trick as
>> hpte_need_flush() to get page size from segment properties. That's
>> not the most efficient way but let's do that until callers of
>> pte_update() provide page size instead of just a huge flag.
>>
>> Signed-off-by: Christophe Leroy <[email protected]>
>> ---
>> arch/powerpc/include/asm/book3s/64/hash-4k.h | 15 --------
>> arch/powerpc/include/asm/book3s/64/hash.h | 38 +++++++++++++++----
>> arch/powerpc/include/asm/book3s/64/hugetlb.h | 38 -------------------
>> .../include/asm/book3s/64/pgtable-4k.h | 34 -----------------
>> .../include/asm/book3s/64/pgtable-64k.h | 20 ----------
>> arch/powerpc/include/asm/hugetlb.h | 4 ++
>> .../include/asm/nohash/32/hugetlb-8xx.h | 4 --
>> .../powerpc/include/asm/nohash/hugetlb-e500.h | 4 --
>> arch/powerpc/include/asm/page.h | 8 ----
>> arch/powerpc/mm/book3s64/hash_utils.c | 11 ++++--
>> arch/powerpc/mm/book3s64/pgtable.c | 12 ------
>> arch/powerpc/mm/hugetlbpage.c | 19 ----------
>> arch/powerpc/mm/pgtable.c | 2 +-
>> arch/powerpc/platforms/Kconfig.cputype | 1 -
>> 14 files changed, 43 insertions(+), 167 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h b/arch/powerpc/include/asm/book3s/64/hash-4k.h
>> index 6472b08fa1b0..c654c376ef8b 100644
>> --- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
>> +++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
>> @@ -74,21 +74,6 @@
>> #define remap_4k_pfn(vma, addr, pfn, prot) \
>> remap_pfn_range((vma), (addr), (pfn), PAGE_SIZE, (prot))
>>
>> -#ifdef CONFIG_HUGETLB_PAGE
>> -static inline int hash__hugepd_ok(hugepd_t hpd)
>> -{
>> - unsigned long hpdval = hpd_val(hpd);
>> - /*
>> - * if it is not a pte and have hugepd shift mask
>> - * set, then it is a hugepd directory pointer
>> - */
>> - if (!(hpdval & _PAGE_PTE) && (hpdval & _PAGE_PRESENT) &&
>> - ((hpdval & HUGEPD_SHIFT_MASK) != 0))
>> - return true;
>> - return false;
>> -}
>> -#endif
>> -
>> /*
>> * 4K PTE format is different from 64K PTE format. Saving the hash_slot is just
>> * a matter of returning the PTE bits that need to be modified. On 64K PTE,
>> diff --git a/arch/powerpc/include/asm/book3s/64/hash.h b/arch/powerpc/include/asm/book3s/64/hash.h
>> index faf3e3b4e4b2..509811ca7695 100644
>> --- a/arch/powerpc/include/asm/book3s/64/hash.h
>> +++ b/arch/powerpc/include/asm/book3s/64/hash.h
>> @@ -4,6 +4,7 @@
>> #ifdef __KERNEL__
>>
>> #include <asm/asm-const.h>
>> +#include <asm/book3s/64/slice.h>
>>
>> /*
>> * Common bits between 4K and 64K pages in a linux-style PTE.
>> @@ -161,14 +162,10 @@ extern void hpte_need_flush(struct mm_struct *mm, unsigned long addr,
>> pte_t *ptep, unsigned long pte, int huge);
>> unsigned long htab_convert_pte_flags(unsigned long pteflags, unsigned long flags);
>> /* Atomic PTE updates */
>> -static inline unsigned long hash__pte_update(struct mm_struct *mm,
>> - unsigned long addr,
>> - pte_t *ptep, unsigned long clr,
>> - unsigned long set,
>> - int huge)
>> +static inline unsigned long hash__pte_update_one(pte_t *ptep, unsigned long clr,
>> + unsigned long set)
>> {
>> __be64 old_be, tmp_be;
>> - unsigned long old;
>>
>> __asm__ __volatile__(
>> "1: ldarx %0,0,%3 # pte_update\n\
>> @@ -182,11 +179,38 @@ static inline unsigned long hash__pte_update(struct mm_struct *mm,
>> : "r" (ptep), "r" (cpu_to_be64(clr)), "m" (*ptep),
>> "r" (cpu_to_be64(H_PAGE_BUSY)), "r" (cpu_to_be64(set))
>> : "cc" );
>> +
>> + return be64_to_cpu(old_be);
>> +}
>> +
>> +static inline unsigned long hash__pte_update(struct mm_struct *mm,
>> + unsigned long addr,
>> + pte_t *ptep, unsigned long clr,
>> + unsigned long set,
>> + int huge)
>> +{
>> + unsigned long old;
>> +
>> + old = hash__pte_update_one(ptep, clr, set);
>> +
>> + if (huge && IS_ENABLED(CONFIG_PPC_4K_PAGES)) {
>> + unsigned int psize = get_slice_psize(mm, addr);
>> + int nb, i;
>> +
>> + if (psize == MMU_PAGE_16M)
>> + nb = SZ_16M / PMD_SIZE;
>> + else if (psize == MMU_PAGE_16G)
>> + nb = SZ_16G / PUD_SIZE;
>> + else
>> + nb = 1;
>> +
>> + for (i = 1; i < nb; i++)
>> + hash__pte_update_one(ptep + i, clr, set);
>> + }
>> /* huge pages use the old page table lock */
>> if (!huge)
>> assert_pte_locked(mm, addr);
>>
>> - old = be64_to_cpu(old_be);
>> if (old & H_PAGE_HASHPTE)
>> hpte_need_flush(mm, addr, ptep, old, huge);
>>
>
> Nice series, I don't know this hugepd code very well but I'll try.
> Why do you have to replicate the PTE entry here? The hash table refill
> should always be working on the first PTE of the page otherwise we have
> bigger problems.

I don't know how book3s/64 works exactly, but on nohash, when you get a
TLB miss exception the only thing you have is the address and you don't
know yes it is a hugepage so you get the PTE as if it was a 4k page and
it is only when you read that PTE that you know it is a hugepage.

Ok, on book3s/64 the page size seems to be encoded inside the segment so
maybe it is a bit different but anyway the TLB miss exception (or DSI ?)
can happen at any address.

>
> What paths look at the N > 0 PTEs of a contiguous page entry?
>

pte_offset_kernel() or pte_offset_map_lock() will land on any contiguous
PTE based on the address handed to pte_index(), as if it was a standard
(4k or 64k) page.

pte_index() doesn't know it is a hugepage, that's the reason why we need
to duplicate the entry.

2024-05-22 01:28:10

by Nicholas Piggin

[permalink] [raw]

Subject: Re: [RFC PATCH v2 18/20] powerpc/64s: Use contiguous PMD/PUD instead of HUGEPD

On Tue May 21, 2024 at 2:43 AM AEST, Christophe Leroy wrote:
>
>
> Le 20/05/2024 à 14:54, Nicholas Piggin a écrit :
> > On Sat May 18, 2024 at 5:00 AM AEST, Christophe Leroy wrote:
> >> On book3s/64, the only user of hugepd is hash in 4k mode.
> >>
> >> All other setups (hash-64, radix-4, radix-64) use leaf PMD/PUD.
> >>
> >> Rework hash-4k to use contiguous PMD and PUD instead.
> >>
> >> In that setup there are only two huge page sizes: 16M and 16G.
> >>
> >> 16M sits at PMD level and 16G at PUD level.
> >>
> >> pte_update doesn't know page size, lets use the same trick as
> >> hpte_need_flush() to get page size from segment properties. That's
> >> not the most efficient way but let's do that until callers of
> >> pte_update() provide page size instead of just a huge flag.
> >>
> >> Signed-off-by: Christophe Leroy <[email protected]>
> >> ---
> >> arch/powerpc/include/asm/book3s/64/hash-4k.h | 15 --------
> >> arch/powerpc/include/asm/book3s/64/hash.h | 38 +++++++++++++++----
> >> arch/powerpc/include/asm/book3s/64/hugetlb.h | 38 -------------------
> >> .../include/asm/book3s/64/pgtable-4k.h | 34 -----------------
> >> .../include/asm/book3s/64/pgtable-64k.h | 20 ----------
> >> arch/powerpc/include/asm/hugetlb.h | 4 ++
> >> .../include/asm/nohash/32/hugetlb-8xx.h | 4 --
> >> .../powerpc/include/asm/nohash/hugetlb-e500.h | 4 --
> >> arch/powerpc/include/asm/page.h | 8 ----
> >> arch/powerpc/mm/book3s64/hash_utils.c | 11 ++++--
> >> arch/powerpc/mm/book3s64/pgtable.c | 12 ------
> >> arch/powerpc/mm/hugetlbpage.c | 19 ----------
> >> arch/powerpc/mm/pgtable.c | 2 +-
> >> arch/powerpc/platforms/Kconfig.cputype | 1 -
> >> 14 files changed, 43 insertions(+), 167 deletions(-)
> >>
> >> diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h b/arch/powerpc/include/asm/book3s/64/hash-4k.h
> >> index 6472b08fa1b0..c654c376ef8b 100644
> >> --- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
> >> +++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
> >> @@ -74,21 +74,6 @@
> >> #define remap_4k_pfn(vma, addr, pfn, prot) \
> >> remap_pfn_range((vma), (addr), (pfn), PAGE_SIZE, (prot))
> >>
> >> -#ifdef CONFIG_HUGETLB_PAGE
> >> -static inline int hash__hugepd_ok(hugepd_t hpd)
> >> -{
> >> - unsigned long hpdval = hpd_val(hpd);
> >> - /*
> >> - * if it is not a pte and have hugepd shift mask
> >> - * set, then it is a hugepd directory pointer
> >> - */
> >> - if (!(hpdval & _PAGE_PTE) && (hpdval & _PAGE_PRESENT) &&
> >> - ((hpdval & HUGEPD_SHIFT_MASK) != 0))
> >> - return true;
> >> - return false;
> >> -}
> >> -#endif
> >> -
> >> /*
> >> * 4K PTE format is different from 64K PTE format. Saving the hash_slot is just
> >> * a matter of returning the PTE bits that need to be modified. On 64K PTE,
> >> diff --git a/arch/powerpc/include/asm/book3s/64/hash.h b/arch/powerpc/include/asm/book3s/64/hash.h
> >> index faf3e3b4e4b2..509811ca7695 100644
> >> --- a/arch/powerpc/include/asm/book3s/64/hash.h
> >> +++ b/arch/powerpc/include/asm/book3s/64/hash.h
> >> @@ -4,6 +4,7 @@
> >> #ifdef __KERNEL__
> >>
> >> #include <asm/asm-const.h>
> >> +#include <asm/book3s/64/slice.h>
> >>
> >> /*
> >> * Common bits between 4K and 64K pages in a linux-style PTE.
> >> @@ -161,14 +162,10 @@ extern void hpte_need_flush(struct mm_struct *mm, unsigned long addr,
> >> pte_t *ptep, unsigned long pte, int huge);
> >> unsigned long htab_convert_pte_flags(unsigned long pteflags, unsigned long flags);
> >> /* Atomic PTE updates */
> >> -static inline unsigned long hash__pte_update(struct mm_struct *mm,
> >> - unsigned long addr,
> >> - pte_t *ptep, unsigned long clr,
> >> - unsigned long set,
> >> - int huge)
> >> +static inline unsigned long hash__pte_update_one(pte_t *ptep, unsigned long clr,
> >> + unsigned long set)
> >> {
> >> __be64 old_be, tmp_be;
> >> - unsigned long old;
> >>
> >> __asm__ __volatile__(
> >> "1: ldarx %0,0,%3 # pte_update\n\
> >> @@ -182,11 +179,38 @@ static inline unsigned long hash__pte_update(struct mm_struct *mm,
> >> : "r" (ptep), "r" (cpu_to_be64(clr)), "m" (*ptep),
> >> "r" (cpu_to_be64(H_PAGE_BUSY)), "r" (cpu_to_be64(set))
> >> : "cc" );
> >> +
> >> + return be64_to_cpu(old_be);
> >> +}
> >> +
> >> +static inline unsigned long hash__pte_update(struct mm_struct *mm,
> >> + unsigned long addr,
> >> + pte_t *ptep, unsigned long clr,
> >> + unsigned long set,
> >> + int huge)
> >> +{
> >> + unsigned long old;
> >> +
> >> + old = hash__pte_update_one(ptep, clr, set);
> >> +
> >> + if (huge && IS_ENABLED(CONFIG_PPC_4K_PAGES)) {
> >> + unsigned int psize = get_slice_psize(mm, addr);
> >> + int nb, i;
> >> +
> >> + if (psize == MMU_PAGE_16M)
> >> + nb = SZ_16M / PMD_SIZE;
> >> + else if (psize == MMU_PAGE_16G)
> >> + nb = SZ_16G / PUD_SIZE;
> >> + else
> >> + nb = 1;
> >> +
> >> + for (i = 1; i < nb; i++)
> >> + hash__pte_update_one(ptep + i, clr, set);
> >> + }
> >> /* huge pages use the old page table lock */
> >> if (!huge)
> >> assert_pte_locked(mm, addr);
> >>
> >> - old = be64_to_cpu(old_be);
> >> if (old & H_PAGE_HASHPTE)
> >> hpte_need_flush(mm, addr, ptep, old, huge);
> >>
> >
> > Nice series, I don't know this hugepd code very well but I'll try.
> > Why do you have to replicate the PTE entry here? The hash table refill
> > should always be working on the first PTE of the page otherwise we have
> > bigger problems.
>
> I don't know how book3s/64 works exactly, but on nohash, when you get a
> TLB miss exception the only thing you have is the address and you don't
> know yes it is a hugepage so you get the PTE as if it was a 4k page and
> it is only when you read that PTE that you know it is a hugepage.
>
> Ok, on book3s/64 the page size seems to be encoded inside the segment so
> maybe it is a bit different but anyway the TLB miss exception (or DSI ?)
> can happen at any address.

Right.

If you think of the hash page table as a software loaded TLB (which
is how Linux kind of thinks of it), then DSI is a TLB miss. hash_page_x
calls find the Linux pte and load that translation into hash page table.

One of the hard parts is keeping them coherent with low overhead. This
requires pte bits H_PAGE_BUSY as a lock and H_PAGE_HASHPTE which means
it might be in the hash table. So Linux PTE and hash PTE have to be
1:1 in general.

There are probably cases where we could get away from 1:1, but I would
much prefer not to. Maybe read-only access would be okay though. But
the hash_page will have to always operate on the 0th pte, which I think
we get via segment size masking, same for any set / update / clear of
the pte.

> >
> > What paths look at the N > 0 PTEs of a contiguous page entry?
> >
>
> pte_offset_kernel() or pte_offset_map_lock() will land on any contiguous
> PTE based on the address handed to pte_index(), as if it was a standard
> (4k or 64k) page.
>
> pte_index() doesn't know it is a hugepage, that's the reason why we need
> to duplicate the entry.

From the mm/ side of things, hugetlb page tables are always walked via
the huge vma which knows the page size and could align address... I
guess except for fast gup? Which should be read-only. So okay you do
need to replicate huge ptes for fast gup at least. Any others?

There's going to need to be a little more to it. __hash_page_huge sets
PTE accessed and dirty for example, so if we allow any PTE readers to
check the non-0th pte we would have to do something about that.

How do you deal with dirty/accessed bits for other subarchs?

We could just remove the hash_page setting of those bits and just cause
a fault and require Linux mm to set them. At least for hugepages we
could do that probably without any real performance worry.

Thanks,
Nick

2024-05-22 09:32:31

by Christophe Leroy

[permalink] [raw]

Subject: Re: [RFC PATCH v2 18/20] powerpc/64s: Use contiguous PMD/PUD instead of HUGEPD

Le 22/05/2024 à 03:13, Nicholas Piggin a écrit :
> On Tue May 21, 2024 at 2:43 AM AEST, Christophe Leroy wrote:
>>
>>
>> Le 20/05/2024 à 14:54, Nicholas Piggin a écrit :
>>> On Sat May 18, 2024 at 5:00 AM AEST, Christophe Leroy wrote:
>>>> On book3s/64, the only user of hugepd is hash in 4k mode.
>>>>
>>>> All other setups (hash-64, radix-4, radix-64) use leaf PMD/PUD.
>>>>
>>>> Rework hash-4k to use contiguous PMD and PUD instead.
>>>>
>>>> In that setup there are only two huge page sizes: 16M and 16G.
>>>>
>>>> 16M sits at PMD level and 16G at PUD level.
>>>>
>>>> pte_update doesn't know page size, lets use the same trick as
>>>> hpte_need_flush() to get page size from segment properties. That's
>>>> not the most efficient way but let's do that until callers of
>>>> pte_update() provide page size instead of just a huge flag.
>>>>
>>>> Signed-off-by: Christophe Leroy <[email protected]>
>>>> ---
>>>> arch/powerpc/include/asm/book3s/64/hash-4k.h | 15 --------
>>>> arch/powerpc/include/asm/book3s/64/hash.h | 38 +++++++++++++++----
>>>> arch/powerpc/include/asm/book3s/64/hugetlb.h | 38 -------------------
>>>> .../include/asm/book3s/64/pgtable-4k.h | 34 -----------------
>>>> .../include/asm/book3s/64/pgtable-64k.h | 20 ----------
>>>> arch/powerpc/include/asm/hugetlb.h | 4 ++
>>>> .../include/asm/nohash/32/hugetlb-8xx.h | 4 --
>>>> .../powerpc/include/asm/nohash/hugetlb-e500.h | 4 --
>>>> arch/powerpc/include/asm/page.h | 8 ----
>>>> arch/powerpc/mm/book3s64/hash_utils.c | 11 ++++--
>>>> arch/powerpc/mm/book3s64/pgtable.c | 12 ------
>>>> arch/powerpc/mm/hugetlbpage.c | 19 ----------
>>>> arch/powerpc/mm/pgtable.c | 2 +-
>>>> arch/powerpc/platforms/Kconfig.cputype | 1 -
>>>> 14 files changed, 43 insertions(+), 167 deletions(-)
>>>>
>>>> diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h b/arch/powerpc/include/asm/book3s/64/hash-4k.h
>>>> index 6472b08fa1b0..c654c376ef8b 100644
>>>> --- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
>>>> +++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
>>>> @@ -74,21 +74,6 @@
>>>> #define remap_4k_pfn(vma, addr, pfn, prot) \
>>>> remap_pfn_range((vma), (addr), (pfn), PAGE_SIZE, (prot))
>>>>
>>>> -#ifdef CONFIG_HUGETLB_PAGE
>>>> -static inline int hash__hugepd_ok(hugepd_t hpd)
>>>> -{
>>>> - unsigned long hpdval = hpd_val(hpd);
>>>> - /*
>>>> - * if it is not a pte and have hugepd shift mask
>>>> - * set, then it is a hugepd directory pointer
>>>> - */
>>>> - if (!(hpdval & _PAGE_PTE) && (hpdval & _PAGE_PRESENT) &&
>>>> - ((hpdval & HUGEPD_SHIFT_MASK) != 0))
>>>> - return true;
>>>> - return false;
>>>> -}
>>>> -#endif
>>>> -
>>>> /*
>>>> * 4K PTE format is different from 64K PTE format. Saving the hash_slot is just
>>>> * a matter of returning the PTE bits that need to be modified. On 64K PTE,
>>>> diff --git a/arch/powerpc/include/asm/book3s/64/hash.h b/arch/powerpc/include/asm/book3s/64/hash.h
>>>> index faf3e3b4e4b2..509811ca7695 100644
>>>> --- a/arch/powerpc/include/asm/book3s/64/hash.h
>>>> +++ b/arch/powerpc/include/asm/book3s/64/hash.h
>>>> @@ -4,6 +4,7 @@
>>>> #ifdef __KERNEL__
>>>>
>>>> #include <asm/asm-const.h>
>>>> +#include <asm/book3s/64/slice.h>
>>>>
>>>> /*
>>>> * Common bits between 4K and 64K pages in a linux-style PTE.
>>>> @@ -161,14 +162,10 @@ extern void hpte_need_flush(struct mm_struct *mm, unsigned long addr,
>>>> pte_t *ptep, unsigned long pte, int huge);
>>>> unsigned long htab_convert_pte_flags(unsigned long pteflags, unsigned long flags);
>>>> /* Atomic PTE updates */
>>>> -static inline unsigned long hash__pte_update(struct mm_struct *mm,
>>>> - unsigned long addr,
>>>> - pte_t *ptep, unsigned long clr,
>>>> - unsigned long set,
>>>> - int huge)
>>>> +static inline unsigned long hash__pte_update_one(pte_t *ptep, unsigned long clr,
>>>> + unsigned long set)
>>>> {
>>>> __be64 old_be, tmp_be;
>>>> - unsigned long old;
>>>>
>>>> __asm__ __volatile__(
>>>> "1: ldarx %0,0,%3 # pte_update\n\
>>>> @@ -182,11 +179,38 @@ static inline unsigned long hash__pte_update(struct mm_struct *mm,
>>>> : "r" (ptep), "r" (cpu_to_be64(clr)), "m" (*ptep),
>>>> "r" (cpu_to_be64(H_PAGE_BUSY)), "r" (cpu_to_be64(set))
>>>> : "cc" );
>>>> +
>>>> + return be64_to_cpu(old_be);
>>>> +}
>>>> +
>>>> +static inline unsigned long hash__pte_update(struct mm_struct *mm,
>>>> + unsigned long addr,
>>>> + pte_t *ptep, unsigned long clr,
>>>> + unsigned long set,
>>>> + int huge)
>>>> +{
>>>> + unsigned long old;
>>>> +
>>>> + old = hash__pte_update_one(ptep, clr, set);
>>>> +
>>>> + if (huge && IS_ENABLED(CONFIG_PPC_4K_PAGES)) {
>>>> + unsigned int psize = get_slice_psize(mm, addr);
>>>> + int nb, i;
>>>> +
>>>> + if (psize == MMU_PAGE_16M)
>>>> + nb = SZ_16M / PMD_SIZE;
>>>> + else if (psize == MMU_PAGE_16G)
>>>> + nb = SZ_16G / PUD_SIZE;
>>>> + else
>>>> + nb = 1;
>>>> +
>>>> + for (i = 1; i < nb; i++)
>>>> + hash__pte_update_one(ptep + i, clr, set);
>>>> + }
>>>> /* huge pages use the old page table lock */
>>>> if (!huge)
>>>> assert_pte_locked(mm, addr);
>>>>
>>>> - old = be64_to_cpu(old_be);
>>>> if (old & H_PAGE_HASHPTE)
>>>> hpte_need_flush(mm, addr, ptep, old, huge);
>>>>
>>>
>>> Nice series, I don't know this hugepd code very well but I'll try.
>>> Why do you have to replicate the PTE entry here? The hash table refill
>>> should always be working on the first PTE of the page otherwise we have
>>> bigger problems.
>>
>> I don't know how book3s/64 works exactly, but on nohash, when you get a
>> TLB miss exception the only thing you have is the address and you don't
>> know yes it is a hugepage so you get the PTE as if it was a 4k page and
>> it is only when you read that PTE that you know it is a hugepage.
>>
>> Ok, on book3s/64 the page size seems to be encoded inside the segment so
>> maybe it is a bit different but anyway the TLB miss exception (or DSI ?)
>> can happen at any address.
>
> Right.
>
> If you think of the hash page table as a software loaded TLB (which
> is how Linux kind of thinks of it), then DSI is a TLB miss. hash_page_x
> calls find the Linux pte and load that translation into hash page table.
>
> One of the hard parts is keeping them coherent with low overhead. This
> requires pte bits H_PAGE_BUSY as a lock and H_PAGE_HASHPTE which means
> it might be in the hash table. So Linux PTE and hash PTE have to be
> 1:1 in general.
>
> There are probably cases where we could get away from 1:1, but I would
> much prefer not to. Maybe read-only access would be okay though. But
> the hash_page will have to always operate on the 0th pte, which I think
> we get via segment size masking, same for any set / update / clear of
> the pte.
>
>>>
>>> What paths look at the N > 0 PTEs of a contiguous page entry?
>>>
>>
>> pte_offset_kernel() or pte_offset_map_lock() will land on any contiguous
>> PTE based on the address handed to pte_index(), as if it was a standard
>> (4k or 64k) page.
>>
>> pte_index() doesn't know it is a hugepage, that's the reason why we need
>> to duplicate the entry.
>
> From the mm/ side of things, hugetlb page tables are always walked via
> the huge vma which knows the page size and could align address... I
> guess except for fast gup? Which should be read-only. So okay you do
> need to replicate huge ptes for fast gup at least. Any others?
>
> There's going to need to be a little more to it. __hash_page_huge sets
> PTE accessed and dirty for example, so if we allow any PTE readers to
> check the non-0th pte we would have to do something about that.
>
> How do you deal with dirty/accessed bits for other subarchs?

All nohash bail out of TLB miss handler when accessing a page which
doesn have the ACCESSED bit or writing a page which doesn't have DIRTY
bit, see commit 2c74e2586bb9 ("powerpc/40x: Rework 40x PTE access and
TLB miss") and other commits it refers to.

Same for the 603 which is the nohash version of book3s/32, see commits
f8b58c64eaef ("powerpc/603: let's handle PAGE_DIRTY directly") and
84de6ab0e904 ("powerpc/603: don't handle PAGE_ACCESSED in TLB miss
handlers.").

Only the hash version of book3s/32 still updated PTE in miss handler,
see
https://elixir.bootlin.com/linux/v6.9/source/arch/powerpc/mm/book3s32/hash_low.S#L146
but there are no hugepages on book3s/32

>
> We could just remove the hash_page setting of those bits and just cause
> a fault and require Linux mm to set them. At least for hugepages we
> could do that probably without any real performance worry.
>
> Thanks,
> Nick

2024-05-22 12:24:27

by Jason Gunthorpe

[permalink] [raw]

Subject: Re: [RFC PATCH v2 18/20] powerpc/64s: Use contiguous PMD/PUD instead of HUGEPD

On Wed, May 22, 2024 at 11:13:53AM +1000, Nicholas Piggin wrote:

> From the mm/ side of things, hugetlb page tables are always walked via
> the huge vma which knows the page size and could align address... I
> guess except for fast gup? Which should be read-only. So okay you do
> need to replicate huge ptes for fast gup at least. Any others?

We are trying to get away from this. We want all content in the page
table to be walkable via the normal pud/pmd/pte/etc functions and the
special huge VMA limited to only weird hugetlbfs internals. It should
not leak into the arch.

> There's going to need to be a little more to it. __hash_page_huge sets
> PTE accessed and dirty for example, so if we allow any PTE readers to
> check the non-0th pte we would have to do something about that.

Ryan added a special function to get the access and dirty flags from a
CONTIG PTE, the arch can do the right thing here. The case where there
was a CONTIG PTE that spanned two PMD entries might be some trouble
though.

> How do you deal with dirty/accessed bits for other subarchs?

ARM and RISCV verions will combine the access flags from every sub
pte. Their HW is allowed to set dirty/access bits on any PTE in a
contiguos set.

Jason