2024-06-11 09:34:38

by Oscar Salvador

[permalink] [raw]
Subject: Re: [PATCH v5 02/18] mm: Define __pte_leaf_size() to also take a PMD entry

On Mon, Jun 10, 2024 at 07:54:47AM +0200, Christophe Leroy wrote:
> On powerpc 8xx, when a page is 8M size, the information is in the PMD
> entry. So allow architectures to provide __pte_leaf_size() instead of
> pte_leaf_size() and provide the PMD entry to that function.
>
> When __pte_leaf_size() is not defined, define it as a pte_leaf_size()
> so that architectures not interested in the PMD arguments are not
> impacted.
>
> Only define a default pte_leaf_size() when __pte_leaf_size() is not
> defined to make sure nobody adds new calls to pte_leaf_size() in the
> core.

Hi Christophe,

Now I am going to give you a hard time, so sorry in advance.
I should have raised this before, but I was not fully aware of it.

There is an ongoing effort of unifying pagewalkers [1], so hugetlb does not have
to be special-cased anymore, and the operations we do for THP on page-table basis
work for hugetlb as well.

The most special bit about this is huge_ptep_get.
huge_ptep_get() gets special handled on arm/arm64/riscv and s390.

arm64 and riscv is about cont-pmd/pte and propagate the dirty/young bits bits, so that
is fine as walkers can already understand that.
s390 is a funny one because it converts pud/pmd to pte and viceversa, because hugetlb
*works* with ptes, so before returning the pte it has to transfer all
bits from PUD/PMD level into a something that PTE level can understand.
As you can imagine, this can be gone as we already have all the
information in PUD/PMD and that is all pagewalkers need.

But we are left with the one you will introduce in patch#8.

8MB pages get mapped as cont-pte, but all the information is stored in
the PMD entries (size, dirtiness, present etc).
huge_ptep_get() will return the PMD for 8MB, and so all operations hugetlb
code performs with what huge_ptep_get returns will be performed on those PMDs.

Which brings me to this point:

I do not think __pte_leaf_size is needed. AFAICS, it should be enough to define
pmd_leaf on 8xx, and return 8MB if it is a 8MB hugepage.

#define pmd_leaf pmd_leaf
static inline bool pmd_leaf(pmd_t pmd)
{
return pmd_val(pmd) & _PMD_PAGE_8M);
}

and then pmd_leaf_size to return _PMD_PAGE_8M.

This will help because on the ongoing effort of unifying hugetlb and
getting rid of huge_ptep_get() [1], pagewalkers will stumble upon the
8mb-PMD as they do for regular PMDs.

Which means that they would be caught in the following code:

ptl = pmd_huge_lock(pmd, vma);
if (ptl) {
- 8MB hugepages will be handled here
smaps_pmd_entry(pmd, addr, walk);
spin_unlock(ptl);
}
/* pte stuff */
...

where pmd_huge_lock is:

static inline spinlock_t *pmd_huge_lock(pmd_t *pmd, struct vm_area_struct *vma)
{
spinlock_t *ptl = pmd_lock(vma->vm_mm, pmd);

if (pmd_leaf(*pmd) || pmd_devmap(*pmd))
return ptl;
spin_unlock(ptl);
return NULL;
}

So, since pmd_leaf() will return true for 8MB hugepages, we are fine,
because anyway we want to perform pagetable operations on *that* PMD and
not the ptes that are cont-mapped, which is different for e.g: 512K
hugepages, where we perform it on pte level.

So I would suggest that instead of this patch, we have one implementing pmd_leaf
and pmd_leaf_size for 8Mb hugepages on power8xx, as that takes us closer to our goal of
unifying hugetlb.

[1] https://github.com/leberus/linux/tree/hugetlb-pagewalk-v2


--
Oscar Salvador
SUSE Labs


2024-06-11 14:18:34

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v5 02/18] mm: Define __pte_leaf_size() to also take a PMD entry

Oscar,

On Tue, Jun 11, 2024 at 11:34:23AM +0200, Oscar Salvador wrote:
> Which means that they would be caught in the following code:
>
> ptl = pmd_huge_lock(pmd, vma);
> if (ptl) {
> - 8MB hugepages will be handled here
> smaps_pmd_entry(pmd, addr, walk);
> spin_unlock(ptl);
> }
> /* pte stuff */
> ...

Just one quick comment: I think there's one challenge though as this is
also not a generic "pmd leaf", but a pgtable page underneath. I think it
means smaps_pmd_entry() won't trivially work here, e.g., it will start to
do this:

if (pmd_present(*pmd)) {
page = vm_normal_page_pmd(vma, addr, *pmd);

Here vm_normal_page_pmd() will only work if pmd_leaf() satisfies its
definition as:

* - It should contain a huge PFN, which points to a huge page larger than
* PAGE_SIZE of the platform. The PFN format isn't important here.

But now it's a pgtable page, containing cont-ptes. Similarly, I think most
pmd_*() helpers will stop working there if we report it as a leaf.

Thanks,

--
Peter Xu


2024-06-11 14:51:16

by LEROY Christophe

[permalink] [raw]
Subject: Re: [PATCH v5 02/18] mm: Define __pte_leaf_size() to also take a PMD entry



Le 11/06/2024 à 11:34, Oscar Salvador a écrit :
> [Vous ne recevez pas souvent de courriers de [email protected]. D?couvrez pourquoi ceci est important ? https://aka.ms/LearnAboutSenderIdentification ]
>
> On Mon, Jun 10, 2024 at 07:54:47AM +0200, Christophe Leroy wrote:
>> On powerpc 8xx, when a page is 8M size, the information is in the PMD
>> entry. So allow architectures to provide __pte_leaf_size() instead of
>> pte_leaf_size() and provide the PMD entry to that function.
>>
>> When __pte_leaf_size() is not defined, define it as a pte_leaf_size()
>> so that architectures not interested in the PMD arguments are not
>> impacted.
>>
>> Only define a default pte_leaf_size() when __pte_leaf_size() is not
>> defined to make sure nobody adds new calls to pte_leaf_size() in the
>> core.
>
> Hi Christophe,
>
> Now I am going to give you a hard time, so sorry in advance.
> I should have raised this before, but I was not fully aware of it.
>
> There is an ongoing effort of unifying pagewalkers [1], so hugetlb does not have
> to be special-cased anymore, and the operations we do for THP on page-table basis
> work for hugetlb as well.
>
> The most special bit about this is huge_ptep_get.
> huge_ptep_get() gets special handled on arm/arm64/riscv and s390.
>
> arm64 and riscv is about cont-pmd/pte and propagate the dirty/young bits bits, so that
> is fine as walkers can already understand that.
> s390 is a funny one because it converts pud/pmd to pte and viceversa, because hugetlb
> *works* with ptes, so before returning the pte it has to transfer all
> bits from PUD/PMD level into a something that PTE level can understand.
> As you can imagine, this can be gone as we already have all the
> information in PUD/PMD and that is all pagewalkers need.
>
> But we are left with the one you will introduce in patch#8.
>
> 8MB pages get mapped as cont-pte, but all the information is stored in
> the PMD entries (size, dirtiness, present etc).

I'm not sure I understand what you mean.

In my case, the PMD entry is almost standard, the only thing it contains
is a bit telling that the pointed PTEs are to be mapped 8M.

> huge_ptep_get() will return the PMD for 8MB, and so all operations hugetlb
> code performs with what huge_ptep_get returns will be performed on those PMDs.

Indeed no, my huge_ptep_get() doesn't return the PMD but the PTE.

>
> Which brings me to this point:
>
> I do not think __pte_leaf_size is needed. AFAICS, it should be enough to define
> pmd_leaf on 8xx, and return 8MB if it is a 8MB hugepage.

If I declare it as a PMD leaf, then many places will expect the PTE
entry to be the PMD entry, which is not the case here. As far as I
understand, in order that the walker walks down to the page table, we
need it flaged as non-leaf by PMD-leaf.

>
> #define pmd_leaf pmd_leaf
> static inline bool pmd_leaf(pmd_t pmd)
> {
> return pmd_val(pmd) & _PMD_PAGE_8M);
> }
>
> and then pmd_leaf_size to return _PMD_PAGE_8M.
>
> This will help because on the ongoing effort of unifying hugetlb and
> getting rid of huge_ptep_get() [1], pagewalkers will stumble upon the
> 8mb-PMD as they do for regular PMDs.

But AFAIU, it won't work that simple, because *pmd is definitely not a
PTE but still a pointer to a page table which contains the PTE.

>
> Which means that they would be caught in the following code:
>
> ptl = pmd_huge_lock(pmd, vma);
> if (ptl) {
> - 8MB hugepages will be handled here
> smaps_pmd_entry(pmd, addr, walk);
> spin_unlock(ptl);
> }
> /* pte stuff */
> ...
>
> where pmd_huge_lock is:
>
> static inline spinlock_t *pmd_huge_lock(pmd_t *pmd, struct vm_area_struct *vma)
> {
> spinlock_t *ptl = pmd_lock(vma->vm_mm, pmd);
>
> if (pmd_leaf(*pmd) || pmd_devmap(*pmd))
> return ptl;
> spin_unlock(ptl);
> return NULL;
> }
>
> So, since pmd_leaf() will return true for 8MB hugepages, we are fine,
> because anyway we want to perform pagetable operations on *that* PMD and
> not the ptes that are cont-mapped, which is different for e.g: 512K
> hugepages, where we perform it on pte level.

We still want to do the operation on the cont-PTE, in fact in both 4M
page tables so that we cover the 8M. There is no operation to do on the
PMD entry itself except that telling it is a 8M page table underneath.

>
> So I would suggest that instead of this patch, we have one implementing pmd_leaf
> and pmd_leaf_size for 8Mb hugepages on power8xx, as that takes us closer to our goal of
> unifying hugetlb.

But then, how will it work to go down the PTE road ?

Christophe

2024-06-11 15:19:00

by Oscar Salvador

[permalink] [raw]
Subject: Re: [PATCH v5 02/18] mm: Define __pte_leaf_size() to also take a PMD entry

On Tue, Jun 11, 2024 at 10:17:30AM -0400, Peter Xu wrote:
> Oscar,
>
> On Tue, Jun 11, 2024 at 11:34:23AM +0200, Oscar Salvador wrote:
> > Which means that they would be caught in the following code:
> >
> > ptl = pmd_huge_lock(pmd, vma);
> > if (ptl) {
> > - 8MB hugepages will be handled here
> > smaps_pmd_entry(pmd, addr, walk);
> > spin_unlock(ptl);
> > }
> > /* pte stuff */
> > ...
>
> Just one quick comment: I think there's one challenge though as this is
> also not a generic "pmd leaf", but a pgtable page underneath. I think it
> means smaps_pmd_entry() won't trivially work here, e.g., it will start to
> do this:
>
> if (pmd_present(*pmd)) {
> page = vm_normal_page_pmd(vma, addr, *pmd);
>
> Here vm_normal_page_pmd() will only work if pmd_leaf() satisfies its
> definition as:
>
> * - It should contain a huge PFN, which points to a huge page larger than
> * PAGE_SIZE of the platform. The PFN format isn't important here.
>
> But now it's a pgtable page, containing cont-ptes. Similarly, I think most
> pmd_*() helpers will stop working there if we report it as a leaf.

Heh, I think I managed to confuse myself.
I do not why but I thought that

static inline pte_t huge_ptep_get(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
{
if (ptep_is_8m_pmdp(mm, addr, ptep))
ptep = pte_offset_kernel((pmd_t *)ptep, 0);
return ptep_get(ptep);
}

would return the address of the pmd for 8MB hugepages, but it will
return the address of the first pte?

Then yeah, this will not work as I thought.

The problem is that we do not have spare bits for 8xx to mark these ptes
as cont-ptes or mark them pte as 8MB, so I do not see a clear path on how
we could remove huge_ptep_get for 8xx.

I am really curious though how we handle that for THP? Or THP on 8xx
does not support that size?


--
Oscar Salvador
SUSE Labs

2024-06-11 15:20:25

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v5 02/18] mm: Define __pte_leaf_size() to also take a PMD entry

On Tue, Jun 11, 2024 at 05:08:45PM +0200, Oscar Salvador wrote:
> The problem is that we do not have spare bits for 8xx to mark these ptes
> as cont-ptes or mark them pte as 8MB, so I do not see a clear path on how
> we could remove huge_ptep_get for 8xx.

Right, I remember I thought about this too when I initially looked at one
previous version of the series, I didn't come up yet with a good solution,
but I guess we probably need to get rid of hugepd first anyway. We may
somehow still need to identify this is a 8M large leaf, and I guess this is
again the only special case where contpte can go over >1 pmds.

>
> I am really curious though how we handle that for THP? Or THP on 8xx
> does not support that size?

I'll leave this to Christophe, but IIUC thp is only PMD_ORDER sized, so
shouldn't apply to the 8MB pages.

Thanks,

--
Peter Xu


2024-06-11 16:14:11

by Oscar Salvador

[permalink] [raw]
Subject: Re: [PATCH v5 02/18] mm: Define __pte_leaf_size() to also take a PMD entry

On Tue, Jun 11, 2024 at 11:20:01AM -0400, Peter Xu wrote:
> On Tue, Jun 11, 2024 at 05:08:45PM +0200, Oscar Salvador wrote:
> > The problem is that we do not have spare bits for 8xx to mark these ptes
> > as cont-ptes or mark them pte as 8MB, so I do not see a clear path on how
> > we could remove huge_ptep_get for 8xx.
>
> Right, I remember I thought about this too when I initially looked at one
> previous version of the series, I didn't come up yet with a good solution,
> but I guess we probably need to get rid of hugepd first anyway. We may
> somehow still need to identify this is a 8M large leaf, and I guess this is
> again the only special case where contpte can go over >1 pmds.

Yes, we definitely need first to get rid of hugepd, which is a huge
step, and one step closer to our goal, but at some point we will have to
see what can we do about 8MB cont-ptes for 8xx and how to mark them,
so ptep_get can work the same way as e.g: arm64
(ptep_get->contpte_ptep_get).

@Christophe: Can you think of a way to flag those ptes? (e.g: a
combination of bits etc)

> I'll leave this to Christophe, but IIUC thp is only PMD_ORDER sized, so
> shouldn't apply to the 8MB pages.

That might be it, yes.


--
Oscar Salvador
SUSE Labs

2024-06-11 16:54:26

by LEROY Christophe

[permalink] [raw]
Subject: Re: [PATCH v5 02/18] mm: Define __pte_leaf_size() to also take a PMD entry



Le 11/06/2024 à 17:08, Oscar Salvador a écrit :
> [Vous ne recevez pas souvent de courriers de [email protected]. D?couvrez pourquoi ceci est important ? https://aka.ms/LearnAboutSenderIdentification ]
>
> On Tue, Jun 11, 2024 at 10:17:30AM -0400, Peter Xu wrote:
>> Oscar,
>>
>> On Tue, Jun 11, 2024 at 11:34:23AM +0200, Oscar Salvador wrote:
>>> Which means that they would be caught in the following code:
>>>
>>> ptl = pmd_huge_lock(pmd, vma);
>>> if (ptl) {
>>> - 8MB hugepages will be handled here
>>> smaps_pmd_entry(pmd, addr, walk);
>>> spin_unlock(ptl);
>>> }
>>> /* pte stuff */
>>> ...
>>
>> Just one quick comment: I think there's one challenge though as this is
>> also not a generic "pmd leaf", but a pgtable page underneath. I think it
>> means smaps_pmd_entry() won't trivially work here, e.g., it will start to
>> do this:
>>
>> if (pmd_present(*pmd)) {
>> page = vm_normal_page_pmd(vma, addr, *pmd);
>>
>> Here vm_normal_page_pmd() will only work if pmd_leaf() satisfies its
>> definition as:
>>
>> * - It should contain a huge PFN, which points to a huge page larger than
>> * PAGE_SIZE of the platform. The PFN format isn't important here.
>>
>> But now it's a pgtable page, containing cont-ptes. Similarly, I think most
>> pmd_*() helpers will stop working there if we report it as a leaf.
>
> Heh, I think I managed to confuse myself.
> I do not why but I thought that
>
> static inline pte_t huge_ptep_get(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
> {
> if (ptep_is_8m_pmdp(mm, addr, ptep))
> ptep = pte_offset_kernel((pmd_t *)ptep, 0);
> return ptep_get(ptep);
> }
>
> would return the address of the pmd for 8MB hugepages, but it will
> return the address of the first pte?
>
> Then yeah, this will not work as I thought.
>
> The problem is that we do not have spare bits for 8xx to mark these ptes
> as cont-ptes or mark them pte as 8MB, so I do not see a clear path on how
> we could remove huge_ptep_get for 8xx.
>
> I am really curious though how we handle that for THP? Or THP on 8xx
> does not support that size?

8xx doesn't support THP, as far as I know THP is only supported on
single leaf PMD/PUD, not on cont-PUD/PMD allthough there is some work in
progress on arm64 to add that.

But honestly I'm not too much interested in 8M transparent pages, what
I'd like to have first is 512k transparent pages.

Christophe

2024-06-12 17:26:46

by LEROY Christophe

[permalink] [raw]
Subject: Re: [PATCH v5 02/18] mm: Define __pte_leaf_size() to also take a PMD entry



Le 11/06/2024 à 18:10, Oscar Salvador a écrit :
> [Vous ne recevez pas souvent de courriers de [email protected]. D?couvrez pourquoi ceci est important ? https://aka.ms/LearnAboutSenderIdentification ]
>
> On Tue, Jun 11, 2024 at 11:20:01AM -0400, Peter Xu wrote:
>> On Tue, Jun 11, 2024 at 05:08:45PM +0200, Oscar Salvador wrote:
>>> The problem is that we do not have spare bits for 8xx to mark these ptes
>>> as cont-ptes or mark them pte as 8MB, so I do not see a clear path on how
>>> we could remove huge_ptep_get for 8xx.
>>
>> Right, I remember I thought about this too when I initially looked at one
>> previous version of the series, I didn't come up yet with a good solution,
>> but I guess we probably need to get rid of hugepd first anyway. We may
>> somehow still need to identify this is a 8M large leaf, and I guess this is
>> again the only special case where contpte can go over >1 pmds.
>
> Yes, we definitely need first to get rid of hugepd, which is a huge
> step, and one step closer to our goal, but at some point we will have to
> see what can we do about 8MB cont-ptes for 8xx and how to mark them,
> so ptep_get can work the same way as e.g: arm64
> (ptep_get->contpte_ptep_get).
>
> @Christophe: Can you think of a way to flag those ptes? (e.g: a
> combination of bits etc)

We have space available in PMD if we need more flags, but in PTE I can't
see anything possible without additional churn that would require
additional instructions in TLB miss handlers, which is what I want to
avoid most.

Bits mapped to HW PTE:

#define _PAGE_PRESENT 0x0001 /* V: Page is valid */
#define _PAGE_NO_CACHE 0x0002 /* CI: cache inhibit */
#define _PAGE_SH 0x0004 /* SH: No ASID (context) compare */
#define _PAGE_SPS 0x0008 /* SPS: Small Page Size (1 if 16k, 512k or 8M)*/
#define _PAGE_DIRTY 0x0100 /* C: page changed */
#define _PAGE_NA 0x0200 /* Supervisor NA, User no access */
#define _PAGE_RO 0x0600 /* Supervisor RO, User no access */

SW bits masked out in TLB miss handler:

#define _PAGE_GUARDED 0x0010 /* Copied to L1 G entry in DTLB */
#define _PAGE_ACCESSED 0x0020 /* Copied to L1 APG 1 entry in I/DTLB */
#define _PAGE_EXEC 0x0040 /* Copied to PP (bit 21) in ITLB */
#define _PAGE_SPECIAL 0x0080 /* SW entry */
#define _PAGE_HUGE 0x0800 /* Copied to L1 PS bit 29 */

All bits are used. The only thing would could do but that would have a
performance cost is to retrieve _PAGE_SH from the PMD and use that bit
for something else.

But I was maybe thinking another way. Lets take the exemple of
pmd_write() helper:

#define pmd_write(pmd) pte_write(pmd_pte(pmd))

At the time being we have

static inline pte_t pmd_pte(pmd_t pmd)
{
return __pte(pmd_val(pmd));
}

But what about something like

static inline pte_t pmd_pte(pmd_t pmd)
{
return *(pte_t *)pmd_page_vaddr(pmd);
}

Would it do the trick ?

Of course it would require to carefully make sure all accesses are done
through pmd_pte().

Would that work ?



Christophe

2024-06-13 07:20:14

by Oscar Salvador

[permalink] [raw]
Subject: Re: [PATCH v5 02/18] mm: Define __pte_leaf_size() to also take a PMD entry

On Tue, Jun 11, 2024 at 07:00:14PM +0000, LEROY Christophe wrote:
> We have space available in PMD if we need more flags, but in PTE I can't
> see anything possible without additional churn that would require
> additional instructions in TLB miss handlers, which is what I want to
> avoid most.
>
> Bits mapped to HW PTE:
>
> #define _PAGE_PRESENT 0x0001 /* V: Page is valid */
> #define _PAGE_NO_CACHE 0x0002 /* CI: cache inhibit */
> #define _PAGE_SH 0x0004 /* SH: No ASID (context) compare */
> #define _PAGE_SPS 0x0008 /* SPS: Small Page Size (1 if 16k, 512k or 8M)*/
> #define _PAGE_DIRTY 0x0100 /* C: page changed */
> #define _PAGE_NA 0x0200 /* Supervisor NA, User no access */
> #define _PAGE_RO 0x0600 /* Supervisor RO, User no access */
>
> SW bits masked out in TLB miss handler:
>
> #define _PAGE_GUARDED 0x0010 /* Copied to L1 G entry in DTLB */
> #define _PAGE_ACCESSED 0x0020 /* Copied to L1 APG 1 entry in I/DTLB */
> #define _PAGE_EXEC 0x0040 /* Copied to PP (bit 21) in ITLB */
> #define _PAGE_SPECIAL 0x0080 /* SW entry */
> #define _PAGE_HUGE 0x0800 /* Copied to L1 PS bit 29 */
>
> All bits are used. The only thing would could do but that would have a
> performance cost is to retrieve _PAGE_SH from the PMD and use that bit
> for something else.

I guess that this would be the last resort if we run out of options.
But at least it is good to know that there is a plan B (or Z if you will
:-))

> But I was maybe thinking another way. Lets take the exemple of
> pmd_write() helper:
>
> #define pmd_write(pmd) pte_write(pmd_pte(pmd))
>
> At the time being we have
>
> static inline pte_t pmd_pte(pmd_t pmd)
> {
> return __pte(pmd_val(pmd));
> }
>
> But what about something like
>
> static inline pte_t pmd_pte(pmd_t pmd)
> {
> return *(pte_t *)pmd_page_vaddr(pmd);
> }

I think this could work, yes.

So, we should define all pmd_*(pmd) operations for 8xx the way they are defined
in include/asm/book3s/64/pgtable.h.

Other page size would not interfere because they already can perform
operations on pte level.

Ok, I think we might have a shot here.

I would help testing, but I do not have 8xx hardware, and Qemu does not support
8xx emulation, but I think that if we are careful enough, this can work.

Actually, as a smoketest would be enough to have a task with a 8MB huge
mapped, and then do:

static const struct mm_walk_ops test_walk_ops = {
.pmd_entry = test_8mbp_hugepage,
.pte_entry = test_16k_and_512k_hugepage,
.hugetlb_entry = check_hugetlb_entry,
.walk_lock = PGWALK_RDLOCK,
};

static int test(void)
{

pr_info("%s: %s [0 - %lx]\n", __func__, current->comm, TASK_SIZE);
mmap_read_lock(current->mm);
ret = walk_page_range(current->mm, 0, TASK_SIZE, &test_walk_ops, NULL);
mmap_read_unlock(current->mm);

pr_info("%s: %s ret: %d\n", __func__, current->comm, ret);

return 0;
}

This is an extract of a debugging mechanism I have to check that I am
not going off rails when unifying hugetlb and normal walkers.

test_8mbp_hugepage() could so some checks with pmd_ operations, print
the results, and then compare them with those that check_hugetlb_entry()
would give us.
If everything is alright, both results should be the same.

I can write the tests up, so we run some sort of smoketests.

So yes, I do think that this is a good initiative.

Thanks a lot Christoph

--
Oscar Salvador
SUSE Labs