by Barry Song

[permalink] [raw]

Subject: Re: [PATCH v2 5/5] mm: add per-order mTHP swpin_refault counter

On Wed, Apr 17, 2024 at 1:40 PM Huang, Ying <[email protected]> wrote:
>
> Barry Song <[email protected]> writes:
>
> > On Wed, Apr 17, 2024 at 8:47 AM Huang, Ying <[email protected]> wrote:
> >>
> >> Barry Song <[email protected]> writes:
> >>
> >> > From: Barry Song <[email protected]>
> >> >
> >> > Currently, we are handling the scenario where we've hit a
> >> > large folio in the swapcache, and the reclaiming process
> >> > for this large folio is still ongoing.
> >> >
> >> > Signed-off-by: Barry Song <[email protected]>
> >> > ---
> >> > include/linux/huge_mm.h | 1 +
> >> > mm/huge_memory.c | 2 ++
> >> > mm/memory.c | 1 +
> >> > 3 files changed, 4 insertions(+)
> >> >
> >> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> >> > index c8256af83e33..b67294d5814f 100644
> >> > --- a/include/linux/huge_mm.h
> >> > +++ b/include/linux/huge_mm.h
> >> > @@ -269,6 +269,7 @@ enum mthp_stat_item {
> >> > MTHP_STAT_ANON_ALLOC_FALLBACK,
> >> > MTHP_STAT_ANON_SWPOUT,
> >> > MTHP_STAT_ANON_SWPOUT_FALLBACK,
> >> > + MTHP_STAT_ANON_SWPIN_REFAULT,
> >>
> >> This is different from the refault concept used in other place in mm
> >> subystem. Please check the following code
> >>
> >> if (shadow)
> >> workingset_refault(folio, shadow);
> >>
> >> in __read_swap_cache_async().
> >
> > right. it is slightly different as refault can also cover the case folios
> > have been entirely released and a new page fault happens soon
> > after it.
> > Do you have a better name for this?
> > MTHP_STAT_ANON_SWPIN_UNDER_RECLAIM
> > or
> > MTHP_STAT_ANON_SWPIN_RECLAIMING ?
>
> TBH, I don't think we need this counter. It's important for you during
> implementation. But I don't think it's important for end users.

Okay. If we can't find a shared interest between the
implementer and user, I'm perfectly fine with keeping it
local only for debugging purposes.

>
> --
> Best Regards,
> Huang, Ying
>
> >>
> >> > __MTHP_STAT_COUNT
> >> > };
> >>
> >> --
> >> Best Regards,
> >> Huang, Ying
> >>
> >> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >> > index d8d2ed80b0bf..fb95345b0bde 100644
> >> > --- a/mm/huge_memory.c
> >> > +++ b/mm/huge_memory.c
> >> > @@ -556,12 +556,14 @@ DEFINE_MTHP_STAT_ATTR(anon_alloc, MTHP_STAT_ANON_ALLOC);
> >> > DEFINE_MTHP_STAT_ATTR(anon_alloc_fallback, MTHP_STAT_ANON_ALLOC_FALLBACK);
> >> > DEFINE_MTHP_STAT_ATTR(anon_swpout, MTHP_STAT_ANON_SWPOUT);
> >> > DEFINE_MTHP_STAT_ATTR(anon_swpout_fallback, MTHP_STAT_ANON_SWPOUT_FALLBACK);
> >> > +DEFINE_MTHP_STAT_ATTR(anon_swpin_refault, MTHP_STAT_ANON_SWPIN_REFAULT);
> >> >
> >> > static struct attribute *stats_attrs[] = {
> >> > &anon_alloc_attr.attr,
> >> > &anon_alloc_fallback_attr.attr,
> >> > &anon_swpout_attr.attr,
> >> > &anon_swpout_fallback_attr.attr,
> >> > + &anon_swpin_refault_attr.attr,
> >> > NULL,
> >> > };
> >> >
> >> > diff --git a/mm/memory.c b/mm/memory.c
> >> > index 9818dc1893c8..acc023795a4d 100644
> >> > --- a/mm/memory.c
> >> > +++ b/mm/memory.c
> >> > @@ -4167,6 +4167,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >> > nr_pages = nr;
> >> > entry = folio->swap;
> >> > page = &folio->page;
> >> > + count_mthp_stat(folio_order(folio), MTHP_STAT_ANON_SWPIN_REFAULT);
> >> > }
> >> >
> >> > check_pte:

2024-04-18 09:55:37

by Barry Song

[permalink] [raw]

Subject: Re: [PATCH v2 4/5] mm: swap: entirely map large folios found in swapcache

[snip]

> > >> >
> > >> > VM_BUG_ON(!folio_test_anon(folio) ||
> > >> > (pte_write(pte) && !PageAnonExclusive(page)));
> > >> > - set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
> > >> > - arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
> > >> > + set_ptes(vma->vm_mm, start_address, start_pte, pte, nr_pages);
> > >> > + vmf->orig_pte = ptep_get(vmf->pte);
> > >> > + arch_do_swap_page(vma->vm_mm, vma, start_address, pte, pte);
> > >>
> > >> Do we need to call arch_do_swap_page() for each subpage? IIUC, the
> > >> corresponding arch_unmap_one() will be called for each subpage.
> > >
> > > i actually thought about this very carefully, right now, the only one who
> > > needs this is sparc and it doesn't support THP_SWAPOUT at all. and
> > > there is no proof doing restoration one by one won't really break sparc.
> > > so i'd like to defer this to when sparc really needs THP_SWAPOUT.
> >
> > Let's ask SPARC developer (Cced) for this.
> >
> > IMHO, even if we cannot get help, we need to change code with our
> > understanding instead of deferring it.
>
> ok. Thanks for Ccing sparc developers.

Hi Khalid & Ying (also Cced sparc maillist),

SPARC is the only platform which needs arch_do_swap_page(), right now,
its THP_SWAPOUT is not enabled. so we will not really hit a large folio
in swapcache. just in case you might need THP_SWAPOUT later, i am
changing the code as below,

@@ -4286,7 +4285,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
VM_BUG_ON(!folio_test_anon(folio) ||
(pte_write(pte) && !PageAnonExclusive(page)));
set_ptes(vma->vm_mm, start_address, start_ptep, pte, nr_pages);
- arch_do_swap_page(vma->vm_mm, vma, start_address, pte, pte);
+ for (int i = 0; i < nr_pages; i++) {
+ arch_do_swap_page(vma->vm_mm, vma, start_address + i *
PAGE_SIZE,
+ pte, pte);
+ pte = pte_advance_pfn(pte, 1);
+ }

folio_unlock(folio);
if (folio != swapcache && swapcache) {

for sparc, nr_pages will always be 1(THP_SWAPOUT not enabled). for
arm64/x86/riscv,
it seems redundant to do a for loop "for (int i = 0; i < nr_pages; i++)".

so another option is adding a helper as below to avoid the idle loop
for arm64/x86/riscv etc.

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index e2f45e22a6d1..ea314a5f9b5e 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1085,6 +1085,28 @@ static inline void arch_do_swap_page(struct
mm_struct *mm,
{

}
+
+static inline void arch_do_swap_page_nr(struct mm_struct *mm,
+ struct vm_area_struct *vma,
+ unsigned long addr,
+ pte_t pte, pte_t oldpte,
+ int nr)
+{
+
+}
+#else
+static inline void arch_do_swap_page_nr(struct mm_struct *mm,
+ struct vm_area_struct *vma,
+ unsigned long addr,
+ pte_t pte, pte_t oldpte,
+ int nr)
+{
+ for (int i = 0; i < nr; i++) {
+ arch_do_swap_page(vma->vm_mm, vma, addr + i * PAGE_SIZE,
+ pte_advance_pfn(pte, i),
+ pte_advance_pfn(oldpte, i));
+ }
+}
#endif

Please tell me your preference.

BTW, i found oldpte and pte are always same in do_swap_page(), is it
something wrong? does arch_do_swap_page() really need two same
arguments?

vmf->orig_pte = pte;
..
arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);

>
> >
> > > on the other hand, it seems really bad we have both
> > > arch_swap_restore - for this, arm64 has moved to using folio
> > > and
> > > arch_do_swap_page
> > >
> > > we should somehow unify them later if sparc wants THP_SWPOUT.

Thanks
Barry