LinuxLists.cc - [RFC PATCH v2 0/4] fix large folio for madvise_cold_or

2023-07-21 09:47:06

Subject: [RFC PATCH v2 0/4] fix large folio for madvise_cold_or_pageout()

Current madvise_cold_or_pageout_pte_range() has two problems to deal
with large folio:
- Using folio_mapcount() with large folio prevent large folio from
picking up.
- always try to split large folio to normal 4K page.

Try to address these two problems by:
- Use folio_estimated_sharers() with large folio. With assumption that
the estimated result of whether the large folio is shared or not is
enough here.

- If the large folio is in the range, don't split it. Leave to page
reclaim as page reclaim can support swap large folio out as whole in
the future.

- Only split the large folio if it crosses the boundaries of the
range. If folio splitting fails, just skip the folio as madvise allows
some pages in the range are ignored.

Patch1 uses folio_estimated_sharers() to replace folio_mapcount().
Patch2 uses API pmdp_clear_flush_young_notify() to clear A bit of page
table. It also notifies the mm subscripter about the A bit clearing.
Patch3 introduce help function to check whether the folio crosses range
boundary.
Patch4 avoid splitting large folio if folio is in the range.

Changes from V1:
- Split patch1 out as Yu's suggestion
- Split patch2 out as Yu's suggestion
- Handle cold case correctly (cold operation was broken in V1 patch)
- rebase the patchset to latest mm-unstable

Testing done:
- mm selftest without new regression.

V1's link:
https://lore.kernel.org/linux-mm/[email protected]/

Yin Fengwei (4):
madvise: not use mapcount() against large folio for sharing check
madvise: Use notify-able API to clear and flush page table entries
mm: add functions folio_in_range() and folio_within_vma()
madvise: avoid trying to split large folio always in cold_pageout

mm/internal.h | 42 ++++++++++++++++
mm/madvise.c | 136 +++++++++++++++++++++++++++++++-------------------
2 files changed, 127 insertions(+), 51 deletions(-)

--
2.39.2

2023-07-21 09:47:57

by Yin, Fengwei

[permalink] [raw]

Subject: [RFC PATCH v2 1/4] madvise: not use mapcount() against large folio for sharing check

The commit
07e8c82b5eff ("madvise: convert madvise_cold_or_pageout_pte_range() to
use folios") replaced the page_mapcount() with folio_mapcount() to
check whether the folio is shared by other mapping.

But it's not correct for large folio. folio_mapcount() returns the
total mapcount of large folio which is not suitable to detect whether
the folio is shared.

Use folio_estimated_sharers() which returns a estimated number of
shares. That means it's not 100% correct. But it should be OK for
madvise case here.

Signed-off-by: Yin Fengwei <[email protected]>
---
mm/madvise.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 38382a5d1e39..f12933ebcc24 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -383,7 +383,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
folio = pfn_folio(pmd_pfn(orig_pmd));

/* Do not interfere with other mappings of this folio */
- if (folio_mapcount(folio) != 1)
+ if (folio_estimated_sharers(folio) != 1)
goto huge_unlock;

if (pageout_anon_only_filter && !folio_test_anon(folio))
@@ -459,7 +459,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
if (folio_test_large(folio)) {
int err;

- if (folio_mapcount(folio) != 1)
+ if (folio_estimated_sharers(folio) != 1)
break;
if (pageout_anon_only_filter && !folio_test_anon(folio))
break;
@@ -682,7 +682,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
if (folio_test_large(folio)) {
int err;

- if (folio_mapcount(folio) != 1)
+ if (folio_estimated_sharers(folio) != 1)
break;
if (!folio_trylock(folio))
break;
--
2.39.2

2023-07-21 09:49:13

by Yin, Fengwei

[permalink] [raw]

Subject: [RFC PATCH v2 3/4] mm: add functions folio_in_range() and folio_within_vma()

It will be used to check whether the folio is mapped to specific
VMA and whether the mapping address of folio is in the range.

Also a helper function folio_within_vma() to check whether folio
is in the range of vma based on folio_in_range().

Signed-off-by: Yin Fengwei <[email protected]>
---
mm/internal.h | 32 ++++++++++++++++++++++++++++++++
1 file changed, 32 insertions(+)

diff --git a/mm/internal.h b/mm/internal.h
index 483add0bfb28..c7dd15d8de3e 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -585,6 +585,38 @@ extern long faultin_vma_page_range(struct vm_area_struct *vma,
bool write, int *locked);
extern bool mlock_future_ok(struct mm_struct *mm, unsigned long flags,
unsigned long bytes);
+
+static inline bool
+folio_in_range(struct folio *folio, struct vm_area_struct *vma,
+ unsigned long start, unsigned long end)
+{
+ pgoff_t pgoff, addr;
+ unsigned long vma_pglen = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
+
+ VM_WARN_ON_FOLIO(folio_test_ksm(folio), folio);
+ if (start < vma->vm_start)
+ start = vma->vm_start;
+
+ if (end > vma->vm_end)
+ end = vma->vm_end;
+
+ pgoff = folio_pgoff(folio);
+
+ /* if folio start address is not in vma range */
+ if (pgoff < vma->vm_pgoff || pgoff > vma->vm_pgoff + vma_pglen)
+ return false;
+
+ addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
+
+ return ((addr >= start) && (addr + folio_size(folio) <= end));
+}
+
+static inline bool
+folio_within_vma(struct folio *folio, struct vm_area_struct *vma)
+{
+ return folio_in_range(folio, vma, vma->vm_start, vma->vm_end);
+}
+
/*
* mlock_vma_folio() and munlock_vma_folio():
* should be called with vma's mmap_lock held for read or write,
--
2.39.2

2023-07-21 10:07:44

by Yin, Fengwei

[permalink] [raw]

Subject: [RFC PATCH v2 4/4] madvise: avoid trying to split large folio always in cold_pageout

Current madvise_cold_or_pageout_pte_range() always tries to split
large folio.

Avoid trying to split large folio always by:
- if large folio is in the request range, don't split it. Leave
to page reclaim to decide whether the large folio needs be
split.
- if large folio crosses boundaries of request range, skip it if
it's page cache. Try to split it if it's anonymous large folio.
If failed to split it, just skip it.

Invoke folio_referenced() to clear the A bit for large folio. As it
will acquire pte lock, just do it after release pte lock.

Signed-off-by: Yin Fengwei <[email protected]>
---
mm/internal.h | 10 +++++
mm/madvise.c | 118 +++++++++++++++++++++++++++++++++++---------------
2 files changed, 93 insertions(+), 35 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index c7dd15d8de3e..cd1ff348d690 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -586,6 +586,16 @@ extern long faultin_vma_page_range(struct vm_area_struct *vma,
extern bool mlock_future_ok(struct mm_struct *mm, unsigned long flags,
unsigned long bytes);

+static inline unsigned int
+folio_op_size(struct folio *folio, pte_t pte,
+ unsigned long addr, unsigned long end)
+{
+ unsigned int nr;
+
+ nr = folio_pfn(folio) + folio_nr_pages(folio) - pte_pfn(pte);
+ return min_t(unsigned int, nr, (end - addr) >> PAGE_SHIFT);
+}
+
static inline bool
folio_in_range(struct folio *folio, struct vm_area_struct *vma,
unsigned long start, unsigned long end)
diff --git a/mm/madvise.c b/mm/madvise.c
index b236e201a738..71af370c3251 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -339,6 +339,23 @@ static inline bool can_do_file_pageout(struct vm_area_struct *vma)
file_permission(vma->vm_file, MAY_WRITE) == 0;
}

+static inline bool skip_cur_entry(struct folio *folio, bool pageout_anon_only)
+{
+ if (!folio)
+ return true;
+
+ if (folio_is_zone_device(folio))
+ return true;
+
+ if (!folio_test_lru(folio))
+ return true;
+
+ if (pageout_anon_only && !folio_test_anon(folio))
+ return true;
+
+ return false;
+}
+
static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
unsigned long addr, unsigned long end,
struct mm_walk *walk)
@@ -352,7 +369,9 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
spinlock_t *ptl;
struct folio *folio = NULL;
LIST_HEAD(folio_list);
+ LIST_HEAD(reclaim_list);
bool pageout_anon_only_filter;
+ unsigned long start = addr;

if (fatal_signal_pending(current))
return -EINTR;
@@ -442,54 +461,90 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
continue;

folio = vm_normal_folio(vma, addr, ptent);
- if (!folio || folio_is_zone_device(folio))
+ if (skip_cur_entry(folio, pageout_anon_only_filter))
continue;

/*
- * Creating a THP page is expensive so split it only if we
- * are sure it's worth. Split it if we are only owner.
+ * Split large folio only if it's anonymous, cross the
+ * boundaries of request range and we are likely the
+ * only onwer.
*/
if (folio_test_large(folio)) {
- int err;
+ int err, step;

if (folio_estimated_sharers(folio) != 1)
- break;
- if (pageout_anon_only_filter && !folio_test_anon(folio))
- break;
- if (!folio_trylock(folio))
- break;
+ continue;
+ if (folio_in_range(folio, vma, start, end))
+ goto pageout_cold_folio;
+ if (!folio_test_anon(folio) || !folio_trylock(folio))
+ continue;
+
folio_get(folio);
+ step = folio_op_size(folio, ptent, addr, end);
arch_leave_lazy_mmu_mode();
pte_unmap_unlock(start_pte, ptl);
start_pte = NULL;
err = split_folio(folio);
folio_unlock(folio);
folio_put(folio);
- if (err)
- break;
+
start_pte = pte =
pte_offset_map_lock(mm, pmd, addr, &ptl);
if (!start_pte)
break;
arch_enter_lazy_mmu_mode();
- pte--;
- addr -= PAGE_SIZE;
- continue;
- }

- /*
- * Do not interfere with other mappings of this folio and
- * non-LRU folio.
- */
- if (!folio_test_lru(folio) || folio_mapcount(folio) != 1)
+ /* split success. retry the same entry */
+ if (!err)
+ step = 0;
+
+ /*
+ * Split fails, jump over the whole folio to avoid
+ * grabbing same folio but fails to split it again
+ * and again.
+ */
+ pte += step - 1;
+ addr += (step - 1) << PAGE_SHIFT;
continue;
+ }

- if (pageout_anon_only_filter && !folio_test_anon(folio))
+ /* Do not interfere with other mappings of this folio */
+ if (folio_mapcount(folio) != 1)
continue;

VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
-
ptep_clear_flush_young_notify(vma, addr, pte);
+
+pageout_cold_folio:
+ if (folio_isolate_lru(folio)) {
+ if (folio_test_unevictable(folio))
+ folio_putback_lru(folio);
+ else
+ list_add(&folio->lru, &folio_list);
+ }
+ }
+
+ if (start_pte) {
+ arch_leave_lazy_mmu_mode();
+ pte_unmap_unlock(start_pte, ptl);
+ }
+
+ while (!list_empty(&folio_list)) {
+ folio = lru_to_folio(&folio_list);
+ list_del(&folio->lru);
+
+ if (folio_test_large(folio)) {
+ int refs;
+ unsigned long flags;
+ struct mem_cgroup *memcg = folio_memcg(folio);
+
+ refs = folio_referenced(folio, 0, memcg, &flags);
+ if ((flags & VM_LOCKED) || (refs == -1)) {
+ folio_putback_lru(folio);
+ continue;
+ }
+ }
+
/*
* We are deactivating a folio for accelerating reclaiming.
* VM couldn't reclaim the folio unless we clear PG_young.
@@ -501,22 +556,15 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
if (folio_test_active(folio))
folio_set_workingset(folio);
if (pageout) {
- if (folio_isolate_lru(folio)) {
- if (folio_test_unevictable(folio))
- folio_putback_lru(folio);
- else
- list_add(&folio->lru, &folio_list);
- }
- } else
- folio_deactivate(folio);
+ list_add(&folio->lru, &reclaim_list);
+ } else {
+ folio_clear_active(folio);
+ folio_putback_lru(folio);
+ }
}

- if (start_pte) {
- arch_leave_lazy_mmu_mode();
- pte_unmap_unlock(start_pte, ptl);
- }
if (pageout)
- reclaim_pages(&folio_list);
+ reclaim_pages(&reclaim_list);
cond_resched();

return 0;
--
2.39.2

2023-07-21 11:13:42

by Yin, Fengwei

[permalink] [raw]

Subject: [RFC PATCH v2 2/4] madvise: Use notify-able API to clear and flush page table entries

Currently, in function madvise_cold_or_pageout_pte_range(), the
young bit of pte/pmd is cleared notify subscripter.

Using notify-able API to make sure the subscripter is signaled about
the young bit clearing.

Signed-off-by: Yin Fengwei <[email protected]>
---
mm/madvise.c | 18 ++----------------
1 file changed, 2 insertions(+), 16 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index f12933ebcc24..b236e201a738 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -403,14 +403,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
return 0;
}

- if (pmd_young(orig_pmd)) {
- pmdp_invalidate(vma, addr, pmd);
- orig_pmd = pmd_mkold(orig_pmd);
-
- set_pmd_at(mm, addr, pmd, orig_pmd);
- tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
- }
-
+ pmdp_clear_flush_young_notify(vma, addr, pmd);
folio_clear_referenced(folio);
folio_test_clear_young(folio);
if (folio_test_active(folio))
@@ -496,14 +489,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,

VM_BUG_ON_FOLIO(folio_test_large(folio), folio);

- if (pte_young(ptent)) {
- ptent = ptep_get_and_clear_full(mm, addr, pte,
- tlb->fullmm);
- ptent = pte_mkold(ptent);
- set_pte_at(mm, addr, pte, ptent);
- tlb_remove_tlb_entry(tlb, pte, addr);
- }
-
+ ptep_clear_flush_young_notify(vma, addr, pte);
/*
* We are deactivating a folio for accelerating reclaiming.
* VM couldn't reclaim the folio unless we clear PG_young.
--
2.39.2

2023-07-21 20:06:25

by Yu Zhao

[permalink] [raw]

Subject: Re: [RFC PATCH v2 1/4] madvise: not use mapcount() against large folio for sharing check

On Fri, Jul 21, 2023 at 3:41 AM Yin Fengwei <[email protected]> wrote:
>
> The commit
> 07e8c82b5eff ("madvise: convert madvise_cold_or_pageout_pte_range() to
> use folios") replaced the page_mapcount() with folio_mapcount() to
> check whether the folio is shared by other mapping.
>
> But it's not correct for large folio. folio_mapcount() returns the
> total mapcount of large folio which is not suitable to detect whether
> the folio is shared.
>
> Use folio_estimated_sharers() which returns a estimated number of
> shares. That means it's not 100% correct. But it should be OK for
> madvise case here.
>
> Signed-off-by: Yin Fengwei <[email protected]>

Fixes:
Cc: stable

> @@ -383,7 +383,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> folio = pfn_folio(pmd_pfn(orig_pmd));
>
> /* Do not interfere with other mappings of this folio */
> - if (folio_mapcount(folio) != 1)
> + if (folio_estimated_sharers(folio) != 1)

Strictly speaking, this isn't a bug. But it may be ok to include it in
the same patch.

> goto huge_unlock;
>
> if (pageout_anon_only_filter && !folio_test_anon(folio))
> @@ -459,7 +459,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> if (folio_test_large(folio)) {
> int err;
>
> - if (folio_mapcount(folio) != 1)
> + if (folio_estimated_sharers(folio) != 1)
> break;
> if (pageout_anon_only_filter && !folio_test_anon(folio))
> break;
> @@ -682,7 +682,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,

What about madvise_free_huge_pmd()? Should it be changed as well so
that it's consistent with the first change? Either change both or neither.

> if (folio_test_large(folio)) {
> int err;
>
> - if (folio_mapcount(folio) != 1)
> + if (folio_estimated_sharers(folio) != 1)

This is another bug fix and should be in a separate patch.

> break;
> if (!folio_trylock(folio))
> break;

Please send two separate fixes, and then:

Reviewed-by: Yu Zhao <[email protected]>

2023-07-23 13:23:00

by Yin, Fengwei

[permalink] [raw]

Subject: Re: [RFC PATCH v2 1/4] madvise: not use mapcount() against large folio for sharing check

On 7/22/2023 2:57 AM, Yu Zhao wrote:
> On Fri, Jul 21, 2023 at 3:41 AM Yin Fengwei <[email protected]> wrote:
>>
>> The commit
>> 07e8c82b5eff ("madvise: convert madvise_cold_or_pageout_pte_range() to
>> use folios") replaced the page_mapcount() with folio_mapcount() to
>> check whether the folio is shared by other mapping.
>>
>> But it's not correct for large folio. folio_mapcount() returns the
>> total mapcount of large folio which is not suitable to detect whether
>> the folio is shared.
>>
>> Use folio_estimated_sharers() which returns a estimated number of
>> shares. That means it's not 100% correct. But it should be OK for
>> madvise case here.
>>
>> Signed-off-by: Yin Fengwei <[email protected]>
>
> Fixes:
> Cc: stable
OK

>
>> @@ -383,7 +383,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>> folio = pfn_folio(pmd_pfn(orig_pmd));
>>
>> /* Do not interfere with other mappings of this folio */
>> - if (folio_mapcount(folio) != 1)
>> + if (folio_estimated_sharers(folio) != 1)
>
> Strictly speaking, this isn't a bug. But it may be ok to include it in
> the same patch.
OK. I will drop the change for pmd.

>
>> goto huge_unlock;
>>
>> if (pageout_anon_only_filter && !folio_test_anon(folio))
>> @@ -459,7 +459,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>> if (folio_test_large(folio)) {
>> int err;
>>
>> - if (folio_mapcount(folio) != 1)
>> + if (folio_estimated_sharers(folio) != 1)
>> break;
>> if (pageout_anon_only_filter && !folio_test_anon(folio))
>> break;
>> @@ -682,7 +682,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>
> What about madvise_free_huge_pmd()? Should it be changed as well so
> that it's consistent with the first change? Either change both or neither.
>
>> if (folio_test_large(folio)) {
>> int err;
>>
>> - if (folio_mapcount(folio) != 1)
>> + if (folio_estimated_sharers(folio) != 1)
>
> This is another bug fix and should be in a separate patch.
OK. Will split to two patches.

>
>> break;
>> if (!folio_trylock(folio))
>> break;
>
> Please send two separate fixes, and then:
>
> Reviewed-by: Yu Zhao <[email protected]>
Thanks a lot. I will drop the mapcount() change for pmd and sent to patches
for madvise_cold_or_pageout_pte_range() and madvise_free_pte_range().

Regards
Yin, Fengwei

2023-07-25 05:44:07

by Yu Zhao

[permalink] [raw]

Subject: Re: [RFC PATCH v2 1/4] madvise: not use mapcount() against large folio for sharing check

On Sun, Jul 23, 2023 at 6:27 AM Yin, Fengwei <[email protected]> wrote:
>
>
>
> On 7/22/2023 2:57 AM, Yu Zhao wrote:
> > On Fri, Jul 21, 2023 at 3:41 AM Yin Fengwei <[email protected]> wrote:
> >>
> >> The commit
> >> 07e8c82b5eff ("madvise: convert madvise_cold_or_pageout_pte_range() to
> >> use folios") replaced the page_mapcount() with folio_mapcount() to
> >> check whether the folio is shared by other mapping.
> >>
> >> But it's not correct for large folio. folio_mapcount() returns the
> >> total mapcount of large folio which is not suitable to detect whether
> >> the folio is shared.
> >>
> >> Use folio_estimated_sharers() which returns a estimated number of
> >> shares. That means it's not 100% correct. But it should be OK for
> >> madvise case here.
> >>
> >> Signed-off-by: Yin Fengwei <[email protected]>
> >
> > Fixes:
> > Cc: stable
> OK
>
> >
> >> @@ -383,7 +383,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> >> folio = pfn_folio(pmd_pfn(orig_pmd));
> >>
> >> /* Do not interfere with other mappings of this folio */
> >> - if (folio_mapcount(folio) != 1)
> >> + if (folio_estimated_sharers(folio) != 1)
> >
> > Strictly speaking, this isn't a bug. But it may be ok to include it in
> > the same patch.
> OK. I will drop the change for pmd.
>
> >
> >> goto huge_unlock;
> >>
> >> if (pageout_anon_only_filter && !folio_test_anon(folio))
> >> @@ -459,7 +459,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> >> if (folio_test_large(folio)) {
> >> int err;
> >>
> >> - if (folio_mapcount(folio) != 1)
> >> + if (folio_estimated_sharers(folio) != 1)
> >> break;
> >> if (pageout_anon_only_filter && !folio_test_anon(folio))
> >> break;
> >> @@ -682,7 +682,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
> >
> > What about madvise_free_huge_pmd()? Should it be changed as well so
> > that it's consistent with the first change? Either change both or neither.
> >
> >> if (folio_test_large(folio)) {
> >> int err;
> >>
> >> - if (folio_mapcount(folio) != 1)
> >> + if (folio_estimated_sharers(folio) != 1)
> >
> > This is another bug fix and should be in a separate patch.
> OK. Will split to two patches.
>
> >
> >> break;
> >> if (!folio_trylock(folio))
> >> break;
> >
> > Please send two separate fixes, and then:
> >
> > Reviewed-by: Yu Zhao <[email protected]>
> Thanks a lot. I will drop the mapcount() change for pmd and sent to patches
> for madvise_cold_or_pageout_pte_range() and madvise_free_pte_range().

I don't mind including the PMD changes. Either way works for me :)

2023-07-25 06:11:02

by Yu Zhao

[permalink] [raw]

Subject: Re: [RFC PATCH v2 4/4] madvise: avoid trying to split large folio always in cold_pageout

On Fri, Jul 21, 2023 at 3:41 AM Yin Fengwei <[email protected]> wrote:
>
> Current madvise_cold_or_pageout_pte_range() always tries to split
> large folio.
>
> Avoid trying to split large folio always by:
> - if large folio is in the request range, don't split it. Leave
> to page reclaim to decide whether the large folio needs be
> split.
> - if large folio crosses boundaries of request range, skip it if
> it's page cache. Try to split it if it's anonymous large folio.
> If failed to split it, just skip it.
>
> Invoke folio_referenced() to clear the A bit for large folio. As it
> will acquire pte lock, just do it after release pte lock.
>
> Signed-off-by: Yin Fengwei <[email protected]>

I currently don't have the bandwidth to look at this one. Minchan, can
you please take a look? Thanks.

2023-07-25 06:34:18

by Yu Zhao

[permalink] [raw]

Subject: Re: [RFC PATCH v2 3/4] mm: add functions folio_in_range() and folio_within_vma()

On Fri, Jul 21, 2023 at 3:41 AM Yin Fengwei <[email protected]> wrote:
>
> It will be used to check whether the folio is mapped to specific
> VMA and whether the mapping address of folio is in the range.
>
> Also a helper function folio_within_vma() to check whether folio
> is in the range of vma based on folio_in_range().
>
> Signed-off-by: Yin Fengwei <[email protected]>
> ---
> mm/internal.h | 32 ++++++++++++++++++++++++++++++++
> 1 file changed, 32 insertions(+)
>
> diff --git a/mm/internal.h b/mm/internal.h
> index 483add0bfb28..c7dd15d8de3e 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -585,6 +585,38 @@ extern long faultin_vma_page_range(struct vm_area_struct *vma,
> bool write, int *locked);
> extern bool mlock_future_ok(struct mm_struct *mm, unsigned long flags,
> unsigned long bytes);
> +
> +static inline bool
> +folio_in_range(struct folio *folio, struct vm_area_struct *vma,
> + unsigned long start, unsigned long end)
> +{
> + pgoff_t pgoff, addr;
> + unsigned long vma_pglen = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
> +
> + VM_WARN_ON_FOLIO(folio_test_ksm(folio), folio);
> + if (start < vma->vm_start)
> + start = vma->vm_start;
> +
> + if (end > vma->vm_end)
> + end = vma->vm_end;
> +
> + pgoff = folio_pgoff(folio);
> +
> + /* if folio start address is not in vma range */
> + if (pgoff < vma->vm_pgoff || pgoff > vma->vm_pgoff + vma_pglen)
> + return false;
> +
> + addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
> +
> + return ((addr >= start) && (addr + folio_size(folio) <= end));

Not sure how much we care but better to avoid (addr + folio_size()),
since it might wrap to 0 on 32-bit systems.

Reusing some helpers from mm/internal.h, e.g., vma_pgoff_address(),
would be great, if it's possible (I'm not sure if it's).

2023-07-25 06:36:54

by Yu Zhao

[permalink] [raw]

Subject: Re: [RFC PATCH v2 2/4] madvise: Use notify-able API to clear and flush page table entries

On Fri, Jul 21, 2023 at 3:41 AM Yin Fengwei <[email protected]> wrote:
>
> Currently, in function madvise_cold_or_pageout_pte_range(), the
> young bit of pte/pmd is cleared notify subscripter.
>
> Using notify-able API to make sure the subscripter is signaled about
> the young bit clearing.
>
> Signed-off-by: Yin Fengwei <[email protected]>
> ---
> mm/madvise.c | 18 ++----------------
> 1 file changed, 2 insertions(+), 16 deletions(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index f12933ebcc24..b236e201a738 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -403,14 +403,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> return 0;
> }
>
> - if (pmd_young(orig_pmd)) {
> - pmdp_invalidate(vma, addr, pmd);
> - orig_pmd = pmd_mkold(orig_pmd);
> -
> - set_pmd_at(mm, addr, pmd, orig_pmd);
> - tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
> - }
> -
> + pmdp_clear_flush_young_notify(vma, addr, pmd);
> folio_clear_referenced(folio);
> folio_test_clear_young(folio);
> if (folio_test_active(folio))
> @@ -496,14 +489,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>
> VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
>
> - if (pte_young(ptent)) {
> - ptent = ptep_get_and_clear_full(mm, addr, pte,
> - tlb->fullmm);
> - ptent = pte_mkold(ptent);
> - set_pte_at(mm, addr, pte, ptent);
> - tlb_remove_tlb_entry(tlb, pte, addr);
> - }
> -
> + ptep_clear_flush_young_notify(vma, addr, pte);

These two places are tricky.

I agree there is a problem here, i.e., we are not consulting the mmu
notifier. In fact, we do pageout on VMs on ChromeOS, and it's been a
known problem to me for a while (not a high priority one).

tlb_remove_tlb_entry() is batched flush, ptep_clear_flush_young() is
not. But, on x86, we might see a performance improvement since
ptep_clear_flush_young() doesn't flush TLB at all. On ARM, there might
be regressions though.

I'd go with ptep_clear_young_notify(), but IIRC, Minchan mentioned he
prefers flush. So I'll let him chime in.

If we do end up with ptep_clear_young_notify(), please remove
mmu_gather -- it should have been done in this patch.

2023-07-26 03:40:33

by Yin, Fengwei

[permalink] [raw]

Subject: Re: [RFC PATCH v2 2/4] madvise: Use notify-able API to clear and flush page table entries

On 7/25/23 13:55, Yu Zhao wrote:
> On Fri, Jul 21, 2023 at 3:41 AM Yin Fengwei <[email protected]> wrote:
>>
>> Currently, in function madvise_cold_or_pageout_pte_range(), the
>> young bit of pte/pmd is cleared notify subscripter.
>>
>> Using notify-able API to make sure the subscripter is signaled about
>> the young bit clearing.
>>
>> Signed-off-by: Yin Fengwei <[email protected]>
>> ---
>> mm/madvise.c | 18 ++----------------
>> 1 file changed, 2 insertions(+), 16 deletions(-)
>>
>> diff --git a/mm/madvise.c b/mm/madvise.c
>> index f12933ebcc24..b236e201a738 100644
>> --- a/mm/madvise.c
>> +++ b/mm/madvise.c
>> @@ -403,14 +403,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>> return 0;
>> }
>>
>> - if (pmd_young(orig_pmd)) {
>> - pmdp_invalidate(vma, addr, pmd);
>> - orig_pmd = pmd_mkold(orig_pmd);
>> -
>> - set_pmd_at(mm, addr, pmd, orig_pmd);
>> - tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
>> - }
>> -
>> + pmdp_clear_flush_young_notify(vma, addr, pmd);
>> folio_clear_referenced(folio);
>> folio_test_clear_young(folio);
>> if (folio_test_active(folio))
>> @@ -496,14 +489,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>>
>> VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
>>
>> - if (pte_young(ptent)) {
>> - ptent = ptep_get_and_clear_full(mm, addr, pte,
>> - tlb->fullmm);
>> - ptent = pte_mkold(ptent);
>> - set_pte_at(mm, addr, pte, ptent);
>> - tlb_remove_tlb_entry(tlb, pte, addr);
>> - }
>> -
>> + ptep_clear_flush_young_notify(vma, addr, pte);
>
> These two places are tricky.
>
> I agree there is a problem here, i.e., we are not consulting the mmu
> notifier. In fact, we do pageout on VMs on ChromeOS, and it's been a
> known problem to me for a while (not a high priority one).
>
> tlb_remove_tlb_entry() is batched flush, ptep_clear_flush_young() is
> not. But, on x86, we might see a performance improvement since
> ptep_clear_flush_young() doesn't flush TLB at all. On ARM, there might
> be regressions though.
>
> I'd go with ptep_clear_young_notify(), but IIRC, Minchan mentioned he
> prefers flush. So I'll let him chime in.
I am OK with either way even no flush way here is more efficient for
arm64. Let's wait for Minchan's comment.

>
> If we do end up with ptep_clear_young_notify(), please remove
> mmu_gather -- it should have been done in this patch.

I suppose "remove mmu_gather" means to trigger flush tlb operation in
batched way to make sure no stale data in TLB for long time on arm64
platform.

Regards
Yin, Fengwei

2023-07-26 03:59:06

by Yu Zhao

[permalink] [raw]

Subject: Re: [RFC PATCH v2 2/4] madvise: Use notify-able API to clear and flush page table entries

On Tue, Jul 25, 2023 at 8:49 PM Yin Fengwei <[email protected]> wrote:
>
>
> On 7/25/23 13:55, Yu Zhao wrote:
> > On Fri, Jul 21, 2023 at 3:41 AM Yin Fengwei <[email protected]> wrote:
> >>
> >> Currently, in function madvise_cold_or_pageout_pte_range(), the
> >> young bit of pte/pmd is cleared notify subscripter.
> >>
> >> Using notify-able API to make sure the subscripter is signaled about
> >> the young bit clearing.
> >>
> >> Signed-off-by: Yin Fengwei <[email protected]>
> >> ---
> >> mm/madvise.c | 18 ++----------------
> >> 1 file changed, 2 insertions(+), 16 deletions(-)
> >>
> >> diff --git a/mm/madvise.c b/mm/madvise.c
> >> index f12933ebcc24..b236e201a738 100644
> >> --- a/mm/madvise.c
> >> +++ b/mm/madvise.c
> >> @@ -403,14 +403,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> >> return 0;
> >> }
> >>
> >> - if (pmd_young(orig_pmd)) {
> >> - pmdp_invalidate(vma, addr, pmd);
> >> - orig_pmd = pmd_mkold(orig_pmd);
> >> -
> >> - set_pmd_at(mm, addr, pmd, orig_pmd);
> >> - tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
> >> - }
> >> -
> >> + pmdp_clear_flush_young_notify(vma, addr, pmd);
> >> folio_clear_referenced(folio);
> >> folio_test_clear_young(folio);
> >> if (folio_test_active(folio))
> >> @@ -496,14 +489,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> >>
> >> VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
> >>
> >> - if (pte_young(ptent)) {
> >> - ptent = ptep_get_and_clear_full(mm, addr, pte,
> >> - tlb->fullmm);
> >> - ptent = pte_mkold(ptent);
> >> - set_pte_at(mm, addr, pte, ptent);
> >> - tlb_remove_tlb_entry(tlb, pte, addr);
> >> - }
> >> -
> >> + ptep_clear_flush_young_notify(vma, addr, pte);
> >
> > These two places are tricky.
> >
> > I agree there is a problem here, i.e., we are not consulting the mmu
> > notifier. In fact, we do pageout on VMs on ChromeOS, and it's been a
> > known problem to me for a while (not a high priority one).
> >
> > tlb_remove_tlb_entry() is batched flush, ptep_clear_flush_young() is
> > not. But, on x86, we might see a performance improvement since
> > ptep_clear_flush_young() doesn't flush TLB at all. On ARM, there might
> > be regressions though.
> >
> > I'd go with ptep_clear_young_notify(), but IIRC, Minchan mentioned he
> > prefers flush. So I'll let him chime in.
> I am OK with either way even no flush way here is more efficient for
> arm64. Let's wait for Minchan's comment.

Yes, and I don't think there would be any "negative" consequences
without tlb flushes when clearing the A-bit.

> > If we do end up with ptep_clear_young_notify(), please remove
> > mmu_gather -- it should have been done in this patch.
>
> I suppose "remove mmu_gather" means to trigger flush tlb operation in
> batched way to make sure no stale data in TLB for long time on arm64
> platform.

In madvise_cold_or_pageout_pte_range(), we only need struct
mmu_gather *tlb because of tlb_remove_pmd_tlb_entry(), i.e., flushing
tlb after clearing the A-bit. There is no correction, e.g., potential
data corruption, involved there.

2023-07-26 05:18:28

by Yin, Fengwei

[permalink] [raw]

Subject: Re: [RFC PATCH v2 2/4] madvise: Use notify-able API to clear and flush page table entries

On 7/26/23 11:26, Yu Zhao wrote:
> On Tue, Jul 25, 2023 at 8:49 PM Yin Fengwei <[email protected]> wrote:
>>
>>
>> On 7/25/23 13:55, Yu Zhao wrote:
>>> On Fri, Jul 21, 2023 at 3:41 AM Yin Fengwei <[email protected]> wrote:
>>>>
>>>> Currently, in function madvise_cold_or_pageout_pte_range(), the
>>>> young bit of pte/pmd is cleared notify subscripter.
>>>>
>>>> Using notify-able API to make sure the subscripter is signaled about
>>>> the young bit clearing.
>>>>
>>>> Signed-off-by: Yin Fengwei <[email protected]>
>>>> ---
>>>> mm/madvise.c | 18 ++----------------
>>>> 1 file changed, 2 insertions(+), 16 deletions(-)
>>>>
>>>> diff --git a/mm/madvise.c b/mm/madvise.c
>>>> index f12933ebcc24..b236e201a738 100644
>>>> --- a/mm/madvise.c
>>>> +++ b/mm/madvise.c
>>>> @@ -403,14 +403,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>>>> return 0;
>>>> }
>>>>
>>>> - if (pmd_young(orig_pmd)) {
>>>> - pmdp_invalidate(vma, addr, pmd);
>>>> - orig_pmd = pmd_mkold(orig_pmd);
>>>> -
>>>> - set_pmd_at(mm, addr, pmd, orig_pmd);
>>>> - tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
>>>> - }
>>>> -
>>>> + pmdp_clear_flush_young_notify(vma, addr, pmd);
>>>> folio_clear_referenced(folio);
>>>> folio_test_clear_young(folio);
>>>> if (folio_test_active(folio))
>>>> @@ -496,14 +489,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>>>>
>>>> VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
>>>>
>>>> - if (pte_young(ptent)) {
>>>> - ptent = ptep_get_and_clear_full(mm, addr, pte,
>>>> - tlb->fullmm);
>>>> - ptent = pte_mkold(ptent);
>>>> - set_pte_at(mm, addr, pte, ptent);
>>>> - tlb_remove_tlb_entry(tlb, pte, addr);
>>>> - }
>>>> -
>>>> + ptep_clear_flush_young_notify(vma, addr, pte);
>>>
>>> These two places are tricky.
>>>
>>> I agree there is a problem here, i.e., we are not consulting the mmu
>>> notifier. In fact, we do pageout on VMs on ChromeOS, and it's been a
>>> known problem to me for a while (not a high priority one).
>>>
>>> tlb_remove_tlb_entry() is batched flush, ptep_clear_flush_young() is
>>> not. But, on x86, we might see a performance improvement since
>>> ptep_clear_flush_young() doesn't flush TLB at all. On ARM, there might
>>> be regressions though.
>>>
>>> I'd go with ptep_clear_young_notify(), but IIRC, Minchan mentioned he
>>> prefers flush. So I'll let him chime in.
>> I am OK with either way even no flush way here is more efficient for
>> arm64. Let's wait for Minchan's comment.
>
> Yes, and I don't think there would be any "negative" consequences
> without tlb flushes when clearing the A-bit.
>
>>> If we do end up with ptep_clear_young_notify(), please remove
>>> mmu_gather -- it should have been done in this patch.
>>
>> I suppose "remove mmu_gather" means to trigger flush tlb operation in
>> batched way to make sure no stale data in TLB for long time on arm64
>> platform.
>
> In madvise_cold_or_pageout_pte_range(), we only need struct
> mmu_gather *tlb because of tlb_remove_pmd_tlb_entry(), i.e., flushing
> tlb after clearing the A-bit. There is no correction, e.g., potential
> data corruption, involved there.

From https://lore.kernel.org/lkml/[email protected]/,
the reason that arm64 didn't drop whole flush tlb in ptep_clear_flush_young()
is to prevent the stale data in TLB. I suppose there is no correction issue
there also.

So why keep stale data in TLB in madvise_cold_or_pageout_pte_range() is fine?

Regards
Yin, Fengwei

2023-07-26 06:59:25

by Yin, Fengwei

[permalink] [raw]

Subject: Re: [RFC PATCH v2 2/4] madvise: Use notify-able API to clear and flush page table entries

On 7/26/23 13:40, Yu Zhao wrote:
> On Tue, Jul 25, 2023 at 10:44 PM Yin Fengwei <[email protected]> wrote:
>>
>>
>>
>> On 7/26/23 11:26, Yu Zhao wrote:
>>> On Tue, Jul 25, 2023 at 8:49 PM Yin Fengwei <[email protected]> wrote:
>>>>
>>>>
>>>> On 7/25/23 13:55, Yu Zhao wrote:
>>>>> On Fri, Jul 21, 2023 at 3:41 AM Yin Fengwei <[email protected]> wrote:
>>>>>>
>>>>>> Currently, in function madvise_cold_or_pageout_pte_range(), the
>>>>>> young bit of pte/pmd is cleared notify subscripter.
>>>>>>
>>>>>> Using notify-able API to make sure the subscripter is signaled about
>>>>>> the young bit clearing.
>>>>>>
>>>>>> Signed-off-by: Yin Fengwei <[email protected]>
>>>>>> ---
>>>>>> mm/madvise.c | 18 ++----------------
>>>>>> 1 file changed, 2 insertions(+), 16 deletions(-)
>>>>>>
>>>>>> diff --git a/mm/madvise.c b/mm/madvise.c
>>>>>> index f12933ebcc24..b236e201a738 100644
>>>>>> --- a/mm/madvise.c
>>>>>> +++ b/mm/madvise.c
>>>>>> @@ -403,14 +403,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>>>>>> return 0;
>>>>>> }
>>>>>>
>>>>>> - if (pmd_young(orig_pmd)) {
>>>>>> - pmdp_invalidate(vma, addr, pmd);
>>>>>> - orig_pmd = pmd_mkold(orig_pmd);
>>>>>> -
>>>>>> - set_pmd_at(mm, addr, pmd, orig_pmd);
>>>>>> - tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
>>>>>> - }
>>>>>> -
>>>>>> + pmdp_clear_flush_young_notify(vma, addr, pmd);
>>>>>> folio_clear_referenced(folio);
>>>>>> folio_test_clear_young(folio);
>>>>>> if (folio_test_active(folio))
>>>>>> @@ -496,14 +489,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>>>>>>
>>>>>> VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
>>>>>>
>>>>>> - if (pte_young(ptent)) {
>>>>>> - ptent = ptep_get_and_clear_full(mm, addr, pte,
>>>>>> - tlb->fullmm);
>>>>>> - ptent = pte_mkold(ptent);
>>>>>> - set_pte_at(mm, addr, pte, ptent);
>>>>>> - tlb_remove_tlb_entry(tlb, pte, addr);
>>>>>> - }
>>>>>> -
>>>>>> + ptep_clear_flush_young_notify(vma, addr, pte);
>>>>>
>>>>> These two places are tricky.
>>>>>
>>>>> I agree there is a problem here, i.e., we are not consulting the mmu
>>>>> notifier. In fact, we do pageout on VMs on ChromeOS, and it's been a
>>>>> known problem to me for a while (not a high priority one).
>>>>>
>>>>> tlb_remove_tlb_entry() is batched flush, ptep_clear_flush_young() is
>>>>> not. But, on x86, we might see a performance improvement since
>>>>> ptep_clear_flush_young() doesn't flush TLB at all. On ARM, there might
>>>>> be regressions though.
>>>>>
>>>>> I'd go with ptep_clear_young_notify(), but IIRC, Minchan mentioned he
>>>>> prefers flush. So I'll let him chime in.
>>>> I am OK with either way even no flush way here is more efficient for
>>>> arm64. Let's wait for Minchan's comment.
>>>
>>> Yes, and I don't think there would be any "negative" consequences
>>> without tlb flushes when clearing the A-bit.
>>>
>>>>> If we do end up with ptep_clear_young_notify(), please remove
>>>>> mmu_gather -- it should have been done in this patch.
>>>>
>>>> I suppose "remove mmu_gather" means to trigger flush tlb operation in
>>>> batched way to make sure no stale data in TLB for long time on arm64
>>>> platform.
>>>
>>> In madvise_cold_or_pageout_pte_range(), we only need struct
>>> mmu_gather *tlb because of tlb_remove_pmd_tlb_entry(), i.e., flushing
>>> tlb after clearing the A-bit. There is no correction, e.g., potential
>>> data corruption, involved there.
>>
>> From https://lore.kernel.org/lkml/[email protected]/,
>> the reason that arm64 didn't drop whole flush tlb in ptep_clear_flush_young()
>> is to prevent the stale data in TLB. I suppose there is no correction issue
>> there also.
>>
>> So why keep stale data in TLB in madvise_cold_or_pageout_pte_range() is fine?
>
> Sorry, I'm not sure I understand your question here.
Oh. Sorry for the confusion. I will explain my understanding and
question in detail.

>
> In this patch, you removed tlb_remove_tlb_entry(), so we don't need
> struct mmu_gather *tlb any more.
Yes. You are right.

>
> If you are asking why I prefer ptep_clear_young_notify() (no flush),
> which also doesn't need tlb_remove_tlb_entry(), then the answer is
> that the TLB size doesn't scale like DRAM does: the gap has been
> growing exponentially. So there is no way TLB can hold stale entries
> long enough to cause a measurable effect on the A-bit. This isn't a
> conjecture -- it's been proven conversely: we encountered bugs (almost
> every year) caused by missing TLB flushes and resulting in data
> corruption. They were never easy to reproduce, meaning stale entries
> never stayed long in TLB.

when I read https://lore.kernel.org/lkml/[email protected]/,

my understanding is that arm64 still keep something in ptep_clear_flush_young.
The reason is finishing TLB flush by next context-switch to make sure no
stale entries in TLB cross next context switch.

Should we still keep same behavior (no stable entries in TLB cross next
context switch) for madvise_cold_or_pageout_pte_range()?

So two versions work (I assume we should keep same behavior) for me:
1. using xxx_flush_xxx() version. but with possible regression on arm64.
2. using none flush version. But do batched TLB flush.

Hope this could make things clear. Thanks.

Regards
Yin, Fengwei

2023-07-26 07:23:53

by Yu Zhao

[permalink] [raw]

Subject: Re: [RFC PATCH v2 2/4] madvise: Use notify-able API to clear and flush page table entries

On Tue, Jul 25, 2023 at 10:44 PM Yin Fengwei <[email protected]> wrote:
>
>
>
> On 7/26/23 11:26, Yu Zhao wrote:
> > On Tue, Jul 25, 2023 at 8:49 PM Yin Fengwei <[email protected]> wrote:
> >>
> >>
> >> On 7/25/23 13:55, Yu Zhao wrote:
> >>> On Fri, Jul 21, 2023 at 3:41 AM Yin Fengwei <[email protected]> wrote:
> >>>>
> >>>> Currently, in function madvise_cold_or_pageout_pte_range(), the
> >>>> young bit of pte/pmd is cleared notify subscripter.
> >>>>
> >>>> Using notify-able API to make sure the subscripter is signaled about
> >>>> the young bit clearing.
> >>>>
> >>>> Signed-off-by: Yin Fengwei <[email protected]>
> >>>> ---
> >>>> mm/madvise.c | 18 ++----------------
> >>>> 1 file changed, 2 insertions(+), 16 deletions(-)
> >>>>
> >>>> diff --git a/mm/madvise.c b/mm/madvise.c
> >>>> index f12933ebcc24..b236e201a738 100644
> >>>> --- a/mm/madvise.c
> >>>> +++ b/mm/madvise.c
> >>>> @@ -403,14 +403,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> >>>> return 0;
> >>>> }
> >>>>
> >>>> - if (pmd_young(orig_pmd)) {
> >>>> - pmdp_invalidate(vma, addr, pmd);
> >>>> - orig_pmd = pmd_mkold(orig_pmd);
> >>>> -
> >>>> - set_pmd_at(mm, addr, pmd, orig_pmd);
> >>>> - tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
> >>>> - }
> >>>> -
> >>>> + pmdp_clear_flush_young_notify(vma, addr, pmd);
> >>>> folio_clear_referenced(folio);
> >>>> folio_test_clear_young(folio);
> >>>> if (folio_test_active(folio))
> >>>> @@ -496,14 +489,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> >>>>
> >>>> VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
> >>>>
> >>>> - if (pte_young(ptent)) {
> >>>> - ptent = ptep_get_and_clear_full(mm, addr, pte,
> >>>> - tlb->fullmm);
> >>>> - ptent = pte_mkold(ptent);
> >>>> - set_pte_at(mm, addr, pte, ptent);
> >>>> - tlb_remove_tlb_entry(tlb, pte, addr);
> >>>> - }
> >>>> -
> >>>> + ptep_clear_flush_young_notify(vma, addr, pte);
> >>>
> >>> These two places are tricky.
> >>>
> >>> I agree there is a problem here, i.e., we are not consulting the mmu
> >>> notifier. In fact, we do pageout on VMs on ChromeOS, and it's been a
> >>> known problem to me for a while (not a high priority one).
> >>>
> >>> tlb_remove_tlb_entry() is batched flush, ptep_clear_flush_young() is
> >>> not. But, on x86, we might see a performance improvement since
> >>> ptep_clear_flush_young() doesn't flush TLB at all. On ARM, there might
> >>> be regressions though.
> >>>
> >>> I'd go with ptep_clear_young_notify(), but IIRC, Minchan mentioned he
> >>> prefers flush. So I'll let him chime in.
> >> I am OK with either way even no flush way here is more efficient for
> >> arm64. Let's wait for Minchan's comment.
> >
> > Yes, and I don't think there would be any "negative" consequences
> > without tlb flushes when clearing the A-bit.
> >
> >>> If we do end up with ptep_clear_young_notify(), please remove
> >>> mmu_gather -- it should have been done in this patch.
> >>
> >> I suppose "remove mmu_gather" means to trigger flush tlb operation in
> >> batched way to make sure no stale data in TLB for long time on arm64
> >> platform.
> >
> > In madvise_cold_or_pageout_pte_range(), we only need struct
> > mmu_gather *tlb because of tlb_remove_pmd_tlb_entry(), i.e., flushing
> > tlb after clearing the A-bit. There is no correction, e.g., potential
> > data corruption, involved there.
>
> From https://lore.kernel.org/lkml/[email protected]/,
> the reason that arm64 didn't drop whole flush tlb in ptep_clear_flush_young()
> is to prevent the stale data in TLB. I suppose there is no correction issue
> there also.
>
> So why keep stale data in TLB in madvise_cold_or_pageout_pte_range() is fine?

Sorry, I'm not sure I understand your question here.

In this patch, you removed tlb_remove_tlb_entry(), so we don't need
struct mmu_gather *tlb any more.

If you are asking why I prefer ptep_clear_young_notify() (no flush),
which also doesn't need tlb_remove_tlb_entry(), then the answer is
that the TLB size doesn't scale like DRAM does: the gap has been
growing exponentially. So there is no way TLB can hold stale entries
long enough to cause a measurable effect on the A-bit. This isn't a
conjecture -- it's been proven conversely: we encountered bugs (almost
every year) caused by missing TLB flushes and resulting in data
corruption. They were never easy to reproduce, meaning stale entries
never stayed long in TLB.

2023-07-27 04:36:02

by Yu Zhao

[permalink] [raw]

Subject: Re: [RFC PATCH v2 2/4] madvise: Use notify-able API to clear and flush page table entries

On Wed, Jul 26, 2023 at 12:21 AM Yin Fengwei <[email protected]> wrote:
>
>
> On 7/26/23 13:40, Yu Zhao wrote:
> > On Tue, Jul 25, 2023 at 10:44 PM Yin Fengwei <[email protected]> wrote:
> >>
> >>
> >> On 7/26/23 11:26, Yu Zhao wrote:
> >>> On Tue, Jul 25, 2023 at 8:49 PM Yin Fengwei <[email protected]> wrote:
> >>>>
> >>>>
> >>>> On 7/25/23 13:55, Yu Zhao wrote:
> >>>>> On Fri, Jul 21, 2023 at 3:41 AM Yin Fengwei <[email protected]> wrote:
> >>>>>>
> >>>>>> Currently, in function madvise_cold_or_pageout_pte_range(), the
> >>>>>> young bit of pte/pmd is cleared notify subscripter.
> >>>>>>
> >>>>>> Using notify-able API to make sure the subscripter is signaled about
> >>>>>> the young bit clearing.
> >>>>>>
> >>>>>> Signed-off-by: Yin Fengwei <[email protected]>
> >>>>>> ---
> >>>>>> mm/madvise.c | 18 ++----------------
> >>>>>> 1 file changed, 2 insertions(+), 16 deletions(-)
> >>>>>>
> >>>>>> diff --git a/mm/madvise.c b/mm/madvise.c
> >>>>>> index f12933ebcc24..b236e201a738 100644
> >>>>>> --- a/mm/madvise.c
> >>>>>> +++ b/mm/madvise.c
> >>>>>> @@ -403,14 +403,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> >>>>>> return 0;
> >>>>>> }
> >>>>>>
> >>>>>> - if (pmd_young(orig_pmd)) {
> >>>>>> - pmdp_invalidate(vma, addr, pmd);
> >>>>>> - orig_pmd = pmd_mkold(orig_pmd);
> >>>>>> -
> >>>>>> - set_pmd_at(mm, addr, pmd, orig_pmd);
> >>>>>> - tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
> >>>>>> - }
> >>>>>> -
> >>>>>> + pmdp_clear_flush_young_notify(vma, addr, pmd);
> >>>>>> folio_clear_referenced(folio);
> >>>>>> folio_test_clear_young(folio);
> >>>>>> if (folio_test_active(folio))
> >>>>>> @@ -496,14 +489,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> >>>>>>
> >>>>>> VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
> >>>>>>
> >>>>>> - if (pte_young(ptent)) {
> >>>>>> - ptent = ptep_get_and_clear_full(mm, addr, pte,
> >>>>>> - tlb->fullmm);
> >>>>>> - ptent = pte_mkold(ptent);
> >>>>>> - set_pte_at(mm, addr, pte, ptent);
> >>>>>> - tlb_remove_tlb_entry(tlb, pte, addr);
> >>>>>> - }
> >>>>>> -
> >>>>>> + ptep_clear_flush_young_notify(vma, addr, pte);
> >>>>>
> >>>>> These two places are tricky.
> >>>>>
> >>>>> I agree there is a problem here, i.e., we are not consulting the mmu
> >>>>> notifier. In fact, we do pageout on VMs on ChromeOS, and it's been a
> >>>>> known problem to me for a while (not a high priority one).
> >>>>>
> >>>>> tlb_remove_tlb_entry() is batched flush, ptep_clear_flush_young() is
> >>>>> not. But, on x86, we might see a performance improvement since
> >>>>> ptep_clear_flush_young() doesn't flush TLB at all. On ARM, there might
> >>>>> be regressions though.
> >>>>>
> >>>>> I'd go with ptep_clear_young_notify(), but IIRC, Minchan mentioned he
> >>>>> prefers flush. So I'll let him chime in.
> >>>> I am OK with either way even no flush way here is more efficient for
> >>>> arm64. Let's wait for Minchan's comment.
> >>>
> >>> Yes, and I don't think there would be any "negative" consequences
> >>> without tlb flushes when clearing the A-bit.
> >>>
> >>>>> If we do end up with ptep_clear_young_notify(), please remove
> >>>>> mmu_gather -- it should have been done in this patch.
> >>>>
> >>>> I suppose "remove mmu_gather" means to trigger flush tlb operation in
> >>>> batched way to make sure no stale data in TLB for long time on arm64
> >>>> platform.
> >>>
> >>> In madvise_cold_or_pageout_pte_range(), we only need struct
> >>> mmu_gather *tlb because of tlb_remove_pmd_tlb_entry(), i.e., flushing
> >>> tlb after clearing the A-bit. There is no correction, e.g., potential
> >>> data corruption, involved there.
> >>
> >> From https://lore.kernel.org/lkml/[email protected]/,
> >> the reason that arm64 didn't drop whole flush tlb in ptep_clear_flush_young()
> >> is to prevent the stale data in TLB. I suppose there is no correction issue
> >> there also.
> >>
> >> So why keep stale data in TLB in madvise_cold_or_pageout_pte_range() is fine?
> >
> > Sorry, I'm not sure I understand your question here.
> Oh. Sorry for the confusion. I will explain my understanding and
> question in detail.
>
> >
> > In this patch, you removed tlb_remove_tlb_entry(), so we don't need
> > struct mmu_gather *tlb any more.
> Yes. You are right.
>
> >
> > If you are asking why I prefer ptep_clear_young_notify() (no flush),
> > which also doesn't need tlb_remove_tlb_entry(), then the answer is
> > that the TLB size doesn't scale like DRAM does: the gap has been
> > growing exponentially. So there is no way TLB can hold stale entries
> > long enough to cause a measurable effect on the A-bit. This isn't a
> > conjecture -- it's been proven conversely: we encountered bugs (almost
> > every year) caused by missing TLB flushes and resulting in data
> > corruption. They were never easy to reproduce, meaning stale entries
> > never stayed long in TLB.
>
> when I read https://lore.kernel.org/lkml/[email protected]/,
>
> my understanding is that arm64 still keep something in ptep_clear_flush_young.
> The reason is finishing TLB flush by next context-switch to make sure no
> stale entries in TLB cross next context switch.
>
> Should we still keep same behavior (no stable entries in TLB cross next
> context switch) for madvise_cold_or_pageout_pte_range()?
>
> So two versions work (I assume we should keep same behavior) for me:
> 1. using xxx_flush_xxx() version. but with possible regression on arm64.
> 2. using none flush version. But do batched TLB flush.

I see. I don't think we need to flush at all, i.e., the no flush
version *without* batched TLB flush. So far nobody can actually prove
that flushing TLB while clearing the A-bit is beneficial, not even in
theory :)

2023-07-28 17:47:15

by Yin, Fengwei

[permalink] [raw]

Subject: Re: [RFC PATCH v2 2/4] madvise: Use notify-able API to clear and flush page table entries

On 7/27/2023 11:28 AM, Yu Zhao wrote:
> On Wed, Jul 26, 2023 at 12:21 AM Yin Fengwei <[email protected]> wrote:
>>
>>
>> On 7/26/23 13:40, Yu Zhao wrote:
>>> On Tue, Jul 25, 2023 at 10:44 PM Yin Fengwei <[email protected]> wrote:
>>>>
>>>>
>>>> On 7/26/23 11:26, Yu Zhao wrote:
>>>>> On Tue, Jul 25, 2023 at 8:49 PM Yin Fengwei <[email protected]> wrote:
>>>>>>
>>>>>>
>>>>>> On 7/25/23 13:55, Yu Zhao wrote:
>>>>>>> On Fri, Jul 21, 2023 at 3:41 AM Yin Fengwei <[email protected]> wrote:
>>>>>>>>
>>>>>>>> Currently, in function madvise_cold_or_pageout_pte_range(), the
>>>>>>>> young bit of pte/pmd is cleared notify subscripter.
>>>>>>>>
>>>>>>>> Using notify-able API to make sure the subscripter is signaled about
>>>>>>>> the young bit clearing.
>>>>>>>>
>>>>>>>> Signed-off-by: Yin Fengwei <[email protected]>
>>>>>>>> ---
>>>>>>>> mm/madvise.c | 18 ++----------------
>>>>>>>> 1 file changed, 2 insertions(+), 16 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/mm/madvise.c b/mm/madvise.c
>>>>>>>> index f12933ebcc24..b236e201a738 100644
>>>>>>>> --- a/mm/madvise.c
>>>>>>>> +++ b/mm/madvise.c
>>>>>>>> @@ -403,14 +403,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>>>>>>>> return 0;
>>>>>>>> }
>>>>>>>>
>>>>>>>> - if (pmd_young(orig_pmd)) {
>>>>>>>> - pmdp_invalidate(vma, addr, pmd);
>>>>>>>> - orig_pmd = pmd_mkold(orig_pmd);
>>>>>>>> -
>>>>>>>> - set_pmd_at(mm, addr, pmd, orig_pmd);
>>>>>>>> - tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
>>>>>>>> - }
>>>>>>>> -
>>>>>>>> + pmdp_clear_flush_young_notify(vma, addr, pmd);
>>>>>>>> folio_clear_referenced(folio);
>>>>>>>> folio_test_clear_young(folio);
>>>>>>>> if (folio_test_active(folio))
>>>>>>>> @@ -496,14 +489,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>>>>>>>>
>>>>>>>> VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
>>>>>>>>
>>>>>>>> - if (pte_young(ptent)) {
>>>>>>>> - ptent = ptep_get_and_clear_full(mm, addr, pte,
>>>>>>>> - tlb->fullmm);
>>>>>>>> - ptent = pte_mkold(ptent);
>>>>>>>> - set_pte_at(mm, addr, pte, ptent);
>>>>>>>> - tlb_remove_tlb_entry(tlb, pte, addr);
>>>>>>>> - }
>>>>>>>> -
>>>>>>>> + ptep_clear_flush_young_notify(vma, addr, pte);
>>>>>>>
>>>>>>> These two places are tricky.
>>>>>>>
>>>>>>> I agree there is a problem here, i.e., we are not consulting the mmu
>>>>>>> notifier. In fact, we do pageout on VMs on ChromeOS, and it's been a
>>>>>>> known problem to me for a while (not a high priority one).
>>>>>>>
>>>>>>> tlb_remove_tlb_entry() is batched flush, ptep_clear_flush_young() is
>>>>>>> not. But, on x86, we might see a performance improvement since
>>>>>>> ptep_clear_flush_young() doesn't flush TLB at all. On ARM, there might
>>>>>>> be regressions though.
>>>>>>>
>>>>>>> I'd go with ptep_clear_young_notify(), but IIRC, Minchan mentioned he
>>>>>>> prefers flush. So I'll let him chime in.
>>>>>> I am OK with either way even no flush way here is more efficient for
>>>>>> arm64. Let's wait for Minchan's comment.
>>>>>
>>>>> Yes, and I don't think there would be any "negative" consequences
>>>>> without tlb flushes when clearing the A-bit.
>>>>>
>>>>>>> If we do end up with ptep_clear_young_notify(), please remove
>>>>>>> mmu_gather -- it should have been done in this patch.
>>>>>>
>>>>>> I suppose "remove mmu_gather" means to trigger flush tlb operation in
>>>>>> batched way to make sure no stale data in TLB for long time on arm64
>>>>>> platform.
>>>>>
>>>>> In madvise_cold_or_pageout_pte_range(), we only need struct
>>>>> mmu_gather *tlb because of tlb_remove_pmd_tlb_entry(), i.e., flushing
>>>>> tlb after clearing the A-bit. There is no correction, e.g., potential
>>>>> data corruption, involved there.
>>>>
>>>> From https://lore.kernel.org/lkml/[email protected]/,
>>>> the reason that arm64 didn't drop whole flush tlb in ptep_clear_flush_young()
>>>> is to prevent the stale data in TLB. I suppose there is no correction issue
>>>> there also.
>>>>
>>>> So why keep stale data in TLB in madvise_cold_or_pageout_pte_range() is fine?
>>>
>>> Sorry, I'm not sure I understand your question here.
>> Oh. Sorry for the confusion. I will explain my understanding and
>> question in detail.
>>
>>>
>>> In this patch, you removed tlb_remove_tlb_entry(), so we don't need
>>> struct mmu_gather *tlb any more.
>> Yes. You are right.
>>
>>>
>>> If you are asking why I prefer ptep_clear_young_notify() (no flush),
>>> which also doesn't need tlb_remove_tlb_entry(), then the answer is
>>> that the TLB size doesn't scale like DRAM does: the gap has been
>>> growing exponentially. So there is no way TLB can hold stale entries
>>> long enough to cause a measurable effect on the A-bit. This isn't a
>>> conjecture -- it's been proven conversely: we encountered bugs (almost
>>> every year) caused by missing TLB flushes and resulting in data
>>> corruption. They were never easy to reproduce, meaning stale entries
>>> never stayed long in TLB.
>>
>> when I read https://lore.kernel.org/lkml/[email protected]/,
>>
>> my understanding is that arm64 still keep something in ptep_clear_flush_young.
>> The reason is finishing TLB flush by next context-switch to make sure no
>> stale entries in TLB cross next context switch.
>>
>> Should we still keep same behavior (no stable entries in TLB cross next
>> context switch) for madvise_cold_or_pageout_pte_range()?
>>
>> So two versions work (I assume we should keep same behavior) for me:
>> 1. using xxx_flush_xxx() version. but with possible regression on arm64.
>> 2. using none flush version. But do batched TLB flush.
>
> I see. I don't think we need to flush at all, i.e., the no flush
> version *without* batched TLB flush. So far nobody can actually prove
> that flushing TLB while clearing the A-bit is beneficial, not even in
> theory :)

I will just send the fix for folio_mapcount() (with your reviewed-by) as
it's bug fix and it's better to be merged standalone.

The other three patches need more time for discussion.

Regards
Yin, Fengwei