2023-07-13 15:17:12

by Yin, Fengwei

[permalink] [raw]
Subject: [RFC PATCH] madvise: make madvise_cold_or_pageout_pte_range() support large folio

Current madvise_cold_or_pageout_pte_range() has two problems for
large folio support:
- Using folio_mapcount() with large folio prevent large folio from
picking up.
- If large folio is in the range requested, shouldn't split it
in madvise_cold_or_pageout_pte_range().

Fix them by:
- Use folio_estimated_sharers() with large folio
- If large folio is in the range requested, don't split it. Leave
to page reclaim phase.

For large folio cross boundaries of requested range, skip it if it's
page cache. Try to split it if it's anonymous folio. If splitting
fails, skip it.

The main reason to call folio_referenced() is to clear the yong of
conresponding PTEs. So in page reclaim phase, there is good chance
the folio can be reclaimed.

Signed-off-by: Yin Fengwei <[email protected]>
---
This patch is based on mlock large folio support rfc2 as it depends
on the folio_in_range() added by that patchset

Also folio_op_size() can be unitfied with get_folio_mlock_step().

Testing done:
- kselftest: No new regression introduced.

mm/madvise.c | 133 ++++++++++++++++++++++++++++++++-------------------
1 file changed, 84 insertions(+), 49 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 38382a5d1e393..5748cf098235d 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -31,6 +31,7 @@
#include <linux/swapops.h>
#include <linux/shmem_fs.h>
#include <linux/mmu_notifier.h>
+#include <linux/kernel.h>

#include <asm/tlb.h>

@@ -339,6 +340,35 @@ static inline bool can_do_file_pageout(struct vm_area_struct *vma)
file_permission(vma->vm_file, MAY_WRITE) == 0;
}

+static inline bool skip_current_entry(struct folio *folio, bool pageout_anon)
+{
+ if (!folio)
+ return true;
+
+ if (folio_is_zone_device(folio))
+ return true;
+
+ if (!folio_test_lru(folio))
+ return true;
+
+ if (pageout_anon && !folio_test_anon(folio))
+ return true;
+
+ if (folio_test_unevictable(folio))
+ return true;
+
+ return false;
+}
+
+static inline unsigned int folio_op_size(struct folio *folio, pte_t pte,
+ unsigned long addr, unsigned long end)
+{
+ unsigned int nr;
+
+ nr = folio_pfn(folio) + folio_nr_pages(folio) - pte_pfn(pte);
+ return min_t(unsigned int, nr, (end - addr) >> PAGE_SHIFT);
+}
+
static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
unsigned long addr, unsigned long end,
struct mm_walk *walk)
@@ -353,6 +383,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
struct folio *folio = NULL;
LIST_HEAD(folio_list);
bool pageout_anon_only_filter;
+ unsigned long start = addr;

if (fatal_signal_pending(current))
return -EINTR;
@@ -383,7 +414,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
folio = pfn_folio(pmd_pfn(orig_pmd));

/* Do not interfere with other mappings of this folio */
- if (folio_mapcount(folio) != 1)
+ if (folio_estimated_sharers(folio) != 1)
goto huge_unlock;

if (pageout_anon_only_filter && !folio_test_anon(folio))
@@ -442,78 +473,60 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
for (; addr < end; pte++, addr += PAGE_SIZE) {
ptent = ptep_get(pte);

- if (pte_none(ptent))
- continue;
-
- if (!pte_present(ptent))
+ if (pte_none(ptent) || !pte_present(ptent))
continue;

folio = vm_normal_folio(vma, addr, ptent);
- if (!folio || folio_is_zone_device(folio))
+ if (skip_current_entry(folio, pageout_anon_only_filter))
continue;

/*
- * Creating a THP page is expensive so split it only if we
- * are sure it's worth. Split it if we are only owner.
+ * Split large folio if it's anonymous and cross the
+ * boundaries of request range.
*/
if (folio_test_large(folio)) {
- int err;
+ int err, step;
+
+ if (folio_estimated_sharers(folio) != 1)
+ continue;
+
+ if (folio_in_range(folio, vma, start, end))
+ goto pageout_cold_folio;

- if (folio_mapcount(folio) != 1)
- break;
- if (pageout_anon_only_filter && !folio_test_anon(folio))
- break;
- if (!folio_trylock(folio))
- break;
folio_get(folio);
+ step = folio_op_size(folio, ptent, addr, end);
+ if (!folio_test_anon(folio) || !folio_trylock(folio)) {
+ folio_put(folio);
+ goto next_folio;
+ }
+
arch_leave_lazy_mmu_mode();
pte_unmap_unlock(start_pte, ptl);
start_pte = NULL;
err = split_folio(folio);
folio_unlock(folio);
folio_put(folio);
- if (err)
- break;
+
start_pte = pte =
pte_offset_map_lock(mm, pmd, addr, &ptl);
if (!start_pte)
break;
arch_enter_lazy_mmu_mode();
- pte--;
- addr -= PAGE_SIZE;
- continue;
- }

- /*
- * Do not interfere with other mappings of this folio and
- * non-LRU folio.
- */
- if (!folio_test_lru(folio) || folio_mapcount(folio) != 1)
+ /* Skip the folio if split fails */
+ if (!err)
+ step = 0;
+next_folio:
+ pte += step - 1;
+ addr += (step - 1) << PAGE_SHIFT;
continue;
+ }

- if (pageout_anon_only_filter && !folio_test_anon(folio))
+ /* Do not interfere with other mappings of this folio */
+ if (folio_mapcount(folio) != 1)
continue;

- VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
-
- if (pte_young(ptent)) {
- ptent = ptep_get_and_clear_full(mm, addr, pte,
- tlb->fullmm);
- ptent = pte_mkold(ptent);
- set_pte_at(mm, addr, pte, ptent);
- tlb_remove_tlb_entry(tlb, pte, addr);
- }
-
- /*
- * We are deactivating a folio for accelerating reclaiming.
- * VM couldn't reclaim the folio unless we clear PG_young.
- * As a side effect, it makes confuse idle-page tracking
- * because they will miss recent referenced history.
- */
- folio_clear_referenced(folio);
- folio_test_clear_young(folio);
- if (folio_test_active(folio))
- folio_set_workingset(folio);
+pageout_cold_folio:
if (pageout) {
if (folio_isolate_lru(folio)) {
if (folio_test_unevictable(folio))
@@ -529,8 +542,30 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
arch_leave_lazy_mmu_mode();
pte_unmap_unlock(start_pte, ptl);
}
- if (pageout)
- reclaim_pages(&folio_list);
+
+ if (pageout) {
+ LIST_HEAD(reclaim_list);
+
+ while (!list_empty(&folio_list)) {
+ int refs;
+ unsigned long flags;
+ struct mem_cgroup *memcg = folio_memcg(folio);
+
+ folio = lru_to_folio(&folio_list);
+ list_del(&folio->lru);
+
+ refs = folio_referenced(folio, 0, memcg, &flags);
+
+ if ((flags & VM_LOCKED) || (refs == -1)) {
+ folio_putback_lru(folio);
+ continue;
+ }
+
+ folio_test_clear_referenced(folio);
+ list_add(&folio->lru, &reclaim_list);
+ }
+ reclaim_pages(&reclaim_list);
+ }
cond_resched();

return 0;
--
2.39.2



2023-07-14 02:22:23

by Yu Zhao

[permalink] [raw]
Subject: Re: [RFC PATCH] madvise: make madvise_cold_or_pageout_pte_range() support large folio

On Thu, Jul 13, 2023 at 9:06 AM Yin Fengwei <[email protected]> wrote:
>
> Current madvise_cold_or_pageout_pte_range() has two problems for
> large folio support:
> - Using folio_mapcount() with large folio prevent large folio from
> picking up.
> - If large folio is in the range requested, shouldn't split it
> in madvise_cold_or_pageout_pte_range().
>
> Fix them by:
> - Use folio_estimated_sharers() with large folio
> - If large folio is in the range requested, don't split it. Leave
> to page reclaim phase.
>
> For large folio cross boundaries of requested range, skip it if it's
> page cache. Try to split it if it's anonymous folio. If splitting
> fails, skip it.

For now, we may not want to change the existing semantic (heuristic).
IOW, we may want to stick to the "only owner" condition:

- if (folio_mapcount(folio) != 1)
+ if (folio_entire_mapcount(folio) ||
+ (any_page_within_range_has_mapcount > 1))

+Minchan Kim

Also there is an existing bug here: the later commit 07e8c82b5eff8
("madvise: convert madvise_cold_or_pageout_pte_range() to use folios")
is incorrect for sure; the original commit 9c276cc65a58f ("mm:
introduce MADV_COLD") seems incorrect too.

+Vishal Moola (Oracle)

The "any_page_within_range_has_mapcount" test above seems to be the
only correct to meet condition claimed by the comments, before or
after the folio conversion, assuming here a THP page means the
compound page without PMD mappings (PMD-split). Otherwise the test is
always false (if it's also PMD mapped somewhere else).

/*
* Creating a THP page is expensive so split it only if we
* are sure it's worth. Split it if we are only owner.
*/

> The main reason to call folio_referenced() is to clear the yong of
> conresponding PTEs. So in page reclaim phase, there is good chance
> the folio can be reclaimed.
>
> Signed-off-by: Yin Fengwei <[email protected]>
> ---
> This patch is based on mlock large folio support rfc2 as it depends
> on the folio_in_range() added by that patchset
>
> Also folio_op_size() can be unitfied with get_folio_mlock_step().
>
> Testing done:
> - kselftest: No new regression introduced.
>
> mm/madvise.c | 133 ++++++++++++++++++++++++++++++++-------------------
> 1 file changed, 84 insertions(+), 49 deletions(-)

Also the refactor looks fine to me but it'd be better if it's a separate patch.

2023-07-14 03:17:49

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [RFC PATCH] madvise: make madvise_cold_or_pageout_pte_range() support large folio



On 7/14/2023 10:08 AM, Yu Zhao wrote:
> On Thu, Jul 13, 2023 at 9:06 AM Yin Fengwei <[email protected]> wrote:
>>
>> Current madvise_cold_or_pageout_pte_range() has two problems for
>> large folio support:
>> - Using folio_mapcount() with large folio prevent large folio from
>> picking up.
>> - If large folio is in the range requested, shouldn't split it
>> in madvise_cold_or_pageout_pte_range().
>>
>> Fix them by:
>> - Use folio_estimated_sharers() with large folio
>> - If large folio is in the range requested, don't split it. Leave
>> to page reclaim phase.
>>
>> For large folio cross boundaries of requested range, skip it if it's
>> page cache. Try to split it if it's anonymous folio. If splitting
>> fails, skip it.
>
> For now, we may not want to change the existing semantic (heuristic).
> IOW, we may want to stick to the "only owner" condition:
>
> - if (folio_mapcount(folio) != 1)
> + if (folio_entire_mapcount(folio) ||
> + (any_page_within_range_has_mapcount > 1))
>
> +Minchan Kim
The folio_estimated_sharers() was discussed here:
https://lore.kernel.org/linux-mm/[email protected]/
https://lore.kernel.org/linux-mm/[email protected]/

Yes. It's accurate to check each page of large folio. But it may be over killed in
some cases (And I think madvise is one of the cases not necessary to be accurate.
So folio_estimated_sharers() is enough. Correct me if I am wrong).

>
> Also there is an existing bug here: the later commit 07e8c82b5eff8
> ("madvise: convert madvise_cold_or_pageout_pte_range() to use folios")
> is incorrect for sure; the original commit 9c276cc65a58f ("mm:
> introduce MADV_COLD") seems incorrect too.
>
> +Vishal Moola (Oracle)
>
> The "any_page_within_range_has_mapcount" test above seems to be the
> only correct to meet condition claimed by the comments, before or
> after the folio conversion, assuming here a THP page means the
> compound page without PMD mappings (PMD-split). Otherwise the test is
> always false (if it's also PMD mapped somewhere else).
>
> /*
> * Creating a THP page is expensive so split it only if we
> * are sure it's worth. Split it if we are only owner.
> */
>
>> The main reason to call folio_referenced() is to clear the yong of
>> conresponding PTEs. So in page reclaim phase, there is good chance
>> the folio can be reclaimed.
>>
>> Signed-off-by: Yin Fengwei <[email protected]>
>> ---
>> This patch is based on mlock large folio support rfc2 as it depends
>> on the folio_in_range() added by that patchset
>>
>> Also folio_op_size() can be unitfied with get_folio_mlock_step().
>>
>> Testing done:
>> - kselftest: No new regression introduced.
>>
>> mm/madvise.c | 133 ++++++++++++++++++++++++++++++++-------------------
>> 1 file changed, 84 insertions(+), 49 deletions(-)
>
> Also the refactor looks fine to me but it'd be better if it's a separate patch.
OK. I will split the bug fix and refactor to two different patches. Thanks.


Regards
Yin, Fengwei

2023-07-14 03:48:09

by Yu Zhao

[permalink] [raw]
Subject: Re: [RFC PATCH] madvise: make madvise_cold_or_pageout_pte_range() support large folio

On Thu, Jul 13, 2023 at 9:10 PM Yin, Fengwei <[email protected]> wrote:
>
>
>
> On 7/14/2023 10:08 AM, Yu Zhao wrote:
> > On Thu, Jul 13, 2023 at 9:06 AM Yin Fengwei <[email protected]> wrote:
> >>
> >> Current madvise_cold_or_pageout_pte_range() has two problems for
> >> large folio support:
> >> - Using folio_mapcount() with large folio prevent large folio from
> >> picking up.
> >> - If large folio is in the range requested, shouldn't split it
> >> in madvise_cold_or_pageout_pte_range().
> >>
> >> Fix them by:
> >> - Use folio_estimated_sharers() with large folio
> >> - If large folio is in the range requested, don't split it. Leave
> >> to page reclaim phase.
> >>
> >> For large folio cross boundaries of requested range, skip it if it's
> >> page cache. Try to split it if it's anonymous folio. If splitting
> >> fails, skip it.
> >
> > For now, we may not want to change the existing semantic (heuristic).
> > IOW, we may want to stick to the "only owner" condition:
> >
> > - if (folio_mapcount(folio) != 1)
> > + if (folio_entire_mapcount(folio) ||
> > + (any_page_within_range_has_mapcount > 1))
> >
> > +Minchan Kim
> The folio_estimated_sharers() was discussed here:
> https://lore.kernel.org/linux-mm/[email protected]/
> https://lore.kernel.org/linux-mm/[email protected]/
>
> Yes. It's accurate to check each page of large folio. But it may be over killed in
> some cases (And I think madvise is one of the cases not necessary to be accurate.
> So folio_estimated_sharers() is enough. Correct me if I am wrong).

I see. Then it's possible this is also what the original commit wants
to do -- Minchan, could you clarify?

Regardless, I think we can have the following fix, potentially cc'ing stable:

- if (folio_mapcount(folio) != 1)
+ if (folio_estimated_sharers(folio) != 1)

Sounds good?

> > Also there is an existing bug here: the later commit 07e8c82b5eff8
> > ("madvise: convert madvise_cold_or_pageout_pte_range() to use folios")
> > is incorrect for sure; the original commit 9c276cc65a58f ("mm:
> > introduce MADV_COLD") seems incorrect too.
> >
> > +Vishal Moola (Oracle)
> >
> > The "any_page_within_range_has_mapcount" test above seems to be the
> > only correct to meet condition claimed by the comments, before or
> > after the folio conversion, assuming here a THP page means the
> > compound page without PMD mappings (PMD-split). Otherwise the test is
> > always false (if it's also PMD mapped somewhere else).
> >
> > /*
> > * Creating a THP page is expensive so split it only if we
> > * are sure it's worth. Split it if we are only owner.
> > */

2023-07-14 04:45:14

by Yu Zhao

[permalink] [raw]
Subject: Re: [RFC PATCH] madvise: make madvise_cold_or_pageout_pte_range() support large folio

On Thu, Jul 13, 2023 at 9:06 AM Yin Fengwei <[email protected]> wrote:
>
> Current madvise_cold_or_pageout_pte_range() has two problems for
> large folio support:
> - Using folio_mapcount() with large folio prevent large folio from
> picking up.
> - If large folio is in the range requested, shouldn't split it
> in madvise_cold_or_pageout_pte_range().
>
> Fix them by:
> - Use folio_estimated_sharers() with large folio
> - If large folio is in the range requested, don't split it. Leave
> to page reclaim phase.
>
> For large folio cross boundaries of requested range, skip it if it's
> page cache. Try to split it if it's anonymous folio. If splitting
> fails, skip it.
>
> The main reason to call folio_referenced() is to clear the yong of
> conresponding PTEs. So in page reclaim phase, there is good chance
> the folio can be reclaimed.
>
> Signed-off-by: Yin Fengwei <[email protected]>
> ---
> This patch is based on mlock large folio support rfc2 as it depends
> on the folio_in_range() added by that patchset
>
> Also folio_op_size() can be unitfied with get_folio_mlock_step().
>
> Testing done:
> - kselftest: No new regression introduced.
>
> mm/madvise.c | 133 ++++++++++++++++++++++++++++++++-------------------
> 1 file changed, 84 insertions(+), 49 deletions(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 38382a5d1e393..5748cf098235d 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -31,6 +31,7 @@
> #include <linux/swapops.h>
> #include <linux/shmem_fs.h>
> #include <linux/mmu_notifier.h>
> +#include <linux/kernel.h>
>
> #include <asm/tlb.h>
>
> @@ -339,6 +340,35 @@ static inline bool can_do_file_pageout(struct vm_area_struct *vma)
> file_permission(vma->vm_file, MAY_WRITE) == 0;
> }
>
> +static inline bool skip_current_entry(struct folio *folio, bool pageout_anon)
> +{
> + if (!folio)
> + return true;
> +
> + if (folio_is_zone_device(folio))
> + return true;
> +
> + if (!folio_test_lru(folio))
> + return true;
> +
> + if (pageout_anon && !folio_test_anon(folio))
> + return true;
> +
> + if (folio_test_unevictable(folio))
> + return true;
> +
> + return false;
> +}
> +
> +static inline unsigned int folio_op_size(struct folio *folio, pte_t pte,
> + unsigned long addr, unsigned long end)
> +{
> + unsigned int nr;
> +
> + nr = folio_pfn(folio) + folio_nr_pages(folio) - pte_pfn(pte);
> + return min_t(unsigned int, nr, (end - addr) >> PAGE_SHIFT);
> +}
> +
> static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> unsigned long addr, unsigned long end,
> struct mm_walk *walk)
> @@ -353,6 +383,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> struct folio *folio = NULL;
> LIST_HEAD(folio_list);
> bool pageout_anon_only_filter;
> + unsigned long start = addr;
>
> if (fatal_signal_pending(current))
> return -EINTR;
> @@ -383,7 +414,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> folio = pfn_folio(pmd_pfn(orig_pmd));
>
> /* Do not interfere with other mappings of this folio */
> - if (folio_mapcount(folio) != 1)
> + if (folio_estimated_sharers(folio) != 1)
> goto huge_unlock;
>
> if (pageout_anon_only_filter && !folio_test_anon(folio))
> @@ -442,78 +473,60 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> for (; addr < end; pte++, addr += PAGE_SIZE) {
> ptent = ptep_get(pte);
>
> - if (pte_none(ptent))
> - continue;
> -
> - if (!pte_present(ptent))
> + if (pte_none(ptent) || !pte_present(ptent))
> continue;
>
> folio = vm_normal_folio(vma, addr, ptent);
> - if (!folio || folio_is_zone_device(folio))
> + if (skip_current_entry(folio, pageout_anon_only_filter))
> continue;
>
> /*
> - * Creating a THP page is expensive so split it only if we
> - * are sure it's worth. Split it if we are only owner.
> + * Split large folio if it's anonymous and cross the
> + * boundaries of request range.
> */
> if (folio_test_large(folio)) {
> - int err;
> + int err, step;
> +
> + if (folio_estimated_sharers(folio) != 1)
> + continue;
> +
> + if (folio_in_range(folio, vma, start, end))
> + goto pageout_cold_folio;
>
> - if (folio_mapcount(folio) != 1)
> - break;
> - if (pageout_anon_only_filter && !folio_test_anon(folio))
> - break;
> - if (!folio_trylock(folio))
> - break;
> folio_get(folio);
> + step = folio_op_size(folio, ptent, addr, end);
> + if (!folio_test_anon(folio) || !folio_trylock(folio)) {
> + folio_put(folio);
> + goto next_folio;
> + }
> +
> arch_leave_lazy_mmu_mode();
> pte_unmap_unlock(start_pte, ptl);
> start_pte = NULL;
> err = split_folio(folio);
> folio_unlock(folio);
> folio_put(folio);
> - if (err)
> - break;
> +
> start_pte = pte =
> pte_offset_map_lock(mm, pmd, addr, &ptl);
> if (!start_pte)
> break;
> arch_enter_lazy_mmu_mode();
> - pte--;
> - addr -= PAGE_SIZE;
> - continue;
> - }
>
> - /*
> - * Do not interfere with other mappings of this folio and
> - * non-LRU folio.
> - */
> - if (!folio_test_lru(folio) || folio_mapcount(folio) != 1)
> + /* Skip the folio if split fails */
> + if (!err)
> + step = 0;
> +next_folio:
> + pte += step - 1;
> + addr += (step - 1) << PAGE_SHIFT;
> continue;
> + }
>
> - if (pageout_anon_only_filter && !folio_test_anon(folio))
> + /* Do not interfere with other mappings of this folio */
> + if (folio_mapcount(folio) != 1)
> continue;
>
> - VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
> -
> - if (pte_young(ptent)) {
> - ptent = ptep_get_and_clear_full(mm, addr, pte,
> - tlb->fullmm);
> - ptent = pte_mkold(ptent);
> - set_pte_at(mm, addr, pte, ptent);
> - tlb_remove_tlb_entry(tlb, pte, addr);
> - }
> -
> - /*
> - * We are deactivating a folio for accelerating reclaiming.
> - * VM couldn't reclaim the folio unless we clear PG_young.
> - * As a side effect, it makes confuse idle-page tracking
> - * because they will miss recent referenced history.
> - */
> - folio_clear_referenced(folio);
> - folio_test_clear_young(folio);
> - if (folio_test_active(folio))
> - folio_set_workingset(folio);
> +pageout_cold_folio:
> if (pageout) {
> if (folio_isolate_lru(folio)) {
> if (folio_test_unevictable(folio))
> @@ -529,8 +542,30 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> arch_leave_lazy_mmu_mode();
> pte_unmap_unlock(start_pte, ptl);
> }
> - if (pageout)
> - reclaim_pages(&folio_list);
> +
> + if (pageout) {
> + LIST_HEAD(reclaim_list);
> +
> + while (!list_empty(&folio_list)) {
> + int refs;
> + unsigned long flags;
> + struct mem_cgroup *memcg = folio_memcg(folio);
> +
> + folio = lru_to_folio(&folio_list);
> + list_del(&folio->lru);
> +
> + refs = folio_referenced(folio, 0, memcg, &flags);
> +
> + if ((flags & VM_LOCKED) || (refs == -1)) {
> + folio_putback_lru(folio);
> + continue;
> + }
> +
> + folio_test_clear_referenced(folio);
> + list_add(&folio->lru, &reclaim_list);
> + }
> + reclaim_pages(&reclaim_list);
> + }

i overlooked the chunk above -- it's unnecessary: after we split the
large folio (and splice the base folios onto the same LRU list), we
continue at the position of the first base folio because of:

pte--;
addr -= PAGE_SIZE;
continue;

And then we do pte_mkold(), which takes care of the A-bit.

2023-07-14 06:13:57

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [RFC PATCH] madvise: make madvise_cold_or_pageout_pte_range() support large folio


>> - if (pageout_anon_only_filter && !folio_test_anon(folio))
>> + /* Do not interfere with other mappings of this folio */
>> + if (folio_mapcount(folio) != 1)
>> continue;
>>
>> - VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
>> -
>> - if (pte_young(ptent)) {
>> - ptent = ptep_get_and_clear_full(mm, addr, pte,
>> - tlb->fullmm);
>> - ptent = pte_mkold(ptent);
>> - set_pte_at(mm, addr, pte, ptent);
>> - tlb_remove_tlb_entry(tlb, pte, addr);
>> - }
>> -
>> - /*
>> - * We are deactivating a folio for accelerating reclaiming.
>> - * VM couldn't reclaim the folio unless we clear PG_young.
>> - * As a side effect, it makes confuse idle-page tracking
>> - * because they will miss recent referenced history.
>> - */
>> - folio_clear_referenced(folio);
>> - folio_test_clear_young(folio);
>> - if (folio_test_active(folio))
>> - folio_set_workingset(folio);
>> +pageout_cold_folio:
>> if (pageout) {
>> if (folio_isolate_lru(folio)) {
>> if (folio_test_unevictable(folio))
>> @@ -529,8 +542,30 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>> arch_leave_lazy_mmu_mode();
>> pte_unmap_unlock(start_pte, ptl);
>> }
>> - if (pageout)
>> - reclaim_pages(&folio_list);
>> +
>> + if (pageout) {
>> + LIST_HEAD(reclaim_list);
>> +
>> + while (!list_empty(&folio_list)) {
>> + int refs;
>> + unsigned long flags;
>> + struct mem_cgroup *memcg = folio_memcg(folio);
>> +
>> + folio = lru_to_folio(&folio_list);
>> + list_del(&folio->lru);
>> +
>> + refs = folio_referenced(folio, 0, memcg, &flags);
>> +
>> + if ((flags & VM_LOCKED) || (refs == -1)) {
>> + folio_putback_lru(folio);
>> + continue;
>> + }
>> +
>> + folio_test_clear_referenced(folio);
>> + list_add(&folio->lru, &reclaim_list);
>> + }
>> + reclaim_pages(&reclaim_list);
>> + }
>
> i overlooked the chunk above -- it's unnecessary: after we split the
> large folio (and splice the base folios onto the same LRU list), we
> continue at the position of the first base folio because of:
>
> pte--;
> addr -= PAGE_SIZE;
> continue;
>
> And then we do pte_mkold(), which takes care of the A-bit.
This patch moves the A-bit clear out of the folio isolation loop. So
even the folio is split and loop restarts from the first base folio,
the A-bit is not cleared. A-bit is only cleared in reclaim loop.

There is one option for A-bit clearing:
- clear A-bit of base 4K page in isolation loop and leave large folio
A-bit clearing to reclaim loop.

This patch didn't use it because don't want to introduce A-bit clearing
in two places. But I am open about clearing base 4K page A-bit cleared in
isolation loop. Thanks.


Regards
Yin, Fengwei


2023-07-14 07:53:19

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH] madvise: make madvise_cold_or_pageout_pte_range() support large folio

On 14.07.23 05:23, Yu Zhao wrote:
> On Thu, Jul 13, 2023 at 9:10 PM Yin, Fengwei <[email protected]> wrote:
>>
>>
>>
>> On 7/14/2023 10:08 AM, Yu Zhao wrote:
>>> On Thu, Jul 13, 2023 at 9:06 AM Yin Fengwei <[email protected]> wrote:
>>>>
>>>> Current madvise_cold_or_pageout_pte_range() has two problems for
>>>> large folio support:
>>>> - Using folio_mapcount() with large folio prevent large folio from
>>>> picking up.
>>>> - If large folio is in the range requested, shouldn't split it
>>>> in madvise_cold_or_pageout_pte_range().
>>>>
>>>> Fix them by:
>>>> - Use folio_estimated_sharers() with large folio
>>>> - If large folio is in the range requested, don't split it. Leave
>>>> to page reclaim phase.
>>>>
>>>> For large folio cross boundaries of requested range, skip it if it's
>>>> page cache. Try to split it if it's anonymous folio. If splitting
>>>> fails, skip it.
>>>
>>> For now, we may not want to change the existing semantic (heuristic).
>>> IOW, we may want to stick to the "only owner" condition:
>>>
>>> - if (folio_mapcount(folio) != 1)
>>> + if (folio_entire_mapcount(folio) ||
>>> + (any_page_within_range_has_mapcount > 1))
>>>
>>> +Minchan Kim
>> The folio_estimated_sharers() was discussed here:
>> https://lore.kernel.org/linux-mm/[email protected]/
>> https://lore.kernel.org/linux-mm/[email protected]/
>>
>> Yes. It's accurate to check each page of large folio. But it may be over killed in
>> some cases (And I think madvise is one of the cases not necessary to be accurate.
>> So folio_estimated_sharers() is enough. Correct me if I am wrong).
>
> I see. Then it's possible this is also what the original commit wants
> to do -- Minchan, could you clarify?
>
> Regardless, I think we can have the following fix, potentially cc'ing stable:
>
> - if (folio_mapcount(folio) != 1)
> + if (folio_estimated_sharers(folio) != 1)
>
> Sounds good?

Adding to the discussion, currently the COW selftest always skips a
PTE-mapped THP.


For example:

# [INFO] Anonymous memory tests in private mappings
# [RUN] Basic COW after fork() ... with base page
ok 1 No leak from parent into child
# [RUN] Basic COW after fork() ... with swapped out base page
ok 2 No leak from parent into child
# [RUN] Basic COW after fork() ... with THP
ok 3 No leak from parent into child
# [RUN] Basic COW after fork() ... with swapped-out THP
ok 4 No leak from parent into child
# [RUN] Basic COW after fork() ... with PTE-mapped THP
ok 5 No leak from parent into child
# [RUN] Basic COW after fork() ... with swapped-out, PTE-mapped THP
ok 6 # SKIP MADV_PAGEOUT did not work, is swap enabled?
...


The commit that introduced that change is:

commit 07e8c82b5eff8ef34b74210eacb8d9c4a2886b82
Author: Vishal Moola (Oracle) <[email protected]>
Date: Wed Dec 21 10:08:46 2022 -0800

madvise: convert madvise_cold_or_pageout_pte_range() to use folios

This change removes a number of calls to compound_head(), and saves
1729 bytes of kernel text.



folio_mapcount(folio) is wrong, because that never works on a PTE-mapped
THP (well, unless only a single subpage is still mapped ...).

page_mapcount(folio) was wrong, because it ignored all other subpages,
but at least it worked in some cases.

folio_estimated_sharers(folio) is similarly wrong like page_mapcount(),
as it's essentially a page_mapcount() of the first subpage.

(ignoring that a lockless mapcount-based check is always kind-of
unreliable, but that's msotly acceptable for these kind of things)


So, unfortunately, page_mapcount() / folio_estimated_sharers() is best
we can do for now, but they miss to detect some cases of sharing of the
folio -- false negatives to detect sharing.


Ideally we want something like folio_maybe_mapped_shared(), and get rid
of folio_estimated_sharers(), we better to guess the exact number,
simply works towards an answer that tells us "yep, this may be mapped by
multiple sharers" vs. "no, this is definitely not mapped by multiple
sharers".

The "mapped" part of it indicates that this does not catch all cases of
sharing. But it should handle most of the cases we care about.


There, we can then implement something better than what
folio_estimated_sharers() currently does:

static inline bool folio_maybe_mapped_shared(folio)
{
if (likely(!folio_test_large(folio)))
return atomic_read(&folio->_mapcount) > 0;

/* Mapped multiple times via PMD? */
if (folio_test_pmd_mappable(folio)
return folio_entire_mapcount() > 1;

/*
* First subpage is mapped multiple times (especially also via
* PMDs)?
*/
if (page_mapcount(folio_page(folio, 0) > 1)
return true;

/* TODO: also test last subpage? */

/* Definitely shared if we're mapping a page multiple times. */
return folio_total_mapcount(folio) > folio_nr_pages(folio);
}

There are some more things we could optimize for.



While it's better than what we have right now:

(a) It's racy. Well, it's always been racy with concurrent (un)mapping
and splitting. But maybe we can do better.

(b) folio_total_mapcount() is currently expensive.

(c) there are still false negatives even without races.


For anon pages, we could scan all subpages and test if they are
PageAnonExclusive, but it's also not really what we want to do here.


I have some idea to handle anon pages better to avoid any page table
walk or subpage scan, improving (a), (b) and (c). It might work for
pagecache pages with some more work, but it's a bit more complicated
with the scheme I have in mind).


First step would be replacing folio->_nr_pages_mapped by
folio->_total_mapcount. While we could eventually have
folio->_total_mapcount in addition to folio->_nr_pages_mapped, I'm, not
sure if we want to go down that path


That would make folio_total_mapcount() extremely fast (I'm working on a
prototype). The downsides are that

(a) We have to make NR_ANON_MAPPED/NR_FILE_MAPPED accounting less
precise. Easiest way to handle it: as soon as a single subpage is
mapped, account the whole folio as mapped. After all, it's consuming
memory, so who cares?

(b) We have to find a different way to decide when to put an anonymous
folio on the deferred split queue in page_remove_rmap(). Some cases are
nasty to handle: PTE-mapped THP that are shared between a parent and a
child.

I have to do some more thinking about (b), I'd be happy about thoughts
on that.

--
Cheers,

David / dhildenb


2023-07-14 09:02:19

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [RFC PATCH] madvise: make madvise_cold_or_pageout_pte_range() support large folio

Hi David,

On 7/14/2023 3:31 PM, David Hildenbrand wrote:
> On 14.07.23 05:23, Yu Zhao wrote:
>> On Thu, Jul 13, 2023 at 9:10 PM Yin, Fengwei <[email protected]> wrote:
>>>
>>>
>>>
>>> On 7/14/2023 10:08 AM, Yu Zhao wrote:
>>>> On Thu, Jul 13, 2023 at 9:06 AM Yin Fengwei <[email protected]> wrote:
>>>>>
>>>>> Current madvise_cold_or_pageout_pte_range() has two problems for
>>>>> large folio support:
>>>>>    - Using folio_mapcount() with large folio prevent large folio from
>>>>>      picking up.
>>>>>    - If large folio is in the range requested, shouldn't split it
>>>>>      in madvise_cold_or_pageout_pte_range().
>>>>>
>>>>> Fix them by:
>>>>>    - Use folio_estimated_sharers() with large folio
>>>>>    - If large folio is in the range requested, don't split it. Leave
>>>>>      to page reclaim phase.
>>>>>
>>>>> For large folio cross boundaries of requested range, skip it if it's
>>>>> page cache. Try to split it if it's anonymous folio. If splitting
>>>>> fails, skip it.
>>>>
>>>> For now, we may not want to change the existing semantic (heuristic).
>>>> IOW, we may want to stick to the "only owner" condition:
>>>>
>>>>    - if (folio_mapcount(folio) != 1)
>>>>    + if (folio_entire_mapcount(folio) ||
>>>>    +     (any_page_within_range_has_mapcount > 1))
>>>>
>>>> +Minchan Kim
>>> The folio_estimated_sharers() was discussed here:
>>> https://lore.kernel.org/linux-mm/[email protected]/
>>> https://lore.kernel.org/linux-mm/[email protected]/
>>>
>>> Yes. It's accurate to check each page of large folio. But it may be over killed in
>>> some cases (And I think madvise is one of the cases not necessary to be accurate.
>>> So folio_estimated_sharers() is enough. Correct me if I am wrong).
>>
>> I see. Then it's possible this is also what the original commit wants
>> to do -- Minchan, could you clarify?
>>
>> Regardless, I think we can have the following fix, potentially cc'ing stable:
>>
>> -  if (folio_mapcount(folio) != 1)
>> +  if (folio_estimated_sharers(folio) != 1)
>>
>> Sounds good?
>
> Adding to the discussion, currently the COW selftest always skips a PTE-mapped THP.
You always have very good summary of the situation. Thanks a lot for
adding the following information.

Add Zi Yan as this is still about mapcount of the folio.

>
>
> For example:
>
> # [INFO] Anonymous memory tests in private mappings
> # [RUN] Basic COW after fork() ... with base page
> ok 1 No leak from parent into child
> # [RUN] Basic COW after fork() ... with swapped out base page
> ok 2 No leak from parent into child
> # [RUN] Basic COW after fork() ... with THP
> ok 3 No leak from parent into child
> # [RUN] Basic COW after fork() ... with swapped-out THP
> ok 4 No leak from parent into child
> # [RUN] Basic COW after fork() ... with PTE-mapped THP
> ok 5 No leak from parent into child
> # [RUN] Basic COW after fork() ... with swapped-out, PTE-mapped THP
> ok 6 # SKIP MADV_PAGEOUT did not work, is swap enabled?
> ...
>
>
> The commit that introduced that change is:
>
> commit 07e8c82b5eff8ef34b74210eacb8d9c4a2886b82
> Author: Vishal Moola (Oracle) <[email protected]>
> Date:   Wed Dec 21 10:08:46 2022 -0800
>
>     madvise: convert madvise_cold_or_pageout_pte_range() to use folios
>
>     This change removes a number of calls to compound_head(), and saves
>     1729 bytes of kernel text.
>
>
>
> folio_mapcount(folio) is wrong, because that never works on a PTE-mapped THP (well, unless only a single subpage is still mapped ...).
>
> page_mapcount(folio) was wrong, because it ignored all other subpages, but at least it worked in some cases.
>
> folio_estimated_sharers(folio) is similarly wrong like page_mapcount(), as it's essentially a page_mapcount() of the first subpage.
>
> (ignoring that a lockless mapcount-based check is always kind-of unreliable, but that's msotly acceptable for these kind of things)
>
>
> So, unfortunately, page_mapcount() / folio_estimated_sharers() is best we can do for now, but they miss to detect some cases of sharing of the folio -- false negatives to detect sharing.
>
>
> Ideally we want something like folio_maybe_mapped_shared(), and get rid of folio_estimated_sharers(), we better to guess the exact number, simply works towards an answer that tells us "yep, this may be mapped by multiple sharers" vs. "no, this is definitely not mapped by multiple sharers".
>
So you want to accurate number. My understanding is that it's required for COW case.

> The "mapped" part of it indicates that this does not catch all cases of sharing. But it should handle most of the cases we care about.
>
>
> There, we can then implement something better than what folio_estimated_sharers() currently does:
>
> static inline bool folio_maybe_mapped_shared(folio)
> {
>     if (likely(!folio_test_large(folio)))
>         return atomic_read(&folio->_mapcount) > 0;
>
>     /* Mapped multiple times via PMD? */
>     if (folio_test_pmd_mappable(folio)
>         return folio_entire_mapcount() > 1;
>
>     /*
>      * First subpage is mapped multiple times (especially also via
>      * PMDs)?
>          */
>     if (page_mapcount(folio_page(folio, 0) > 1)
>         return true;
>
>     /* TODO: also test last subpage? */
>     
>     /* Definitely shared if we're mapping a page multiple times. */
>     return folio_total_mapcount(folio) > folio_nr_pages(folio);
> }
>
> There are some more things we could optimize for.
>
>
>
> While it's better than what we have right now:
>
> (a) It's racy. Well, it's always been racy with concurrent (un)mapping
>     and splitting. But maybe we can do better.
>
> (b) folio_total_mapcount() is currently expensive.
>
> (c) there are still false negatives even without races.
>
>
> For anon pages, we could scan all subpages and test if they are PageAnonExclusive, but it's also not really what we want to do here.
I was wondering whether we could identify the cases as:
- bold estimated mapcount is enough. In this case, we can use
current folio_estimated_sharers() for now. For long term, I
am with Zi Yan's proposal: maintain total_mapcount and just use
total_mapcount > folio_nr_pages() as estimated number.

The madvise/migration cases are identified as this type.

- Need some level accurate. Use estimated mapcount to filter out obvious
shared case first as estimated mapcount is correct for shared case.
Then use some heavy operations (check anon folio if pages are
PageAnonExclusive or use pvmw) to get more accurate number.

Cow is identified as this type.

>
>
> I have some idea to handle anon pages better to avoid any page table walk or subpage scan, improving (a), (b) and (c). It might work for pagecache pages with some more work, but it's a bit more complicated with the scheme I have in mind).
>
Great.

>
> First step would be replacing folio->_nr_pages_mapped by folio->_total_mapcount. While we could eventually have folio->_total_mapcount in addition to folio->_nr_pages_mapped, I'm, not sure if we want to go down that path
>
I saw Zi Yan shared same proposal.

>
> That would make folio_total_mapcount() extremely fast (I'm working on a prototype). The downsides are that
>
> (a) We have to make NR_ANON_MAPPED/NR_FILE_MAPPED accounting less precise. Easiest way to handle it: as soon as a single subpage is mapped, account the whole folio as mapped. After all, it's consuming memory, so who cares?
>
> (b) We have to find a different way to decide when to put an anonymous folio on the deferred split queue in page_remove_rmap(). Some cases are nasty to handle: PTE-mapped THP that are shared between a parent and a child.
>
It's nasty because partial mapped to parent and partial mapped to child? Thanks.


Regards
Yin, Fengwei

> I have to do some more thinking about (b), I'd be happy about thoughts on that.
>

2023-07-14 09:31:29

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH] madvise: make madvise_cold_or_pageout_pte_range() support large folio

On 14.07.23 10:34, Yin, Fengwei wrote:
[...]

>>> Sounds good?
>>
>> Adding to the discussion, currently the COW selftest always skips a PTE-mapped THP.
> You always have very good summary of the situation. Thanks a lot for
> adding the following information.
>
> Add Zi Yan as this is still about mapcount of the folio.
>

Thanks, I thought he would have already been CCed!

>>
>>
>> For example:
>>
>> # [INFO] Anonymous memory tests in private mappings
>> # [RUN] Basic COW after fork() ... with base page
>> ok 1 No leak from parent into child
>> # [RUN] Basic COW after fork() ... with swapped out base page
>> ok 2 No leak from parent into child
>> # [RUN] Basic COW after fork() ... with THP
>> ok 3 No leak from parent into child
>> # [RUN] Basic COW after fork() ... with swapped-out THP
>> ok 4 No leak from parent into child
>> # [RUN] Basic COW after fork() ... with PTE-mapped THP
>> ok 5 No leak from parent into child
>> # [RUN] Basic COW after fork() ... with swapped-out, PTE-mapped THP
>> ok 6 # SKIP MADV_PAGEOUT did not work, is swap enabled?
>> ...
>>
>>
>> The commit that introduced that change is:
>>
>> commit 07e8c82b5eff8ef34b74210eacb8d9c4a2886b82
>> Author: Vishal Moola (Oracle) <[email protected]>
>> Date:   Wed Dec 21 10:08:46 2022 -0800
>>
>>     madvise: convert madvise_cold_or_pageout_pte_range() to use folios
>>
>>     This change removes a number of calls to compound_head(), and saves
>>     1729 bytes of kernel text.
>>
>>
>>
>> folio_mapcount(folio) is wrong, because that never works on a PTE-mapped THP (well, unless only a single subpage is still mapped ...).
>>
>> page_mapcount(folio) was wrong, because it ignored all other subpages, but at least it worked in some cases.
>>
>> folio_estimated_sharers(folio) is similarly wrong like page_mapcount(), as it's essentially a page_mapcount() of the first subpage.
>>
>> (ignoring that a lockless mapcount-based check is always kind-of unreliable, but that's msotly acceptable for these kind of things)
>>
>>
>> So, unfortunately, page_mapcount() / folio_estimated_sharers() is best we can do for now, but they miss to detect some cases of sharing of the folio -- false negatives to detect sharing.
>>
>>
>> Ideally we want something like folio_maybe_mapped_shared(), and get rid of folio_estimated_sharers(), we better to guess the exact number, simply works towards an answer that tells us "yep, this may be mapped by multiple sharers" vs. "no, this is definitely not mapped by multiple sharers".
>>
> So you want to accurate number. My understanding is that it's required for COW case.

For COW we have to take a look at the mapcount vs. the refcount:

With an order-0 page that's straight forward:
(a) Has that page never been shared (PageAnonExclusive?), then I am
the exclusive owner and can reuse.
(a) I am mapping the page and it cannot get unmapped concurrently due
to the PT lock (which implies mapcount > 0, refcount > 0). Is this
reference I am holding is in fact the only reference to
the page (refcount == 1, implying mapcount == 1)? Then I am
the exclusive owner and can reuse.

Note that we don't have to perform any mapcount checks, because it's
implied by pur page table mapping and the refcount.

What I want to achieve is the same for PTE-mapped THP, without scanning
page tables to detect if I am holding all references to the folio. That is:

(1) total_mapcount() == refcount AND
(2) I am responsible for all these mappings AND
(3) Subpages cannot get unmapped / shared concurrently

To make that work reliable, we might need some synchronization,
especially when multiple page tables are involved.

I previously raised tracking the "creator" of the anon page. I think we
can do better, but still have to prototype it.

[...]

>>
>> While it's better than what we have right now:
>>
>> (a) It's racy. Well, it's always been racy with concurrent (un)mapping
>>     and splitting. But maybe we can do better.
>>
>> (b) folio_total_mapcount() is currently expensive.
>>
>> (c) there are still false negatives even without races.
>>
>>
>> For anon pages, we could scan all subpages and test if they are PageAnonExclusive, but it's also not really what we want to do here.
> I was wondering whether we could identify the cases as:
> - bold estimated mapcount is enough. In this case, we can use
> current folio_estimated_sharers() for now. For long term, I
> am with Zi Yan's proposal: maintain total_mapcount and just use
> total_mapcount > folio_nr_pages() as estimated number.
>
> The madvise/migration cases are identified as this type.
>
> - Need some level accurate. Use estimated mapcount to filter out obvious
> shared case first as estimated mapcount is correct for shared case.
> Then use some heavy operations (check anon folio if pages are
> PageAnonExclusive or use pvmw) to get more accurate number.
>
> Cow is identified as this type.

I want to tackle both (at least for anon pages) using the same
mechanism. Especially, to cover the case "total_mapcount <=
folio_nr_pages()". For total_mapcount > folio_nr_pages(), it's easy.

In any case, we want an atomic total mapcount I think.

>
>>
>>
>> I have some idea to handle anon pages better to avoid any page table walk or subpage scan, improving (a), (b) and (c). It might work for pagecache pages with some more work, but it's a bit more complicated with the scheme I have in mind).
>>
> Great.
>
>>
>> First step would be replacing folio->_nr_pages_mapped by folio->_total_mapcount. While we could eventually have folio->_total_mapcount in addition to folio->_nr_pages_mapped, I'm, not sure if we want to go down that path
>>
> I saw Zi Yan shared same proposal.
>
>>
>> That would make folio_total_mapcount() extremely fast (I'm working on a prototype). The downsides are that
>>
>> (a) We have to make NR_ANON_MAPPED/NR_FILE_MAPPED accounting less precise. Easiest way to handle it: as soon as a single subpage is mapped, account the whole folio as mapped. After all, it's consuming memory, so who cares?
>>
>> (b) We have to find a different way to decide when to put an anonymous folio on the deferred split queue in page_remove_rmap(). Some cases are nasty to handle: PTE-mapped THP that are shared between a parent and a child.
>>
> It's nasty because partial mapped to parent and partial mapped to child? Thanks.

I thought about this a lot already, but let me dedicate some time here
to write it down. There are two scenarios to consider: do we still want
to use the subpage mapcount or not.

When still considering the subpage mapcount, it gets easier.


(1) We're unmapping a single subpage, the compound_mapcount == 0
and the total_mapcount > 0. If the subpage mapcount is now 0, add it
to the deferred split queue.

(2) We're unmapping a complete folio (PMD mapping / compound), the
compound_mapcount is 0 and the total_mapcount > 0.

(a) If total mapcount < folio_nr_pages, add it
to the deferred split queue.

(b) If total mapcount >= folio_nr_pages , we have to scan all subpage
mapcounts. If any subpage mapcount == 0, add it to the deferred
split queue.


(b) is a bit nasty. It would happen when we fork() with a PMD-mapped
THP, the parent splits the THP due to COW, and then our child unmaps or
splits the PMD-mapped THP (unmap easily happening during exec()).
Fortunately, we'd only scan once when unmapping the PMD.


Getting rid of the subpage mapcount usage in (1) would mean that we have
to do exactly what we do in (2). But then we'd need to ha handle (2) (B)
differently as well.

So, for 2 (b) we would either need some other heuristic, or we add it to
the deferred split queue more frequently and let that one detect using
an rmap walk "well, every subpage is still mapped, let's abort the split".

--
Cheers,

David / dhildenb


2023-07-14 14:06:35

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [RFC PATCH] madvise: make madvise_cold_or_pageout_pte_range() support large folio



On 7/14/2023 5:25 PM, David Hildenbrand wrote:
>
> (1) We're unmapping a single subpage, the compound_mapcount == 0
>     and the total_mapcount > 0. If the subpage mapcount is now 0, add it
>     to the deferred split queue.
>
> (2) We're unmapping a complete folio (PMD mapping / compound), the
>     compound_mapcount is 0 and the total_mapcount > 0.
>
>  (a) If total mapcount < folio_nr_pages, add it
>      to the deferred split queue.
>
>  (b) If total mapcount >= folio_nr_pages , we have to scan all subpage
>      mapcounts. If any subpage mapcount == 0, add it to the deferred
>      split queue.
>
>
> (b) is a bit nasty. It would happen when we fork() with a PMD-mapped THP, the parent splits the THP due to COW, and then our child unmaps or splits the PMD-mapped THP (unmap easily happening during exec()). Fortunately, we'd only scan once when unmapping the PMD.
>
>
> Getting rid of the subpage mapcount usage in (1) would mean that we have to do exactly what we do in (2). But then we'd need to ha handle (2) (B) differently as well.
>
> So, for 2 (b) we would either need some other heuristic, or we add it to the deferred split queue more frequently and let that one detect using an rmap walk "well, every subpage is still mapped, let's abort the split".

Or another option for 2 (b): don't add it to deferred split queue. We
know the folio in deferred list is mainly scanned when system needs to
reclaim memory.

Maybe it's better to let page reclaim choose the large folio here because
page reclaiming will call folio_referenced() which does rmap_walk()->pvmw().
And we can reuse rmap_walk() in folio_referenced() to detect whether there
are pages of folio are not mapped. If it's the case, we can split it then.

Comparing to deferred list, the advantage is that folio_referenced() doesn't
unmap page. While in deferred list, we need to add extra rmap_walk() to
check whether there is page not mapped.

Just a thought. I could miss something here. Thanks.


Regards
Yin, Fengwei

2023-07-14 14:31:07

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH] madvise: make madvise_cold_or_pageout_pte_range() support large folio

On 14.07.23 15:58, Yin, Fengwei wrote:
>
>
> On 7/14/2023 5:25 PM, David Hildenbrand wrote:
>>
>> (1) We're unmapping a single subpage, the compound_mapcount == 0
>>     and the total_mapcount > 0. If the subpage mapcount is now 0, add it
>>     to the deferred split queue.
>>
>> (2) We're unmapping a complete folio (PMD mapping / compound), the
>>     compound_mapcount is 0 and the total_mapcount > 0.
>>
>>  (a) If total mapcount < folio_nr_pages, add it
>>      to the deferred split queue.
>>
>>  (b) If total mapcount >= folio_nr_pages , we have to scan all subpage
>>      mapcounts. If any subpage mapcount == 0, add it to the deferred
>>      split queue.
>>
>>
>> (b) is a bit nasty. It would happen when we fork() with a PMD-mapped THP, the parent splits the THP due to COW, and then our child unmaps or splits the PMD-mapped THP (unmap easily happening during exec()). Fortunately, we'd only scan once when unmapping the PMD.
>>
>>
>> Getting rid of the subpage mapcount usage in (1) would mean that we have to do exactly what we do in (2). But then we'd need to ha handle (2) (B) differently as well.
>>
>> So, for 2 (b) we would either need some other heuristic, or we add it to the deferred split queue more frequently and let that one detect using an rmap walk "well, every subpage is still mapped, let's abort the split".
>
> Or another option for 2 (b): don't add it to deferred split queue. We
> know the folio in deferred list is mainly scanned when system needs to
> reclaim memory.
>
> Maybe it's better to let page reclaim choose the large folio here because
> page reclaiming will call folio_referenced() which does rmap_walk()->pvmw().
> And we can reuse rmap_walk() in folio_referenced() to detect whether there
> are pages of folio are not mapped. If it's the case, we can split it then.
>
> Comparing to deferred list, the advantage is that folio_referenced() doesn't
> unmap page. While in deferred list, we need to add extra rmap_walk() to
> check whether there is page not mapped.

Right, I also came to the conclusion that the unmapping is undesirable.
However, once benefit of the unmap is that we know when to stop scanning
(due to page_mapped()). But maybe the temporary unmapping is actually
counter-productive.

>
> Just a thought. I could miss something here. Thanks.

Interesting idea.

I also had the thought that adding folios to the deferred split queue
when removing the rmap is semantically questionable.

Yes, we remove the rmap when zapping/unmapping a pte/pmd. But we also
(eventually only temporarily!) unmap when splitting a THP or when
installing migration entries.

Maybe we can flag folios when zapping PTEs that they are a reasonable
candidate to tell page reclaim code "this one was partially zapped,
maybe there is some memory to reclaim there, now". Maybe that involves
the deferred split queue.

--
Cheers,

David / dhildenb


2023-07-14 14:55:35

by Zi Yan

[permalink] [raw]
Subject: Re: [RFC PATCH] madvise: make madvise_cold_or_pageout_pte_range() support large folio

On 14 Jul 2023, at 3:31, David Hildenbrand wrote:

> On 14.07.23 05:23, Yu Zhao wrote:
>> On Thu, Jul 13, 2023 at 9:10 PM Yin, Fengwei <[email protected]> wrote:
>>>
>>>
>>>
>>> On 7/14/2023 10:08 AM, Yu Zhao wrote:
>>>> On Thu, Jul 13, 2023 at 9:06 AM Yin Fengwei <[email protected]> wrote:
>>>>>
>>>>> Current madvise_cold_or_pageout_pte_range() has two problems for
>>>>> large folio support:
>>>>> - Using folio_mapcount() with large folio prevent large folio from
>>>>> picking up.
>>>>> - If large folio is in the range requested, shouldn't split it
>>>>> in madvise_cold_or_pageout_pte_range().
>>>>>
>>>>> Fix them by:
>>>>> - Use folio_estimated_sharers() with large folio
>>>>> - If large folio is in the range requested, don't split it. Leave
>>>>> to page reclaim phase.
>>>>>
>>>>> For large folio cross boundaries of requested range, skip it if it's
>>>>> page cache. Try to split it if it's anonymous folio. If splitting
>>>>> fails, skip it.
>>>>
>>>> For now, we may not want to change the existing semantic (heuristic).
>>>> IOW, we may want to stick to the "only owner" condition:
>>>>
>>>> - if (folio_mapcount(folio) != 1)
>>>> + if (folio_entire_mapcount(folio) ||
>>>> + (any_page_within_range_has_mapcount > 1))
>>>>
>>>> +Minchan Kim
>>> The folio_estimated_sharers() was discussed here:
>>> https://lore.kernel.org/linux-mm/[email protected]/
>>> https://lore.kernel.org/linux-mm/[email protected]/
>>>
>>> Yes. It's accurate to check each page of large folio. But it may be over killed in
>>> some cases (And I think madvise is one of the cases not necessary to be accurate.
>>> So folio_estimated_sharers() is enough. Correct me if I am wrong).
>>
>> I see. Then it's possible this is also what the original commit wants
>> to do -- Minchan, could you clarify?
>>
>> Regardless, I think we can have the following fix, potentially cc'ing stable:
>>
>> - if (folio_mapcount(folio) != 1)
>> + if (folio_estimated_sharers(folio) != 1)
>>
>> Sounds good?
>
> Adding to the discussion, currently the COW selftest always skips a PTE-mapped THP.
>
>
> For example:
>
> # [INFO] Anonymous memory tests in private mappings
> # [RUN] Basic COW after fork() ... with base page
> ok 1 No leak from parent into child
> # [RUN] Basic COW after fork() ... with swapped out base page
> ok 2 No leak from parent into child
> # [RUN] Basic COW after fork() ... with THP
> ok 3 No leak from parent into child
> # [RUN] Basic COW after fork() ... with swapped-out THP
> ok 4 No leak from parent into child
> # [RUN] Basic COW after fork() ... with PTE-mapped THP
> ok 5 No leak from parent into child
> # [RUN] Basic COW after fork() ... with swapped-out, PTE-mapped THP
> ok 6 # SKIP MADV_PAGEOUT did not work, is swap enabled?
> ...
>
>
> The commit that introduced that change is:
>
> commit 07e8c82b5eff8ef34b74210eacb8d9c4a2886b82
> Author: Vishal Moola (Oracle) <[email protected]>
> Date: Wed Dec 21 10:08:46 2022 -0800
>
> madvise: convert madvise_cold_or_pageout_pte_range() to use folios
>
> This change removes a number of calls to compound_head(), and saves
> 1729 bytes of kernel text.
>
>
>
> folio_mapcount(folio) is wrong, because that never works on a PTE-mapped THP (well, unless only a single subpage is still mapped ...).
>
> page_mapcount(folio) was wrong, because it ignored all other subpages, but at least it worked in some cases.
>
> folio_estimated_sharers(folio) is similarly wrong like page_mapcount(), as it's essentially a page_mapcount() of the first subpage.
>
> (ignoring that a lockless mapcount-based check is always kind-of unreliable, but that's msotly acceptable for these kind of things)
>
>
> So, unfortunately, page_mapcount() / folio_estimated_sharers() is best we can do for now, but they miss to detect some cases of sharing of the folio -- false negatives to detect sharing.
>
>
> Ideally we want something like folio_maybe_mapped_shared(), and get rid of folio_estimated_sharers(), we better to guess the exact number, simply works towards an answer that tells us "yep, this may be mapped by multiple sharers" vs. "no, this is definitely not mapped by multiple sharers".
>
> The "mapped" part of it indicates that this does not catch all cases of sharing. But it should handle most of the cases we care about.
>
>
> There, we can then implement something better than what folio_estimated_sharers() currently does:
>
> static inline bool folio_maybe_mapped_shared(folio)
> {
> if (likely(!folio_test_large(folio)))
> return atomic_read(&folio->_mapcount) > 0;
>
> /* Mapped multiple times via PMD? */
> if (folio_test_pmd_mappable(folio)
> return folio_entire_mapcount() > 1;
>
> /*
> * First subpage is mapped multiple times (especially also via
> * PMDs)?
> */
> if (page_mapcount(folio_page(folio, 0) > 1)
> return true;
>
> /* TODO: also test last subpage? */
>
> /* Definitely shared if we're mapping a page multiple times. */
> return folio_total_mapcount(folio) > folio_nr_pages(folio);
> }
>
> There are some more things we could optimize for.

Before jumping into the mapcount, I would like to get some clarification
on "sharer". Does it mean a page is mapped/shared by more than one page
table entry or is mapped/shared by more than one process? Your function
indicates it is the former, but for madvise_cold_or_pageout_pte_range(),
I am not sure that is what we want. What if user wants to page out a
page that is mapped by the same process twice? With current method
or any existing proposals, it will fail undesirably. It will only work
as expect with your creator proposal[1].

Other places like auto NUMA migration also use mapcount to check if
a page is mapped by multiple process with the assumption that a page will
only be mapped by one process once.


[1] https://lore.kernel.org/linux-mm/[email protected]/

--
Best Regards,
Yan, Zi


Attachments:
signature.asc (871.00 B)
OpenPGP digital signature

2023-07-14 15:57:24

by Yu Zhao

[permalink] [raw]
Subject: Re: [RFC PATCH] madvise: make madvise_cold_or_pageout_pte_range() support large folio

On Thu, Jul 13, 2023 at 11:57 PM Yin, Fengwei <[email protected]> wrote:
>
>
> >> - if (pageout_anon_only_filter && !folio_test_anon(folio))
> >> + /* Do not interfere with other mappings of this folio */
> >> + if (folio_mapcount(folio) != 1)
> >> continue;
> >>
> >> - VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
> >> -
> >> - if (pte_young(ptent)) {
> >> - ptent = ptep_get_and_clear_full(mm, addr, pte,
> >> - tlb->fullmm);
> >> - ptent = pte_mkold(ptent);
> >> - set_pte_at(mm, addr, pte, ptent);
> >> - tlb_remove_tlb_entry(tlb, pte, addr);
> >> - }
> >> -
> >> - /*
> >> - * We are deactivating a folio for accelerating reclaiming.
> >> - * VM couldn't reclaim the folio unless we clear PG_young.
> >> - * As a side effect, it makes confuse idle-page tracking
> >> - * because they will miss recent referenced history.
> >> - */
> >> - folio_clear_referenced(folio);
> >> - folio_test_clear_young(folio);
> >> - if (folio_test_active(folio))
> >> - folio_set_workingset(folio);
> >> +pageout_cold_folio:
> >> if (pageout) {
> >> if (folio_isolate_lru(folio)) {
> >> if (folio_test_unevictable(folio))
> >> @@ -529,8 +542,30 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> >> arch_leave_lazy_mmu_mode();
> >> pte_unmap_unlock(start_pte, ptl);
> >> }
> >> - if (pageout)
> >> - reclaim_pages(&folio_list);
> >> +
> >> + if (pageout) {
> >> + LIST_HEAD(reclaim_list);
> >> +
> >> + while (!list_empty(&folio_list)) {
> >> + int refs;
> >> + unsigned long flags;
> >> + struct mem_cgroup *memcg = folio_memcg(folio);
> >> +
> >> + folio = lru_to_folio(&folio_list);
> >> + list_del(&folio->lru);
> >> +
> >> + refs = folio_referenced(folio, 0, memcg, &flags);
> >> +
> >> + if ((flags & VM_LOCKED) || (refs == -1)) {
> >> + folio_putback_lru(folio);
> >> + continue;
> >> + }
> >> +
> >> + folio_test_clear_referenced(folio);
> >> + list_add(&folio->lru, &reclaim_list);
> >> + }
> >> + reclaim_pages(&reclaim_list);
> >> + }
> >
> > i overlooked the chunk above -- it's unnecessary: after we split the
> > large folio (and splice the base folios onto the same LRU list), we
> > continue at the position of the first base folio because of:
> >
> > pte--;
> > addr -= PAGE_SIZE;
> > continue;
> >
> > And then we do pte_mkold(), which takes care of the A-bit.
> This patch moves the A-bit clear out of the folio isolation loop. So
> even the folio is split and loop restarts from the first base folio,
> the A-bit is not cleared. A-bit is only cleared in reclaim loop.
>
> There is one option for A-bit clearing:
> - clear A-bit of base 4K page in isolation loop and leave large folio
> A-bit clearing to reclaim loop.
>
> This patch didn't use it because don't want to introduce A-bit clearing
> in two places. But I am open about clearing base 4K page A-bit cleared in
> isolation loop. Thanks.

Sorry but why are we trying to do multiple things in one patch that I
assumed is supposed to simply fix madvise() for large anon folios? And
none of those things seems to have a clear rationale behind it.

The only patch that makes sense at the moment (or the first patch of a
series) is what I said before:

- if (folio_mapcount(folio) != 1)
+ if (folio_estimated_sharers(folio) != 1)

And probably clarify (preferrably in the comments above) this is an
estimate because we think it's a better tradeoff if we do so (less
code/overhead from checking the mapcounts of the rest of folios within
the range).

2023-07-14 16:18:37

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH] madvise: make madvise_cold_or_pageout_pte_range() support large folio

On 14.07.23 16:41, Zi Yan wrote:
> On 14 Jul 2023, at 3:31, David Hildenbrand wrote:
>
>> On 14.07.23 05:23, Yu Zhao wrote:
>>> On Thu, Jul 13, 2023 at 9:10 PM Yin, Fengwei <[email protected]> wrote:
>>>>
>>>>
>>>>
>>>> On 7/14/2023 10:08 AM, Yu Zhao wrote:
>>>>> On Thu, Jul 13, 2023 at 9:06 AM Yin Fengwei <[email protected]> wrote:
>>>>>>
>>>>>> Current madvise_cold_or_pageout_pte_range() has two problems for
>>>>>> large folio support:
>>>>>> - Using folio_mapcount() with large folio prevent large folio from
>>>>>> picking up.
>>>>>> - If large folio is in the range requested, shouldn't split it
>>>>>> in madvise_cold_or_pageout_pte_range().
>>>>>>
>>>>>> Fix them by:
>>>>>> - Use folio_estimated_sharers() with large folio
>>>>>> - If large folio is in the range requested, don't split it. Leave
>>>>>> to page reclaim phase.
>>>>>>
>>>>>> For large folio cross boundaries of requested range, skip it if it's
>>>>>> page cache. Try to split it if it's anonymous folio. If splitting
>>>>>> fails, skip it.
>>>>>
>>>>> For now, we may not want to change the existing semantic (heuristic).
>>>>> IOW, we may want to stick to the "only owner" condition:
>>>>>
>>>>> - if (folio_mapcount(folio) != 1)
>>>>> + if (folio_entire_mapcount(folio) ||
>>>>> + (any_page_within_range_has_mapcount > 1))
>>>>>
>>>>> +Minchan Kim
>>>> The folio_estimated_sharers() was discussed here:
>>>> https://lore.kernel.org/linux-mm/[email protected]/
>>>> https://lore.kernel.org/linux-mm/[email protected]/
>>>>
>>>> Yes. It's accurate to check each page of large folio. But it may be over killed in
>>>> some cases (And I think madvise is one of the cases not necessary to be accurate.
>>>> So folio_estimated_sharers() is enough. Correct me if I am wrong).
>>>
>>> I see. Then it's possible this is also what the original commit wants
>>> to do -- Minchan, could you clarify?
>>>
>>> Regardless, I think we can have the following fix, potentially cc'ing stable:
>>>
>>> - if (folio_mapcount(folio) != 1)
>>> + if (folio_estimated_sharers(folio) != 1)
>>>
>>> Sounds good?
>>
>> Adding to the discussion, currently the COW selftest always skips a PTE-mapped THP.
>>
>>
>> For example:
>>
>> # [INFO] Anonymous memory tests in private mappings
>> # [RUN] Basic COW after fork() ... with base page
>> ok 1 No leak from parent into child
>> # [RUN] Basic COW after fork() ... with swapped out base page
>> ok 2 No leak from parent into child
>> # [RUN] Basic COW after fork() ... with THP
>> ok 3 No leak from parent into child
>> # [RUN] Basic COW after fork() ... with swapped-out THP
>> ok 4 No leak from parent into child
>> # [RUN] Basic COW after fork() ... with PTE-mapped THP
>> ok 5 No leak from parent into child
>> # [RUN] Basic COW after fork() ... with swapped-out, PTE-mapped THP
>> ok 6 # SKIP MADV_PAGEOUT did not work, is swap enabled?
>> ...
>>
>>
>> The commit that introduced that change is:
>>
>> commit 07e8c82b5eff8ef34b74210eacb8d9c4a2886b82
>> Author: Vishal Moola (Oracle) <[email protected]>
>> Date: Wed Dec 21 10:08:46 2022 -0800
>>
>> madvise: convert madvise_cold_or_pageout_pte_range() to use folios
>>
>> This change removes a number of calls to compound_head(), and saves
>> 1729 bytes of kernel text.
>>
>>
>>
>> folio_mapcount(folio) is wrong, because that never works on a PTE-mapped THP (well, unless only a single subpage is still mapped ...).
>>
>> page_mapcount(folio) was wrong, because it ignored all other subpages, but at least it worked in some cases.
>>
>> folio_estimated_sharers(folio) is similarly wrong like page_mapcount(), as it's essentially a page_mapcount() of the first subpage.
>>
>> (ignoring that a lockless mapcount-based check is always kind-of unreliable, but that's msotly acceptable for these kind of things)
>>
>>
>> So, unfortunately, page_mapcount() / folio_estimated_sharers() is best we can do for now, but they miss to detect some cases of sharing of the folio -- false negatives to detect sharing.
>>
>>
>> Ideally we want something like folio_maybe_mapped_shared(), and get rid of folio_estimated_sharers(), we better to guess the exact number, simply works towards an answer that tells us "yep, this may be mapped by multiple sharers" vs. "no, this is definitely not mapped by multiple sharers".
>>
>> The "mapped" part of it indicates that this does not catch all cases of sharing. But it should handle most of the cases we care about.
>>
>>
>> There, we can then implement something better than what folio_estimated_sharers() currently does:
>>
>> static inline bool folio_maybe_mapped_shared(folio)
>> {
>> if (likely(!folio_test_large(folio)))
>> return atomic_read(&folio->_mapcount) > 0;
>>
>> /* Mapped multiple times via PMD? */
>> if (folio_test_pmd_mappable(folio)
>> return folio_entire_mapcount() > 1;
>>
>> /*
>> * First subpage is mapped multiple times (especially also via
>> * PMDs)?
>> */
>> if (page_mapcount(folio_page(folio, 0) > 1)
>> return true;
>>
>> /* TODO: also test last subpage? */
>>
>> /* Definitely shared if we're mapping a page multiple times. */
>> return folio_total_mapcount(folio) > folio_nr_pages(folio);
>> }
>>
>> There are some more things we could optimize for.
>
> Before jumping into the mapcount, I would like to get some clarification
> on "sharer". Does it mean a page is mapped/shared by more than one page
> table entry or is mapped/shared by more than one process? Your function

:) I think it depends. For a order-0 page it is "more than one page
table entry", which is is what the order-0 mapcount expresses.


So let's focus on order > 0 (and keep KSM out of the picture). There, it
is different.

We're talking about "logical mapping" of the page. So a single logical
mapping == one sharer.


1) Anon pages

For the time being, it really is "mapped by one process" -> exclusively
mapped, "mapped by more than one process" -> mapped shared.

It's not "mapped into multiple page tables" or "mapped into multiple VMAs".

[That doesn't necessarily have to be that way for ever -- imagine we'd
ever support a mremap(KEEP) with COW semantics -- but think there would
be ways to handle that].


2) Pagecache pages

It really depends what we want. Either

(a) "mapped via a single logical mmap() operation"
mapped.

(b) "mapped from a single open() operation"

(c) mapped by a single process.

Of course, mremap() is weird.


Currently, with the order-0 mapcount (and we cannot really change these
semantics easily because we don't have any space to store additional
information), it would be (a).

For example, mremap(KEEP) or a second mmap() creates a new logical mapping.

But the pagecache is kind-of weird: anybody could immediately map a page
after we detected it as "exclusive" -- and eventually even concurrently.
So this detection of "shared" is inherently problematic.


My primary focus is anon pages for now (as so often :) ). Anon is hard
but eventually a bit easier to handle -- famous last words.

> indicates it is the former, but for madvise_cold_or_pageout_pte_range(),
> I am not sure that is what we want. What if user wants to page out a
> page that is mapped by the same process twice? With current method
> or any existing proposals, it will fail undesirably. It will only work
> as expect with your creator proposal[1].

For anon pages it's fairly easy IMHO. For pagecache pages, I guess we
should keep it the way it was for order-0 pages: individual logical
mmap() operations -- (a). It's sub-optimal, but the way it used to be
for order-0.


>
> Other places like auto NUMA migration also use mapcount to check if
> a page is mapped by multiple process with the assumption that a page will
> only be mapped by one process once.

Right, most of these cases might be able to use
folio_maybe_mapped_shared() to detect "currently the whole folio is
mapped exclusively by a single logical mapping". That makes a lot of
sense in most cases that want to mess with a folio (e.g., migration,
pageout, NUMA).

--
Cheers,

David / dhildenb


2023-07-17 00:18:05

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [RFC PATCH] madvise: make madvise_cold_or_pageout_pte_range() support large folio



On 7/14/2023 11:41 PM, Yu Zhao wrote:
> On Thu, Jul 13, 2023 at 11:57 PM Yin, Fengwei <[email protected]> wrote:
>>
>>
>>>> - if (pageout_anon_only_filter && !folio_test_anon(folio))
>>>> + /* Do not interfere with other mappings of this folio */
>>>> + if (folio_mapcount(folio) != 1)
>>>> continue;
>>>>
>>>> - VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
>>>> -
>>>> - if (pte_young(ptent)) {
>>>> - ptent = ptep_get_and_clear_full(mm, addr, pte,
>>>> - tlb->fullmm);
>>>> - ptent = pte_mkold(ptent);
>>>> - set_pte_at(mm, addr, pte, ptent);
>>>> - tlb_remove_tlb_entry(tlb, pte, addr);
>>>> - }
>>>> -
>>>> - /*
>>>> - * We are deactivating a folio for accelerating reclaiming.
>>>> - * VM couldn't reclaim the folio unless we clear PG_young.
>>>> - * As a side effect, it makes confuse idle-page tracking
>>>> - * because they will miss recent referenced history.
>>>> - */
>>>> - folio_clear_referenced(folio);
>>>> - folio_test_clear_young(folio);
>>>> - if (folio_test_active(folio))
>>>> - folio_set_workingset(folio);
>>>> +pageout_cold_folio:
>>>> if (pageout) {
>>>> if (folio_isolate_lru(folio)) {
>>>> if (folio_test_unevictable(folio))
>>>> @@ -529,8 +542,30 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>>>> arch_leave_lazy_mmu_mode();
>>>> pte_unmap_unlock(start_pte, ptl);
>>>> }
>>>> - if (pageout)
>>>> - reclaim_pages(&folio_list);
>>>> +
>>>> + if (pageout) {
>>>> + LIST_HEAD(reclaim_list);
>>>> +
>>>> + while (!list_empty(&folio_list)) {
>>>> + int refs;
>>>> + unsigned long flags;
>>>> + struct mem_cgroup *memcg = folio_memcg(folio);
>>>> +
>>>> + folio = lru_to_folio(&folio_list);
>>>> + list_del(&folio->lru);
>>>> +
>>>> + refs = folio_referenced(folio, 0, memcg, &flags);
>>>> +
>>>> + if ((flags & VM_LOCKED) || (refs == -1)) {
>>>> + folio_putback_lru(folio);
>>>> + continue;
>>>> + }
>>>> +
>>>> + folio_test_clear_referenced(folio);
>>>> + list_add(&folio->lru, &reclaim_list);
>>>> + }
>>>> + reclaim_pages(&reclaim_list);
>>>> + }
>>>
>>> i overlooked the chunk above -- it's unnecessary: after we split the
>>> large folio (and splice the base folios onto the same LRU list), we
>>> continue at the position of the first base folio because of:
>>>
>>> pte--;
>>> addr -= PAGE_SIZE;
>>> continue;
>>>
>>> And then we do pte_mkold(), which takes care of the A-bit.
>> This patch moves the A-bit clear out of the folio isolation loop. So
>> even the folio is split and loop restarts from the first base folio,
>> the A-bit is not cleared. A-bit is only cleared in reclaim loop.
>>
>> There is one option for A-bit clearing:
>> - clear A-bit of base 4K page in isolation loop and leave large folio
>> A-bit clearing to reclaim loop.
>>
>> This patch didn't use it because don't want to introduce A-bit clearing
>> in two places. But I am open about clearing base 4K page A-bit cleared in
>> isolation loop. Thanks.
>
> Sorry but why are we trying to do multiple things in one patch that I
> assumed is supposed to simply fix madvise() for large anon folios? And
> none of those things seems to have a clear rationale behind it.
>
> The only patch that makes sense at the moment (or the first patch of a
> series) is what I said before:
>
> - if (folio_mapcount(folio) != 1)
> + if (folio_estimated_sharers(folio) != 1)
Definitely. As I replied to you, I will split the patch to two parts:
- just bug fixing. Include the filio_mapcount() -> folio_estimated_shares().
And using ptep_clear_flush_young_notify() to clear the young of PTEs.
- refactor for large folio.
Let me know if this is OK. Thanks.


Regards
Yin, Fengwei

>
> And probably clarify (preferrably in the comments above) this is an
> estimate because we think it's a better tradeoff if we do so (less
> code/overhead from checking the mapcounts of the rest of folios within
> the range).

2023-07-17 01:07:15

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [RFC PATCH] madvise: make madvise_cold_or_pageout_pte_range() support large folio



On 7/14/2023 10:41 PM, Zi Yan wrote:
> On 14 Jul 2023, at 3:31, David Hildenbrand wrote:
>
>> On 14.07.23 05:23, Yu Zhao wrote:
>>> On Thu, Jul 13, 2023 at 9:10 PM Yin, Fengwei <[email protected]> wrote:
>>>>
>>>>
>>>>
>>>> On 7/14/2023 10:08 AM, Yu Zhao wrote:
>>>>> On Thu, Jul 13, 2023 at 9:06 AM Yin Fengwei <[email protected]> wrote:
>>>>>>
>>>>>> Current madvise_cold_or_pageout_pte_range() has two problems for
>>>>>> large folio support:
>>>>>> - Using folio_mapcount() with large folio prevent large folio from
>>>>>> picking up.
>>>>>> - If large folio is in the range requested, shouldn't split it
>>>>>> in madvise_cold_or_pageout_pte_range().
>>>>>>
>>>>>> Fix them by:
>>>>>> - Use folio_estimated_sharers() with large folio
>>>>>> - If large folio is in the range requested, don't split it. Leave
>>>>>> to page reclaim phase.
>>>>>>
>>>>>> For large folio cross boundaries of requested range, skip it if it's
>>>>>> page cache. Try to split it if it's anonymous folio. If splitting
>>>>>> fails, skip it.
>>>>>
>>>>> For now, we may not want to change the existing semantic (heuristic).
>>>>> IOW, we may want to stick to the "only owner" condition:
>>>>>
>>>>> - if (folio_mapcount(folio) != 1)
>>>>> + if (folio_entire_mapcount(folio) ||
>>>>> + (any_page_within_range_has_mapcount > 1))
>>>>>
>>>>> +Minchan Kim
>>>> The folio_estimated_sharers() was discussed here:
>>>> https://lore.kernel.org/linux-mm/[email protected]/
>>>> https://lore.kernel.org/linux-mm/[email protected]/
>>>>
>>>> Yes. It's accurate to check each page of large folio. But it may be over killed in
>>>> some cases (And I think madvise is one of the cases not necessary to be accurate.
>>>> So folio_estimated_sharers() is enough. Correct me if I am wrong).
>>>
>>> I see. Then it's possible this is also what the original commit wants
>>> to do -- Minchan, could you clarify?
>>>
>>> Regardless, I think we can have the following fix, potentially cc'ing stable:
>>>
>>> - if (folio_mapcount(folio) != 1)
>>> + if (folio_estimated_sharers(folio) != 1)
>>>
>>> Sounds good?
>>
>> Adding to the discussion, currently the COW selftest always skips a PTE-mapped THP.
>>
>>
>> For example:
>>
>> # [INFO] Anonymous memory tests in private mappings
>> # [RUN] Basic COW after fork() ... with base page
>> ok 1 No leak from parent into child
>> # [RUN] Basic COW after fork() ... with swapped out base page
>> ok 2 No leak from parent into child
>> # [RUN] Basic COW after fork() ... with THP
>> ok 3 No leak from parent into child
>> # [RUN] Basic COW after fork() ... with swapped-out THP
>> ok 4 No leak from parent into child
>> # [RUN] Basic COW after fork() ... with PTE-mapped THP
>> ok 5 No leak from parent into child
>> # [RUN] Basic COW after fork() ... with swapped-out, PTE-mapped THP
>> ok 6 # SKIP MADV_PAGEOUT did not work, is swap enabled?
>> ...
>>
>>
>> The commit that introduced that change is:
>>
>> commit 07e8c82b5eff8ef34b74210eacb8d9c4a2886b82
>> Author: Vishal Moola (Oracle) <[email protected]>
>> Date: Wed Dec 21 10:08:46 2022 -0800
>>
>> madvise: convert madvise_cold_or_pageout_pte_range() to use folios
>>
>> This change removes a number of calls to compound_head(), and saves
>> 1729 bytes of kernel text.
>>
>>
>>
>> folio_mapcount(folio) is wrong, because that never works on a PTE-mapped THP (well, unless only a single subpage is still mapped ...).
>>
>> page_mapcount(folio) was wrong, because it ignored all other subpages, but at least it worked in some cases.
>>
>> folio_estimated_sharers(folio) is similarly wrong like page_mapcount(), as it's essentially a page_mapcount() of the first subpage.
>>
>> (ignoring that a lockless mapcount-based check is always kind-of unreliable, but that's msotly acceptable for these kind of things)
>>
>>
>> So, unfortunately, page_mapcount() / folio_estimated_sharers() is best we can do for now, but they miss to detect some cases of sharing of the folio -- false negatives to detect sharing.
>>
>>
>> Ideally we want something like folio_maybe_mapped_shared(), and get rid of folio_estimated_sharers(), we better to guess the exact number, simply works towards an answer that tells us "yep, this may be mapped by multiple sharers" vs. "no, this is definitely not mapped by multiple sharers".
>>
>> The "mapped" part of it indicates that this does not catch all cases of sharing. But it should handle most of the cases we care about.
>>
>>
>> There, we can then implement something better than what folio_estimated_sharers() currently does:
>>
>> static inline bool folio_maybe_mapped_shared(folio)
>> {
>> if (likely(!folio_test_large(folio)))
>> return atomic_read(&folio->_mapcount) > 0;
>>
>> /* Mapped multiple times via PMD? */
>> if (folio_test_pmd_mappable(folio)
>> return folio_entire_mapcount() > 1;
>>
>> /*
>> * First subpage is mapped multiple times (especially also via
>> * PMDs)?
>> */
>> if (page_mapcount(folio_page(folio, 0) > 1)
>> return true;
>>
>> /* TODO: also test last subpage? */
>>
>> /* Definitely shared if we're mapping a page multiple times. */
>> return folio_total_mapcount(folio) > folio_nr_pages(folio);
>> }
>>
>> There are some more things we could optimize for.
>
> Before jumping into the mapcount, I would like to get some clarification
> on "sharer". Does it mean a page is mapped/shared by more than one page
> table entry or is mapped/shared by more than one process? Your function
> indicates it is the former, but for madvise_cold_or_pageout_pte_range(),
> I am not sure that is what we want. What if user wants to page out a
> page that is mapped by the same process twice? With current method
> or any existing proposals, it will fail undesirably. It will only work
> as expect with your creator proposal[1].
My understanding here is more than one page table entry based on:
- Original code use mapcount() for normal 4K page. It's for page table.
- If user wants to page out a page mapped to same process twice, maybe
madvise_pageout() should be called twice for different mapping address.
- It has more chance to fail reclaiming if pages mapped twice in page
table entry. And more chance to be faulted in again.


Regards
Yin, Fengwei

>
> Other places like auto NUMA migration also use mapcount to check if
> a page is mapped by multiple process with the assumption that a page will
> only be mapped by one process once.
>
>
> [1] https://lore.kernel.org/linux-mm/[email protected]/
>
> --
> Best Regards,
> Yan, Zi

2023-07-17 15:33:25

by Zi Yan

[permalink] [raw]
Subject: Re: [RFC PATCH] madvise: make madvise_cold_or_pageout_pte_range() support large folio

On 16 Jul 2023, at 20:15, Yin, Fengwei wrote:

> On 7/14/2023 10:41 PM, Zi Yan wrote:
>> On 14 Jul 2023, at 3:31, David Hildenbrand wrote:
>>
>>> On 14.07.23 05:23, Yu Zhao wrote:
>>>> On Thu, Jul 13, 2023 at 9:10 PM Yin, Fengwei <[email protected]> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 7/14/2023 10:08 AM, Yu Zhao wrote:
>>>>>> On Thu, Jul 13, 2023 at 9:06 AM Yin Fengwei <[email protected]> wrote:
>>>>>>>
>>>>>>> Current madvise_cold_or_pageout_pte_range() has two problems for
>>>>>>> large folio support:
>>>>>>> - Using folio_mapcount() with large folio prevent large folio from
>>>>>>> picking up.
>>>>>>> - If large folio is in the range requested, shouldn't split it
>>>>>>> in madvise_cold_or_pageout_pte_range().
>>>>>>>
>>>>>>> Fix them by:
>>>>>>> - Use folio_estimated_sharers() with large folio
>>>>>>> - If large folio is in the range requested, don't split it. Leave
>>>>>>> to page reclaim phase.
>>>>>>>
>>>>>>> For large folio cross boundaries of requested range, skip it if it's
>>>>>>> page cache. Try to split it if it's anonymous folio. If splitting
>>>>>>> fails, skip it.
>>>>>>
>>>>>> For now, we may not want to change the existing semantic (heuristic).
>>>>>> IOW, we may want to stick to the "only owner" condition:
>>>>>>
>>>>>> - if (folio_mapcount(folio) != 1)
>>>>>> + if (folio_entire_mapcount(folio) ||
>>>>>> + (any_page_within_range_has_mapcount > 1))
>>>>>>
>>>>>> +Minchan Kim
>>>>> The folio_estimated_sharers() was discussed here:
>>>>> https://lore.kernel.org/linux-mm/[email protected]/
>>>>> https://lore.kernel.org/linux-mm/[email protected]/
>>>>>
>>>>> Yes. It's accurate to check each page of large folio. But it may be over killed in
>>>>> some cases (And I think madvise is one of the cases not necessary to be accurate.
>>>>> So folio_estimated_sharers() is enough. Correct me if I am wrong).
>>>>
>>>> I see. Then it's possible this is also what the original commit wants
>>>> to do -- Minchan, could you clarify?
>>>>
>>>> Regardless, I think we can have the following fix, potentially cc'ing stable:
>>>>
>>>> - if (folio_mapcount(folio) != 1)
>>>> + if (folio_estimated_sharers(folio) != 1)
>>>>
>>>> Sounds good?
>>>
>>> Adding to the discussion, currently the COW selftest always skips a PTE-mapped THP.
>>>
>>>
>>> For example:
>>>
>>> # [INFO] Anonymous memory tests in private mappings
>>> # [RUN] Basic COW after fork() ... with base page
>>> ok 1 No leak from parent into child
>>> # [RUN] Basic COW after fork() ... with swapped out base page
>>> ok 2 No leak from parent into child
>>> # [RUN] Basic COW after fork() ... with THP
>>> ok 3 No leak from parent into child
>>> # [RUN] Basic COW after fork() ... with swapped-out THP
>>> ok 4 No leak from parent into child
>>> # [RUN] Basic COW after fork() ... with PTE-mapped THP
>>> ok 5 No leak from parent into child
>>> # [RUN] Basic COW after fork() ... with swapped-out, PTE-mapped THP
>>> ok 6 # SKIP MADV_PAGEOUT did not work, is swap enabled?
>>> ...
>>>
>>>
>>> The commit that introduced that change is:
>>>
>>> commit 07e8c82b5eff8ef34b74210eacb8d9c4a2886b82
>>> Author: Vishal Moola (Oracle) <[email protected]>
>>> Date: Wed Dec 21 10:08:46 2022 -0800
>>>
>>> madvise: convert madvise_cold_or_pageout_pte_range() to use folios
>>>
>>> This change removes a number of calls to compound_head(), and saves
>>> 1729 bytes of kernel text.
>>>
>>>
>>>
>>> folio_mapcount(folio) is wrong, because that never works on a PTE-mapped THP (well, unless only a single subpage is still mapped ...).
>>>
>>> page_mapcount(folio) was wrong, because it ignored all other subpages, but at least it worked in some cases.
>>>
>>> folio_estimated_sharers(folio) is similarly wrong like page_mapcount(), as it's essentially a page_mapcount() of the first subpage.
>>>
>>> (ignoring that a lockless mapcount-based check is always kind-of unreliable, but that's msotly acceptable for these kind of things)
>>>
>>>
>>> So, unfortunately, page_mapcount() / folio_estimated_sharers() is best we can do for now, but they miss to detect some cases of sharing of the folio -- false negatives to detect sharing.
>>>
>>>
>>> Ideally we want something like folio_maybe_mapped_shared(), and get rid of folio_estimated_sharers(), we better to guess the exact number, simply works towards an answer that tells us "yep, this may be mapped by multiple sharers" vs. "no, this is definitely not mapped by multiple sharers".
>>>
>>> The "mapped" part of it indicates that this does not catch all cases of sharing. But it should handle most of the cases we care about.
>>>
>>>
>>> There, we can then implement something better than what folio_estimated_sharers() currently does:
>>>
>>> static inline bool folio_maybe_mapped_shared(folio)
>>> {
>>> if (likely(!folio_test_large(folio)))
>>> return atomic_read(&folio->_mapcount) > 0;
>>>
>>> /* Mapped multiple times via PMD? */
>>> if (folio_test_pmd_mappable(folio)
>>> return folio_entire_mapcount() > 1;
>>>
>>> /*
>>> * First subpage is mapped multiple times (especially also via
>>> * PMDs)?
>>> */
>>> if (page_mapcount(folio_page(folio, 0) > 1)
>>> return true;
>>>
>>> /* TODO: also test last subpage? */
>>>
>>> /* Definitely shared if we're mapping a page multiple times. */
>>> return folio_total_mapcount(folio) > folio_nr_pages(folio);
>>> }
>>>
>>> There are some more things we could optimize for.
>>
>> Before jumping into the mapcount, I would like to get some clarification
>> on "sharer". Does it mean a page is mapped/shared by more than one page
>> table entry or is mapped/shared by more than one process? Your function
>> indicates it is the former, but for madvise_cold_or_pageout_pte_range(),
>> I am not sure that is what we want. What if user wants to page out a
>> page that is mapped by the same process twice? With current method
>> or any existing proposals, it will fail undesirably. It will only work
>> as expect with your creator proposal[1].
> My understanding here is more than one page table entry based on:
> - Original code use mapcount() for normal 4K page. It's for page table.
Yes.

> - If user wants to page out a page mapped to same process twice, maybe
> madvise_pageout() should be called twice for different mapping address.

That could be a reasonable way of handing it, but current code bails out
once it finds a folio is mapped more than once. This means a process cannot
page out any page mapped twice no matter if it is mapped to the same process
or not. Maybe it is worth some clarification in the manpage for this behavior.


> - It has more chance to fail reclaiming if pages mapped twice in page
> table entry. And more chance to be faulted in again.
Only if madvise_cold_or_pageout_pte_range() unmaps the page regardless of
its mapcount.


--
Best Regards,
Yan, Zi


Attachments:
signature.asc (871.00 B)
OpenPGP digital signature

2023-07-17 16:52:45

by Yu Zhao

[permalink] [raw]
Subject: Re: [RFC PATCH] madvise: make madvise_cold_or_pageout_pte_range() support large folio

On Sun, Jul 16, 2023 at 5:52 PM Yin, Fengwei <[email protected]> wrote:
>
> On 7/14/2023 11:41 PM, Yu Zhao wrote:
> > On Thu, Jul 13, 2023 at 11:57 PM Yin, Fengwei <[email protected]> wrote:
> >>
> >>
> >>>> - if (pageout_anon_only_filter && !folio_test_anon(folio))
> >>>> + /* Do not interfere with other mappings of this folio */
> >>>> + if (folio_mapcount(folio) != 1)
> >>>> continue;
> >>>>
> >>>> - VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
> >>>> -
> >>>> - if (pte_young(ptent)) {
> >>>> - ptent = ptep_get_and_clear_full(mm, addr, pte,
> >>>> - tlb->fullmm);
> >>>> - ptent = pte_mkold(ptent);
> >>>> - set_pte_at(mm, addr, pte, ptent);
> >>>> - tlb_remove_tlb_entry(tlb, pte, addr);
> >>>> - }
> >>>> -
> >>>> - /*
> >>>> - * We are deactivating a folio for accelerating reclaiming.
> >>>> - * VM couldn't reclaim the folio unless we clear PG_young.
> >>>> - * As a side effect, it makes confuse idle-page tracking
> >>>> - * because they will miss recent referenced history.
> >>>> - */
> >>>> - folio_clear_referenced(folio);
> >>>> - folio_test_clear_young(folio);
> >>>> - if (folio_test_active(folio))
> >>>> - folio_set_workingset(folio);
> >>>> +pageout_cold_folio:
> >>>> if (pageout) {
> >>>> if (folio_isolate_lru(folio)) {
> >>>> if (folio_test_unevictable(folio))
> >>>> @@ -529,8 +542,30 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> >>>> arch_leave_lazy_mmu_mode();
> >>>> pte_unmap_unlock(start_pte, ptl);
> >>>> }
> >>>> - if (pageout)
> >>>> - reclaim_pages(&folio_list);
> >>>> +
> >>>> + if (pageout) {
> >>>> + LIST_HEAD(reclaim_list);
> >>>> +
> >>>> + while (!list_empty(&folio_list)) {
> >>>> + int refs;
> >>>> + unsigned long flags;
> >>>> + struct mem_cgroup *memcg = folio_memcg(folio);
> >>>> +
> >>>> + folio = lru_to_folio(&folio_list);
> >>>> + list_del(&folio->lru);
> >>>> +
> >>>> + refs = folio_referenced(folio, 0, memcg, &flags);
> >>>> +
> >>>> + if ((flags & VM_LOCKED) || (refs == -1)) {
> >>>> + folio_putback_lru(folio);
> >>>> + continue;
> >>>> + }
> >>>> +
> >>>> + folio_test_clear_referenced(folio);
> >>>> + list_add(&folio->lru, &reclaim_list);
> >>>> + }
> >>>> + reclaim_pages(&reclaim_list);
> >>>> + }
> >>>
> >>> i overlooked the chunk above -- it's unnecessary: after we split the
> >>> large folio (and splice the base folios onto the same LRU list), we
> >>> continue at the position of the first base folio because of:
> >>>
> >>> pte--;
> >>> addr -= PAGE_SIZE;
> >>> continue;
> >>>
> >>> And then we do pte_mkold(), which takes care of the A-bit.
> >> This patch moves the A-bit clear out of the folio isolation loop. So
> >> even the folio is split and loop restarts from the first base folio,
> >> the A-bit is not cleared. A-bit is only cleared in reclaim loop.
> >>
> >> There is one option for A-bit clearing:
> >> - clear A-bit of base 4K page in isolation loop and leave large folio
> >> A-bit clearing to reclaim loop.
> >>
> >> This patch didn't use it because don't want to introduce A-bit clearing
> >> in two places. But I am open about clearing base 4K page A-bit cleared in
> >> isolation loop. Thanks.
> >
> > Sorry but why are we trying to do multiple things in one patch that I
> > assumed is supposed to simply fix madvise() for large anon folios? And
> > none of those things seems to have a clear rationale behind it.
> >
> > The only patch that makes sense at the moment (or the first patch of a
> > series) is what I said before:
> >
> > - if (folio_mapcount(folio) != 1)
> > + if (folio_estimated_sharers(folio) != 1)
> Definitely. As I replied to you, I will split the patch to two parts:
> - just bug fixing. Include the filio_mapcount() -> folio_estimated_shares().

I'm onboard with this fix.

> And using ptep_clear_flush_young_notify() to clear the young of PTEs.

This is another fix (if it's a real problem), hence a separate patch.

> - refactor for large folio.

Minchan will look at the last two.

> Let me know if this is OK. Thanks.

SGTM. Thanks.

2023-07-18 00:20:28

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [RFC PATCH] madvise: make madvise_cold_or_pageout_pte_range() support large folio



On 7/17/23 22:38, Zi Yan wrote:
> On 16 Jul 2023, at 20:15, Yin, Fengwei wrote:
>
>> On 7/14/2023 10:41 PM, Zi Yan wrote:
>>> On 14 Jul 2023, at 3:31, David Hildenbrand wrote:
>>>
>>>> On 14.07.23 05:23, Yu Zhao wrote:
>>>>> On Thu, Jul 13, 2023 at 9:10 PM Yin, Fengwei <[email protected]> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 7/14/2023 10:08 AM, Yu Zhao wrote:
>>>>>>> On Thu, Jul 13, 2023 at 9:06 AM Yin Fengwei <[email protected]> wrote:
>>>>>>>>
>>>>>>>> Current madvise_cold_or_pageout_pte_range() has two problems for
>>>>>>>> large folio support:
>>>>>>>> - Using folio_mapcount() with large folio prevent large folio from
>>>>>>>> picking up.
>>>>>>>> - If large folio is in the range requested, shouldn't split it
>>>>>>>> in madvise_cold_or_pageout_pte_range().
>>>>>>>>
>>>>>>>> Fix them by:
>>>>>>>> - Use folio_estimated_sharers() with large folio
>>>>>>>> - If large folio is in the range requested, don't split it. Leave
>>>>>>>> to page reclaim phase.
>>>>>>>>
>>>>>>>> For large folio cross boundaries of requested range, skip it if it's
>>>>>>>> page cache. Try to split it if it's anonymous folio. If splitting
>>>>>>>> fails, skip it.
>>>>>>>
>>>>>>> For now, we may not want to change the existing semantic (heuristic).
>>>>>>> IOW, we may want to stick to the "only owner" condition:
>>>>>>>
>>>>>>> - if (folio_mapcount(folio) != 1)
>>>>>>> + if (folio_entire_mapcount(folio) ||
>>>>>>> + (any_page_within_range_has_mapcount > 1))
>>>>>>>
>>>>>>> +Minchan Kim
>>>>>> The folio_estimated_sharers() was discussed here:
>>>>>> https://lore.kernel.org/linux-mm/[email protected]/
>>>>>> https://lore.kernel.org/linux-mm/[email protected]/
>>>>>>
>>>>>> Yes. It's accurate to check each page of large folio. But it may be over killed in
>>>>>> some cases (And I think madvise is one of the cases not necessary to be accurate.
>>>>>> So folio_estimated_sharers() is enough. Correct me if I am wrong).
>>>>>
>>>>> I see. Then it's possible this is also what the original commit wants
>>>>> to do -- Minchan, could you clarify?
>>>>>
>>>>> Regardless, I think we can have the following fix, potentially cc'ing stable:
>>>>>
>>>>> - if (folio_mapcount(folio) != 1)
>>>>> + if (folio_estimated_sharers(folio) != 1)
>>>>>
>>>>> Sounds good?
>>>>
>>>> Adding to the discussion, currently the COW selftest always skips a PTE-mapped THP.
>>>>
>>>>
>>>> For example:
>>>>
>>>> # [INFO] Anonymous memory tests in private mappings
>>>> # [RUN] Basic COW after fork() ... with base page
>>>> ok 1 No leak from parent into child
>>>> # [RUN] Basic COW after fork() ... with swapped out base page
>>>> ok 2 No leak from parent into child
>>>> # [RUN] Basic COW after fork() ... with THP
>>>> ok 3 No leak from parent into child
>>>> # [RUN] Basic COW after fork() ... with swapped-out THP
>>>> ok 4 No leak from parent into child
>>>> # [RUN] Basic COW after fork() ... with PTE-mapped THP
>>>> ok 5 No leak from parent into child
>>>> # [RUN] Basic COW after fork() ... with swapped-out, PTE-mapped THP
>>>> ok 6 # SKIP MADV_PAGEOUT did not work, is swap enabled?
>>>> ...
>>>>
>>>>
>>>> The commit that introduced that change is:
>>>>
>>>> commit 07e8c82b5eff8ef34b74210eacb8d9c4a2886b82
>>>> Author: Vishal Moola (Oracle) <[email protected]>
>>>> Date: Wed Dec 21 10:08:46 2022 -0800
>>>>
>>>> madvise: convert madvise_cold_or_pageout_pte_range() to use folios
>>>>
>>>> This change removes a number of calls to compound_head(), and saves
>>>> 1729 bytes of kernel text.
>>>>
>>>>
>>>>
>>>> folio_mapcount(folio) is wrong, because that never works on a PTE-mapped THP (well, unless only a single subpage is still mapped ...).
>>>>
>>>> page_mapcount(folio) was wrong, because it ignored all other subpages, but at least it worked in some cases.
>>>>
>>>> folio_estimated_sharers(folio) is similarly wrong like page_mapcount(), as it's essentially a page_mapcount() of the first subpage.
>>>>
>>>> (ignoring that a lockless mapcount-based check is always kind-of unreliable, but that's msotly acceptable for these kind of things)
>>>>
>>>>
>>>> So, unfortunately, page_mapcount() / folio_estimated_sharers() is best we can do for now, but they miss to detect some cases of sharing of the folio -- false negatives to detect sharing.
>>>>
>>>>
>>>> Ideally we want something like folio_maybe_mapped_shared(), and get rid of folio_estimated_sharers(), we better to guess the exact number, simply works towards an answer that tells us "yep, this may be mapped by multiple sharers" vs. "no, this is definitely not mapped by multiple sharers".
>>>>
>>>> The "mapped" part of it indicates that this does not catch all cases of sharing. But it should handle most of the cases we care about.
>>>>
>>>>
>>>> There, we can then implement something better than what folio_estimated_sharers() currently does:
>>>>
>>>> static inline bool folio_maybe_mapped_shared(folio)
>>>> {
>>>> if (likely(!folio_test_large(folio)))
>>>> return atomic_read(&folio->_mapcount) > 0;
>>>>
>>>> /* Mapped multiple times via PMD? */
>>>> if (folio_test_pmd_mappable(folio)
>>>> return folio_entire_mapcount() > 1;
>>>>
>>>> /*
>>>> * First subpage is mapped multiple times (especially also via
>>>> * PMDs)?
>>>> */
>>>> if (page_mapcount(folio_page(folio, 0) > 1)
>>>> return true;
>>>>
>>>> /* TODO: also test last subpage? */
>>>>
>>>> /* Definitely shared if we're mapping a page multiple times. */
>>>> return folio_total_mapcount(folio) > folio_nr_pages(folio);
>>>> }
>>>>
>>>> There are some more things we could optimize for.
>>>
>>> Before jumping into the mapcount, I would like to get some clarification
>>> on "sharer". Does it mean a page is mapped/shared by more than one page
>>> table entry or is mapped/shared by more than one process? Your function
>>> indicates it is the former, but for madvise_cold_or_pageout_pte_range(),
>>> I am not sure that is what we want. What if user wants to page out a
>>> page that is mapped by the same process twice? With current method
>>> or any existing proposals, it will fail undesirably. It will only work
>>> as expect with your creator proposal[1].
>> My understanding here is more than one page table entry based on:
>> - Original code use mapcount() for normal 4K page. It's for page table.
> Yes.
>
>> - If user wants to page out a page mapped to same process twice, maybe
>> madvise_pageout() should be called twice for different mapping address.
>
> That could be a reasonable way of handing it, but current code bails out
> once it finds a folio is mapped more than once. This means a process cannot
> page out any page mapped twice no matter if it is mapped to the same process
> or not. Maybe it is worth some clarification in the manpage for this behavior.
This is just a thought when I read this part of code. Especially for mapcount
constrain. Like if the mapcount is larger than 1, unmap the page from the
VMA requested by the syscall this time.

But I am lack of understanding to the user case of this syscall. So it may
be total wrong.

>
>
>> - It has more chance to fail reclaiming if pages mapped twice in page
>> table entry. And more chance to be faulted in again.
> Only if madvise_cold_or_pageout_pte_range() unmaps the page regardless of
> its mapcount.
This is just my guess why only handle mapcount ==1 case.


Regards
Yin, Fengwei

>
>
> --
> Best Regards,
> Yan, Zi