2023-08-08 03:18:05

by Yin, Fengwei

[permalink] [raw]
Subject: [PATCH v2 0/3] don't use mapcount() to check large folio sharing

In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(),
folio_mapcount() is used to check whether the folio is shared. But it's
not correct as folio_mapcount() returns total mapcount of large folio.

Use folio_estimated_sharers() here as the estimated number is enough.


This patchset will fix the cases:
User space application call madvise() with MADV_FREE, MADV_COLD and
MADV_PAGEOUT for specific address range. There are THP mapped to the
range. Without the patchset, the THP is skipped. With the patch, the
THP will be split and handled accordingly.

David reported the cow self test skip some cases because of
MADV_PAGEOUT skip THP:
https://lore.kernel.org/linux-mm/[email protected]/T/#mbf0f2ec7fbe45da47526de1d7036183981691e81
and I confirmed this patchset make it work again.


Changelog from v1:
- Avoid two Fixes tags make backport harder. Thank Andrew for pointing
this out.

- Add note section to mention this is a temporary fix which is fine
to reduce user-visble effects. For long term fix, we should wait for
David's solution. Thank Ryan and David for pointing this out.

- Spell user-visible effects out. Then people could decide whether
these patches are necessary for stable branch. Thank Andrew for
pointing this out.

V1:
https://lore.kernel.org/linux-mm/[email protected]/T/#med74caad0cbd0049641cfddc5b9fe793b4b50276

Yin Fengwei (3):
madvise:madvise_cold_or_pageout_pte_range(): don't use mapcount()
against large folio for sharing check
madvise:madvise_free_huge_pmd(): don't use mapcount() against large
folio for sharing check
madvise:madvise_free_pte_range(): don't use mapcount() against large
folio for sharing check

mm/huge_memory.c | 2 +-
mm/madvise.c | 6 +++---
2 files changed, 4 insertions(+), 4 deletions(-)

--
2.39.2



2023-08-08 03:26:19

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] don't use mapcount() to check large folio sharing

On Mon, Aug 7, 2023 at 8:10 PM Yin Fengwei <[email protected]> wrote:
>
> In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(),
> folio_mapcount() is used to check whether the folio is shared. But it's
> not correct as folio_mapcount() returns total mapcount of large folio.
>
> Use folio_estimated_sharers() here as the estimated number is enough.
>
>
> This patchset will fix the cases:
> User space application call madvise() with MADV_FREE, MADV_COLD and
> MADV_PAGEOUT for specific address range. There are THP mapped to the
> range. Without the patchset, the THP is skipped. With the patch, the
> THP will be split and handled accordingly.
>
> David reported the cow self test skip some cases because of
> MADV_PAGEOUT skip THP:
> https://lore.kernel.org/linux-mm/[email protected]/T/#mbf0f2ec7fbe45da47526de1d7036183981691e81
> and I confirmed this patchset make it work again.
>
>
> Changelog from v1:
> - Avoid two Fixes tags make backport harder. Thank Andrew for pointing
> this out.
>
> - Add note section to mention this is a temporary fix which is fine
> to reduce user-visble effects. For long term fix, we should wait for
> David's solution. Thank Ryan and David for pointing this out.
>
> - Spell user-visible effects out. Then people could decide whether
> these patches are necessary for stable branch. Thank Andrew for
> pointing this out.

LGTM, thank you.

2023-08-08 15:37:33

by Yin, Fengwei

[permalink] [raw]
Subject: [PATCH v2 3/3] madvise:madvise_free_pte_range(): don't use mapcount() against large folio for sharing check

Commit 98b211d6415f ("madvise: convert madvise_free_pte_range() to use a
folio") replaced the page_mapcount() with folio_mapcount() to check
whether the folio is shared by other mapping.

It's not correct for large folios. folio_mapcount() returns the total
mapcount of large folio which is not suitable to detect whether the folio
is shared.

Use folio_estimated_sharers() which returns a estimated number of shares.
That means it's not 100% correct. It should be OK for madvise case here.

User-visible effects is that the THP is skipped when user call madvise.
But the correct behavior is THP should be split and processed then.

NOTE: this change is a temporary fix to reduce the user-visible effects
before the long term fix from David is ready.

Fixes: 98b211d6415f ("madvise: convert madvise_free_pte_range() to use a folio")
Cc: [email protected]
Signed-off-by: Yin Fengwei <[email protected]>
Reviewed-by: Yu Zhao <[email protected]>
Reviewed-by: Ryan Roberts <[email protected]>
---
mm/madvise.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 49af35e2d99a..4dded5d27e7e 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -683,7 +683,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
if (folio_test_large(folio)) {
int err;

- if (folio_mapcount(folio) != 1)
+ if (folio_estimated_sharers(folio) != 1)
break;
if (!folio_trylock(folio))
break;
--
2.39.2


2023-08-08 16:23:21

by Yin, Fengwei

[permalink] [raw]
Subject: [PATCH v2 2/3] madvise:madvise_free_huge_pmd(): don't use mapcount() against large folio for sharing check

Commit fc986a38b670 ("mm: huge_memory: convert madvise_free_huge_pmd to
use a folio") replaced the page_mapcount() with folio_mapcount() to check
whether the folio is shared by other mapping.

It's not correct for large folios. folio_mapcount() returns the total
mapcount of large folio which is not suitable to detect whether the folio
is shared.

Use folio_estimated_sharers() which returns a estimated number of shares.
That means it's not 100% correct. It should be OK for madvise case here.

User-visible effects is that the THP is skipped when user call madvise.
But the correct behavior is THP should be split and processed then.

NOTE: this change is a temporary fix to reduce the user-visible effects
before the long term fix from David is ready.

Fixes: fc986a38b670 ("mm: huge_memory: convert madvise_free_huge_pmd to use a folio")
Cc: [email protected]
Signed-off-by: Yin Fengwei <[email protected]>
Reviewed-by: Yu Zhao <[email protected]>
Reviewed-by: Ryan Roberts <[email protected]>
---
mm/huge_memory.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 154c210892a1..0b709d2c46c6 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1612,7 +1612,7 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
* If other processes are mapping this folio, we couldn't discard
* the folio unless they all do MADV_FREE so let's skip the folio.
*/
- if (folio_mapcount(folio) != 1)
+ if (folio_estimated_sharers(folio) != 1)
goto out;

if (!folio_trylock(folio))
--
2.39.2


2023-08-08 16:30:29

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] madvise:madvise_free_huge_pmd(): don't use mapcount() against large folio for sharing check

On Mon, Aug 7, 2023 at 8:11 PM Yin Fengwei <[email protected]> wrote:
>
> Commit fc986a38b670 ("mm: huge_memory: convert madvise_free_huge_pmd to
> use a folio") replaced the page_mapcount() with folio_mapcount() to check
> whether the folio is shared by other mapping.
>
> It's not correct for large folios. folio_mapcount() returns the total
> mapcount of large folio which is not suitable to detect whether the folio
> is shared.
>
> Use folio_estimated_sharers() which returns a estimated number of shares.
> That means it's not 100% correct. It should be OK for madvise case here.
>
> User-visible effects is that the THP is skipped when user call madvise.
> But the correct behavior is THP should be split and processed then.
>
> NOTE: this change is a temporary fix to reduce the user-visible effects
> before the long term fix from David is ready.
>
> Fixes: fc986a38b670 ("mm: huge_memory: convert madvise_free_huge_pmd to use a folio")
> Cc: [email protected]

Andrew, this one isn't really a bug fix but an optimization, so please
feel free to drop the Fixes and Cc tags above. (It seems to me it
doesn't hurt for stable to take it though.)

Thank you.


> Signed-off-by: Yin Fengwei <[email protected]>
> Reviewed-by: Yu Zhao <[email protected]>
> Reviewed-by: Ryan Roberts <[email protected]>
> ---
> mm/huge_memory.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 154c210892a1..0b709d2c46c6 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1612,7 +1612,7 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> * If other processes are mapping this folio, we couldn't discard
> * the folio unless they all do MADV_FREE so let's skip the folio.
> */
> - if (folio_mapcount(folio) != 1)
> + if (folio_estimated_sharers(folio) != 1)
> goto out;
>
> if (!folio_trylock(folio))
> --
> 2.39.2
>

2023-08-08 17:35:16

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] don't use mapcount() to check large folio sharing



On 8/8/2023 10:43 AM, Yu Zhao wrote:
> On Mon, Aug 7, 2023 at 8:10 PM Yin Fengwei <[email protected]> wrote:
>>
>> In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(),
>> folio_mapcount() is used to check whether the folio is shared. But it's
>> not correct as folio_mapcount() returns total mapcount of large folio.
>>
>> Use folio_estimated_sharers() here as the estimated number is enough.
>>
>>
>> This patchset will fix the cases:
>> User space application call madvise() with MADV_FREE, MADV_COLD and
>> MADV_PAGEOUT for specific address range. There are THP mapped to the
>> range. Without the patchset, the THP is skipped. With the patch, the
>> THP will be split and handled accordingly.
>>
>> David reported the cow self test skip some cases because of
>> MADV_PAGEOUT skip THP:
>> https://lore.kernel.org/linux-mm/[email protected]/T/#mbf0f2ec7fbe45da47526de1d7036183981691e81
>> and I confirmed this patchset make it work again.
>>
>>
>> Changelog from v1:
>> - Avoid two Fixes tags make backport harder. Thank Andrew for pointing
>> this out.
>>
>> - Add note section to mention this is a temporary fix which is fine
>> to reduce user-visble effects. For long term fix, we should wait for
>> David's solution. Thank Ryan and David for pointing this out.
>>
>> - Spell user-visible effects out. Then people could decide whether
>> these patches are necessary for stable branch. Thank Andrew for
>> pointing this out.
>
> LGTM, thank you.
Thanks a lot for looking at this patchset.


Regards
Yin, Fengwei

2023-08-19 14:38:20

by Daniel Gomez

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] madvise:madvise_free_pte_range(): don't use mapcount() against large folio for sharing check

On Wed, Aug 16, 2023 at 08:04:11PM +0800, Yin, Fengwei wrote:
>
>
> On 8/16/2023 7:44 PM, Daniel Gomez wrote:
> > On Wed, Aug 16, 2023 at 07:30:35AM +0800, Yin Fengwei wrote:
> >>
> >>
> >> On 8/15/23 21:25, Daniel Gomez wrote:
> >>> Hi Yin,
> >>> On Tue, Aug 08, 2023 at 10:09:17AM +0800, Yin Fengwei wrote:
> >>>> Commit 98b211d6415f ("madvise: convert madvise_free_pte_range() to use a
> >>>> folio") replaced the page_mapcount() with folio_mapcount() to check
> >>>> whether the folio is shared by other mapping.
> >>>>
> >>>> It's not correct for large folios. folio_mapcount() returns the total
> >>>> mapcount of large folio which is not suitable to detect whether the folio
> >>>> is shared.
> >>>>
> >>>> Use folio_estimated_sharers() which returns a estimated number of shares.
> >>>> That means it's not 100% correct. It should be OK for madvise case here.
> >>>
> >>> I'm trying to understand why it should be ok for madvise this change, so
> >>> I hope it's okay to ask you few questions.
> >>>
> >>> folio_mapcount() calculates the total maps for all the subpages of a
> >>> folio. However, the folio_estimated_sharers does it only for the first
> >>> subpage making it not true for large folios. Then, wouldn't this change
> >>> drop support for large folios?
> >> I saw David explained this very well in another mail.
> >>
> >>>
> >>> Seems like folio_entire_mapcount() is not accurate either because of it
> >>> does not inclue PTE-mapped sub-pages which I think we need here. Hence,
> >>> the folio_mapcount(). Could this be something missing in the test side?
> >>>
> >>> I tried to replicate the setup with CONFIG_TRANSPARENT_HUGEPAGE but
> >>> seems like I'm not able to do it:
> >>>
> >>> ./cow
> >>> # [INFO] detected THP size: 2048 KiB
> >>> # [INFO] detected hugetlb size: 2048 KiB
> >>> # [INFO] detected hugetlb size: 1048576 KiB
> >>> # [INFO] huge zeropage is enabled
> >>> TAP version 13
> >>> 1..166
> >>> # [INFO] Anonymous memory tests in private mappings
> >>> # [RUN] Basic COW after fork() ... with base page
> >>> not ok 1 MADV_NOHUGEPAGE failed
> >>> # [RUN] Basic COW after fork() ... with swapped out base page
> >>> not ok 2 MADV_NOHUGEPAGE failed
> >>> # [RUN] Basic COW after fork() ... with THP
> >>> not ok 3 MADV_HUGEPAGE failed
> >>> # [RUN] Basic COW after fork() ... with swapped-out THP
> >>> not ok 4 MADV_HUGEPAGE failed
> >>> # [RUN] Basic COW after fork() ... with PTE-mapped THP
> >>> not ok 5 MADV_HUGEPAGE failed
> >>> # [RUN] Basic COW after fork() ... with swapped-out, PTE-mapped THP
> >>> not ok 6 MADV_HUGEPAGE failed
> >>> ...
> >> Can you post the MADV_PAGEOUT and PTE-mapped THP related testing result?
> >> And I suppose swap need be enabled also for the testing.
> >
> > You may find a dump of the logs in the link below with system information. Let me
> > know if you find something wrong in my setup or if you need something else.
> > Besides CONFIG_TRANSPARENT_HUGEPAGE, CONFIG_SWAP is also enabled in the kernel.
> >
> > https://gitlab.com/-/snippets/2584135
> >
> > Also, strace reports ENOSYS for MADV_*:
> > madvise(0x7f2912465000, 4096, MADV_NOHUGEPAGE) = -1 ENOSYS (Function not implemented)
> > madvise(0x7f2912000000, 2097152, MADV_HUGEPAGE) = -1 ENOSYS (Function not implemented)
> O. The problem here is MADV_HUGEPAGE/MADV_NOHUGEPAGE doesn't work.
> Do you have CONFIG_ADVISE_SYSCALLS enabled?
It worked after I enabled the conf. Some tests failed and some were
skipped. But I managed to reproduce the issue now, thanks Yin!

Bail out! 4 out of 166 tests failed
# Totals: pass:146 fail:4 xfail:0 xpass:0 skip:16 error:0

Here the full log:
https://gitlab.com/-/snippets/2584190/raw/main/cow.txt
>
> Regards
> Yin, Fengwei
>
> >
> >
> >>
> >>
> >> Regards
> >> Yin, Fengwei
> >>
> >>>
> >>>
> >>> Daniel
> >>>>
> >>>> User-visible effects is that the THP is skipped when user call madvise.
> >>>> But the correct behavior is THP should be split and processed then.
> >>>>
> >>>> NOTE: this change is a temporary fix to reduce the user-visible effects
> >>>> before the long term fix from David is ready.
> >>>>
> >>>> Fixes: 98b211d6415f ("madvise: convert madvise_free_pte_range() to use a folio")
> >>>> Cc: [email protected]
> >>>> Signed-off-by: Yin Fengwei <[email protected]>
> >>>> Reviewed-by: Yu Zhao <[email protected]>
> >>>> Reviewed-by: Ryan Roberts <[email protected]>
> >>>> ---
> >>>> mm/madvise.c | 2 +-
> >>>> 1 file changed, 1 insertion(+), 1 deletion(-)
> >>>>
> >>>> diff --git a/mm/madvise.c b/mm/madvise.c
> >>>> index 49af35e2d99a..4dded5d27e7e 100644
> >>>> --- a/mm/madvise.c
> >>>> +++ b/mm/madvise.c
> >>>> @@ -683,7 +683,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
> >>>> if (folio_test_large(folio)) {
> >>>> int err;
> >>>>
> >>>> - if (folio_mapcount(folio) != 1)
> >>>> + if (folio_estimated_sharers(folio) != 1)
> >>>> break;
> >>>> if (!folio_trylock(folio))
> >>>> break;
> >>>> --
> >>>> 2.39.2
> >>> >

2023-08-19 22:49:00

by Daniel Gomez

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] madvise:madvise_free_pte_range(): don't use mapcount() against large folio for sharing check

Hi Yin,
On Tue, Aug 08, 2023 at 10:09:17AM +0800, Yin Fengwei wrote:
> Commit 98b211d6415f ("madvise: convert madvise_free_pte_range() to use a
> folio") replaced the page_mapcount() with folio_mapcount() to check
> whether the folio is shared by other mapping.
>
> It's not correct for large folios. folio_mapcount() returns the total
> mapcount of large folio which is not suitable to detect whether the folio
> is shared.
>
> Use folio_estimated_sharers() which returns a estimated number of shares.
> That means it's not 100% correct. It should be OK for madvise case here.

I'm trying to understand why it should be ok for madvise this change, so
I hope it's okay to ask you few questions.

folio_mapcount() calculates the total maps for all the subpages of a
folio. However, the folio_estimated_sharers does it only for the first
subpage making it not true for large folios. Then, wouldn't this change
drop support for large folios?

Seems like folio_entire_mapcount() is not accurate either because of it
does not inclue PTE-mapped sub-pages which I think we need here. Hence,
the folio_mapcount(). Could this be something missing in the test side?

I tried to replicate the setup with CONFIG_TRANSPARENT_HUGEPAGE but
seems like I'm not able to do it:

./cow
# [INFO] detected THP size: 2048 KiB
# [INFO] detected hugetlb size: 2048 KiB
# [INFO] detected hugetlb size: 1048576 KiB
# [INFO] huge zeropage is enabled
TAP version 13
1..166
# [INFO] Anonymous memory tests in private mappings
# [RUN] Basic COW after fork() ... with base page
not ok 1 MADV_NOHUGEPAGE failed
# [RUN] Basic COW after fork() ... with swapped out base page
not ok 2 MADV_NOHUGEPAGE failed
# [RUN] Basic COW after fork() ... with THP
not ok 3 MADV_HUGEPAGE failed
# [RUN] Basic COW after fork() ... with swapped-out THP
not ok 4 MADV_HUGEPAGE failed
# [RUN] Basic COW after fork() ... with PTE-mapped THP
not ok 5 MADV_HUGEPAGE failed
# [RUN] Basic COW after fork() ... with swapped-out, PTE-mapped THP
not ok 6 MADV_HUGEPAGE failed
...


Daniel
>
> User-visible effects is that the THP is skipped when user call madvise.
> But the correct behavior is THP should be split and processed then.
>
> NOTE: this change is a temporary fix to reduce the user-visible effects
> before the long term fix from David is ready.
>
> Fixes: 98b211d6415f ("madvise: convert madvise_free_pte_range() to use a folio")
> Cc: [email protected]
> Signed-off-by: Yin Fengwei <[email protected]>
> Reviewed-by: Yu Zhao <[email protected]>
> Reviewed-by: Ryan Roberts <[email protected]>
> ---
> mm/madvise.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 49af35e2d99a..4dded5d27e7e 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -683,7 +683,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
> if (folio_test_large(folio)) {
> int err;
>
> - if (folio_mapcount(folio) != 1)
> + if (folio_estimated_sharers(folio) != 1)
> break;
> if (!folio_trylock(folio))
> break;
> --
> 2.39.2
>