2023-07-28 17:03:05

by Yin, Fengwei

[permalink] [raw]
Subject: [PATCH 0/2] don't use mapcount() to check large folio sharing

In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(),
folio_mapcount() is used to check whether the folio is shared. But it's
not correct as folio_mapcount() returns total mapcount of large folio.

Use folio_estimated_sharers() here as the estimated number is enough.

Yin Fengwei (2):
madvise: don't use mapcount() against large folio for sharing check
madvise: don't use mapcount() against large folio for sharing check

mm/huge_memory.c | 2 +-
mm/madvise.c | 6 +++---
2 files changed, 4 insertions(+), 4 deletions(-)

--
2.39.2



2023-07-28 17:36:14

by Yin, Fengwei

[permalink] [raw]
Subject: [PATCH 2/2] madvise: don't use mapcount() against large folio for sharing check

The commits
98b211d6415f ("madvise: convert madvise_free_pte_range() to use
a folio")
fc986a38b670 ("mm: huge_memory: convert madvise_free_huge_pmd to
use a folio")

replaced the page_mapcount() with folio_mapcount() to check whether
the folio is shared by other mapping.

But it's not correct for large folio. folio_mapcount() returns the
total mapcount of large folio which is not suitable to detect whether
the folio is shared.

Use folio_estimated_sharers() which returns a estimated number of
shares. That means it's not 100% correct. But it should be OK for
madvise case here.

Fixes: 98b211d6415f ("madvise: convert madvise_free_pte_range() to use a folio")
Fixes: fc986a38b670 ("mm: huge_memory: convert madvise_free_huge_pmd to use a folio")
Signed-off-by: Yin Fengwei <[email protected]>
Reviewed-by: Yu Zhao <[email protected]>
---
mm/huge_memory.c | 2 +-
mm/madvise.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index eb3678360b97..68c890875257 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1613,7 +1613,7 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
* If other processes are mapping this folio, we couldn't discard
* the folio unless they all do MADV_FREE so let's skip the folio.
*/
- if (folio_mapcount(folio) != 1)
+ if (folio_estimated_sharers(folio) != 1)
goto out;

if (!folio_trylock(folio))
diff --git a/mm/madvise.c b/mm/madvise.c
index 148b46beb039..55bdf641abfa 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -678,7 +678,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
if (folio_test_large(folio)) {
int err;

- if (folio_mapcount(folio) != 1)
+ if (folio_estimated_sharers(folio) != 1)
break;
if (!folio_trylock(folio))
break;
--
2.39.2


2023-07-28 17:39:52

by Yin, Fengwei

[permalink] [raw]
Subject: [PATCH 1/2] madvise: don't use mapcount() against large folio for sharing check

The commit
07e8c82b5eff ("madvise: convert madvise_cold_or_pageout_pte_range() to
use folios") replaced the page_mapcount() with folio_mapcount() to
check whether the folio is shared by other mapping.

But it's not correct for large folio. folio_mapcount() returns the
total mapcount of large folio which is not suitable to detect whether
the folio is shared.

Use folio_estimated_sharers() which returns a estimated number of
shares. That means it's not 100% correct. But it should be OK for
madvise case here.

Fixes: 07e8c82b5eff ("madvise: convert madvise_cold_or_pageout_pte_range() to use folios")
Signed-off-by: Yin Fengwei <[email protected]>
Reviewed-by: Yu Zhao <[email protected]>
---
mm/madvise.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 886f06066622..148b46beb039 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -383,7 +383,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
folio = pfn_folio(pmd_pfn(orig_pmd));

/* Do not interfere with other mappings of this folio */
- if (folio_mapcount(folio) != 1)
+ if (folio_estimated_sharers(folio) != 1)
goto huge_unlock;

if (pageout_anon_only_filter && !folio_test_anon(folio))
@@ -457,7 +457,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
if (folio_test_large(folio)) {
int err;

- if (folio_mapcount(folio) != 1)
+ if (folio_estimated_sharers(folio) != 1)
break;
if (pageout_anon_only_filter && !folio_test_anon(folio))
break;
--
2.39.2


2023-07-28 18:02:19

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 2/2] madvise: don't use mapcount() against large folio for sharing check

On Sat, 29 Jul 2023 00:13:56 +0800 Yin Fengwei <[email protected]> wrote:

> Fixes: 98b211d6415f ("madvise: convert madvise_free_pte_range() to use a folio")
> Fixes: fc986a38b670 ("mm: huge_memory: convert madvise_free_huge_pmd to use a folio")

Having two Fixes: for one patch presumably makes backporting more
complicated and adds risk of making mistakes.

So I have split this into a three-patch series and I've fixed up the patch naming:

Subject: madvise:madvise_cold_or_pageout_pte_range(): don't use mapcount() against large folio for sharing check
Subject: madvise:madvise_free_huge_pmd(): don't use mapcount() against large folio for sharing check
Subject: madvise:madvise_free_pte_range(): don't use mapcount() against large folio for sharing check

I haven't added cc:stable at this time - that awaits the description of
user-visible effects.


2023-07-28 18:11:52

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 0/2] don't use mapcount() to check large folio sharing

On Sat, 29 Jul 2023 00:13:54 +0800 Yin Fengwei <[email protected]> wrote:

> In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(),
> folio_mapcount() is used to check whether the folio is shared. But it's
> not correct as folio_mapcount() returns total mapcount of large folio.
>
> Use folio_estimated_sharers() here as the estimated number is enough.

What are the user-visible runtime effects of these changes?

(and please try to avoid using the same Subject: for different patches)

2023-07-29 14:41:02

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [PATCH 2/2] madvise: don't use mapcount() against large folio for sharing check

Hi Andrew,

On 7/29/2023 1:41 AM, Andrew Morton wrote:
> On Sat, 29 Jul 2023 00:13:56 +0800 Yin Fengwei <[email protected]> wrote:
>
>> Fixes: 98b211d6415f ("madvise: convert madvise_free_pte_range() to use a folio")
>> Fixes: fc986a38b670 ("mm: huge_memory: convert madvise_free_huge_pmd to use a folio")
>
> Having two Fixes: for one patch presumably makes backporting more
> complicated and adds risk of making mistakes.
>
> So I have split this into a three-patch series and I've fixed up the patch naming:
>
> Subject: madvise:madvise_cold_or_pageout_pte_range(): don't use mapcount() against large folio for sharing check
> Subject: madvise:madvise_free_huge_pmd(): don't use mapcount() against large folio for sharing check
> Subject: madvise:madvise_free_pte_range(): don't use mapcount() against large folio for sharing check
Thanks a lot for your kind help. Will be careful for the future patches.

>
> I haven't added cc:stable at this time - that awaits the description of
> user-visible effects.
The impact of the patch:
Without the patch, when user calls madvise() with MADV_COLD, MADV_PAGEOUT
and MADV_FREE, it's likely THP pages will be skipped. With the patch,
It's likely the THP pages will be split to pages which will be made code,
reclaimed and freed.


Regards
Yin, Fengwei


2023-08-02 11:53:22

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH 0/2] don't use mapcount() to check large folio sharing

On 02/08/2023 11:48, David Hildenbrand wrote:
> On 02.08.23 12:27, Ryan Roberts wrote:
>> On 28/07/2023 17:13, Yin Fengwei wrote:
>>> In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(),
>>> folio_mapcount() is used to check whether the folio is shared. But it's
>>> not correct as folio_mapcount() returns total mapcount of large folio.
>>>
>>> Use folio_estimated_sharers() here as the estimated number is enough.
>>>
>>> Yin Fengwei (2):
>>>    madvise: don't use mapcount() against large folio for sharing check
>>>    madvise: don't use mapcount() against large folio for sharing check
>>>
>>>   mm/huge_memory.c | 2 +-
>>>   mm/madvise.c     | 6 +++---
>>>   2 files changed, 4 insertions(+), 4 deletions(-)
>>>
>>
>> As a set of fixes, I agree this is definitely an improvement, so:
>>
>> Reviewed-By: Ryan Roberts
>>
>>
>> But I have a couple of comments around further improvements;
>>
>> Once we have the scheme that David is working on to be able to provide precise
>> exclusive vs shared info, we will probably want to move to that. Although that
>> scheme will need access to the mm_struct of a process known to be mapping the
>
> There are probably ways to work around lack of mm_struct, but it would not be
> completely for free. But passing the mm_struct should probably be an easy
> refactoring.
>
>> folio. We have that info, but its not passed to folio_estimated_sharers() so we
>> can't just reimplement folio_estimated_sharers() - we will need to rework these
>> call sites again.
>
> We should probably just have a
>
> folio_maybe_mapped_shared()
>
> with proper documentation. Nobody should care about the exact number.
>
>
> If my scheme for anon pages makes it in, that would be precise for anon pages
> and we could document that. Once we can handle pagecache pages as well to get a
> precise answer, we could change to folio_mapped_shared() and adjust the
> documentation.

Makes sense to me. I'm assuming your change would allow us to get rid of
PG_anon_exclusive too? In which case we would also want a precise API
specifically for anon folios for the CoW case, without waiting for pagecache
page support.

>
>
> I just saw
>
> https://lkml.kernel.org/r/[email protected]
>
> that converts a lot of code to folio_estimated_sharers().
>
>
> That patchset, for example, also does
>
> total_mapcount(page) > 1 -> folio_estimated_sharers(folio) > 1
>
> I'm not 100% sure what to think about that at this point. We eventually add
> false negatives (actually shared but we fail to detect it) all over the place,
> instead of having false positives (actually exclusive, but we fail to detect it).
>
> And that patch set doesn't even spell that out.
>
>
> Maybe it's as good as we will get, especially if my scheme doesn't make it in.

I've been working on the assumption that your scheme is plan A, and I'm waiting
for it to unblock forward progress on large anon folios. Is this the right
approach, or do you think your scheme is sufficiently riskly and/or far out that
I should aim not to depend on it?

> But we should definitely spell that out.
>


2023-08-02 12:16:33

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH 0/2] don't use mapcount() to check large folio sharing

On 28/07/2023 17:13, Yin Fengwei wrote:
> In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(),
> folio_mapcount() is used to check whether the folio is shared. But it's
> not correct as folio_mapcount() returns total mapcount of large folio.
>
> Use folio_estimated_sharers() here as the estimated number is enough.
>
> Yin Fengwei (2):
> madvise: don't use mapcount() against large folio for sharing check
> madvise: don't use mapcount() against large folio for sharing check
>
> mm/huge_memory.c | 2 +-
> mm/madvise.c | 6 +++---
> 2 files changed, 4 insertions(+), 4 deletions(-)
>

As a set of fixes, I agree this is definitely an improvement, so:

Reviewed-By: Ryan Roberts


But I have a couple of comments around further improvements;

Once we have the scheme that David is working on to be able to provide precise
exclusive vs shared info, we will probably want to move to that. Although that
scheme will need access to the mm_struct of a process known to be mapping the
folio. We have that info, but its not passed to folio_estimated_sharers() so we
can't just reimplement folio_estimated_sharers() - we will need to rework these
call sites again.

Given the aspiration for most of the memory to be large folios going forwards,
wouldn't it be better to avoid splitting the large folio where the large folio
is mapped entirely within the range of the madvise operation? Sorry if this has
already been discussed and decided against - I didn't follow the RFC too
closely. Or perhaps you plan to do this as a follow up?

Thanks,
Ryan


2023-08-02 12:22:36

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH 0/2] don't use mapcount() to check large folio sharing

On 02.08.23 12:27, Ryan Roberts wrote:
> On 28/07/2023 17:13, Yin Fengwei wrote:
>> In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(),
>> folio_mapcount() is used to check whether the folio is shared. But it's
>> not correct as folio_mapcount() returns total mapcount of large folio.
>>
>> Use folio_estimated_sharers() here as the estimated number is enough.
>>
>> Yin Fengwei (2):
>> madvise: don't use mapcount() against large folio for sharing check
>> madvise: don't use mapcount() against large folio for sharing check
>>
>> mm/huge_memory.c | 2 +-
>> mm/madvise.c | 6 +++---
>> 2 files changed, 4 insertions(+), 4 deletions(-)
>>
>
> As a set of fixes, I agree this is definitely an improvement, so:
>
> Reviewed-By: Ryan Roberts
>
>
> But I have a couple of comments around further improvements;
>
> Once we have the scheme that David is working on to be able to provide precise
> exclusive vs shared info, we will probably want to move to that. Although that
> scheme will need access to the mm_struct of a process known to be mapping the

There are probably ways to work around lack of mm_struct, but it would
not be completely for free. But passing the mm_struct should probably be
an easy refactoring.

> folio. We have that info, but its not passed to folio_estimated_sharers() so we
> can't just reimplement folio_estimated_sharers() - we will need to rework these
> call sites again.

We should probably just have a

folio_maybe_mapped_shared()

with proper documentation. Nobody should care about the exact number.


If my scheme for anon pages makes it in, that would be precise for anon
pages and we could document that. Once we can handle pagecache pages as
well to get a precise answer, we could change to folio_mapped_shared()
and adjust the documentation.


I just saw

https://lkml.kernel.org/r/[email protected]

that converts a lot of code to folio_estimated_sharers().


That patchset, for example, also does

total_mapcount(page) > 1 -> folio_estimated_sharers(folio) > 1

I'm not 100% sure what to think about that at this point. We eventually
add false negatives (actually shared but we fail to detect it) all over
the place, instead of having false positives (actually exclusive, but we
fail to detect it).

And that patch set doesn't even spell that out.


Maybe it's as good as we will get, especially if my scheme doesn't make
it in. But we should definitely spell that out.

--
Cheers,

David / dhildenb


2023-08-02 12:23:32

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH 0/2] don't use mapcount() to check large folio sharing

On 02/08/2023 12:36, David Hildenbrand wrote:
> On 02.08.23 13:20, Ryan Roberts wrote:
>> On 02/08/2023 11:48, David Hildenbrand wrote:
>>> On 02.08.23 12:27, Ryan Roberts wrote:
>>>> On 28/07/2023 17:13, Yin Fengwei wrote:
>>>>> In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(),
>>>>> folio_mapcount() is used to check whether the folio is shared. But it's
>>>>> not correct as folio_mapcount() returns total mapcount of large folio.
>>>>>
>>>>> Use folio_estimated_sharers() here as the estimated number is enough.
>>>>>
>>>>> Yin Fengwei (2):
>>>>>     madvise: don't use mapcount() against large folio for sharing check
>>>>>     madvise: don't use mapcount() against large folio for sharing check
>>>>>
>>>>>    mm/huge_memory.c | 2 +-
>>>>>    mm/madvise.c     | 6 +++---
>>>>>    2 files changed, 4 insertions(+), 4 deletions(-)
>>>>>
>>>>
>>>> As a set of fixes, I agree this is definitely an improvement, so:
>>>>
>>>> Reviewed-By: Ryan Roberts
>>>>
>>>>
>>>> But I have a couple of comments around further improvements;
>>>>
>>>> Once we have the scheme that David is working on to be able to provide precise
>>>> exclusive vs shared info, we will probably want to move to that. Although that
>>>> scheme will need access to the mm_struct of a process known to be mapping the
>>>
>>> There are probably ways to work around lack of mm_struct, but it would not be
>>> completely for free. But passing the mm_struct should probably be an easy
>>> refactoring.
>>>
>>>> folio. We have that info, but its not passed to folio_estimated_sharers() so we
>>>> can't just reimplement folio_estimated_sharers() - we will need to rework these
>>>> call sites again.
>>>
>>> We should probably just have a
>>>
>>> folio_maybe_mapped_shared()
>>>
>>> with proper documentation. Nobody should care about the exact number.
>>>
>>>
>>> If my scheme for anon pages makes it in, that would be precise for anon pages
>>> and we could document that. Once we can handle pagecache pages as well to get a
>>> precise answer, we could change to folio_mapped_shared() and adjust the
>>> documentation.
>>
>> Makes sense to me. I'm assuming your change would allow us to get rid of
>> PG_anon_exclusive too? In which case we would also want a precise API
>> specifically for anon folios for the CoW case, without waiting for pagecache
>> page support.
>
> Not necessarily and I'm currently not planning that
>
> On the COW path, I'm planning on using it only when PG_anon_exclusive is clear
> for a compound page, combined with a check that there are no other page
> references besides from mappings: all mappings from me and #refs == #mappings ->
> reuse (set PG_anon_exclusive). That keeps the default (no fork) as fast as
> possible and simple.
>
>>>
>>> I just saw
>>>
>>> https://lkml.kernel.org/r/[email protected]
>>>
>>> that converts a lot of code to folio_estimated_sharers().
>>>
>>>
>>> That patchset, for example, also does
>>>
>>> total_mapcount(page) > 1 -> folio_estimated_sharers(folio) > 1
>>>
>>> I'm not 100% sure what to think about that at this point. We eventually add
>>> false negatives (actually shared but we fail to detect it) all over the place,
>>> instead of having false positives (actually exclusive, but we fail to detect
>>> it).
>>>
>>> And that patch set doesn't even spell that out.
>>>
>>>
>>> Maybe it's as good as we will get, especially if my scheme doesn't make it in.
>>
>> I've been working on the assumption that your scheme is plan A, and I'm waiting
>> for it to unblock forward progress on large anon folios. Is this the right
>> approach, or do you think your scheme is sufficiently riskly and/or far out that
>> I should aim not to depend on it?
>
> It is plan A. IMHO, it does not feel too risky and/or far out at this point --
> and the implementation should not end up too complicated. But as always, I
> cannot promise anything before it's been implemented and discussed upstream.

OK, good we are on the same folio... (stolen from Hugh; if a joke is worth
telling once, its worth telling 1000 times ;-)

>
> Hopefully, we know more soon. I'll get at implementing it fairly soon.
>


2023-08-02 12:24:34

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH 0/2] don't use mapcount() to check large folio sharing

On 02.08.23 13:51, Ryan Roberts wrote:
> On 02/08/2023 12:36, David Hildenbrand wrote:
>> On 02.08.23 13:20, Ryan Roberts wrote:
>>> On 02/08/2023 11:48, David Hildenbrand wrote:
>>>> On 02.08.23 12:27, Ryan Roberts wrote:
>>>>> On 28/07/2023 17:13, Yin Fengwei wrote:
>>>>>> In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(),
>>>>>> folio_mapcount() is used to check whether the folio is shared. But it's
>>>>>> not correct as folio_mapcount() returns total mapcount of large folio.
>>>>>>
>>>>>> Use folio_estimated_sharers() here as the estimated number is enough.
>>>>>>
>>>>>> Yin Fengwei (2):
>>>>>>     madvise: don't use mapcount() against large folio for sharing check
>>>>>>     madvise: don't use mapcount() against large folio for sharing check
>>>>>>
>>>>>>    mm/huge_memory.c | 2 +-
>>>>>>    mm/madvise.c     | 6 +++---
>>>>>>    2 files changed, 4 insertions(+), 4 deletions(-)
>>>>>>
>>>>>
>>>>> As a set of fixes, I agree this is definitely an improvement, so:
>>>>>
>>>>> Reviewed-By: Ryan Roberts
>>>>>
>>>>>
>>>>> But I have a couple of comments around further improvements;
>>>>>
>>>>> Once we have the scheme that David is working on to be able to provide precise
>>>>> exclusive vs shared info, we will probably want to move to that. Although that
>>>>> scheme will need access to the mm_struct of a process known to be mapping the
>>>>
>>>> There are probably ways to work around lack of mm_struct, but it would not be
>>>> completely for free. But passing the mm_struct should probably be an easy
>>>> refactoring.
>>>>
>>>>> folio. We have that info, but its not passed to folio_estimated_sharers() so we
>>>>> can't just reimplement folio_estimated_sharers() - we will need to rework these
>>>>> call sites again.
>>>>
>>>> We should probably just have a
>>>>
>>>> folio_maybe_mapped_shared()
>>>>
>>>> with proper documentation. Nobody should care about the exact number.
>>>>
>>>>
>>>> If my scheme for anon pages makes it in, that would be precise for anon pages
>>>> and we could document that. Once we can handle pagecache pages as well to get a
>>>> precise answer, we could change to folio_mapped_shared() and adjust the
>>>> documentation.
>>>
>>> Makes sense to me. I'm assuming your change would allow us to get rid of
>>> PG_anon_exclusive too? In which case we would also want a precise API
>>> specifically for anon folios for the CoW case, without waiting for pagecache
>>> page support.
>>
>> Not necessarily and I'm currently not planning that
>>
>> On the COW path, I'm planning on using it only when PG_anon_exclusive is clear
>> for a compound page, combined with a check that there are no other page
>> references besides from mappings: all mappings from me and #refs == #mappings ->
>> reuse (set PG_anon_exclusive). That keeps the default (no fork) as fast as
>> possible and simple.
>>
>>>>
>>>> I just saw
>>>>
>>>> https://lkml.kernel.org/r/[email protected]
>>>>
>>>> that converts a lot of code to folio_estimated_sharers().
>>>>
>>>>
>>>> That patchset, for example, also does
>>>>
>>>> total_mapcount(page) > 1 -> folio_estimated_sharers(folio) > 1
>>>>
>>>> I'm not 100% sure what to think about that at this point. We eventually add
>>>> false negatives (actually shared but we fail to detect it) all over the place,
>>>> instead of having false positives (actually exclusive, but we fail to detect
>>>> it).
>>>>
>>>> And that patch set doesn't even spell that out.
>>>>
>>>>
>>>> Maybe it's as good as we will get, especially if my scheme doesn't make it in.
>>>
>>> I've been working on the assumption that your scheme is plan A, and I'm waiting
>>> for it to unblock forward progress on large anon folios. Is this the right
>>> approach, or do you think your scheme is sufficiently riskly and/or far out that
>>> I should aim not to depend on it?
>>
>> It is plan A. IMHO, it does not feel too risky and/or far out at this point --
>> and the implementation should not end up too complicated. But as always, I
>> cannot promise anything before it's been implemented and discussed upstream.
>
> OK, good we are on the same folio... (stolen from Hugh; if a joke is worth
> telling once, its worth telling 1000 times ;-)

Heard it first the time :))

--
Cheers,

David / dhildenb


2023-08-02 12:51:29

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [PATCH 0/2] don't use mapcount() to check large folio sharing

Hi Andrew,

On 7/29/2023 1:24 AM, Andrew Morton wrote:
> On Sat, 29 Jul 2023 00:13:54 +0800 Yin Fengwei <[email protected]> wrote:
>
>> In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(),
>> folio_mapcount() is used to check whether the folio is shared. But it's
>> not correct as folio_mapcount() returns total mapcount of large folio.
>>
>> Use folio_estimated_sharers() here as the estimated number is enough.
>
> What are the user-visible runtime effects of these changes?
>
> (and please try to avoid using the same Subject: for different patches)
>

Can you hold on these patches to mm-unstable? I think we need to wait for
David's work on folio_maybe_mapped_shared() and redo the fix base on that.
Thanks and sorry for the noise.


Regards
Yin, Fengwei

2023-08-02 13:28:55

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [PATCH 0/2] don't use mapcount() to check large folio sharing



On 8/2/2023 6:27 PM, Ryan Roberts wrote:
> On 28/07/2023 17:13, Yin Fengwei wrote:
>> In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(),
>> folio_mapcount() is used to check whether the folio is shared. But it's
>> not correct as folio_mapcount() returns total mapcount of large folio.
>>
>> Use folio_estimated_sharers() here as the estimated number is enough.
>>
>> Yin Fengwei (2):
>> madvise: don't use mapcount() against large folio for sharing check
>> madvise: don't use mapcount() against large folio for sharing check
>>
>> mm/huge_memory.c | 2 +-
>> mm/madvise.c | 6 +++---
>> 2 files changed, 4 insertions(+), 4 deletions(-)
>>
>
> As a set of fixes, I agree this is definitely an improvement, so:
>
> Reviewed-By: Ryan Roberts
Thanks.

>
>
> But I have a couple of comments around further improvements;
>
> Once we have the scheme that David is working on to be able to provide precise
> exclusive vs shared info, we will probably want to move to that. Although that
> scheme will need access to the mm_struct of a process known to be mapping the
> folio. We have that info, but its not passed to folio_estimated_sharers() so we
> can't just reimplement folio_estimated_sharers() - we will need to rework these
> call sites again.
Yes. This could be extra work. Maybe should delay till David's work is done.

>
> Given the aspiration for most of the memory to be large folios going forwards,
> wouldn't it be better to avoid splitting the large folio where the large folio
> is mapped entirely within the range of the madvise operation? Sorry if this has
> already been discussed and decided against - I didn't follow the RFC too
> closely. Or perhaps you plan to do this as a follow up?
Yes. We are on same page. RFC patchset did that. But there are some other opens
on the RFC. So I tried to submit this part of change which is bug fix. The other
thing left in RFC is optimization (avoid split large folio if we can).


Regards
Yin, Fengwei

>
> Thanks,
> Ryan
>

2023-08-02 13:34:38

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH 0/2] don't use mapcount() to check large folio sharing

On 02/08/2023 13:42, Yin, Fengwei wrote:
>
>
> On 8/2/2023 8:40 PM, Ryan Roberts wrote:
>> On 02/08/2023 13:35, Yin, Fengwei wrote:
>>>
>>>
>>> On 8/2/2023 6:27 PM, Ryan Roberts wrote:
>>>> On 28/07/2023 17:13, Yin Fengwei wrote:
>>>>> In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(),
>>>>> folio_mapcount() is used to check whether the folio is shared. But it's
>>>>> not correct as folio_mapcount() returns total mapcount of large folio.
>>>>>
>>>>> Use folio_estimated_sharers() here as the estimated number is enough.
>>>>>
>>>>> Yin Fengwei (2):
>>>>> madvise: don't use mapcount() against large folio for sharing check
>>>>> madvise: don't use mapcount() against large folio for sharing check
>>>>>
>>>>> mm/huge_memory.c | 2 +-
>>>>> mm/madvise.c | 6 +++---
>>>>> 2 files changed, 4 insertions(+), 4 deletions(-)
>>>>>
>>>>
>>>> As a set of fixes, I agree this is definitely an improvement, so:
>>>>
>>>> Reviewed-By: Ryan Roberts
>>> Thanks.
>>>
>>>>
>>>>
>>>> But I have a couple of comments around further improvements;
>>>>
>>>> Once we have the scheme that David is working on to be able to provide precise
>>>> exclusive vs shared info, we will probably want to move to that. Although that
>>>> scheme will need access to the mm_struct of a process known to be mapping the
>>>> folio. We have that info, but its not passed to folio_estimated_sharers() so we
>>>> can't just reimplement folio_estimated_sharers() - we will need to rework these
>>>> call sites again.
>>> Yes. This could be extra work. Maybe should delay till David's work is done.
>>
>> What you have is definitely an improvement over what was there before. And is
>> probably the best we can do without David's scheme. So I wouldn't delay this.
>> Just pointing out that we will be able to make it even better later on (if
>> David's stuff goes in).
> Yes. I agree that we should wait for David's work ready and do fix based on that.

I was suggesting the opposite - not waiting. Then we can do separate improvement
later.

>
>
> Regards
> Yin, Fengwei
>
>>
>>>
>>>>
>>>> Given the aspiration for most of the memory to be large folios going forwards,
>>>> wouldn't it be better to avoid splitting the large folio where the large folio
>>>> is mapped entirely within the range of the madvise operation? Sorry if this has
>>>> already been discussed and decided against - I didn't follow the RFC too
>>>> closely. Or perhaps you plan to do this as a follow up?
>>> Yes. We are on same page. RFC patchset did that. But there are some other opens
>>> on the RFC. So I tried to submit this part of change which is bug fix. The other
>>> thing left in RFC is optimization (avoid split large folio if we can).
>>>
>>>
>>> Regards
>>> Yin, Fengwei
>>>
>>>>
>>>> Thanks,
>>>> Ryan
>>>>
>>
>>


2023-08-02 13:35:16

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH 0/2] don't use mapcount() to check large folio sharing

On 02/08/2023 13:35, Yin, Fengwei wrote:
>
>
> On 8/2/2023 6:27 PM, Ryan Roberts wrote:
>> On 28/07/2023 17:13, Yin Fengwei wrote:
>>> In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(),
>>> folio_mapcount() is used to check whether the folio is shared. But it's
>>> not correct as folio_mapcount() returns total mapcount of large folio.
>>>
>>> Use folio_estimated_sharers() here as the estimated number is enough.
>>>
>>> Yin Fengwei (2):
>>> madvise: don't use mapcount() against large folio for sharing check
>>> madvise: don't use mapcount() against large folio for sharing check
>>>
>>> mm/huge_memory.c | 2 +-
>>> mm/madvise.c | 6 +++---
>>> 2 files changed, 4 insertions(+), 4 deletions(-)
>>>
>>
>> As a set of fixes, I agree this is definitely an improvement, so:
>>
>> Reviewed-By: Ryan Roberts
> Thanks.
>
>>
>>
>> But I have a couple of comments around further improvements;
>>
>> Once we have the scheme that David is working on to be able to provide precise
>> exclusive vs shared info, we will probably want to move to that. Although that
>> scheme will need access to the mm_struct of a process known to be mapping the
>> folio. We have that info, but its not passed to folio_estimated_sharers() so we
>> can't just reimplement folio_estimated_sharers() - we will need to rework these
>> call sites again.
> Yes. This could be extra work. Maybe should delay till David's work is done.

What you have is definitely an improvement over what was there before. And is
probably the best we can do without David's scheme. So I wouldn't delay this.
Just pointing out that we will be able to make it even better later on (if
David's stuff goes in).

>
>>
>> Given the aspiration for most of the memory to be large folios going forwards,
>> wouldn't it be better to avoid splitting the large folio where the large folio
>> is mapped entirely within the range of the madvise operation? Sorry if this has
>> already been discussed and decided against - I didn't follow the RFC too
>> closely. Or perhaps you plan to do this as a follow up?
> Yes. We are on same page. RFC patchset did that. But there are some other opens
> on the RFC. So I tried to submit this part of change which is bug fix. The other
> thing left in RFC is optimization (avoid split large folio if we can).
>
>
> Regards
> Yin, Fengwei
>
>>
>> Thanks,
>> Ryan
>>


2023-08-02 13:39:40

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH 0/2] don't use mapcount() to check large folio sharing

On 02.08.23 13:20, Ryan Roberts wrote:
> On 02/08/2023 11:48, David Hildenbrand wrote:
>> On 02.08.23 12:27, Ryan Roberts wrote:
>>> On 28/07/2023 17:13, Yin Fengwei wrote:
>>>> In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(),
>>>> folio_mapcount() is used to check whether the folio is shared. But it's
>>>> not correct as folio_mapcount() returns total mapcount of large folio.
>>>>
>>>> Use folio_estimated_sharers() here as the estimated number is enough.
>>>>
>>>> Yin Fengwei (2):
>>>>    madvise: don't use mapcount() against large folio for sharing check
>>>>    madvise: don't use mapcount() against large folio for sharing check
>>>>
>>>>   mm/huge_memory.c | 2 +-
>>>>   mm/madvise.c     | 6 +++---
>>>>   2 files changed, 4 insertions(+), 4 deletions(-)
>>>>
>>>
>>> As a set of fixes, I agree this is definitely an improvement, so:
>>>
>>> Reviewed-By: Ryan Roberts
>>>
>>>
>>> But I have a couple of comments around further improvements;
>>>
>>> Once we have the scheme that David is working on to be able to provide precise
>>> exclusive vs shared info, we will probably want to move to that. Although that
>>> scheme will need access to the mm_struct of a process known to be mapping the
>>
>> There are probably ways to work around lack of mm_struct, but it would not be
>> completely for free. But passing the mm_struct should probably be an easy
>> refactoring.
>>
>>> folio. We have that info, but its not passed to folio_estimated_sharers() so we
>>> can't just reimplement folio_estimated_sharers() - we will need to rework these
>>> call sites again.
>>
>> We should probably just have a
>>
>> folio_maybe_mapped_shared()
>>
>> with proper documentation. Nobody should care about the exact number.
>>
>>
>> If my scheme for anon pages makes it in, that would be precise for anon pages
>> and we could document that. Once we can handle pagecache pages as well to get a
>> precise answer, we could change to folio_mapped_shared() and adjust the
>> documentation.
>
> Makes sense to me. I'm assuming your change would allow us to get rid of
> PG_anon_exclusive too? In which case we would also want a precise API
> specifically for anon folios for the CoW case, without waiting for pagecache
> page support.

Not necessarily and I'm currently not planning that

On the COW path, I'm planning on using it only when PG_anon_exclusive is
clear for a compound page, combined with a check that there are no other
page references besides from mappings: all mappings from me and #refs ==
#mappings -> reuse (set PG_anon_exclusive). That keeps the default (no
fork) as fast as possible and simple.

>>
>> I just saw
>>
>> https://lkml.kernel.org/r/[email protected]
>>
>> that converts a lot of code to folio_estimated_sharers().
>>
>>
>> That patchset, for example, also does
>>
>> total_mapcount(page) > 1 -> folio_estimated_sharers(folio) > 1
>>
>> I'm not 100% sure what to think about that at this point. We eventually add
>> false negatives (actually shared but we fail to detect it) all over the place,
>> instead of having false positives (actually exclusive, but we fail to detect it).
>>
>> And that patch set doesn't even spell that out.
>>
>>
>> Maybe it's as good as we will get, especially if my scheme doesn't make it in.
>
> I've been working on the assumption that your scheme is plan A, and I'm waiting
> for it to unblock forward progress on large anon folios. Is this the right
> approach, or do you think your scheme is sufficiently riskly and/or far out that
> I should aim not to depend on it?

It is plan A. IMHO, it does not feel too risky and/or far out at this
point -- and the implementation should not end up too complicated. But
as always, I cannot promise anything before it's been implemented and
discussed upstream.

Hopefully, we know more soon. I'll get at implementing it fairly soon.

--
Cheers,

David / dhildenb


2023-08-02 14:08:44

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH 0/2] don't use mapcount() to check large folio sharing

On 02.08.23 14:40, Ryan Roberts wrote:
> On 02/08/2023 13:35, Yin, Fengwei wrote:
>>
>>
>> On 8/2/2023 6:27 PM, Ryan Roberts wrote:
>>> On 28/07/2023 17:13, Yin Fengwei wrote:
>>>> In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(),
>>>> folio_mapcount() is used to check whether the folio is shared. But it's
>>>> not correct as folio_mapcount() returns total mapcount of large folio.
>>>>
>>>> Use folio_estimated_sharers() here as the estimated number is enough.
>>>>
>>>> Yin Fengwei (2):
>>>> madvise: don't use mapcount() against large folio for sharing check
>>>> madvise: don't use mapcount() against large folio for sharing check
>>>>
>>>> mm/huge_memory.c | 2 +-
>>>> mm/madvise.c | 6 +++---
>>>> 2 files changed, 4 insertions(+), 4 deletions(-)
>>>>
>>>
>>> As a set of fixes, I agree this is definitely an improvement, so:
>>>
>>> Reviewed-By: Ryan Roberts
>> Thanks.
>>
>>>
>>>
>>> But I have a couple of comments around further improvements;
>>>
>>> Once we have the scheme that David is working on to be able to provide precise
>>> exclusive vs shared info, we will probably want to move to that. Although that
>>> scheme will need access to the mm_struct of a process known to be mapping the
>>> folio. We have that info, but its not passed to folio_estimated_sharers() so we
>>> can't just reimplement folio_estimated_sharers() - we will need to rework these
>>> call sites again.
>> Yes. This could be extra work. Maybe should delay till David's work is done.
>
> What you have is definitely an improvement over what was there before. And is
> probably the best we can do without David's scheme. So I wouldn't delay this.
> Just pointing out that we will be able to make it even better later on (if
> David's stuff goes in).

Agreed, we just should be careful and clearly spell out the implications
and that this is eventually also not what we 100% want.

That MADV_PAGEOUT now fails on a PTE-mapped THP -- as can be seen when
executing the cow selftest where MADV_PAGEOUT will essentially fail --
is certainly undesired and should be fixed IMHO.

--
Cheers,

David / dhildenb


2023-08-02 14:20:14

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [PATCH 0/2] don't use mapcount() to check large folio sharing



On 8/2/2023 8:40 PM, Ryan Roberts wrote:
> On 02/08/2023 13:35, Yin, Fengwei wrote:
>>
>>
>> On 8/2/2023 6:27 PM, Ryan Roberts wrote:
>>> On 28/07/2023 17:13, Yin Fengwei wrote:
>>>> In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(),
>>>> folio_mapcount() is used to check whether the folio is shared. But it's
>>>> not correct as folio_mapcount() returns total mapcount of large folio.
>>>>
>>>> Use folio_estimated_sharers() here as the estimated number is enough.
>>>>
>>>> Yin Fengwei (2):
>>>> madvise: don't use mapcount() against large folio for sharing check
>>>> madvise: don't use mapcount() against large folio for sharing check
>>>>
>>>> mm/huge_memory.c | 2 +-
>>>> mm/madvise.c | 6 +++---
>>>> 2 files changed, 4 insertions(+), 4 deletions(-)
>>>>
>>>
>>> As a set of fixes, I agree this is definitely an improvement, so:
>>>
>>> Reviewed-By: Ryan Roberts
>> Thanks.
>>
>>>
>>>
>>> But I have a couple of comments around further improvements;
>>>
>>> Once we have the scheme that David is working on to be able to provide precise
>>> exclusive vs shared info, we will probably want to move to that. Although that
>>> scheme will need access to the mm_struct of a process known to be mapping the
>>> folio. We have that info, but its not passed to folio_estimated_sharers() so we
>>> can't just reimplement folio_estimated_sharers() - we will need to rework these
>>> call sites again.
>> Yes. This could be extra work. Maybe should delay till David's work is done.
>
> What you have is definitely an improvement over what was there before. And is
> probably the best we can do without David's scheme. So I wouldn't delay this.
> Just pointing out that we will be able to make it even better later on (if
> David's stuff goes in).
Yes. I agree that we should wait for David's work ready and do fix based on that.


Regards
Yin, Fengwei

>
>>
>>>
>>> Given the aspiration for most of the memory to be large folios going forwards,
>>> wouldn't it be better to avoid splitting the large folio where the large folio
>>> is mapped entirely within the range of the madvise operation? Sorry if this has
>>> already been discussed and decided against - I didn't follow the RFC too
>>> closely. Or perhaps you plan to do this as a follow up?
>> Yes. We are on same page. RFC patchset did that. But there are some other opens
>> on the RFC. So I tried to submit this part of change which is bug fix. The other
>> thing left in RFC is optimization (avoid split large folio if we can).
>>
>>
>> Regards
>> Yin, Fengwei
>>
>>>
>>> Thanks,
>>> Ryan
>>>
>
>

2023-08-02 15:02:16

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [PATCH 0/2] don't use mapcount() to check large folio sharing



On 8/2/2023 8:49 PM, Ryan Roberts wrote:
> On 02/08/2023 13:42, Yin, Fengwei wrote:
>>
>>
>> On 8/2/2023 8:40 PM, Ryan Roberts wrote:
>>> On 02/08/2023 13:35, Yin, Fengwei wrote:
>>>>
>>>>
>>>> On 8/2/2023 6:27 PM, Ryan Roberts wrote:
>>>>> On 28/07/2023 17:13, Yin Fengwei wrote:
>>>>>> In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(),
>>>>>> folio_mapcount() is used to check whether the folio is shared. But it's
>>>>>> not correct as folio_mapcount() returns total mapcount of large folio.
>>>>>>
>>>>>> Use folio_estimated_sharers() here as the estimated number is enough.
>>>>>>
>>>>>> Yin Fengwei (2):
>>>>>> madvise: don't use mapcount() against large folio for sharing check
>>>>>> madvise: don't use mapcount() against large folio for sharing check
>>>>>>
>>>>>> mm/huge_memory.c | 2 +-
>>>>>> mm/madvise.c | 6 +++---
>>>>>> 2 files changed, 4 insertions(+), 4 deletions(-)
>>>>>>
>>>>>
>>>>> As a set of fixes, I agree this is definitely an improvement, so:
>>>>>
>>>>> Reviewed-By: Ryan Roberts
>>>> Thanks.
>>>>
>>>>>
>>>>>
>>>>> But I have a couple of comments around further improvements;
>>>>>
>>>>> Once we have the scheme that David is working on to be able to provide precise
>>>>> exclusive vs shared info, we will probably want to move to that. Although that
>>>>> scheme will need access to the mm_struct of a process known to be mapping the
>>>>> folio. We have that info, but its not passed to folio_estimated_sharers() so we
>>>>> can't just reimplement folio_estimated_sharers() - we will need to rework these
>>>>> call sites again.
>>>> Yes. This could be extra work. Maybe should delay till David's work is done.
>>>
>>> What you have is definitely an improvement over what was there before. And is
>>> probably the best we can do without David's scheme. So I wouldn't delay this.
>>> Just pointing out that we will be able to make it even better later on (if
>>> David's stuff goes in).
>> Yes. I agree that we should wait for David's work ready and do fix based on that.
>
> I was suggesting the opposite - not waiting. Then we can do separate improvement
> later.
Let's wait for David's work ready. Otherwise, some other similar fix could be
submitted. Unfortunately, we can't build folio_estimated_sharers() based on
folio_maybe_mapped_shared() because latter one requires the extra parameter.


Regards
Yin, Fengwei

>
>>
>>
>> Regards
>> Yin, Fengwei
>>
>>>
>>>>
>>>>>
>>>>> Given the aspiration for most of the memory to be large folios going forwards,
>>>>> wouldn't it be better to avoid splitting the large folio where the large folio
>>>>> is mapped entirely within the range of the madvise operation? Sorry if this has
>>>>> already been discussed and decided against - I didn't follow the RFC too
>>>>> closely. Or perhaps you plan to do this as a follow up?
>>>> Yes. We are on same page. RFC patchset did that. But there are some other opens
>>>> on the RFC. So I tried to submit this part of change which is bug fix. The other
>>>> thing left in RFC is optimization (avoid split large folio if we can).
>>>>
>>>>
>>>> Regards
>>>> Yin, Fengwei
>>>>
>>>>>
>>>>> Thanks,
>>>>> Ryan
>>>>>
>>>
>>>
>

2023-08-03 22:18:45

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH 0/2] don't use mapcount() to check large folio sharing

On Wed, Aug 2, 2023 at 6:56 AM Yin, Fengwei <[email protected]> wrote:
>
>
>
> On 8/2/2023 8:49 PM, Ryan Roberts wrote:
> > On 02/08/2023 13:42, Yin, Fengwei wrote:
> >>
> >>
> >> On 8/2/2023 8:40 PM, Ryan Roberts wrote:
> >>> On 02/08/2023 13:35, Yin, Fengwei wrote:
> >>>>
> >>>>
> >>>> On 8/2/2023 6:27 PM, Ryan Roberts wrote:
> >>>>> On 28/07/2023 17:13, Yin Fengwei wrote:
> >>>>>> In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(),
> >>>>>> folio_mapcount() is used to check whether the folio is shared. But it's
> >>>>>> not correct as folio_mapcount() returns total mapcount of large folio.
> >>>>>>
> >>>>>> Use folio_estimated_sharers() here as the estimated number is enough.
> >>>>>>
> >>>>>> Yin Fengwei (2):
> >>>>>> madvise: don't use mapcount() against large folio for sharing check
> >>>>>> madvise: don't use mapcount() against large folio for sharing check
> >>>>>>
> >>>>>> mm/huge_memory.c | 2 +-
> >>>>>> mm/madvise.c | 6 +++---
> >>>>>> 2 files changed, 4 insertions(+), 4 deletions(-)
> >>>>>>
> >>>>>
> >>>>> As a set of fixes, I agree this is definitely an improvement, so:
> >>>>>
> >>>>> Reviewed-By: Ryan Roberts
> >>>> Thanks.
> >>>>
> >>>>>
> >>>>>
> >>>>> But I have a couple of comments around further improvements;
> >>>>>
> >>>>> Once we have the scheme that David is working on to be able to provide precise
> >>>>> exclusive vs shared info, we will probably want to move to that. Although that
> >>>>> scheme will need access to the mm_struct of a process known to be mapping the
> >>>>> folio. We have that info, but its not passed to folio_estimated_sharers() so we
> >>>>> can't just reimplement folio_estimated_sharers() - we will need to rework these
> >>>>> call sites again.
> >>>> Yes. This could be extra work. Maybe should delay till David's work is done.
> >>>
> >>> What you have is definitely an improvement over what was there before. And is
> >>> probably the best we can do without David's scheme. So I wouldn't delay this.
> >>> Just pointing out that we will be able to make it even better later on (if
> >>> David's stuff goes in).
> >> Yes. I agree that we should wait for David's work ready and do fix based on that.
> >
> > I was suggesting the opposite - not waiting. Then we can do separate improvement
> > later.
> Let's wait for David's work ready.

Waiting is fine as long as we don't miss the next merge window -- we
don't want these two bugs to get into another release. Also I think we
should cc stable, since as David mentioned, they have been causing
selftest failures.

2023-08-04 00:45:33

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [PATCH 0/2] don't use mapcount() to check large folio sharing



On 8/4/2023 4:46 AM, Yu Zhao wrote:
> On Wed, Aug 2, 2023 at 6:56 AM Yin, Fengwei <[email protected]> wrote:
>>
>>
>>
>> On 8/2/2023 8:49 PM, Ryan Roberts wrote:
>>> On 02/08/2023 13:42, Yin, Fengwei wrote:
>>>>
>>>>
>>>> On 8/2/2023 8:40 PM, Ryan Roberts wrote:
>>>>> On 02/08/2023 13:35, Yin, Fengwei wrote:
>>>>>>
>>>>>>
>>>>>> On 8/2/2023 6:27 PM, Ryan Roberts wrote:
>>>>>>> On 28/07/2023 17:13, Yin Fengwei wrote:
>>>>>>>> In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(),
>>>>>>>> folio_mapcount() is used to check whether the folio is shared. But it's
>>>>>>>> not correct as folio_mapcount() returns total mapcount of large folio.
>>>>>>>>
>>>>>>>> Use folio_estimated_sharers() here as the estimated number is enough.
>>>>>>>>
>>>>>>>> Yin Fengwei (2):
>>>>>>>> madvise: don't use mapcount() against large folio for sharing check
>>>>>>>> madvise: don't use mapcount() against large folio for sharing check
>>>>>>>>
>>>>>>>> mm/huge_memory.c | 2 +-
>>>>>>>> mm/madvise.c | 6 +++---
>>>>>>>> 2 files changed, 4 insertions(+), 4 deletions(-)
>>>>>>>>
>>>>>>>
>>>>>>> As a set of fixes, I agree this is definitely an improvement, so:
>>>>>>>
>>>>>>> Reviewed-By: Ryan Roberts
>>>>>> Thanks.
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> But I have a couple of comments around further improvements;
>>>>>>>
>>>>>>> Once we have the scheme that David is working on to be able to provide precise
>>>>>>> exclusive vs shared info, we will probably want to move to that. Although that
>>>>>>> scheme will need access to the mm_struct of a process known to be mapping the
>>>>>>> folio. We have that info, but its not passed to folio_estimated_sharers() so we
>>>>>>> can't just reimplement folio_estimated_sharers() - we will need to rework these
>>>>>>> call sites again.
>>>>>> Yes. This could be extra work. Maybe should delay till David's work is done.
>>>>>
>>>>> What you have is definitely an improvement over what was there before. And is
>>>>> probably the best we can do without David's scheme. So I wouldn't delay this.
>>>>> Just pointing out that we will be able to make it even better later on (if
>>>>> David's stuff goes in).
>>>> Yes. I agree that we should wait for David's work ready and do fix based on that.
>>>
>>> I was suggesting the opposite - not waiting. Then we can do separate improvement
>>> later.
>> Let's wait for David's work ready.
>
> Waiting is fine as long as we don't miss the next merge window -- we
> don't want these two bugs to get into another release. Also I think we
> should cc stable, since as David mentioned, they have been causing
> selftest failures.

Stable was CCed. Andrew asked about the user-visible impact and I replied:
https://lore.kernel.org/linux-mm/[email protected]/

I was not aware that selftest is impact by the issue. And waiting for whether these
patches are necessary for stable.

But I don't want to promote this kind of change in other place as we will move to
David's solution.


Regards
Yin, Fengwei

2023-08-04 02:20:50

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH 0/2] don't use mapcount() to check large folio sharing

On Thu, Aug 3, 2023 at 5:27 PM Yin, Fengwei <[email protected]> wrote:
>
>
>
> On 8/4/2023 4:46 AM, Yu Zhao wrote:
> > On Wed, Aug 2, 2023 at 6:56 AM Yin, Fengwei <[email protected]> wrote:
> >>
> >>"
> >>
> >> On 8/2/2023 8:49 PM, Ryan Roberts wrote:
> >>> On 02/08/2023 13:42, Yin, Fengwei wrote:
> >>>>
> >>>>
> >>>> On 8/2/2023 8:40 PM, Ryan Roberts wrote:
> >>>>> On 02/08/2023 13:35, Yin, Fengwei wrote:
> >>>>>>
> >>>>>>
> >>>>>> On 8/2/2023 6:27 PM, Ryan Roberts wrote:
> >>>>>>> On 28/07/2023 17:13, Yin Fengwei wrote:
> >>>>>>>> In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(),
> >>>>>>>> folio_mapcount() is used to check whether the folio is shared. But it's
> >>>>>>>> not correct as folio_mapcount() returns total mapcount of large folio.
> >>>>>>>>
> >>>>>>>> Use folio_estimated_sharers() here as the estimated number is enough.
> >>>>>>>>
> >>>>>>>> Yin Fengwei (2):
> >>>>>>>> madvise: don't use mapcount() against large folio for sharing check
> >>>>>>>> madvise: don't use mapcount() against large folio for sharing check
> >>>>>>>>
> >>>>>>>> mm/huge_memory.c | 2 +-
> >>>>>>>> mm/madvise.c | 6 +++---
> >>>>>>>> 2 files changed, 4 insertions(+), 4 deletions(-)
> >>>>>>>>
> >>>>>>>
> >>>>>>> As a set of fixes, I agree this is definitely an improvement, so:
> >>>>>>>
> >>>>>>> Reviewed-By: Ryan Roberts
> >>>>>> Thanks.
> >>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> But I have a couple of comments around further improvements;
> >>>>>>>
> >>>>>>> Once we have the scheme that David is working on to be able to provide precise
> >>>>>>> exclusive vs shared info, we will probably want to move to that. Although that
> >>>>>>> scheme will need access to the mm_struct of a process known to be mapping the
> >>>>>>> folio. We have that info, but its not passed to folio_estimated_sharers() so we
> >>>>>>> can't just reimplement folio_estimated_sharers() - we will need to rework these
> >>>>>>> call sites again.
> >>>>>> Yes. This could be extra work. Maybe should delay till David's work is done.
> >>>>>
> >>>>> What you have is definitely an improvement over what was there before. And is
> >>>>> probably the best we can do without David's scheme. So I wouldn't delay this.
> >>>>> Just pointing out that we will be able to make it even better later on (if
> >>>>> David's stuff goes in).
> >>>> Yes. I agree that we should wait for David's work ready and do fix based on that.
> >>>
> >>> I was suggesting the opposite - not waiting. Then we can do separate improvement
> >>> later.
> >> Let's wait for David's work ready.
> >
> > Waiting is fine as long as we don't miss the next merge window -- we
> > don't want these two bugs to get into another release. Also I think we
> > should cc stable, since as David mentioned, they have been causing
> > selftest failures.
>
> Stable was CCed.

Need to add the "Cc: [email protected]" tag:
Documentation/process/stable-kernel-rules.rst

2023-08-04 02:47:31

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [PATCH 0/2] don't use mapcount() to check large folio sharing



On 8/4/2023 7:38 AM, Yu Zhao wrote:
> On Thu, Aug 3, 2023 at 5:27 PM Yin, Fengwei <[email protected]> wrote:
>>
>>
>>
>> On 8/4/2023 4:46 AM, Yu Zhao wrote:
>>> On Wed, Aug 2, 2023 at 6:56 AM Yin, Fengwei <[email protected]> wrote:
>>>>
>>>> "
>>>>
>>>> On 8/2/2023 8:49 PM, Ryan Roberts wrote:
>>>>> On 02/08/2023 13:42, Yin, Fengwei wrote:
>>>>>>
>>>>>>
>>>>>> On 8/2/2023 8:40 PM, Ryan Roberts wrote:
>>>>>>> On 02/08/2023 13:35, Yin, Fengwei wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 8/2/2023 6:27 PM, Ryan Roberts wrote:
>>>>>>>>> On 28/07/2023 17:13, Yin Fengwei wrote:
>>>>>>>>>> In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(),
>>>>>>>>>> folio_mapcount() is used to check whether the folio is shared. But it's
>>>>>>>>>> not correct as folio_mapcount() returns total mapcount of large folio.
>>>>>>>>>>
>>>>>>>>>> Use folio_estimated_sharers() here as the estimated number is enough.
>>>>>>>>>>
>>>>>>>>>> Yin Fengwei (2):
>>>>>>>>>> madvise: don't use mapcount() against large folio for sharing check
>>>>>>>>>> madvise: don't use mapcount() against large folio for sharing check
>>>>>>>>>>
>>>>>>>>>> mm/huge_memory.c | 2 +-
>>>>>>>>>> mm/madvise.c | 6 +++---
>>>>>>>>>> 2 files changed, 4 insertions(+), 4 deletions(-)
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> As a set of fixes, I agree this is definitely an improvement, so:
>>>>>>>>>
>>>>>>>>> Reviewed-By: Ryan Roberts
>>>>>>>> Thanks.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> But I have a couple of comments around further improvements;
>>>>>>>>>
>>>>>>>>> Once we have the scheme that David is working on to be able to provide precise
>>>>>>>>> exclusive vs shared info, we will probably want to move to that. Although that
>>>>>>>>> scheme will need access to the mm_struct of a process known to be mapping the
>>>>>>>>> folio. We have that info, but its not passed to folio_estimated_sharers() so we
>>>>>>>>> can't just reimplement folio_estimated_sharers() - we will need to rework these
>>>>>>>>> call sites again.
>>>>>>>> Yes. This could be extra work. Maybe should delay till David's work is done.
>>>>>>>
>>>>>>> What you have is definitely an improvement over what was there before. And is
>>>>>>> probably the best we can do without David's scheme. So I wouldn't delay this.
>>>>>>> Just pointing out that we will be able to make it even better later on (if
>>>>>>> David's stuff goes in).
>>>>>> Yes. I agree that we should wait for David's work ready and do fix based on that.
>>>>>
>>>>> I was suggesting the opposite - not waiting. Then we can do separate improvement
>>>>> later.
>>>> Let's wait for David's work ready.
>>>
>>> Waiting is fine as long as we don't miss the next merge window -- we
>>> don't want these two bugs to get into another release. Also I think we
>>> should cc stable, since as David mentioned, they have been causing
>>> selftest failures.
>>
>> Stable was CCed.
>
> Need to add the "Cc: [email protected]" tag:
> Documentation/process/stable-kernel-rules.rst
OK. Thanks for clarification. I totally mis-understanded this. :).

I'd like to wait for answer from Andrew whether these patches are suitable
for stable (I suppose you think so) branch.


Regards
Yin, Fengwei

2023-08-04 07:59:12

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [PATCH 0/2] don't use mapcount() to check large folio sharing

Hi Andrew,

On 8/2/2023 8:39 PM, Yin, Fengwei wrote:
> Hi Andrew,
>
> On 7/29/2023 1:24 AM, Andrew Morton wrote:
>> On Sat, 29 Jul 2023 00:13:54 +0800 Yin Fengwei <[email protected]> wrote:
>>
>>> In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(),
>>> folio_mapcount() is used to check whether the folio is shared. But it's
>>> not correct as folio_mapcount() returns total mapcount of large folio.
>>>
>>> Use folio_estimated_sharers() here as the estimated number is enough.
>>
>> What are the user-visible runtime effects of these changes?
>>
>> (and please try to avoid using the same Subject: for different patches)
>>
>
> Can you hold on these patches to mm-unstable? I think we need to wait for
> David's work on folio_maybe_mapped_shared() and redo the fix base on that.
> Thanks and sorry for the noise.
Sorry for bothering you again for this patchset.

Let me explain the situation here:
- The reason to hold on the patches to mm-unstable is that I don't want to
promote the fix in this patch (using folio_estimated_sharers()). The
correct way is waiting for folio_maybe_mapped_shared() from David.

Merging these patches motivate using folio_estimated_sharers() in other
places. So once folio_maybe_mapped_shared() is ready, we need to replace
folio_estimated_sharers() with folio_maybe_mapped_shared().

- For this specific patches, if they are suitable for stable, we may want to
merge it (special for stable branch. I assume folio_maybe_mapped_shared()
may not be back ported to stable branch).

So how do we deal with this situation? Thanks in advance.

Regards
Yin, Fengwei

>
>
> Regards
> Yin, Fengwei

2023-08-04 09:26:04

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH 0/2] don't use mapcount() to check large folio sharing

On 04.08.23 02:17, Yin, Fengwei wrote:
>
>
> On 8/4/2023 7:38 AM, Yu Zhao wrote:
>> On Thu, Aug 3, 2023 at 5:27 PM Yin, Fengwei <[email protected]> wrote:
>>>
>>>
>>>
>>> On 8/4/2023 4:46 AM, Yu Zhao wrote:
>>>> On Wed, Aug 2, 2023 at 6:56 AM Yin, Fengwei <[email protected]> wrote:
>>>>>
>>>>> "
>>>>>
>>>>> On 8/2/2023 8:49 PM, Ryan Roberts wrote:
>>>>>> On 02/08/2023 13:42, Yin, Fengwei wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 8/2/2023 8:40 PM, Ryan Roberts wrote:
>>>>>>>> On 02/08/2023 13:35, Yin, Fengwei wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 8/2/2023 6:27 PM, Ryan Roberts wrote:
>>>>>>>>>> On 28/07/2023 17:13, Yin Fengwei wrote:
>>>>>>>>>>> In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(),
>>>>>>>>>>> folio_mapcount() is used to check whether the folio is shared. But it's
>>>>>>>>>>> not correct as folio_mapcount() returns total mapcount of large folio.
>>>>>>>>>>>
>>>>>>>>>>> Use folio_estimated_sharers() here as the estimated number is enough.
>>>>>>>>>>>
>>>>>>>>>>> Yin Fengwei (2):
>>>>>>>>>>> madvise: don't use mapcount() against large folio for sharing check
>>>>>>>>>>> madvise: don't use mapcount() against large folio for sharing check
>>>>>>>>>>>
>>>>>>>>>>> mm/huge_memory.c | 2 +-
>>>>>>>>>>> mm/madvise.c | 6 +++---
>>>>>>>>>>> 2 files changed, 4 insertions(+), 4 deletions(-)
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> As a set of fixes, I agree this is definitely an improvement, so:
>>>>>>>>>>
>>>>>>>>>> Reviewed-By: Ryan Roberts
>>>>>>>>> Thanks.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> But I have a couple of comments around further improvements;
>>>>>>>>>>
>>>>>>>>>> Once we have the scheme that David is working on to be able to provide precise
>>>>>>>>>> exclusive vs shared info, we will probably want to move to that. Although that
>>>>>>>>>> scheme will need access to the mm_struct of a process known to be mapping the
>>>>>>>>>> folio. We have that info, but its not passed to folio_estimated_sharers() so we
>>>>>>>>>> can't just reimplement folio_estimated_sharers() - we will need to rework these
>>>>>>>>>> call sites again.
>>>>>>>>> Yes. This could be extra work. Maybe should delay till David's work is done.
>>>>>>>>
>>>>>>>> What you have is definitely an improvement over what was there before. And is
>>>>>>>> probably the best we can do without David's scheme. So I wouldn't delay this.
>>>>>>>> Just pointing out that we will be able to make it even better later on (if
>>>>>>>> David's stuff goes in).
>>>>>>> Yes. I agree that we should wait for David's work ready and do fix based on that.
>>>>>>
>>>>>> I was suggesting the opposite - not waiting. Then we can do separate improvement
>>>>>> later.
>>>>> Let's wait for David's work ready.
>>>>
>>>> Waiting is fine as long as we don't miss the next merge window -- we
>>>> don't want these two bugs to get into another release. Also I think we
>>>> should cc stable, since as David mentioned, they have been causing
>>>> selftest failures.
>>>
>>> Stable was CCed.
>>
>> Need to add the "Cc: [email protected]" tag:
>> Documentation/process/stable-kernel-rules.rst
> OK. Thanks for clarification. I totally mis-understanded this. :).
>
> I'd like to wait for answer from Andrew whether these patches are suitable
> for stable (I suppose you think so) branch.

Note that the COW test does not fail -- it skips -- but the behavir changed:

$ ./cow
# [INFO] detected THP size: 2048 KiB
# [INFO] detected hugetlb page size: 2048 KiB
# [INFO] detected hugetlb page size: 1048576 KiB
# [INFO] huge zeropage is enabled
TAP version 13
1..190
# [INFO] Anonymous memory tests in private mappings
# [RUN] Basic COW after fork() ... with base page
ok 1 No leak from parent into child
# [RUN] Basic COW after fork() ... with swapped out base page
ok 2 No leak from parent into child
# [RUN] Basic COW after fork() ... with THP
ok 3 No leak from parent into child
# [RUN] Basic COW after fork() ... with swapped-out THP
ok 4 No leak from parent into child
# [RUN] Basic COW after fork() ... with PTE-mapped THP
ok 5 No leak from parent into child
# [RUN] Basic COW after fork() ... with swapped-out, PTE-mapped THP
ok 6 # SKIP MADV_PAGEOUT did not work, is swap enabled?
# [RUN] Basic COW after fork() ... with single PTE of THP
ok 7 No leak from parent into child
# [RUN] Basic COW after fork() ... with single PTE of swapped-out THP
ok 8 No leak from parent into child
# [RUN] Basic COW after fork() ... with partially mremap()'ed THP
ok 9 No leak from parent into child
# [RUN] Basic COW after fork() ... with partially shared THP
ok 10 No leak from parent into child
...

Observe how patch #6 skips because the MADV_PAGEOUT was not effective (which might have happened due to other reasons as well, thus no failure).

The code that broke it is

commit 07e8c82b5eff8ef34b74210eacb8d9c4a2886b82
Author: Vishal Moola (Oracle) <[email protected]>
Date: Wed Dec 21 10:08:46 2022 -0800

madvise: convert madvise_cold_or_pageout_pte_range() to use folios

This change removes a number of calls to compound_head(), and saves
1729 bytes of kernel text.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Vishal Moola (Oracle) <[email protected]>
Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>
Cc: SeongJae Park <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>


Ever since v6.3.

The simplest way to fix it would be to revert the page_mapcount() -> folio_mapcount(),
conversion.


Probably all that is information worth having in the patch description.

--
Cheers,

David / dhildenb


2023-08-04 09:27:41

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [PATCH 0/2] don't use mapcount() to check large folio sharing

Hi David,

On 8/4/2023 3:31 PM, David Hildenbrand wrote:
> On 04.08.23 02:17, Yin, Fengwei wrote:
>>
>>
>> On 8/4/2023 7:38 AM, Yu Zhao wrote:
>>> On Thu, Aug 3, 2023 at 5:27 PM Yin, Fengwei <[email protected]> wrote:
>>>>
>>>>
>>>>
>>>> On 8/4/2023 4:46 AM, Yu Zhao wrote:
>>>>> On Wed, Aug 2, 2023 at 6:56 AM Yin, Fengwei <[email protected]> wrote:
>>>>>>
>>>>>> "
>>>>>>
>>>>>> On 8/2/2023 8:49 PM, Ryan Roberts wrote:
>>>>>>> On 02/08/2023 13:42, Yin, Fengwei wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 8/2/2023 8:40 PM, Ryan Roberts wrote:
>>>>>>>>> On 02/08/2023 13:35, Yin, Fengwei wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 8/2/2023 6:27 PM, Ryan Roberts wrote:
>>>>>>>>>>> On 28/07/2023 17:13, Yin Fengwei wrote:
>>>>>>>>>>>> In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(),
>>>>>>>>>>>> folio_mapcount() is used to check whether the folio is shared. But it's
>>>>>>>>>>>> not correct as folio_mapcount() returns total mapcount of large folio.
>>>>>>>>>>>>
>>>>>>>>>>>> Use folio_estimated_sharers() here as the estimated number is enough.
>>>>>>>>>>>>
>>>>>>>>>>>> Yin Fengwei (2):
>>>>>>>>>>>>    madvise: don't use mapcount() against large folio for sharing check
>>>>>>>>>>>>    madvise: don't use mapcount() against large folio for sharing check
>>>>>>>>>>>>
>>>>>>>>>>>>   mm/huge_memory.c | 2 +-
>>>>>>>>>>>>   mm/madvise.c     | 6 +++---
>>>>>>>>>>>>   2 files changed, 4 insertions(+), 4 deletions(-)
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> As a set of fixes, I agree this is definitely an improvement, so:
>>>>>>>>>>>
>>>>>>>>>>> Reviewed-By: Ryan Roberts
>>>>>>>>>> Thanks.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> But I have a couple of comments around further improvements;
>>>>>>>>>>>
>>>>>>>>>>> Once we have the scheme that David is working on to be able to provide precise
>>>>>>>>>>> exclusive vs shared info, we will probably want to move to that. Although that
>>>>>>>>>>> scheme will need access to the mm_struct of a process known to be mapping the
>>>>>>>>>>> folio. We have that info, but its not passed to folio_estimated_sharers() so we
>>>>>>>>>>> can't just reimplement folio_estimated_sharers() - we will need to rework these
>>>>>>>>>>> call sites again.
>>>>>>>>>> Yes. This could be extra work. Maybe should delay till David's work is done.
>>>>>>>>>
>>>>>>>>> What you have is definitely an improvement over what was there before. And is
>>>>>>>>> probably the best we can do without David's scheme. So I wouldn't delay this.
>>>>>>>>> Just pointing out that we will be able to make it even better later on (if
>>>>>>>>> David's stuff goes in).
>>>>>>>> Yes. I agree that we should wait for David's work ready and do fix based on that.
>>>>>>>
>>>>>>> I was suggesting the opposite - not waiting. Then we can do separate improvement
>>>>>>> later.
>>>>>> Let's wait for David's work ready.
>>>>>
>>>>> Waiting is fine as long as we don't miss the next merge window -- we
>>>>> don't want these two bugs to get into another release. Also I think we
>>>>> should cc stable, since as David mentioned, they have been causing
>>>>> selftest failures.
>>>>
>>>> Stable was CCed.
>>>
>>> Need to add the "Cc: [email protected]" tag:
>>> Documentation/process/stable-kernel-rules.rst
>> OK. Thanks for clarification. I totally mis-understanded this. :).
>>
>> I'd like to wait for answer from Andrew whether these patches are suitable
>> for stable (I suppose you think so) branch.
>
> Note that the COW test does not fail -- it skips -- but the behavir changed:
>
> $ ./cow
> # [INFO] detected THP size: 2048 KiB
> # [INFO] detected hugetlb page size: 2048 KiB
> # [INFO] detected hugetlb page size: 1048576 KiB
> # [INFO] huge zeropage is enabled
> TAP version 13
> 1..190
> # [INFO] Anonymous memory tests in private mappings
> # [RUN] Basic COW after fork() ... with base page
> ok 1 No leak from parent into child
> # [RUN] Basic COW after fork() ... with swapped out base page
> ok 2 No leak from parent into child
> # [RUN] Basic COW after fork() ... with THP
> ok 3 No leak from parent into child
> # [RUN] Basic COW after fork() ... with swapped-out THP
> ok 4 No leak from parent into child
> # [RUN] Basic COW after fork() ... with PTE-mapped THP
> ok 5 No leak from parent into child
> # [RUN] Basic COW after fork() ... with swapped-out, PTE-mapped THP
> ok 6 # SKIP MADV_PAGEOUT did not work, is swap enabled?
> # [RUN] Basic COW after fork() ... with single PTE of THP
> ok 7 No leak from parent into child
> # [RUN] Basic COW after fork() ... with single PTE of swapped-out THP
> ok 8 No leak from parent into child
> # [RUN] Basic COW after fork() ... with partially mremap()'ed THP
> ok 9 No leak from parent into child
> # [RUN] Basic COW after fork() ... with partially shared THP
> ok 10 No leak from parent into child
> ...
>
> Observe how patch #6 skips because the MADV_PAGEOUT was not effective (which might have happened due to other reasons as well, thus no failure).
>
> The code that broke it is
>
> commit 07e8c82b5eff8ef34b74210eacb8d9c4a2886b82
> Author: Vishal Moola (Oracle) <[email protected]>
> Date:   Wed Dec 21 10:08:46 2022 -0800
>
>     madvise: convert madvise_cold_or_pageout_pte_range() to use folios
>         This change removes a number of calls to compound_head(), and saves
>     1729 bytes of kernel text.
>         Link: https://lkml.kernel.org/r/[email protected]
>     Signed-off-by: Vishal Moola (Oracle) <[email protected]>
>     Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>
>     Cc: SeongJae Park <[email protected]>
>     Signed-off-by: Andrew Morton <[email protected]>
>
>
> Ever since v6.3.
>
> The simplest way to fix it would be to revert the page_mapcount() -> folio_mapcount(),
> conversion.
>
>
> Probably all that is information worth having in the patch description.
Thanks a lot for checking this. I will try this patchset to see whether
it can restore the behavior (I suppose so from your broken commit info
but want to confirm).

Regards
Yin, Fengwei



2023-08-04 10:49:28

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [PATCH 0/2] don't use mapcount() to check large folio sharing



On 8/4/2023 3:31 PM, David Hildenbrand wrote:
> On 04.08.23 02:17, Yin, Fengwei wrote:
>>
>>
>> On 8/4/2023 7:38 AM, Yu Zhao wrote:
>>> On Thu, Aug 3, 2023 at 5:27 PM Yin, Fengwei <[email protected]> wrote:
>>>>
>>>>
>>>>
>>>> On 8/4/2023 4:46 AM, Yu Zhao wrote:
>>>>> On Wed, Aug 2, 2023 at 6:56 AM Yin, Fengwei <[email protected]> wrote:
>>>>>>
>>>>>> "
>>>>>>
>>>>>> On 8/2/2023 8:49 PM, Ryan Roberts wrote:
>>>>>>> On 02/08/2023 13:42, Yin, Fengwei wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 8/2/2023 8:40 PM, Ryan Roberts wrote:
>>>>>>>>> On 02/08/2023 13:35, Yin, Fengwei wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 8/2/2023 6:27 PM, Ryan Roberts wrote:
>>>>>>>>>>> On 28/07/2023 17:13, Yin Fengwei wrote:
>>>>>>>>>>>> In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(),
>>>>>>>>>>>> folio_mapcount() is used to check whether the folio is shared. But it's
>>>>>>>>>>>> not correct as folio_mapcount() returns total mapcount of large folio.
>>>>>>>>>>>>
>>>>>>>>>>>> Use folio_estimated_sharers() here as the estimated number is enough.
>>>>>>>>>>>>
>>>>>>>>>>>> Yin Fengwei (2):
>>>>>>>>>>>>    madvise: don't use mapcount() against large folio for sharing check
>>>>>>>>>>>>    madvise: don't use mapcount() against large folio for sharing check
>>>>>>>>>>>>
>>>>>>>>>>>>   mm/huge_memory.c | 2 +-
>>>>>>>>>>>>   mm/madvise.c     | 6 +++---
>>>>>>>>>>>>   2 files changed, 4 insertions(+), 4 deletions(-)
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> As a set of fixes, I agree this is definitely an improvement, so:
>>>>>>>>>>>
>>>>>>>>>>> Reviewed-By: Ryan Roberts
>>>>>>>>>> Thanks.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> But I have a couple of comments around further improvements;
>>>>>>>>>>>
>>>>>>>>>>> Once we have the scheme that David is working on to be able to provide precise
>>>>>>>>>>> exclusive vs shared info, we will probably want to move to that. Although that
>>>>>>>>>>> scheme will need access to the mm_struct of a process known to be mapping the
>>>>>>>>>>> folio. We have that info, but its not passed to folio_estimated_sharers() so we
>>>>>>>>>>> can't just reimplement folio_estimated_sharers() - we will need to rework these
>>>>>>>>>>> call sites again.
>>>>>>>>>> Yes. This could be extra work. Maybe should delay till David's work is done.
>>>>>>>>>
>>>>>>>>> What you have is definitely an improvement over what was there before. And is
>>>>>>>>> probably the best we can do without David's scheme. So I wouldn't delay this.
>>>>>>>>> Just pointing out that we will be able to make it even better later on (if
>>>>>>>>> David's stuff goes in).
>>>>>>>> Yes. I agree that we should wait for David's work ready and do fix based on that.
>>>>>>>
>>>>>>> I was suggesting the opposite - not waiting. Then we can do separate improvement
>>>>>>> later.
>>>>>> Let's wait for David's work ready.
>>>>>
>>>>> Waiting is fine as long as we don't miss the next merge window -- we
>>>>> don't want these two bugs to get into another release. Also I think we
>>>>> should cc stable, since as David mentioned, they have been causing
>>>>> selftest failures.
>>>>
>>>> Stable was CCed.
>>>
>>> Need to add the "Cc: [email protected]" tag:
>>> Documentation/process/stable-kernel-rules.rst
>> OK. Thanks for clarification. I totally mis-understanded this. :).
>>
>> I'd like to wait for answer from Andrew whether these patches are suitable
>> for stable (I suppose you think so) branch.
>
> Note that the COW test does not fail -- it skips -- but the behavir changed:
>
> $ ./cow
> # [INFO] detected THP size: 2048 KiB
> # [INFO] detected hugetlb page size: 2048 KiB
> # [INFO] detected hugetlb page size: 1048576 KiB
> # [INFO] huge zeropage is enabled
> TAP version 13
> 1..190
> # [INFO] Anonymous memory tests in private mappings
> # [RUN] Basic COW after fork() ... with base page
> ok 1 No leak from parent into child
> # [RUN] Basic COW after fork() ... with swapped out base page
> ok 2 No leak from parent into child
> # [RUN] Basic COW after fork() ... with THP
> ok 3 No leak from parent into child
> # [RUN] Basic COW after fork() ... with swapped-out THP
> ok 4 No leak from parent into child
> # [RUN] Basic COW after fork() ... with PTE-mapped THP
> ok 5 No leak from parent into child
> # [RUN] Basic COW after fork() ... with swapped-out, PTE-mapped THP
> ok 6 # SKIP MADV_PAGEOUT did not work, is swap enabled?
> # [RUN] Basic COW after fork() ... with single PTE of THP
> ok 7 No leak from parent into child
> # [RUN] Basic COW after fork() ... with single PTE of swapped-out THP
> ok 8 No leak from parent into child
> # [RUN] Basic COW after fork() ... with partially mremap()'ed THP
> ok 9 No leak from parent into child
> # [RUN] Basic COW after fork() ... with partially shared THP
> ok 10 No leak from parent into child
> ...
>
> Observe how patch #6 skips because the MADV_PAGEOUT was not effective (which might have happened due to other reasons as well, thus no failure).
>
> The code that broke it is
>
> commit 07e8c82b5eff8ef34b74210eacb8d9c4a2886b82
> Author: Vishal Moola (Oracle) <[email protected]>
> Date:   Wed Dec 21 10:08:46 2022 -0800
>
>     madvise: convert madvise_cold_or_pageout_pte_range() to use folios
>         This change removes a number of calls to compound_head(), and saves
>     1729 bytes of kernel text.
>         Link: https://lkml.kernel.org/r/[email protected]
>     Signed-off-by: Vishal Moola (Oracle) <[email protected]>
>     Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>
>     Cc: SeongJae Park <[email protected]>
>     Signed-off-by: Andrew Morton <[email protected]>
>
>
> Ever since v6.3.
>
> The simplest way to fix it would be to revert the page_mapcount() -> folio_mapcount(),
> conversion.
Confirmed this patchset also can make swapped-out, PTE-mapped THP related tests from
skip to pass.


Regards
Yin, Fengwei

>
>
> Probably all that is information worth having in the patch description.
>

2023-08-07 17:10:13

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 0/2] don't use mapcount() to check large folio sharing

On Fri, 4 Aug 2023 15:14:53 +0800 "Yin, Fengwei" <[email protected]> wrote:

> Hi Andrew,
>
> On 8/2/2023 8:39 PM, Yin, Fengwei wrote:
> > Hi Andrew,
> >
> > On 7/29/2023 1:24 AM, Andrew Morton wrote:
> >> On Sat, 29 Jul 2023 00:13:54 +0800 Yin Fengwei <[email protected]> wrote:
> >>
> >>> In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(),
> >>> folio_mapcount() is used to check whether the folio is shared. But it's
> >>> not correct as folio_mapcount() returns total mapcount of large folio.
> >>>
> >>> Use folio_estimated_sharers() here as the estimated number is enough.
> >>
> >> What are the user-visible runtime effects of these changes?
> >>
> >> (and please try to avoid using the same Subject: for different patches)
> >>
> >
> > Can you hold on these patches to mm-unstable? I think we need to wait for
> > David's work on folio_maybe_mapped_shared() and redo the fix base on that.
> > Thanks and sorry for the noise.
> Sorry for bothering you again for this patchset.
>
> Let me explain the situation here:
> - The reason to hold on the patches to mm-unstable is that I don't want to
> promote the fix in this patch (using folio_estimated_sharers()). The
> correct way is waiting for folio_maybe_mapped_shared() from David.
>
> Merging these patches motivate using folio_estimated_sharers() in other
> places. So once folio_maybe_mapped_shared() is ready, we need to replace
> folio_estimated_sharers() with folio_maybe_mapped_shared().
>
> - For this specific patches, if they are suitable for stable, we may want to
> merge it (special for stable branch. I assume folio_maybe_mapped_shared()
> may not be back ported to stable branch).
>
> So how do we deal with this situation? Thanks in advance.
>

I think I'll stage them for 6.5, with a cc:stable.

I'll drop the current three patches. Please resend with

a) different Subject:s for all patches and

b) changelogs which fully describe the user-visible effects of the change.

Thanks.

2023-08-08 01:00:50

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [PATCH 0/2] don't use mapcount() to check large folio sharing

Hi Andrew,

On 8/8/2023 12:43 AM, Andrew Morton wrote:
> On Fri, 4 Aug 2023 15:14:53 +0800 "Yin, Fengwei" <[email protected]> wrote:
>
>> Hi Andrew,
>>
>> On 8/2/2023 8:39 PM, Yin, Fengwei wrote:
>>> Hi Andrew,
>>>
>>> On 7/29/2023 1:24 AM, Andrew Morton wrote:
>>>> On Sat, 29 Jul 2023 00:13:54 +0800 Yin Fengwei <[email protected]> wrote:
>>>>
>>>>> In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(),
>>>>> folio_mapcount() is used to check whether the folio is shared. But it's
>>>>> not correct as folio_mapcount() returns total mapcount of large folio.
>>>>>
>>>>> Use folio_estimated_sharers() here as the estimated number is enough.
>>>>
>>>> What are the user-visible runtime effects of these changes?
>>>>
>>>> (and please try to avoid using the same Subject: for different patches)
>>>>
>>>
>>> Can you hold on these patches to mm-unstable? I think we need to wait for
>>> David's work on folio_maybe_mapped_shared() and redo the fix base on that.
>>> Thanks and sorry for the noise.
>> Sorry for bothering you again for this patchset.
>>
>> Let me explain the situation here:
>> - The reason to hold on the patches to mm-unstable is that I don't want to
>> promote the fix in this patch (using folio_estimated_sharers()). The
>> correct way is waiting for folio_maybe_mapped_shared() from David.
>>
>> Merging these patches motivate using folio_estimated_sharers() in other
>> places. So once folio_maybe_mapped_shared() is ready, we need to replace
>> folio_estimated_sharers() with folio_maybe_mapped_shared().
>>
>> - For this specific patches, if they are suitable for stable, we may want to
>> merge it (special for stable branch. I assume folio_maybe_mapped_shared()
>> may not be back ported to stable branch).
>>
>> So how do we deal with this situation? Thanks in advance.
>>
>
> I think I'll stage them for 6.5, with a cc:stable.
>
> I'll drop the current three patches. Please resend with
Thanks. Will resend the patches.

Regards
Yin, Fengwei

>
> a) different Subject:s for all patches and
>
> b) changelogs which fully describe the user-visible effects of the change.
>
> Thanks.