2024-04-11 15:48:05

by Zi Yan

[permalink] [raw]
Subject: [PATCH] mm/rmap: do not add fully unmapped large folio to deferred split list

From: Zi Yan <[email protected]>

In __folio_remove_rmap(), a large folio is added to deferred split list
if any page in a folio loses its final mapping. It is possible that
the folio is unmapped fully, but it is unnecessary to add the folio
to deferred split list at all. Fix it by checking folio mapcount before
adding a folio to deferred split list.

Signed-off-by: Zi Yan <[email protected]>
---
mm/rmap.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 2608c40dffad..d599a772e282 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1494,7 +1494,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
enum rmap_level level)
{
atomic_t *mapped = &folio->_nr_pages_mapped;
- int last, nr = 0, nr_pmdmapped = 0;
+ int last, nr = 0, nr_pmdmapped = 0, mapcount = 0;
enum node_stat_item idx;

__folio_rmap_sanity_checks(folio, page, nr_pages, level);
@@ -1506,7 +1506,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
break;
}

- atomic_sub(nr_pages, &folio->_large_mapcount);
+ mapcount = atomic_sub_return(nr_pages,
+ &folio->_large_mapcount) + 1;
do {
last = atomic_add_negative(-1, &page->_mapcount);
if (last) {
@@ -1554,7 +1555,9 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
* is still mapped.
*/
if (folio_test_large(folio) && folio_test_anon(folio))
- if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped)
+ if ((level == RMAP_LEVEL_PTE &&
+ mapcount != 0) ||
+ (level == RMAP_LEVEL_PMD && nr < nr_pmdmapped))
deferred_split_folio(folio);
}


base-commit: ed7c95c95397baff9b40ba9b0919933eee2d7960
--
2.43.0



2024-04-11 17:25:05

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH] mm/rmap: do not add fully unmapped large folio to deferred split list

On 11.04.24 17:32, Zi Yan wrote:
> From: Zi Yan <[email protected]>
>
> In __folio_remove_rmap(), a large folio is added to deferred split list
> if any page in a folio loses its final mapping. It is possible that
> the folio is unmapped fully, but it is unnecessary to add the folio
> to deferred split list at all. Fix it by checking folio mapcount before
> adding a folio to deferred split list.
>
> Signed-off-by: Zi Yan <[email protected]>
> ---
> mm/rmap.c | 9 ++++++---
> 1 file changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 2608c40dffad..d599a772e282 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1494,7 +1494,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
> enum rmap_level level)
> {
> atomic_t *mapped = &folio->_nr_pages_mapped;
> - int last, nr = 0, nr_pmdmapped = 0;
> + int last, nr = 0, nr_pmdmapped = 0, mapcount = 0;
> enum node_stat_item idx;
>
> __folio_rmap_sanity_checks(folio, page, nr_pages, level);
> @@ -1506,7 +1506,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
> break;
> }
>
> - atomic_sub(nr_pages, &folio->_large_mapcount);
> + mapcount = atomic_sub_return(nr_pages,
> + &folio->_large_mapcount) + 1;

That becomes a new memory barrier on some archs. Rather just re-read it
below. Re-reading should be fine here.

> do {
> last = atomic_add_negative(-1, &page->_mapcount);
> if (last) {
> @@ -1554,7 +1555,9 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
> * is still mapped.
> */
> if (folio_test_large(folio) && folio_test_anon(folio))
> - if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped)
> + if ((level == RMAP_LEVEL_PTE &&
> + mapcount != 0) ||
> + (level == RMAP_LEVEL_PMD && nr < nr_pmdmapped))
> deferred_split_folio(folio);
> }

But I do wonder if we really care? Usually the folio will simply get
freed afterwards, where we simply remove it from the list.

If it's pinned, we won't be able to free or reclaim, but it's rather a
corner case ...

Is it really worth the added code? Not convinced.

--
Cheers,

David / dhildenb


2024-04-11 19:02:53

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCH] mm/rmap: do not add fully unmapped large folio to deferred split list

On Thu, Apr 11, 2024 at 8:46 AM David Hildenbrand <[email protected]> wrote:
>
> On 11.04.24 17:32, Zi Yan wrote:
> > From: Zi Yan <[email protected]>
> >
> > In __folio_remove_rmap(), a large folio is added to deferred split list
> > if any page in a folio loses its final mapping. It is possible that
> > the folio is unmapped fully, but it is unnecessary to add the folio
> > to deferred split list at all. Fix it by checking folio mapcount before
> > adding a folio to deferred split list.
> >
> > Signed-off-by: Zi Yan <[email protected]>
> > ---
> > mm/rmap.c | 9 ++++++---
> > 1 file changed, 6 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 2608c40dffad..d599a772e282 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -1494,7 +1494,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
> > enum rmap_level level)
> > {
> > atomic_t *mapped = &folio->_nr_pages_mapped;
> > - int last, nr = 0, nr_pmdmapped = 0;
> > + int last, nr = 0, nr_pmdmapped = 0, mapcount = 0;
> > enum node_stat_item idx;
> >
> > __folio_rmap_sanity_checks(folio, page, nr_pages, level);
> > @@ -1506,7 +1506,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
> > break;
> > }
> >
> > - atomic_sub(nr_pages, &folio->_large_mapcount);
> > + mapcount = atomic_sub_return(nr_pages,
> > + &folio->_large_mapcount) + 1;
>
> That becomes a new memory barrier on some archs. Rather just re-read it
> below. Re-reading should be fine here.
>
> > do {
> > last = atomic_add_negative(-1, &page->_mapcount);
> > if (last) {
> > @@ -1554,7 +1555,9 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
> > * is still mapped.
> > */
> > if (folio_test_large(folio) && folio_test_anon(folio))
> > - if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped)
> > + if ((level == RMAP_LEVEL_PTE &&
> > + mapcount != 0) ||
> > + (level == RMAP_LEVEL_PMD && nr < nr_pmdmapped))
> > deferred_split_folio(folio);
> > }
>
> But I do wonder if we really care? Usually the folio will simply get
> freed afterwards, where we simply remove it from the list.
>
> If it's pinned, we won't be able to free or reclaim, but it's rather a
> corner case ...
>
> Is it really worth the added code? Not convinced.

It is actually not only an optimization, but also fixed the broken
thp_deferred_split_page counter in /proc/vmstat.

The counter actually counted the partially unmapped huge pages (so
they are on deferred split queue), but it counts the fully unmapped
mTHP as well now. For example, when a 64K THP is fully unmapped, the
thp_deferred_split_page is not supposed to get inc'ed, but it does
now.

The counter is also useful for performance analysis, for example,
whether a workload did a lot of partial unmap or not. So fixing the
counter seems worthy. Zi Yan should have mentioned this in the commit
log.

>
> --
> Cheers,
>
> David / dhildenb
>

2024-04-11 21:15:21

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH] mm/rmap: do not add fully unmapped large folio to deferred split list

On 11.04.24 21:01, Yang Shi wrote:
> On Thu, Apr 11, 2024 at 8:46 AM David Hildenbrand <[email protected]> wrote:
>>
>> On 11.04.24 17:32, Zi Yan wrote:
>>> From: Zi Yan <[email protected]>
>>>
>>> In __folio_remove_rmap(), a large folio is added to deferred split list
>>> if any page in a folio loses its final mapping. It is possible that
>>> the folio is unmapped fully, but it is unnecessary to add the folio
>>> to deferred split list at all. Fix it by checking folio mapcount before
>>> adding a folio to deferred split list.
>>>
>>> Signed-off-by: Zi Yan <[email protected]>
>>> ---
>>> mm/rmap.c | 9 ++++++---
>>> 1 file changed, 6 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>> index 2608c40dffad..d599a772e282 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -1494,7 +1494,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>> enum rmap_level level)
>>> {
>>> atomic_t *mapped = &folio->_nr_pages_mapped;
>>> - int last, nr = 0, nr_pmdmapped = 0;
>>> + int last, nr = 0, nr_pmdmapped = 0, mapcount = 0;
>>> enum node_stat_item idx;
>>>
>>> __folio_rmap_sanity_checks(folio, page, nr_pages, level);
>>> @@ -1506,7 +1506,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>> break;
>>> }
>>>
>>> - atomic_sub(nr_pages, &folio->_large_mapcount);
>>> + mapcount = atomic_sub_return(nr_pages,
>>> + &folio->_large_mapcount) + 1;
>>
>> That becomes a new memory barrier on some archs. Rather just re-read it
>> below. Re-reading should be fine here.
>>
>>> do {
>>> last = atomic_add_negative(-1, &page->_mapcount);
>>> if (last) {
>>> @@ -1554,7 +1555,9 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>> * is still mapped.
>>> */
>>> if (folio_test_large(folio) && folio_test_anon(folio))
>>> - if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped)
>>> + if ((level == RMAP_LEVEL_PTE &&
>>> + mapcount != 0) ||
>>> + (level == RMAP_LEVEL_PMD && nr < nr_pmdmapped))
>>> deferred_split_folio(folio);
>>> }
>>
>> But I do wonder if we really care? Usually the folio will simply get
>> freed afterwards, where we simply remove it from the list.
>>
>> If it's pinned, we won't be able to free or reclaim, but it's rather a
>> corner case ...
>>
>> Is it really worth the added code? Not convinced.
>
> It is actually not only an optimization, but also fixed the broken
> thp_deferred_split_page counter in /proc/vmstat.
>
> The counter actually counted the partially unmapped huge pages (so
> they are on deferred split queue), but it counts the fully unmapped
> mTHP as well now. For example, when a 64K THP is fully unmapped, the
> thp_deferred_split_page is not supposed to get inc'ed, but it does
> now.
>
> The counter is also useful for performance analysis, for example,
> whether a workload did a lot of partial unmap or not. So fixing the
> counter seems worthy. Zi Yan should have mentioned this in the commit
> log.

Yes, all that is information that is missing from the patch description.
If it's a fix, there should be a "Fixes:".

Likely we want to have a folio_large_mapcount() check in the code below.
(I yet have to digest the condition where this happens -- can we have an
example where we'd use to do the wrong thing and now would do the right
thing as well?)
--
Cheers,

David / dhildenb


2024-04-11 21:59:27

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCH] mm/rmap: do not add fully unmapped large folio to deferred split list

On Thu, Apr 11, 2024 at 2:15 PM David Hildenbrand <[email protected]> wrote:
>
> On 11.04.24 21:01, Yang Shi wrote:
> > On Thu, Apr 11, 2024 at 8:46 AM David Hildenbrand <david@redhatcom> wrote:
> >>
> >> On 11.04.24 17:32, Zi Yan wrote:
> >>> From: Zi Yan <[email protected]>
> >>>
> >>> In __folio_remove_rmap(), a large folio is added to deferred split list
> >>> if any page in a folio loses its final mapping. It is possible that
> >>> the folio is unmapped fully, but it is unnecessary to add the folio
> >>> to deferred split list at all. Fix it by checking folio mapcount before
> >>> adding a folio to deferred split list.
> >>>
> >>> Signed-off-by: Zi Yan <[email protected]>
> >>> ---
> >>> mm/rmap.c | 9 ++++++---
> >>> 1 file changed, 6 insertions(+), 3 deletions(-)
> >>>
> >>> diff --git a/mm/rmap.c b/mm/rmap.c
> >>> index 2608c40dffad..d599a772e282 100644
> >>> --- a/mm/rmap.c
> >>> +++ b/mm/rmap.c
> >>> @@ -1494,7 +1494,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
> >>> enum rmap_level level)
> >>> {
> >>> atomic_t *mapped = &folio->_nr_pages_mapped;
> >>> - int last, nr = 0, nr_pmdmapped = 0;
> >>> + int last, nr = 0, nr_pmdmapped = 0, mapcount = 0;
> >>> enum node_stat_item idx;
> >>>
> >>> __folio_rmap_sanity_checks(folio, page, nr_pages, level);
> >>> @@ -1506,7 +1506,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
> >>> break;
> >>> }
> >>>
> >>> - atomic_sub(nr_pages, &folio->_large_mapcount);
> >>> + mapcount = atomic_sub_return(nr_pages,
> >>> + &folio->_large_mapcount) + 1;
> >>
> >> That becomes a new memory barrier on some archs. Rather just re-read it
> >> below. Re-reading should be fine here.
> >>
> >>> do {
> >>> last = atomic_add_negative(-1, &page->_mapcount);
> >>> if (last) {
> >>> @@ -1554,7 +1555,9 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
> >>> * is still mapped.
> >>> */
> >>> if (folio_test_large(folio) && folio_test_anon(folio))
> >>> - if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped)
> >>> + if ((level == RMAP_LEVEL_PTE &&
> >>> + mapcount != 0) ||
> >>> + (level == RMAP_LEVEL_PMD && nr < nr_pmdmapped))
> >>> deferred_split_folio(folio);
> >>> }
> >>
> >> But I do wonder if we really care? Usually the folio will simply get
> >> freed afterwards, where we simply remove it from the list.
> >>
> >> If it's pinned, we won't be able to free or reclaim, but it's rather a
> >> corner case ...
> >>
> >> Is it really worth the added code? Not convinced.
> >
> > It is actually not only an optimization, but also fixed the broken
> > thp_deferred_split_page counter in /proc/vmstat.
> >
> > The counter actually counted the partially unmapped huge pages (so
> > they are on deferred split queue), but it counts the fully unmapped
> > mTHP as well now. For example, when a 64K THP is fully unmapped, the
> > thp_deferred_split_page is not supposed to get inc'ed, but it does
> > now.
> >
> > The counter is also useful for performance analysis, for example,
> > whether a workload did a lot of partial unmap or not. So fixing the
> > counter seems worthy. Zi Yan should have mentioned this in the commit
> > log.
>
> Yes, all that is information that is missing from the patch description.
> If it's a fix, there should be a "Fixes:".
>
> Likely we want to have a folio_large_mapcount() check in the code below.
> (I yet have to digest the condition where this happens -- can we have an
> example where we'd use to do the wrong thing and now would do the right
> thing as well?)

For example, map 1G memory with 64K mTHP, then unmap the whole 1G or
some full 64K areas, you will see thp_deferred_split_page increased,
but it shouldn't.

It looks __folio_remove_rmap() incorrectly detected whether the mTHP
is fully unmapped or partially unmapped by comparing the number of
still-mapped subpages to ENTIRELY_MAPPED, which should just work for
PMD-mappable THP.

However I just realized this problem was kind of workaround'ed by commit:

commit 98046944a1597f3a02b792dbe9665e9943b77f28
Author: Baolin Wang <[email protected]>
Date: Fri Mar 29 14:59:33 2024 +0800

mm: huge_memory: add the missing folio_test_pmd_mappable() for THP
split statistics

Now the mTHP can also be split or added into the deferred list, so add
folio_test_pmd_mappable() validation for PMD mapped THP, to avoid
confusion with PMD mapped THP related statistics.

Link: https://lkml.kernel.org/r/a5341defeef27c9ac7b85c97f030f93e4368bbc1.1711694852.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Muchun Song <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

This commit made thp_deferred_split_page didn't count mTHP anymore, it
also made thp_split_page didn't count mTHP anymore.

However Zi Yan's patch does make the code more robust and we don't
need to worry about the miscounting issue anymore if we will add
deferred_split_page and split_page counters for mTHP in the future.

> --
> Cheers,
>
> David / dhildenb
>

2024-04-12 14:22:38

by Zi Yan

[permalink] [raw]
Subject: Re: [PATCH] mm/rmap: do not add fully unmapped large folio to deferred split list

On 11 Apr 2024, at 17:59, Yang Shi wrote:

> On Thu, Apr 11, 2024 at 2:15 PM David Hildenbrand <[email protected]> wrote:
>>
>> On 11.04.24 21:01, Yang Shi wrote:
>>> On Thu, Apr 11, 2024 at 8:46 AM David Hildenbrand <[email protected]> wrote:
>>>>
>>>> On 11.04.24 17:32, Zi Yan wrote:
>>>>> From: Zi Yan <[email protected]>
>>>>>
>>>>> In __folio_remove_rmap(), a large folio is added to deferred split list
>>>>> if any page in a folio loses its final mapping. It is possible that
>>>>> the folio is unmapped fully, but it is unnecessary to add the folio
>>>>> to deferred split list at all. Fix it by checking folio mapcount before
>>>>> adding a folio to deferred split list.
>>>>>
>>>>> Signed-off-by: Zi Yan <[email protected]>
>>>>> ---
>>>>> mm/rmap.c | 9 ++++++---
>>>>> 1 file changed, 6 insertions(+), 3 deletions(-)
>>>>>
>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>>> index 2608c40dffad..d599a772e282 100644
>>>>> --- a/mm/rmap.c
>>>>> +++ b/mm/rmap.c
>>>>> @@ -1494,7 +1494,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>>>> enum rmap_level level)
>>>>> {
>>>>> atomic_t *mapped = &folio->_nr_pages_mapped;
>>>>> - int last, nr = 0, nr_pmdmapped = 0;
>>>>> + int last, nr = 0, nr_pmdmapped = 0, mapcount = 0;
>>>>> enum node_stat_item idx;
>>>>>
>>>>> __folio_rmap_sanity_checks(folio, page, nr_pages, level);
>>>>> @@ -1506,7 +1506,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>>>> break;
>>>>> }
>>>>>
>>>>> - atomic_sub(nr_pages, &folio->_large_mapcount);
>>>>> + mapcount = atomic_sub_return(nr_pages,
>>>>> + &folio->_large_mapcount) + 1;
>>>>
>>>> That becomes a new memory barrier on some archs. Rather just re-read it
>>>> below. Re-reading should be fine here.
>>>>
>>>>> do {
>>>>> last = atomic_add_negative(-1, &page->_mapcount);
>>>>> if (last) {
>>>>> @@ -1554,7 +1555,9 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>>>> * is still mapped.
>>>>> */
>>>>> if (folio_test_large(folio) && folio_test_anon(folio))
>>>>> - if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped)
>>>>> + if ((level == RMAP_LEVEL_PTE &&
>>>>> + mapcount != 0) ||
>>>>> + (level == RMAP_LEVEL_PMD && nr < nr_pmdmapped))
>>>>> deferred_split_folio(folio);
>>>>> }
>>>>
>>>> But I do wonder if we really care? Usually the folio will simply get
>>>> freed afterwards, where we simply remove it from the list.
>>>>
>>>> If it's pinned, we won't be able to free or reclaim, but it's rather a
>>>> corner case ...
>>>>
>>>> Is it really worth the added code? Not convinced.
>>>
>>> It is actually not only an optimization, but also fixed the broken
>>> thp_deferred_split_page counter in /proc/vmstat.
>>>
>>> The counter actually counted the partially unmapped huge pages (so
>>> they are on deferred split queue), but it counts the fully unmapped
>>> mTHP as well now. For example, when a 64K THP is fully unmapped, the
>>> thp_deferred_split_page is not supposed to get inc'ed, but it does
>>> now.
>>>
>>> The counter is also useful for performance analysis, for example,
>>> whether a workload did a lot of partial unmap or not. So fixing the
>>> counter seems worthy. Zi Yan should have mentioned this in the commit
>>> log.
>>
>> Yes, all that is information that is missing from the patch description.
>> If it's a fix, there should be a "Fixes:".
>>
>> Likely we want to have a folio_large_mapcount() check in the code below.
>> (I yet have to digest the condition where this happens -- can we have an
>> example where we'd use to do the wrong thing and now would do the right
>> thing as well?)
>
> For example, map 1G memory with 64K mTHP, then unmap the whole 1G or
> some full 64K areas, you will see thp_deferred_split_page increased,
> but it shouldn't.
>
> It looks __folio_remove_rmap() incorrectly detected whether the mTHP
> is fully unmapped or partially unmapped by comparing the number of
> still-mapped subpages to ENTIRELY_MAPPED, which should just work for
> PMD-mappable THP.
>
> However I just realized this problem was kind of workaround'ed by commit:
>
> commit 98046944a1597f3a02b792dbe9665e9943b77f28
> Author: Baolin Wang <[email protected]>
> Date: Fri Mar 29 14:59:33 2024 +0800
>
> mm: huge_memory: add the missing folio_test_pmd_mappable() for THP
> split statistics
>
> Now the mTHP can also be split or added into the deferred list, so add
> folio_test_pmd_mappable() validation for PMD mapped THP, to avoid
> confusion with PMD mapped THP related statistics.
>
> Link: https://lkml.kernel.org/r/a5341defeef27c9ac7b85c97f030f93e4368bbc1.1711694852.git.baolin.wang@linux.alibaba.com
> Signed-off-by: Baolin Wang <[email protected]>
> Acked-by: David Hildenbrand <[email protected]>
> Cc: Muchun Song <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>
>
> This commit made thp_deferred_split_page didn't count mTHP anymore, it
> also made thp_split_page didn't count mTHP anymore.
>
> However Zi Yan's patch does make the code more robust and we don't
> need to worry about the miscounting issue anymore if we will add
> deferred_split_page and split_page counters for mTHP in the future.

Actually, the patch above does not fix everything. A fully unmapped
PTE-mapped order-9 THP is also added to deferred split list and
counted as THP_DEFERRED_SPLIT_PAGE without my patch, since nr is 512
(non zero), level is RMAP_LEVEL_PTE, and inside deferred_split_folio()
the order-9 folio is folio_test_pmd_mappable().

I will add this information in the next version.

--
Best Regards,
Yan, Zi


Attachments:
signature.asc (871.00 B)
OpenPGP digital signature

2024-04-12 14:31:31

by Zi Yan

[permalink] [raw]
Subject: Re: [PATCH] mm/rmap: do not add fully unmapped large folio to deferred split list

On 12 Apr 2024, at 10:21, Zi Yan wrote:

> On 11 Apr 2024, at 17:59, Yang Shi wrote:
>
>> On Thu, Apr 11, 2024 at 2:15 PM David Hildenbrand <[email protected]> wrote:
>>>
>>> On 11.04.24 21:01, Yang Shi wrote:
>>>> On Thu, Apr 11, 2024 at 8:46 AM David Hildenbrand <[email protected]> wrote:
>>>>>
>>>>> On 11.04.24 17:32, Zi Yan wrote:
>>>>>> From: Zi Yan <[email protected]>
>>>>>>
>>>>>> In __folio_remove_rmap(), a large folio is added to deferred split list
>>>>>> if any page in a folio loses its final mapping. It is possible that
>>>>>> the folio is unmapped fully, but it is unnecessary to add the folio
>>>>>> to deferred split list at all. Fix it by checking folio mapcount before
>>>>>> adding a folio to deferred split list.
>>>>>>
>>>>>> Signed-off-by: Zi Yan <[email protected]>
>>>>>> ---
>>>>>> mm/rmap.c | 9 ++++++---
>>>>>> 1 file changed, 6 insertions(+), 3 deletions(-)
>>>>>>
>>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>>>> index 2608c40dffad..d599a772e282 100644
>>>>>> --- a/mm/rmap.c
>>>>>> +++ b/mm/rmap.c
>>>>>> @@ -1494,7 +1494,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>>>>> enum rmap_level level)
>>>>>> {
>>>>>> atomic_t *mapped = &folio->_nr_pages_mapped;
>>>>>> - int last, nr = 0, nr_pmdmapped = 0;
>>>>>> + int last, nr = 0, nr_pmdmapped = 0, mapcount = 0;
>>>>>> enum node_stat_item idx;
>>>>>>
>>>>>> __folio_rmap_sanity_checks(folio, page, nr_pages, level);
>>>>>> @@ -1506,7 +1506,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>>>>> break;
>>>>>> }
>>>>>>
>>>>>> - atomic_sub(nr_pages, &folio->_large_mapcount);
>>>>>> + mapcount = atomic_sub_return(nr_pages,
>>>>>> + &folio->_large_mapcount) + 1;
>>>>>
>>>>> That becomes a new memory barrier on some archs. Rather just re-read it
>>>>> below. Re-reading should be fine here.
>>>>>
>>>>>> do {
>>>>>> last = atomic_add_negative(-1, &page->_mapcount);
>>>>>> if (last) {
>>>>>> @@ -1554,7 +1555,9 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>>>>> * is still mapped.
>>>>>> */
>>>>>> if (folio_test_large(folio) && folio_test_anon(folio))
>>>>>> - if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped)
>>>>>> + if ((level == RMAP_LEVEL_PTE &&
>>>>>> + mapcount != 0) ||
>>>>>> + (level == RMAP_LEVEL_PMD && nr < nr_pmdmapped))
>>>>>> deferred_split_folio(folio);
>>>>>> }
>>>>>
>>>>> But I do wonder if we really care? Usually the folio will simply get
>>>>> freed afterwards, where we simply remove it from the list.
>>>>>
>>>>> If it's pinned, we won't be able to free or reclaim, but it's rather a
>>>>> corner case ...
>>>>>
>>>>> Is it really worth the added code? Not convinced.
>>>>
>>>> It is actually not only an optimization, but also fixed the broken
>>>> thp_deferred_split_page counter in /proc/vmstat.
>>>>
>>>> The counter actually counted the partially unmapped huge pages (so
>>>> they are on deferred split queue), but it counts the fully unmapped
>>>> mTHP as well now. For example, when a 64K THP is fully unmapped, the
>>>> thp_deferred_split_page is not supposed to get inc'ed, but it does
>>>> now.
>>>>
>>>> The counter is also useful for performance analysis, for example,
>>>> whether a workload did a lot of partial unmap or not. So fixing the
>>>> counter seems worthy. Zi Yan should have mentioned this in the commit
>>>> log.
>>>
>>> Yes, all that is information that is missing from the patch description.
>>> If it's a fix, there should be a "Fixes:".
>>>
>>> Likely we want to have a folio_large_mapcount() check in the code below.
>>> (I yet have to digest the condition where this happens -- can we have an
>>> example where we'd use to do the wrong thing and now would do the right
>>> thing as well?)
>>
>> For example, map 1G memory with 64K mTHP, then unmap the whole 1G or
>> some full 64K areas, you will see thp_deferred_split_page increased,
>> but it shouldn't.
>>
>> It looks __folio_remove_rmap() incorrectly detected whether the mTHP
>> is fully unmapped or partially unmapped by comparing the number of
>> still-mapped subpages to ENTIRELY_MAPPED, which should just work for
>> PMD-mappable THP.
>>
>> However I just realized this problem was kind of workaround'ed by commit:
>>
>> commit 98046944a1597f3a02b792dbe9665e9943b77f28
>> Author: Baolin Wang <[email protected]>
>> Date: Fri Mar 29 14:59:33 2024 +0800
>>
>> mm: huge_memory: add the missing folio_test_pmd_mappable() for THP
>> split statistics
>>
>> Now the mTHP can also be split or added into the deferred list, so add
>> folio_test_pmd_mappable() validation for PMD mapped THP, to avoid
>> confusion with PMD mapped THP related statistics.
>>
>> Link: https://lkml.kernel.org/r/a5341defeef27c9ac7b85c97f030f93e4368bbc1.1711694852.git.baolin.wang@linux.alibaba.com
>> Signed-off-by: Baolin Wang <[email protected]>
>> Acked-by: David Hildenbrand <[email protected]>
>> Cc: Muchun Song <[email protected]>
>> Signed-off-by: Andrew Morton <[email protected]>
>>
>> This commit made thp_deferred_split_page didn't count mTHP anymore, it
>> also made thp_split_page didn't count mTHP anymore.
>>
>> However Zi Yan's patch does make the code more robust and we don't
>> need to worry about the miscounting issue anymore if we will add
>> deferred_split_page and split_page counters for mTHP in the future.
>
> Actually, the patch above does not fix everything. A fully unmapped
> PTE-mapped order-9 THP is also added to deferred split list and
> counted as THP_DEFERRED_SPLIT_PAGE without my patch, since nr is 512
> (non zero), level is RMAP_LEVEL_PTE, and inside deferred_split_folio()
> the order-9 folio is folio_test_pmd_mappable().
>
> I will add this information in the next version.

It might
Fixes: b06dc281aa99 ("mm/rmap: introduce folio_remove_rmap_[pte|ptes|pmd]()"),
but before this commit fully unmapping a PTE-mapped order-9 THP still increased
THP_DEFERRED_SPLIT_PAGE, because PTEs are unmapped individually and first PTE
unmapping adds the THP into the deferred split list. This means commit b06dc281aa99
did not change anything and before that THP_DEFERRED_SPLIT_PAGE increase is
due to implementation. I will add this to the commit log as well without Fixes
tag.


--
Best Regards,
Yan, Zi


Attachments:
signature.asc (871.00 B)
OpenPGP digital signature

2024-04-12 14:41:30

by Zi Yan

[permalink] [raw]
Subject: Re: [PATCH] mm/rmap: do not add fully unmapped large folio to deferred split list

On 11 Apr 2024, at 11:46, David Hildenbrand wrote:

> On 11.04.24 17:32, Zi Yan wrote:
>> From: Zi Yan <[email protected]>
>>
>> In __folio_remove_rmap(), a large folio is added to deferred split list
>> if any page in a folio loses its final mapping. It is possible that
>> the folio is unmapped fully, but it is unnecessary to add the folio
>> to deferred split list at all. Fix it by checking folio mapcount before
>> adding a folio to deferred split list.
>>
>> Signed-off-by: Zi Yan <[email protected]>
>> ---
>> mm/rmap.c | 9 ++++++---
>> 1 file changed, 6 insertions(+), 3 deletions(-)
>>
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 2608c40dffad..d599a772e282 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1494,7 +1494,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>> enum rmap_level level)
>> {
>> atomic_t *mapped = &folio->_nr_pages_mapped;
>> - int last, nr = 0, nr_pmdmapped = 0;
>> + int last, nr = 0, nr_pmdmapped = 0, mapcount = 0;
>> enum node_stat_item idx;
>> __folio_rmap_sanity_checks(folio, page, nr_pages, level);
>> @@ -1506,7 +1506,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>> break;
>> }
>> - atomic_sub(nr_pages, &folio->_large_mapcount);
>> + mapcount = atomic_sub_return(nr_pages,
>> + &folio->_large_mapcount) + 1;
>
> That becomes a new memory barrier on some archs. Rather just re-read it below. Re-reading should be fine here.

Would atomic_sub_return_relaxed() work? Originally I was using atomic_read(mapped)
below, but to save an atomic op, I chose to read mapcount here.

--
Best Regards,
Yan, Zi


Attachments:
signature.asc (871.00 B)
OpenPGP digital signature

2024-04-12 18:32:12

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCH] mm/rmap: do not add fully unmapped large folio to deferred split list

On Fri, Apr 12, 2024 at 7:31 AM Zi Yan <[email protected]> wrote:
>
> On 12 Apr 2024, at 10:21, Zi Yan wrote:
>
> > On 11 Apr 2024, at 17:59, Yang Shi wrote:
> >
> >> On Thu, Apr 11, 2024 at 2:15 PM David Hildenbrand <[email protected]> wrote:
> >>>
> >>> On 11.04.24 21:01, Yang Shi wrote:
> >>>> On Thu, Apr 11, 2024 at 8:46 AM David Hildenbrand <[email protected]> wrote:
> >>>>>
> >>>>> On 11.04.24 17:32, Zi Yan wrote:
> >>>>>> From: Zi Yan <[email protected]>
> >>>>>>
> >>>>>> In __folio_remove_rmap(), a large folio is added to deferred split list
> >>>>>> if any page in a folio loses its final mapping. It is possible that
> >>>>>> the folio is unmapped fully, but it is unnecessary to add the folio
> >>>>>> to deferred split list at all. Fix it by checking folio mapcount before
> >>>>>> adding a folio to deferred split list.
> >>>>>>
> >>>>>> Signed-off-by: Zi Yan <[email protected]>
> >>>>>> ---
> >>>>>> mm/rmap.c | 9 ++++++---
> >>>>>> 1 file changed, 6 insertions(+), 3 deletions(-)
> >>>>>>
> >>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
> >>>>>> index 2608c40dffad..d599a772e282 100644
> >>>>>> --- a/mm/rmap.c
> >>>>>> +++ b/mm/rmap.c
> >>>>>> @@ -1494,7 +1494,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
> >>>>>> enum rmap_level level)
> >>>>>> {
> >>>>>> atomic_t *mapped = &folio->_nr_pages_mapped;
> >>>>>> - int last, nr = 0, nr_pmdmapped = 0;
> >>>>>> + int last, nr = 0, nr_pmdmapped = 0, mapcount = 0;
> >>>>>> enum node_stat_item idx;
> >>>>>>
> >>>>>> __folio_rmap_sanity_checks(folio, page, nr_pages, level);
> >>>>>> @@ -1506,7 +1506,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
> >>>>>> break;
> >>>>>> }
> >>>>>>
> >>>>>> - atomic_sub(nr_pages, &folio->_large_mapcount);
> >>>>>> + mapcount = atomic_sub_return(nr_pages,
> >>>>>> + &folio->_large_mapcount) + 1;
> >>>>>
> >>>>> That becomes a new memory barrier on some archs. Rather just re-read it
> >>>>> below. Re-reading should be fine here.
> >>>>>
> >>>>>> do {
> >>>>>> last = atomic_add_negative(-1, &page->_mapcount);
> >>>>>> if (last) {
> >>>>>> @@ -1554,7 +1555,9 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
> >>>>>> * is still mapped.
> >>>>>> */
> >>>>>> if (folio_test_large(folio) && folio_test_anon(folio))
> >>>>>> - if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped)
> >>>>>> + if ((level == RMAP_LEVEL_PTE &&
> >>>>>> + mapcount != 0) ||
> >>>>>> + (level == RMAP_LEVEL_PMD && nr < nr_pmdmapped))
> >>>>>> deferred_split_folio(folio);
> >>>>>> }
> >>>>>
> >>>>> But I do wonder if we really care? Usually the folio will simply get
> >>>>> freed afterwards, where we simply remove it from the list.
> >>>>>
> >>>>> If it's pinned, we won't be able to free or reclaim, but it's rather a
> >>>>> corner case ...
> >>>>>
> >>>>> Is it really worth the added code? Not convinced.
> >>>>
> >>>> It is actually not only an optimization, but also fixed the broken
> >>>> thp_deferred_split_page counter in /proc/vmstat.
> >>>>
> >>>> The counter actually counted the partially unmapped huge pages (so
> >>>> they are on deferred split queue), but it counts the fully unmapped
> >>>> mTHP as well now. For example, when a 64K THP is fully unmapped, the
> >>>> thp_deferred_split_page is not supposed to get inc'ed, but it does
> >>>> now.
> >>>>
> >>>> The counter is also useful for performance analysis, for example,
> >>>> whether a workload did a lot of partial unmap or not. So fixing the
> >>>> counter seems worthy. Zi Yan should have mentioned this in the commit
> >>>> log.
> >>>
> >>> Yes, all that is information that is missing from the patch description.
> >>> If it's a fix, there should be a "Fixes:".
> >>>
> >>> Likely we want to have a folio_large_mapcount() check in the code below.
> >>> (I yet have to digest the condition where this happens -- can we have an
> >>> example where we'd use to do the wrong thing and now would do the right
> >>> thing as well?)
> >>
> >> For example, map 1G memory with 64K mTHP, then unmap the whole 1G or
> >> some full 64K areas, you will see thp_deferred_split_page increased,
> >> but it shouldn't.
> >>
> >> It looks __folio_remove_rmap() incorrectly detected whether the mTHP
> >> is fully unmapped or partially unmapped by comparing the number of
> >> still-mapped subpages to ENTIRELY_MAPPED, which should just work for
> >> PMD-mappable THP.
> >>
> >> However I just realized this problem was kind of workaround'ed by commit:
> >>
> >> commit 98046944a1597f3a02b792dbe9665e9943b77f28
> >> Author: Baolin Wang <[email protected]>
> >> Date: Fri Mar 29 14:59:33 2024 +0800
> >>
> >> mm: huge_memory: add the missing folio_test_pmd_mappable() for THP
> >> split statistics
> >>
> >> Now the mTHP can also be split or added into the deferred list, so add
> >> folio_test_pmd_mappable() validation for PMD mapped THP, to avoid
> >> confusion with PMD mapped THP related statistics.
> >>
> >> Link: https://lkml.kernel.org/r/a5341defeef27c9ac7b85c97f030f93e4368bbc1.1711694852.git.baolin.wang@linux.alibaba.com
> >> Signed-off-by: Baolin Wang <[email protected]>
> >> Acked-by: David Hildenbrand <[email protected]>
> >> Cc: Muchun Song <[email protected]>
> >> Signed-off-by: Andrew Morton <[email protected]>
> >>
> >> This commit made thp_deferred_split_page didn't count mTHP anymore, it
> >> also made thp_split_page didn't count mTHP anymore.
> >>
> >> However Zi Yan's patch does make the code more robust and we don't
> >> need to worry about the miscounting issue anymore if we will add
> >> deferred_split_page and split_page counters for mTHP in the future.
> >
> > Actually, the patch above does not fix everything. A fully unmapped
> > PTE-mapped order-9 THP is also added to deferred split list and
> > counted as THP_DEFERRED_SPLIT_PAGE without my patch, since nr is 512
> > (non zero), level is RMAP_LEVEL_PTE, and inside deferred_split_folio()
> > the order-9 folio is folio_test_pmd_mappable().
> >
> > I will add this information in the next version.
>
> It might
> Fixes: b06dc281aa99 ("mm/rmap: introduce folio_remove_rmap_[pte|ptes|pmd]()"),
> but before this commit fully unmapping a PTE-mapped order-9 THP still increased
> THP_DEFERRED_SPLIT_PAGE, because PTEs are unmapped individually and first PTE
> unmapping adds the THP into the deferred split list. This means commit b06dc281aa99
> did not change anything and before that THP_DEFERRED_SPLIT_PAGE increase is
> due to implementation. I will add this to the commit log as well without Fixes
> tag.

Thanks for digging deeper. The problem may be not that obvious before
mTHP because PMD-mappable THP is converted to PTE-mapped due to
partial unmap in most cases. But mTHP is always PTE-mapped in the
first place. The other reason is batched rmap remove was not supported
before David's optimization.

Now we do have reasonable motivation to make it precise and it is also
easier to do so than before.

>
>
> --
> Best Regards,
> Yan, Zi

2024-04-12 19:07:14

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH] mm/rmap: do not add fully unmapped large folio to deferred split list

On 12.04.24 16:31, Zi Yan wrote:
> On 12 Apr 2024, at 10:21, Zi Yan wrote:
>
>> On 11 Apr 2024, at 17:59, Yang Shi wrote:
>>
>>> On Thu, Apr 11, 2024 at 2:15 PM David Hildenbrand <[email protected]> wrote:
>>>>
>>>> On 11.04.24 21:01, Yang Shi wrote:
>>>>> On Thu, Apr 11, 2024 at 8:46 AM David Hildenbrand <[email protected]> wrote:
>>>>>>
>>>>>> On 11.04.24 17:32, Zi Yan wrote:
>>>>>>> From: Zi Yan <[email protected]>
>>>>>>>
>>>>>>> In __folio_remove_rmap(), a large folio is added to deferred split list
>>>>>>> if any page in a folio loses its final mapping. It is possible that
>>>>>>> the folio is unmapped fully, but it is unnecessary to add the folio
>>>>>>> to deferred split list at all. Fix it by checking folio mapcount before
>>>>>>> adding a folio to deferred split list.
>>>>>>>
>>>>>>> Signed-off-by: Zi Yan <[email protected]>
>>>>>>> ---
>>>>>>> mm/rmap.c | 9 ++++++---
>>>>>>> 1 file changed, 6 insertions(+), 3 deletions(-)
>>>>>>>
>>>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>>>>> index 2608c40dffad..d599a772e282 100644
>>>>>>> --- a/mm/rmap.c
>>>>>>> +++ b/mm/rmap.c
>>>>>>> @@ -1494,7 +1494,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>>>>>> enum rmap_level level)
>>>>>>> {
>>>>>>> atomic_t *mapped = &folio->_nr_pages_mapped;
>>>>>>> - int last, nr = 0, nr_pmdmapped = 0;
>>>>>>> + int last, nr = 0, nr_pmdmapped = 0, mapcount = 0;
>>>>>>> enum node_stat_item idx;
>>>>>>>
>>>>>>> __folio_rmap_sanity_checks(folio, page, nr_pages, level);
>>>>>>> @@ -1506,7 +1506,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>>>>>> break;
>>>>>>> }
>>>>>>>
>>>>>>> - atomic_sub(nr_pages, &folio->_large_mapcount);
>>>>>>> + mapcount = atomic_sub_return(nr_pages,
>>>>>>> + &folio->_large_mapcount) + 1;
>>>>>>
>>>>>> That becomes a new memory barrier on some archs. Rather just re-read it
>>>>>> below. Re-reading should be fine here.
>>>>>>
>>>>>>> do {
>>>>>>> last = atomic_add_negative(-1, &page->_mapcount);
>>>>>>> if (last) {
>>>>>>> @@ -1554,7 +1555,9 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>>>>>> * is still mapped.
>>>>>>> */
>>>>>>> if (folio_test_large(folio) && folio_test_anon(folio))
>>>>>>> - if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped)
>>>>>>> + if ((level == RMAP_LEVEL_PTE &&
>>>>>>> + mapcount != 0) ||
>>>>>>> + (level == RMAP_LEVEL_PMD && nr < nr_pmdmapped))
>>>>>>> deferred_split_folio(folio);
>>>>>>> }
>>>>>>
>>>>>> But I do wonder if we really care? Usually the folio will simply get
>>>>>> freed afterwards, where we simply remove it from the list.
>>>>>>
>>>>>> If it's pinned, we won't be able to free or reclaim, but it's rather a
>>>>>> corner case ...
>>>>>>
>>>>>> Is it really worth the added code? Not convinced.
>>>>>
>>>>> It is actually not only an optimization, but also fixed the broken
>>>>> thp_deferred_split_page counter in /proc/vmstat.
>>>>>
>>>>> The counter actually counted the partially unmapped huge pages (so
>>>>> they are on deferred split queue), but it counts the fully unmapped
>>>>> mTHP as well now. For example, when a 64K THP is fully unmapped, the
>>>>> thp_deferred_split_page is not supposed to get inc'ed, but it does
>>>>> now.
>>>>>
>>>>> The counter is also useful for performance analysis, for example,
>>>>> whether a workload did a lot of partial unmap or not. So fixing the
>>>>> counter seems worthy. Zi Yan should have mentioned this in the commit
>>>>> log.
>>>>
>>>> Yes, all that is information that is missing from the patch description.
>>>> If it's a fix, there should be a "Fixes:".
>>>>
>>>> Likely we want to have a folio_large_mapcount() check in the code below.
>>>> (I yet have to digest the condition where this happens -- can we have an
>>>> example where we'd use to do the wrong thing and now would do the right
>>>> thing as well?)
>>>
>>> For example, map 1G memory with 64K mTHP, then unmap the whole 1G or
>>> some full 64K areas, you will see thp_deferred_split_page increased,
>>> but it shouldn't.
>>>
>>> It looks __folio_remove_rmap() incorrectly detected whether the mTHP
>>> is fully unmapped or partially unmapped by comparing the number of
>>> still-mapped subpages to ENTIRELY_MAPPED, which should just work for
>>> PMD-mappable THP.
>>>
>>> However I just realized this problem was kind of workaround'ed by commit:
>>>
>>> commit 98046944a1597f3a02b792dbe9665e9943b77f28
>>> Author: Baolin Wang <[email protected]>
>>> Date: Fri Mar 29 14:59:33 2024 +0800
>>>
>>> mm: huge_memory: add the missing folio_test_pmd_mappable() for THP
>>> split statistics
>>>
>>> Now the mTHP can also be split or added into the deferred list, so add
>>> folio_test_pmd_mappable() validation for PMD mapped THP, to avoid
>>> confusion with PMD mapped THP related statistics.
>>>
>>> Link: https://lkml.kernel.org/r/a5341defeef27c9ac7b85c97f030f93e4368bbc1.1711694852.git.baolin.wang@linux.alibaba.com
>>> Signed-off-by: Baolin Wang <[email protected]>
>>> Acked-by: David Hildenbrand <[email protected]>
>>> Cc: Muchun Song <[email protected]>
>>> Signed-off-by: Andrew Morton <[email protected]>
>>>
>>> This commit made thp_deferred_split_page didn't count mTHP anymore, it
>>> also made thp_split_page didn't count mTHP anymore.
>>>
>>> However Zi Yan's patch does make the code more robust and we don't
>>> need to worry about the miscounting issue anymore if we will add
>>> deferred_split_page and split_page counters for mTHP in the future.
>>
>> Actually, the patch above does not fix everything. A fully unmapped
>> PTE-mapped order-9 THP is also added to deferred split list and
>> counted as THP_DEFERRED_SPLIT_PAGE without my patch, since nr is 512
>> (non zero), level is RMAP_LEVEL_PTE, and inside deferred_split_folio()
>> the order-9 folio is folio_test_pmd_mappable().
>>
>> I will add this information in the next version.
>
> It might
> Fixes: b06dc281aa99 ("mm/rmap: introduce folio_remove_rmap_[pte|ptes|pmd]()"),
> but before this commit fully unmapping a PTE-mapped order-9 THP still increased
> THP_DEFERRED_SPLIT_PAGE, because PTEs are unmapped individually and first PTE
> unmapping adds the THP into the deferred split list. This means commit b06dc281aa99
> did not change anything and before that THP_DEFERRED_SPLIT_PAGE increase is
> due to implementation. I will add this to the commit log as well without Fixes
> tag.

Right, so it's always been a problem for PTE-mapped PMD-sized THP and
only with batching we can now do "better". But not fix it completely.

I'll reply separately to your other mail.

--
Cheers,

David / dhildenb


2024-04-12 19:33:12

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH] mm/rmap: do not add fully unmapped large folio to deferred split list

On 12.04.24 16:35, Zi Yan wrote:
> On 11 Apr 2024, at 11:46, David Hildenbrand wrote:
>
>> On 11.04.24 17:32, Zi Yan wrote:
>>> From: Zi Yan <[email protected]>
>>>
>>> In __folio_remove_rmap(), a large folio is added to deferred split list
>>> if any page in a folio loses its final mapping. It is possible that
>>> the folio is unmapped fully, but it is unnecessary to add the folio
>>> to deferred split list at all. Fix it by checking folio mapcount before
>>> adding a folio to deferred split list.
>>>
>>> Signed-off-by: Zi Yan <[email protected]>
>>> ---
>>> mm/rmap.c | 9 ++++++---
>>> 1 file changed, 6 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>> index 2608c40dffad..d599a772e282 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -1494,7 +1494,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>> enum rmap_level level)
>>> {
>>> atomic_t *mapped = &folio->_nr_pages_mapped;
>>> - int last, nr = 0, nr_pmdmapped = 0;
>>> + int last, nr = 0, nr_pmdmapped = 0, mapcount = 0;
>>> enum node_stat_item idx;
>>> __folio_rmap_sanity_checks(folio, page, nr_pages, level);
>>> @@ -1506,7 +1506,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>> break;
>>> }
>>> - atomic_sub(nr_pages, &folio->_large_mapcount);
>>> + mapcount = atomic_sub_return(nr_pages,
>>> + &folio->_large_mapcount) + 1;
>>
>> That becomes a new memory barrier on some archs. Rather just re-read it below. Re-reading should be fine here.
>
> Would atomic_sub_return_relaxed() work? Originally I was using atomic_read(mapped)
> below, but to save an atomic op, I chose to read mapcount here.

Some points:

(1) I suggest reading about atomic get/set vs. atomic RMW vs. atomic
RMW that return a value -- and how they interact with memory barriers.
Further, how relaxed variants are only optimized on some architectures.

atomic_read() is usually READ_ONCE(), which is just an "ordinary" memory
access that should not be refetched. Usually cheaper than most other stuff
that involves atomics.

(2) We can either use folio_large_mapcount() == 0 or !atomic_read(mapped)
to figure out if the folio is now completely unmapped.

(3) There is one fundamental issue: if we are not batch-unmapping the whole
thing, we will still add the folios to the deferred split queue. Migration
would still do that, or if there are multiple VMAs covering a folio.

(4) We should really avoid making common operations slower only to make
some unreliable stats less unreliable.


We should likely do something like the following, which might even be a bit
faster in some cases because we avoid a function call in case we unmap
individual PTEs by checking _deferred_list ahead of time

diff --git a/mm/rmap.c b/mm/rmap.c
index 2608c40dffad..356598b3dc3c 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1553,9 +1553,11 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
* page of the folio is unmapped and at least one page
* is still mapped.
*/
- if (folio_test_large(folio) && folio_test_anon(folio))
- if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped)
- deferred_split_folio(folio);
+ if (folio_test_large(folio) && folio_test_anon(folio) &&
+ (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped) &&
+ atomic_read(mapped) &&
+ data_race(list_empty(&folio->_deferred_list)))
+ deferred_split_folio(folio);
}


I also thought about handling the scenario where we unmap the whole
think in smaller chunks. We could detect "!atomic_read(mapped)" and
detect that it is on the deferred split list, and simply remove it
from that list incrementing an THP_UNDO_DEFERRED_SPLIT_PAGE event.

But it would be racy with concurrent remapping of the folio (might happen with
anon folios in corner cases I guess).

What we can do is the following, though:

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index dc30139590e6..f05cba1807f2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3133,6 +3133,8 @@ void folio_undo_large_rmappable(struct folio *folio)
ds_queue = get_deferred_split_queue(folio);
spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
if (!list_empty(&folio->_deferred_list)) {
+ if (folio_test_pmd_mappable(folio))
+ count_vm_event(THP_UNDO_DEFERRED_SPLIT_PAGE);
ds_queue->split_queue_len--;
list_del_init(&folio->_deferred_list);
}

Adding the right event of course.


Then it's easy to filter out these "temporarily added to the list, but never split
before the folio was freed" cases.


--
Cheers,

David / dhildenb


2024-04-12 19:36:49

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH] mm/rmap: do not add fully unmapped large folio to deferred split list

On 12.04.24 20:29, Yang Shi wrote:
> On Fri, Apr 12, 2024 at 7:31 AM Zi Yan <[email protected]> wrote:
>>
>> On 12 Apr 2024, at 10:21, Zi Yan wrote:
>>
>>> On 11 Apr 2024, at 17:59, Yang Shi wrote:
>>>
>>>> On Thu, Apr 11, 2024 at 2:15 PM David Hildenbrand <[email protected]> wrote:
>>>>>
>>>>> On 11.04.24 21:01, Yang Shi wrote:
>>>>>> On Thu, Apr 11, 2024 at 8:46 AM David Hildenbrand <[email protected]> wrote:
>>>>>>>
>>>>>>> On 11.04.24 17:32, Zi Yan wrote:
>>>>>>>> From: Zi Yan <[email protected]>
>>>>>>>>
>>>>>>>> In __folio_remove_rmap(), a large folio is added to deferred split list
>>>>>>>> if any page in a folio loses its final mapping. It is possible that
>>>>>>>> the folio is unmapped fully, but it is unnecessary to add the folio
>>>>>>>> to deferred split list at all. Fix it by checking folio mapcount before
>>>>>>>> adding a folio to deferred split list.
>>>>>>>>
>>>>>>>> Signed-off-by: Zi Yan <[email protected]>
>>>>>>>> ---
>>>>>>>> mm/rmap.c | 9 ++++++---
>>>>>>>> 1 file changed, 6 insertions(+), 3 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>>>>>> index 2608c40dffad..d599a772e282 100644
>>>>>>>> --- a/mm/rmap.c
>>>>>>>> +++ b/mm/rmap.c
>>>>>>>> @@ -1494,7 +1494,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>>>>>>> enum rmap_level level)
>>>>>>>> {
>>>>>>>> atomic_t *mapped = &folio->_nr_pages_mapped;
>>>>>>>> - int last, nr = 0, nr_pmdmapped = 0;
>>>>>>>> + int last, nr = 0, nr_pmdmapped = 0, mapcount = 0;
>>>>>>>> enum node_stat_item idx;
>>>>>>>>
>>>>>>>> __folio_rmap_sanity_checks(folio, page, nr_pages, level);
>>>>>>>> @@ -1506,7 +1506,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>>>>>>> break;
>>>>>>>> }
>>>>>>>>
>>>>>>>> - atomic_sub(nr_pages, &folio->_large_mapcount);
>>>>>>>> + mapcount = atomic_sub_return(nr_pages,
>>>>>>>> + &folio->_large_mapcount) + 1;
>>>>>>>
>>>>>>> That becomes a new memory barrier on some archs. Rather just re-read it
>>>>>>> below. Re-reading should be fine here.
>>>>>>>
>>>>>>>> do {
>>>>>>>> last = atomic_add_negative(-1, &page->_mapcount);
>>>>>>>> if (last) {
>>>>>>>> @@ -1554,7 +1555,9 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>>>>>>> * is still mapped.
>>>>>>>> */
>>>>>>>> if (folio_test_large(folio) && folio_test_anon(folio))
>>>>>>>> - if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped)
>>>>>>>> + if ((level == RMAP_LEVEL_PTE &&
>>>>>>>> + mapcount != 0) ||
>>>>>>>> + (level == RMAP_LEVEL_PMD && nr < nr_pmdmapped))
>>>>>>>> deferred_split_folio(folio);
>>>>>>>> }
>>>>>>>
>>>>>>> But I do wonder if we really care? Usually the folio will simply get
>>>>>>> freed afterwards, where we simply remove it from the list.
>>>>>>>
>>>>>>> If it's pinned, we won't be able to free or reclaim, but it's rather a
>>>>>>> corner case ...
>>>>>>>
>>>>>>> Is it really worth the added code? Not convinced.
>>>>>>
>>>>>> It is actually not only an optimization, but also fixed the broken
>>>>>> thp_deferred_split_page counter in /proc/vmstat.
>>>>>>
>>>>>> The counter actually counted the partially unmapped huge pages (so
>>>>>> they are on deferred split queue), but it counts the fully unmapped
>>>>>> mTHP as well now. For example, when a 64K THP is fully unmapped, the
>>>>>> thp_deferred_split_page is not supposed to get inc'ed, but it does
>>>>>> now.
>>>>>>
>>>>>> The counter is also useful for performance analysis, for example,
>>>>>> whether a workload did a lot of partial unmap or not. So fixing the
>>>>>> counter seems worthy. Zi Yan should have mentioned this in the commit
>>>>>> log.
>>>>>
>>>>> Yes, all that is information that is missing from the patch description.
>>>>> If it's a fix, there should be a "Fixes:".
>>>>>
>>>>> Likely we want to have a folio_large_mapcount() check in the code below.
>>>>> (I yet have to digest the condition where this happens -- can we have an
>>>>> example where we'd use to do the wrong thing and now would do the right
>>>>> thing as well?)
>>>>
>>>> For example, map 1G memory with 64K mTHP, then unmap the whole 1G or
>>>> some full 64K areas, you will see thp_deferred_split_page increased,
>>>> but it shouldn't.
>>>>
>>>> It looks __folio_remove_rmap() incorrectly detected whether the mTHP
>>>> is fully unmapped or partially unmapped by comparing the number of
>>>> still-mapped subpages to ENTIRELY_MAPPED, which should just work for
>>>> PMD-mappable THP.
>>>>
>>>> However I just realized this problem was kind of workaround'ed by commit:
>>>>
>>>> commit 98046944a1597f3a02b792dbe9665e9943b77f28
>>>> Author: Baolin Wang <[email protected]>
>>>> Date: Fri Mar 29 14:59:33 2024 +0800
>>>>
>>>> mm: huge_memory: add the missing folio_test_pmd_mappable() for THP
>>>> split statistics
>>>>
>>>> Now the mTHP can also be split or added into the deferred list, so add
>>>> folio_test_pmd_mappable() validation for PMD mapped THP, to avoid
>>>> confusion with PMD mapped THP related statistics.
>>>>
>>>> Link: https://lkml.kernel.org/r/a5341defeef27c9ac7b85c97f030f93e4368bbc1.1711694852.git.baolin.wang@linux.alibaba.com
>>>> Signed-off-by: Baolin Wang <[email protected]>
>>>> Acked-by: David Hildenbrand <[email protected]>
>>>> Cc: Muchun Song <[email protected]>
>>>> Signed-off-by: Andrew Morton <[email protected]>
>>>>
>>>> This commit made thp_deferred_split_page didn't count mTHP anymore, it
>>>> also made thp_split_page didn't count mTHP anymore.
>>>>
>>>> However Zi Yan's patch does make the code more robust and we don't
>>>> need to worry about the miscounting issue anymore if we will add
>>>> deferred_split_page and split_page counters for mTHP in the future.
>>>
>>> Actually, the patch above does not fix everything. A fully unmapped
>>> PTE-mapped order-9 THP is also added to deferred split list and
>>> counted as THP_DEFERRED_SPLIT_PAGE without my patch, since nr is 512
>>> (non zero), level is RMAP_LEVEL_PTE, and inside deferred_split_folio()
>>> the order-9 folio is folio_test_pmd_mappable().
>>>
>>> I will add this information in the next version.
>>
>> It might
>> Fixes: b06dc281aa99 ("mm/rmap: introduce folio_remove_rmap_[pte|ptes|pmd]()"),
>> but before this commit fully unmapping a PTE-mapped order-9 THP still increased
>> THP_DEFERRED_SPLIT_PAGE, because PTEs are unmapped individually and first PTE
>> unmapping adds the THP into the deferred split list. This means commit b06dc281aa99
>> did not change anything and before that THP_DEFERRED_SPLIT_PAGE increase is
>> due to implementation. I will add this to the commit log as well without Fixes
>> tag.
>
> Thanks for digging deeper. The problem may be not that obvious before
> mTHP because PMD-mappable THP is converted to PTE-mapped due to
> partial unmap in most cases. But mTHP is always PTE-mapped in the
> first place. The other reason is batched rmap remove was not supported
> before David's optimization.

Yes.

>
> Now we do have reasonable motivation to make it precise and it is also
> easier to do so than before.

If by "precise" you mean "less unreliable in some cases", yes. See my
other mail.

--
Cheers,

David / dhildenb


2024-04-12 20:21:43

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCH] mm/rmap: do not add fully unmapped large folio to deferred split list

On Fri, Apr 12, 2024 at 12:36 PM David Hildenbrand <[email protected]> wrote:
>
> On 12.04.24 20:29, Yang Shi wrote:
> > On Fri, Apr 12, 2024 at 7:31 AM Zi Yan <[email protected]> wrote:
> >>
> >> On 12 Apr 2024, at 10:21, Zi Yan wrote:
> >>
> >>> On 11 Apr 2024, at 17:59, Yang Shi wrote:
> >>>
> >>>> On Thu, Apr 11, 2024 at 2:15 PM David Hildenbrand <[email protected]> wrote:
> >>>>>
> >>>>> On 11.04.24 21:01, Yang Shi wrote:
> >>>>>> On Thu, Apr 11, 2024 at 8:46 AM David Hildenbrand <[email protected]> wrote:
> >>>>>>>
> >>>>>>> On 11.04.24 17:32, Zi Yan wrote:
> >>>>>>>> From: Zi Yan <[email protected]>
> >>>>>>>>
> >>>>>>>> In __folio_remove_rmap(), a large folio is added to deferred split list
> >>>>>>>> if any page in a folio loses its final mapping. It is possible that
> >>>>>>>> the folio is unmapped fully, but it is unnecessary to add the folio
> >>>>>>>> to deferred split list at all. Fix it by checking folio mapcount before
> >>>>>>>> adding a folio to deferred split list.
> >>>>>>>>
> >>>>>>>> Signed-off-by: Zi Yan <[email protected]>
> >>>>>>>> ---
> >>>>>>>> mm/rmap.c | 9 ++++++---
> >>>>>>>> 1 file changed, 6 insertions(+), 3 deletions(-)
> >>>>>>>>
> >>>>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
> >>>>>>>> index 2608c40dffad..d599a772e282 100644
> >>>>>>>> --- a/mm/rmap.c
> >>>>>>>> +++ b/mm/rmap.c
> >>>>>>>> @@ -1494,7 +1494,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
> >>>>>>>> enum rmap_level level)
> >>>>>>>> {
> >>>>>>>> atomic_t *mapped = &folio->_nr_pages_mapped;
> >>>>>>>> - int last, nr = 0, nr_pmdmapped = 0;
> >>>>>>>> + int last, nr = 0, nr_pmdmapped = 0, mapcount = 0;
> >>>>>>>> enum node_stat_item idx;
> >>>>>>>>
> >>>>>>>> __folio_rmap_sanity_checks(folio, page, nr_pages, level);
> >>>>>>>> @@ -1506,7 +1506,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
> >>>>>>>> break;
> >>>>>>>> }
> >>>>>>>>
> >>>>>>>> - atomic_sub(nr_pages, &folio->_large_mapcount);
> >>>>>>>> + mapcount = atomic_sub_return(nr_pages,
> >>>>>>>> + &folio->_large_mapcount) + 1;
> >>>>>>>
> >>>>>>> That becomes a new memory barrier on some archs. Rather just re-read it
> >>>>>>> below. Re-reading should be fine here.
> >>>>>>>
> >>>>>>>> do {
> >>>>>>>> last = atomic_add_negative(-1, &page->_mapcount);
> >>>>>>>> if (last) {
> >>>>>>>> @@ -1554,7 +1555,9 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
> >>>>>>>> * is still mapped.
> >>>>>>>> */
> >>>>>>>> if (folio_test_large(folio) && folio_test_anon(folio))
> >>>>>>>> - if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped)
> >>>>>>>> + if ((level == RMAP_LEVEL_PTE &&
> >>>>>>>> + mapcount != 0) ||
> >>>>>>>> + (level == RMAP_LEVEL_PMD && nr < nr_pmdmapped))
> >>>>>>>> deferred_split_folio(folio);
> >>>>>>>> }
> >>>>>>>
> >>>>>>> But I do wonder if we really care? Usually the folio will simply get
> >>>>>>> freed afterwards, where we simply remove it from the list.
> >>>>>>>
> >>>>>>> If it's pinned, we won't be able to free or reclaim, but it's rather a
> >>>>>>> corner case ...
> >>>>>>>
> >>>>>>> Is it really worth the added code? Not convinced.
> >>>>>>
> >>>>>> It is actually not only an optimization, but also fixed the broken
> >>>>>> thp_deferred_split_page counter in /proc/vmstat.
> >>>>>>
> >>>>>> The counter actually counted the partially unmapped huge pages (so
> >>>>>> they are on deferred split queue), but it counts the fully unmapped
> >>>>>> mTHP as well now. For example, when a 64K THP is fully unmapped, the
> >>>>>> thp_deferred_split_page is not supposed to get inc'ed, but it does
> >>>>>> now.
> >>>>>>
> >>>>>> The counter is also useful for performance analysis, for example,
> >>>>>> whether a workload did a lot of partial unmap or not. So fixing the
> >>>>>> counter seems worthy. Zi Yan should have mentioned this in the commit
> >>>>>> log.
> >>>>>
> >>>>> Yes, all that is information that is missing from the patch description.
> >>>>> If it's a fix, there should be a "Fixes:".
> >>>>>
> >>>>> Likely we want to have a folio_large_mapcount() check in the code below.
> >>>>> (I yet have to digest the condition where this happens -- can we have an
> >>>>> example where we'd use to do the wrong thing and now would do the right
> >>>>> thing as well?)
> >>>>
> >>>> For example, map 1G memory with 64K mTHP, then unmap the whole 1G or
> >>>> some full 64K areas, you will see thp_deferred_split_page increased,
> >>>> but it shouldn't.
> >>>>
> >>>> It looks __folio_remove_rmap() incorrectly detected whether the mTHP
> >>>> is fully unmapped or partially unmapped by comparing the number of
> >>>> still-mapped subpages to ENTIRELY_MAPPED, which should just work for
> >>>> PMD-mappable THP.
> >>>>
> >>>> However I just realized this problem was kind of workaround'ed by commit:
> >>>>
> >>>> commit 98046944a1597f3a02b792dbe9665e9943b77f28
> >>>> Author: Baolin Wang <[email protected]>
> >>>> Date: Fri Mar 29 14:59:33 2024 +0800
> >>>>
> >>>> mm: huge_memory: add the missing folio_test_pmd_mappable() for THP
> >>>> split statistics
> >>>>
> >>>> Now the mTHP can also be split or added into the deferred list, so add
> >>>> folio_test_pmd_mappable() validation for PMD mapped THP, to avoid
> >>>> confusion with PMD mapped THP related statistics.
> >>>>
> >>>> Link: https://lkml.kernel.org/r/a5341defeef27c9ac7b85c97f030f93e4368bbc1.1711694852.git.baolin.wang@linux.alibaba.com
> >>>> Signed-off-by: Baolin Wang <[email protected]>
> >>>> Acked-by: David Hildenbrand <[email protected]>
> >>>> Cc: Muchun Song <[email protected]>
> >>>> Signed-off-by: Andrew Morton <[email protected]>
> >>>>
> >>>> This commit made thp_deferred_split_page didn't count mTHP anymore, it
> >>>> also made thp_split_page didn't count mTHP anymore.
> >>>>
> >>>> However Zi Yan's patch does make the code more robust and we don't
> >>>> need to worry about the miscounting issue anymore if we will add
> >>>> deferred_split_page and split_page counters for mTHP in the future.
> >>>
> >>> Actually, the patch above does not fix everything. A fully unmapped
> >>> PTE-mapped order-9 THP is also added to deferred split list and
> >>> counted as THP_DEFERRED_SPLIT_PAGE without my patch, since nr is 512
> >>> (non zero), level is RMAP_LEVEL_PTE, and inside deferred_split_folio()
> >>> the order-9 folio is folio_test_pmd_mappable().
> >>>
> >>> I will add this information in the next version.
> >>
> >> It might
> >> Fixes: b06dc281aa99 ("mm/rmap: introduce folio_remove_rmap_[pte|ptes|pmd]()"),
> >> but before this commit fully unmapping a PTE-mapped order-9 THP still increased
> >> THP_DEFERRED_SPLIT_PAGE, because PTEs are unmapped individually and first PTE
> >> unmapping adds the THP into the deferred split list. This means commit b06dc281aa99
> >> did not change anything and before that THP_DEFERRED_SPLIT_PAGE increase is
> >> due to implementation. I will add this to the commit log as well without Fixes
> >> tag.
> >
> > Thanks for digging deeper. The problem may be not that obvious before
> > mTHP because PMD-mappable THP is converted to PTE-mapped due to
> > partial unmap in most cases. But mTHP is always PTE-mapped in the
> > first place. The other reason is batched rmap remove was not supported
> > before David's optimization.
>
> Yes.
>
> >
> > Now we do have reasonable motivation to make it precise and it is also
> > easier to do so than before.
>
> If by "precise" you mean "less unreliable in some cases", yes. See my
> other mail.

Yes, definitely. Make the unreliability somehow acceptable.

>
> --
> Cheers,
>
> David / dhildenb
>

2024-04-12 20:37:06

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCH] mm/rmap: do not add fully unmapped large folio to deferred split list

On Fri, Apr 12, 2024 at 12:33 PM David Hildenbrand <[email protected]> wrote:
>
> On 12.04.24 16:35, Zi Yan wrote:
> > On 11 Apr 2024, at 11:46, David Hildenbrand wrote:
> >
> >> On 11.04.24 17:32, Zi Yan wrote:
> >>> From: Zi Yan <[email protected]>
> >>>
> >>> In __folio_remove_rmap(), a large folio is added to deferred split list
> >>> if any page in a folio loses its final mapping. It is possible that
> >>> the folio is unmapped fully, but it is unnecessary to add the folio
> >>> to deferred split list at all. Fix it by checking folio mapcount before
> >>> adding a folio to deferred split list.
> >>>
> >>> Signed-off-by: Zi Yan <[email protected]>
> >>> ---
> >>> mm/rmap.c | 9 ++++++---
> >>> 1 file changed, 6 insertions(+), 3 deletions(-)
> >>>
> >>> diff --git a/mm/rmap.c b/mm/rmap.c
> >>> index 2608c40dffad..d599a772e282 100644
> >>> --- a/mm/rmap.c
> >>> +++ b/mm/rmap.c
> >>> @@ -1494,7 +1494,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
> >>> enum rmap_level level)
> >>> {
> >>> atomic_t *mapped = &folio->_nr_pages_mapped;
> >>> - int last, nr = 0, nr_pmdmapped = 0;
> >>> + int last, nr = 0, nr_pmdmapped = 0, mapcount = 0;
> >>> enum node_stat_item idx;
> >>> __folio_rmap_sanity_checks(folio, page, nr_pages, level);
> >>> @@ -1506,7 +1506,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
> >>> break;
> >>> }
> >>> - atomic_sub(nr_pages, &folio->_large_mapcount);
> >>> + mapcount = atomic_sub_return(nr_pages,
> >>> + &folio->_large_mapcount) + 1;
> >>
> >> That becomes a new memory barrier on some archs. Rather just re-read it below. Re-reading should be fine here.
> >
> > Would atomic_sub_return_relaxed() work? Originally I was using atomic_read(mapped)
> > below, but to save an atomic op, I chose to read mapcount here.
>
> Some points:
>
> (1) I suggest reading about atomic get/set vs. atomic RMW vs. atomic
> RMW that return a value -- and how they interact with memory barriers.
> Further, how relaxed variants are only optimized on some architectures.
>
> atomic_read() is usually READ_ONCE(), which is just an "ordinary" memory
> access that should not be refetched. Usually cheaper than most other stuff
> that involves atomics.
>
> (2) We can either use folio_large_mapcount() == 0 or !atomic_read(mapped)
> to figure out if the folio is now completely unmapped.
>
> (3) There is one fundamental issue: if we are not batch-unmapping the whole
> thing, we will still add the folios to the deferred split queue. Migration
> would still do that, or if there are multiple VMAs covering a folio.

Maybe we can let rmap remove code not touch deferred split queue in
migration path at all because we know all the PTEs will be converted
to migration entries. And all the migration entries will be converted
back to PTEs regardless of whether try_to_migrate succeeded or not.

>
> (4) We should really avoid making common operations slower only to make
> some unreliable stats less unreliable.
>
>
> We should likely do something like the following, which might even be a bit
> faster in some cases because we avoid a function call in case we unmap
> individual PTEs by checking _deferred_list ahead of time
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 2608c40dffad..356598b3dc3c 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1553,9 +1553,11 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
> * page of the folio is unmapped and at least one page
> * is still mapped.
> */
> - if (folio_test_large(folio) && folio_test_anon(folio))
> - if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped)
> - deferred_split_folio(folio);
> + if (folio_test_large(folio) && folio_test_anon(folio) &&
> + (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped) &&
> + atomic_read(mapped) &&
> + data_race(list_empty(&folio->_deferred_list)))
> + deferred_split_folio(folio);
> }
>
>
> I also thought about handling the scenario where we unmap the whole
> think in smaller chunks. We could detect "!atomic_read(mapped)" and
> detect that it is on the deferred split list, and simply remove it
> from that list incrementing an THP_UNDO_DEFERRED_SPLIT_PAGE event.
>
> But it would be racy with concurrent remapping of the folio (might happen with
> anon folios in corner cases I guess).
>
> What we can do is the following, though:
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index dc30139590e6..f05cba1807f2 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3133,6 +3133,8 @@ void folio_undo_large_rmappable(struct folio *folio)
> ds_queue = get_deferred_split_queue(folio);
> spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> if (!list_empty(&folio->_deferred_list)) {
> + if (folio_test_pmd_mappable(folio))
> + count_vm_event(THP_UNDO_DEFERRED_SPLIT_PAGE);
> ds_queue->split_queue_len--;
> list_del_init(&folio->_deferred_list);
> }
>
> Adding the right event of course.
>
>
> Then it's easy to filter out these "temporarily added to the list, but never split
> before the folio was freed" cases.
>
>
> --
> Cheers,
>
> David / dhildenb
>

2024-04-12 21:06:34

by Zi Yan

[permalink] [raw]
Subject: Re: [PATCH] mm/rmap: do not add fully unmapped large folio to deferred split list

On 12 Apr 2024, at 15:32, David Hildenbrand wrote:

> On 12.04.24 16:35, Zi Yan wrote:
>> On 11 Apr 2024, at 11:46, David Hildenbrand wrote:
>>
>>> On 11.04.24 17:32, Zi Yan wrote:
>>>> From: Zi Yan <[email protected]>
>>>>
>>>> In __folio_remove_rmap(), a large folio is added to deferred split list
>>>> if any page in a folio loses its final mapping. It is possible that
>>>> the folio is unmapped fully, but it is unnecessary to add the folio
>>>> to deferred split list at all. Fix it by checking folio mapcount before
>>>> adding a folio to deferred split list.
>>>>
>>>> Signed-off-by: Zi Yan <[email protected]>
>>>> ---
>>>> mm/rmap.c | 9 ++++++---
>>>> 1 file changed, 6 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>> index 2608c40dffad..d599a772e282 100644
>>>> --- a/mm/rmap.c
>>>> +++ b/mm/rmap.c
>>>> @@ -1494,7 +1494,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>>> enum rmap_level level)
>>>> {
>>>> atomic_t *mapped = &folio->_nr_pages_mapped;
>>>> - int last, nr = 0, nr_pmdmapped = 0;
>>>> + int last, nr = 0, nr_pmdmapped = 0, mapcount = 0;
>>>> enum node_stat_item idx;
>>>> __folio_rmap_sanity_checks(folio, page, nr_pages, level);
>>>> @@ -1506,7 +1506,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>>> break;
>>>> }
>>>> - atomic_sub(nr_pages, &folio->_large_mapcount);
>>>> + mapcount = atomic_sub_return(nr_pages,
>>>> + &folio->_large_mapcount) + 1;
>>>
>>> That becomes a new memory barrier on some archs. Rather just re-read it below. Re-reading should be fine here.
>>
>> Would atomic_sub_return_relaxed() work? Originally I was using atomic_read(mapped)
>> below, but to save an atomic op, I chose to read mapcount here.
>
> Some points:
>
> (1) I suggest reading about atomic get/set vs. atomic RMW vs. atomic
> RMW that return a value -- and how they interact with memory barriers.
> Further, how relaxed variants are only optimized on some architectures.
>
> atomic_read() is usually READ_ONCE(), which is just an "ordinary" memory
> access that should not be refetched. Usually cheaper than most other stuff
> that involves atomics.

I should have checked the actual implementation instead of being fooled
by the name. Will read about it. Thanks.

>
> (2) We can either use folio_large_mapcount() == 0 or !atomic_read(mapped)
> to figure out if the folio is now completely unmapped.
>
> (3) There is one fundamental issue: if we are not batch-unmapping the whole
> thing, we will still add the folios to the deferred split queue. Migration
> would still do that, or if there are multiple VMAs covering a folio.
>
> (4) We should really avoid making common operations slower only to make
> some unreliable stats less unreliable.
>
>
> We should likely do something like the following, which might even be a bit
> faster in some cases because we avoid a function call in case we unmap
> individual PTEs by checking _deferred_list ahead of time
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 2608c40dffad..356598b3dc3c 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1553,9 +1553,11 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
> * page of the folio is unmapped and at least one page
> * is still mapped.
> */
> - if (folio_test_large(folio) && folio_test_anon(folio))
> - if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped)
> - deferred_split_folio(folio);
> + if (folio_test_large(folio) && folio_test_anon(folio) &&
> + (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped) &&
> + atomic_read(mapped) &&
> + data_race(list_empty(&folio->_deferred_list)))

data_race() might not be needed, as Ryan pointed out[1]

> + deferred_split_folio(folio);
> }
>
> I also thought about handling the scenario where we unmap the whole
> think in smaller chunks. We could detect "!atomic_read(mapped)" and
> detect that it is on the deferred split list, and simply remove it
> from that list incrementing an THP_UNDO_DEFERRED_SPLIT_PAGE event.
>
> But it would be racy with concurrent remapping of the folio (might happen with
> anon folios in corner cases I guess).
>
> What we can do is the following, though:
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index dc30139590e6..f05cba1807f2 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3133,6 +3133,8 @@ void folio_undo_large_rmappable(struct folio *folio)
> ds_queue = get_deferred_split_queue(folio);
> spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> if (!list_empty(&folio->_deferred_list)) {
> + if (folio_test_pmd_mappable(folio))
> + count_vm_event(THP_UNDO_DEFERRED_SPLIT_PAGE);
> ds_queue->split_queue_len--;
> list_del_init(&folio->_deferred_list);
> }
>
> Adding the right event of course.
>
>
> Then it's easy to filter out these "temporarily added to the list, but never split
> before the folio was freed" cases.

So instead of making THP_DEFERRED_SPLIT_PAGE precise, use
THP_DEFERRED_SPLIT_PAGE - THP_UNDO_DEFERRED_SPLIT_PAGE instead? That should work.

I wonder what THP_DEFERRED_SPLIT_PAGE counts. If it counts THP deferred
splits, why not just move the counter to deferred_split_scan(), where the actual
split happens. Or the counter has a different meaning?



[1] https://lore.kernel.org/linux-mm/[email protected]/

--
Best Regards,
Yan, Zi


Attachments:
signature.asc (871.00 B)
OpenPGP digital signature

2024-04-12 22:30:00

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCH] mm/rmap: do not add fully unmapped large folio to deferred split list

On Fri, Apr 12, 2024 at 2:06 PM Zi Yan <[email protected]> wrote:
>
> On 12 Apr 2024, at 15:32, David Hildenbrand wrote:
>
> > On 12.04.24 16:35, Zi Yan wrote:
> >> On 11 Apr 2024, at 11:46, David Hildenbrand wrote:
> >>
> >>> On 11.04.24 17:32, Zi Yan wrote:
> >>>> From: Zi Yan <[email protected]>
> >>>>
> >>>> In __folio_remove_rmap(), a large folio is added to deferred split list
> >>>> if any page in a folio loses its final mapping. It is possible that
> >>>> the folio is unmapped fully, but it is unnecessary to add the folio
> >>>> to deferred split list at all. Fix it by checking folio mapcount before
> >>>> adding a folio to deferred split list.
> >>>>
> >>>> Signed-off-by: Zi Yan <[email protected]>
> >>>> ---
> >>>> mm/rmap.c | 9 ++++++---
> >>>> 1 file changed, 6 insertions(+), 3 deletions(-)
> >>>>
> >>>> diff --git a/mm/rmap.c b/mm/rmap.c
> >>>> index 2608c40dffad..d599a772e282 100644
> >>>> --- a/mm/rmap.c
> >>>> +++ b/mm/rmap.c
> >>>> @@ -1494,7 +1494,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
> >>>> enum rmap_level level)
> >>>> {
> >>>> atomic_t *mapped = &folio->_nr_pages_mapped;
> >>>> - int last, nr = 0, nr_pmdmapped = 0;
> >>>> + int last, nr = 0, nr_pmdmapped = 0, mapcount = 0;
> >>>> enum node_stat_item idx;
> >>>> __folio_rmap_sanity_checks(folio, page, nr_pages, level);
> >>>> @@ -1506,7 +1506,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
> >>>> break;
> >>>> }
> >>>> - atomic_sub(nr_pages, &folio->_large_mapcount);
> >>>> + mapcount = atomic_sub_return(nr_pages,
> >>>> + &folio->_large_mapcount) + 1;
> >>>
> >>> That becomes a new memory barrier on some archs. Rather just re-read it below. Re-reading should be fine here.
> >>
> >> Would atomic_sub_return_relaxed() work? Originally I was using atomic_read(mapped)
> >> below, but to save an atomic op, I chose to read mapcount here.
> >
> > Some points:
> >
> > (1) I suggest reading about atomic get/set vs. atomic RMW vs. atomic
> > RMW that return a value -- and how they interact with memory barriers.
> > Further, how relaxed variants are only optimized on some architectures.
> >
> > atomic_read() is usually READ_ONCE(), which is just an "ordinary" memory
> > access that should not be refetched. Usually cheaper than most other stuff
> > that involves atomics.
>
> I should have checked the actual implementation instead of being fooled
> by the name. Will read about it. Thanks.
>
> >
> > (2) We can either use folio_large_mapcount() == 0 or !atomic_read(mapped)
> > to figure out if the folio is now completely unmapped.
> >
> > (3) There is one fundamental issue: if we are not batch-unmapping the whole
> > thing, we will still add the folios to the deferred split queue. Migration
> > would still do that, or if there are multiple VMAs covering a folio.
> >
> > (4) We should really avoid making common operations slower only to make
> > some unreliable stats less unreliable.
> >
> >
> > We should likely do something like the following, which might even be a bit
> > faster in some cases because we avoid a function call in case we unmap
> > individual PTEs by checking _deferred_list ahead of time
> >
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 2608c40dffad..356598b3dc3c 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -1553,9 +1553,11 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
> > * page of the folio is unmapped and at least one page
> > * is still mapped.
> > */
> > - if (folio_test_large(folio) && folio_test_anon(folio))
> > - if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped)
> > - deferred_split_folio(folio);
> > + if (folio_test_large(folio) && folio_test_anon(folio) &&
> > + (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped) &&
> > + atomic_read(mapped) &&
> > + data_race(list_empty(&folio->_deferred_list)))
>
> data_race() might not be needed, as Ryan pointed out[1]
>
> > + deferred_split_folio(folio);
> > }
> >
> > I also thought about handling the scenario where we unmap the whole
> > think in smaller chunks. We could detect "!atomic_read(mapped)" and
> > detect that it is on the deferred split list, and simply remove it
> > from that list incrementing an THP_UNDO_DEFERRED_SPLIT_PAGE event.
> >
> > But it would be racy with concurrent remapping of the folio (might happen with
> > anon folios in corner cases I guess).
> >
> > What we can do is the following, though:
> >
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index dc30139590e6..f05cba1807f2 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -3133,6 +3133,8 @@ void folio_undo_large_rmappable(struct folio *folio)
> > ds_queue = get_deferred_split_queue(folio);
> > spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> > if (!list_empty(&folio->_deferred_list)) {
> > + if (folio_test_pmd_mappable(folio))
> > + count_vm_event(THP_UNDO_DEFERRED_SPLIT_PAGE);
> > ds_queue->split_queue_len--;
> > list_del_init(&folio->_deferred_list);
> > }
> >
> > Adding the right event of course.
> >
> >
> > Then it's easy to filter out these "temporarily added to the list, but never split
> > before the folio was freed" cases.
>
> So instead of making THP_DEFERRED_SPLIT_PAGE precise, use
> THP_DEFERRED_SPLIT_PAGE - THP_UNDO_DEFERRED_SPLIT_PAGE instead? That should work.

It is definitely possible that the THP on the deferred split queue are
freed instead of split. For example, 1M is unmapped for a 2M THP, then
later the remaining 1M is unmapped, or the process exits before memory
pressure happens. So how come we can tell it is "temporarily added to
list"? Then THP_DEFERRED_SPLIT_PAGE - THP_UNDO_DEFERRED_SPLIT_PAGE
actually just counts how many pages are still on deferred split queue.
It may be useful. However the counter is typically used to estimate
how many THP are partially unmapped during a period of time. So we
just need to know the initial value and the value when we read it
again.

>
> I wonder what THP_DEFERRED_SPLIT_PAGE counts. If it counts THP deferred
> splits, why not just move the counter to deferred_split_scan(), where the actual
> split happens. Or the counter has a different meaning?

The deferred_split_scan() / deferred_split_count() just can return the
number of pages on a specific queue (a specific node with a specific
memcg). But THP_DEFERRED_SPLIT_PAGE is a global counter. Did I miss
something? Or you mean traverse all memcgs and all nodes? It sounds
too overkilling.

>
>
>
> [1] https://lore.kernel.org/linux-mm/[email protected]/
>
> --
> Best Regards,
> Yan, Zi

2024-04-12 22:59:26

by Zi Yan

[permalink] [raw]
Subject: Re: [PATCH] mm/rmap: do not add fully unmapped large folio to deferred split list

On 12 Apr 2024, at 18:29, Yang Shi wrote:

> On Fri, Apr 12, 2024 at 2:06 PM Zi Yan <[email protected]> wrote:
>>
>> On 12 Apr 2024, at 15:32, David Hildenbrand wrote:
>>
>>> On 12.04.24 16:35, Zi Yan wrote:
>>>> On 11 Apr 2024, at 11:46, David Hildenbrand wrote:
>>>>
>>>>> On 11.04.24 17:32, Zi Yan wrote:
>>>>>> From: Zi Yan <[email protected]>
>>>>>>
>>>>>> In __folio_remove_rmap(), a large folio is added to deferred split list
>>>>>> if any page in a folio loses its final mapping. It is possible that
>>>>>> the folio is unmapped fully, but it is unnecessary to add the folio
>>>>>> to deferred split list at all. Fix it by checking folio mapcount before
>>>>>> adding a folio to deferred split list.
>>>>>>
>>>>>> Signed-off-by: Zi Yan <[email protected]>
>>>>>> ---
>>>>>> mm/rmap.c | 9 ++++++---
>>>>>> 1 file changed, 6 insertions(+), 3 deletions(-)
>>>>>>
>>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>>>> index 2608c40dffad..d599a772e282 100644
>>>>>> --- a/mm/rmap.c
>>>>>> +++ b/mm/rmap.c
>>>>>> @@ -1494,7 +1494,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>>>>> enum rmap_level level)
>>>>>> {
>>>>>> atomic_t *mapped = &folio->_nr_pages_mapped;
>>>>>> - int last, nr = 0, nr_pmdmapped = 0;
>>>>>> + int last, nr = 0, nr_pmdmapped = 0, mapcount = 0;
>>>>>> enum node_stat_item idx;
>>>>>> __folio_rmap_sanity_checks(folio, page, nr_pages, level);
>>>>>> @@ -1506,7 +1506,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>>>>> break;
>>>>>> }
>>>>>> - atomic_sub(nr_pages, &folio->_large_mapcount);
>>>>>> + mapcount = atomic_sub_return(nr_pages,
>>>>>> + &folio->_large_mapcount) + 1;
>>>>>
>>>>> That becomes a new memory barrier on some archs. Rather just re-read it below. Re-reading should be fine here.
>>>>
>>>> Would atomic_sub_return_relaxed() work? Originally I was using atomic_read(mapped)
>>>> below, but to save an atomic op, I chose to read mapcount here.
>>>
>>> Some points:
>>>
>>> (1) I suggest reading about atomic get/set vs. atomic RMW vs. atomic
>>> RMW that return a value -- and how they interact with memory barriers.
>>> Further, how relaxed variants are only optimized on some architectures.
>>>
>>> atomic_read() is usually READ_ONCE(), which is just an "ordinary" memory
>>> access that should not be refetched. Usually cheaper than most other stuff
>>> that involves atomics.
>>
>> I should have checked the actual implementation instead of being fooled
>> by the name. Will read about it. Thanks.
>>
>>>
>>> (2) We can either use folio_large_mapcount() == 0 or !atomic_read(mapped)
>>> to figure out if the folio is now completely unmapped.
>>>
>>> (3) There is one fundamental issue: if we are not batch-unmapping the whole
>>> thing, we will still add the folios to the deferred split queue. Migration
>>> would still do that, or if there are multiple VMAs covering a folio.
>>>
>>> (4) We should really avoid making common operations slower only to make
>>> some unreliable stats less unreliable.
>>>
>>>
>>> We should likely do something like the following, which might even be a bit
>>> faster in some cases because we avoid a function call in case we unmap
>>> individual PTEs by checking _deferred_list ahead of time
>>>
>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>> index 2608c40dffad..356598b3dc3c 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -1553,9 +1553,11 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>> * page of the folio is unmapped and at least one page
>>> * is still mapped.
>>> */
>>> - if (folio_test_large(folio) && folio_test_anon(folio))
>>> - if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped)
>>> - deferred_split_folio(folio);
>>> + if (folio_test_large(folio) && folio_test_anon(folio) &&
>>> + (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped) &&
>>> + atomic_read(mapped) &&
>>> + data_race(list_empty(&folio->_deferred_list)))
>>
>> data_race() might not be needed, as Ryan pointed out[1]
>>
>>> + deferred_split_folio(folio);
>>> }
>>>
>>> I also thought about handling the scenario where we unmap the whole
>>> think in smaller chunks. We could detect "!atomic_read(mapped)" and
>>> detect that it is on the deferred split list, and simply remove it
>>> from that list incrementing an THP_UNDO_DEFERRED_SPLIT_PAGE event.
>>>
>>> But it would be racy with concurrent remapping of the folio (might happen with
>>> anon folios in corner cases I guess).
>>>
>>> What we can do is the following, though:
>>>
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index dc30139590e6..f05cba1807f2 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -3133,6 +3133,8 @@ void folio_undo_large_rmappable(struct folio *folio)
>>> ds_queue = get_deferred_split_queue(folio);
>>> spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>>> if (!list_empty(&folio->_deferred_list)) {
>>> + if (folio_test_pmd_mappable(folio))
>>> + count_vm_event(THP_UNDO_DEFERRED_SPLIT_PAGE);
>>> ds_queue->split_queue_len--;
>>> list_del_init(&folio->_deferred_list);
>>> }
>>>
>>> Adding the right event of course.
>>>
>>>
>>> Then it's easy to filter out these "temporarily added to the list, but never split
>>> before the folio was freed" cases.
>>
>> So instead of making THP_DEFERRED_SPLIT_PAGE precise, use
>> THP_DEFERRED_SPLIT_PAGE - THP_UNDO_DEFERRED_SPLIT_PAGE instead? That should work.
>
> It is definitely possible that the THP on the deferred split queue are
> freed instead of split. For example, 1M is unmapped for a 2M THP, then
> later the remaining 1M is unmapped, or the process exits before memory
> pressure happens. So how come we can tell it is "temporarily added to
> list"? Then THP_DEFERRED_SPLIT_PAGE - THP_UNDO_DEFERRED_SPLIT_PAGE
> actually just counts how many pages are still on deferred split queue.
> It may be useful. However the counter is typically used to estimate
> how many THP are partially unmapped during a period of time. So we
> just need to know the initial value and the value when we read it
> again.
>
>>
>> I wonder what THP_DEFERRED_SPLIT_PAGE counts. If it counts THP deferred
>> splits, why not just move the counter to deferred_split_scan(), where the actual
>> split happens. Or the counter has a different meaning?
>
> The deferred_split_scan() / deferred_split_count() just can return the
> number of pages on a specific queue (a specific node with a specific
> memcg). But THP_DEFERRED_SPLIT_PAGE is a global counter. Did I miss
> something? Or you mean traverse all memcgs and all nodes? It sounds
> too overkilling.

I mean instead of increasing THP_DEFERRED_SPLIT_PAGE when a folio is added
to the split list, increase it when a folio is split in deferred_split_scan(),
regardless which list the folio is on.

--
Best Regards,
Yan, Zi


Attachments:
signature.asc (871.00 B)
OpenPGP digital signature

2024-04-13 00:51:19

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCH] mm/rmap: do not add fully unmapped large folio to deferred split list

On Fri, Apr 12, 2024 at 3:59 PM Zi Yan <[email protected]> wrote:
>
> On 12 Apr 2024, at 18:29, Yang Shi wrote:
>
> > On Fri, Apr 12, 2024 at 2:06 PM Zi Yan <[email protected]> wrote:
> >>
> >> On 12 Apr 2024, at 15:32, David Hildenbrand wrote:
> >>
> >>> On 12.04.24 16:35, Zi Yan wrote:
> >>>> On 11 Apr 2024, at 11:46, David Hildenbrand wrote:
> >>>>
> >>>>> On 11.04.24 17:32, Zi Yan wrote:
> >>>>>> From: Zi Yan <[email protected]>
> >>>>>>
> >>>>>> In __folio_remove_rmap(), a large folio is added to deferred split list
> >>>>>> if any page in a folio loses its final mapping. It is possible that
> >>>>>> the folio is unmapped fully, but it is unnecessary to add the folio
> >>>>>> to deferred split list at all. Fix it by checking folio mapcount before
> >>>>>> adding a folio to deferred split list.
> >>>>>>
> >>>>>> Signed-off-by: Zi Yan <[email protected]>
> >>>>>> ---
> >>>>>> mm/rmap.c | 9 ++++++---
> >>>>>> 1 file changed, 6 insertions(+), 3 deletions(-)
> >>>>>>
> >>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
> >>>>>> index 2608c40dffad..d599a772e282 100644
> >>>>>> --- a/mm/rmap.c
> >>>>>> +++ b/mm/rmap.c
> >>>>>> @@ -1494,7 +1494,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
> >>>>>> enum rmap_level level)
> >>>>>> {
> >>>>>> atomic_t *mapped = &folio->_nr_pages_mapped;
> >>>>>> - int last, nr = 0, nr_pmdmapped = 0;
> >>>>>> + int last, nr = 0, nr_pmdmapped = 0, mapcount = 0;
> >>>>>> enum node_stat_item idx;
> >>>>>> __folio_rmap_sanity_checks(folio, page, nr_pages, level);
> >>>>>> @@ -1506,7 +1506,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
> >>>>>> break;
> >>>>>> }
> >>>>>> - atomic_sub(nr_pages, &folio->_large_mapcount);
> >>>>>> + mapcount = atomic_sub_return(nr_pages,
> >>>>>> + &folio->_large_mapcount) + 1;
> >>>>>
> >>>>> That becomes a new memory barrier on some archs. Rather just re-read it below. Re-reading should be fine here.
> >>>>
> >>>> Would atomic_sub_return_relaxed() work? Originally I was using atomic_read(mapped)
> >>>> below, but to save an atomic op, I chose to read mapcount here.
> >>>
> >>> Some points:
> >>>
> >>> (1) I suggest reading about atomic get/set vs. atomic RMW vs. atomic
> >>> RMW that return a value -- and how they interact with memory barriers.
> >>> Further, how relaxed variants are only optimized on some architectures.
> >>>
> >>> atomic_read() is usually READ_ONCE(), which is just an "ordinary" memory
> >>> access that should not be refetched. Usually cheaper than most other stuff
> >>> that involves atomics.
> >>
> >> I should have checked the actual implementation instead of being fooled
> >> by the name. Will read about it. Thanks.
> >>
> >>>
> >>> (2) We can either use folio_large_mapcount() == 0 or !atomic_read(mapped)
> >>> to figure out if the folio is now completely unmapped.
> >>>
> >>> (3) There is one fundamental issue: if we are not batch-unmapping the whole
> >>> thing, we will still add the folios to the deferred split queue. Migration
> >>> would still do that, or if there are multiple VMAs covering a folio.
> >>>
> >>> (4) We should really avoid making common operations slower only to make
> >>> some unreliable stats less unreliable.
> >>>
> >>>
> >>> We should likely do something like the following, which might even be a bit
> >>> faster in some cases because we avoid a function call in case we unmap
> >>> individual PTEs by checking _deferred_list ahead of time
> >>>
> >>> diff --git a/mm/rmap.c b/mm/rmap.c
> >>> index 2608c40dffad..356598b3dc3c 100644
> >>> --- a/mm/rmap.c
> >>> +++ b/mm/rmap.c
> >>> @@ -1553,9 +1553,11 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
> >>> * page of the folio is unmapped and at least one page
> >>> * is still mapped.
> >>> */
> >>> - if (folio_test_large(folio) && folio_test_anon(folio))
> >>> - if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped)
> >>> - deferred_split_folio(folio);
> >>> + if (folio_test_large(folio) && folio_test_anon(folio) &&
> >>> + (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped) &&
> >>> + atomic_read(mapped) &&
> >>> + data_race(list_empty(&folio->_deferred_list)))
> >>
> >> data_race() might not be needed, as Ryan pointed out[1]
> >>
> >>> + deferred_split_folio(folio);
> >>> }
> >>>
> >>> I also thought about handling the scenario where we unmap the whole
> >>> think in smaller chunks. We could detect "!atomic_read(mapped)" and
> >>> detect that it is on the deferred split list, and simply remove it
> >>> from that list incrementing an THP_UNDO_DEFERRED_SPLIT_PAGE event.
> >>>
> >>> But it would be racy with concurrent remapping of the folio (might happen with
> >>> anon folios in corner cases I guess).
> >>>
> >>> What we can do is the following, though:
> >>>
> >>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >>> index dc30139590e6..f05cba1807f2 100644
> >>> --- a/mm/huge_memory.c
> >>> +++ b/mm/huge_memory.c
> >>> @@ -3133,6 +3133,8 @@ void folio_undo_large_rmappable(struct folio *folio)
> >>> ds_queue = get_deferred_split_queue(folio);
> >>> spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> >>> if (!list_empty(&folio->_deferred_list)) {
> >>> + if (folio_test_pmd_mappable(folio))
> >>> + count_vm_event(THP_UNDO_DEFERRED_SPLIT_PAGE);
> >>> ds_queue->split_queue_len--;
> >>> list_del_init(&folio->_deferred_list);
> >>> }
> >>>
> >>> Adding the right event of course.
> >>>
> >>>
> >>> Then it's easy to filter out these "temporarily added to the list, but never split
> >>> before the folio was freed" cases.
> >>
> >> So instead of making THP_DEFERRED_SPLIT_PAGE precise, use
> >> THP_DEFERRED_SPLIT_PAGE - THP_UNDO_DEFERRED_SPLIT_PAGE instead? That should work.
> >
> > It is definitely possible that the THP on the deferred split queue are
> > freed instead of split. For example, 1M is unmapped for a 2M THP, then
> > later the remaining 1M is unmapped, or the process exits before memory
> > pressure happens. So how come we can tell it is "temporarily added to
> > list"? Then THP_DEFERRED_SPLIT_PAGE - THP_UNDO_DEFERRED_SPLIT_PAGE
> > actually just counts how many pages are still on deferred split queue.
> > It may be useful. However the counter is typically used to estimate
> > how many THP are partially unmapped during a period of time. So we
> > just need to know the initial value and the value when we read it
> > again.
> >
> >>
> >> I wonder what THP_DEFERRED_SPLIT_PAGE counts. If it counts THP deferred
> >> splits, why not just move the counter to deferred_split_scan(), where the actual
> >> split happens. Or the counter has a different meaning?
> >
> > The deferred_split_scan() / deferred_split_count() just can return the
> > number of pages on a specific queue (a specific node with a specific
> > memcg). But THP_DEFERRED_SPLIT_PAGE is a global counter. Did I miss
> > something? Or you mean traverse all memcgs and all nodes? It sounds
> > too overkilling.
>
> I mean instead of increasing THP_DEFERRED_SPLIT_PAGE when a folio is added
> to the split list, increase it when a folio is split in deferred_split_scan(),
> regardless which list the folio is on.

It will have overlap with thp_split_page. And what if memory pressure
doesn't happen? The counter will be 0 even though thousands THP have
been partially unmapped.

>
> --
> Best Regards,
> Yan, Zi

2024-04-15 15:13:27

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH] mm/rmap: do not add fully unmapped large folio to deferred split list

On 12.04.24 23:06, Zi Yan wrote:
> On 12 Apr 2024, at 15:32, David Hildenbrand wrote:
>
>> On 12.04.24 16:35, Zi Yan wrote:
>>> On 11 Apr 2024, at 11:46, David Hildenbrand wrote:
>>>
>>>> On 11.04.24 17:32, Zi Yan wrote:
>>>>> From: Zi Yan <[email protected]>
>>>>>
>>>>> In __folio_remove_rmap(), a large folio is added to deferred split list
>>>>> if any page in a folio loses its final mapping. It is possible that
>>>>> the folio is unmapped fully, but it is unnecessary to add the folio
>>>>> to deferred split list at all. Fix it by checking folio mapcount before
>>>>> adding a folio to deferred split list.
>>>>>
>>>>> Signed-off-by: Zi Yan <[email protected]>
>>>>> ---
>>>>> mm/rmap.c | 9 ++++++---
>>>>> 1 file changed, 6 insertions(+), 3 deletions(-)
>>>>>
>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>>> index 2608c40dffad..d599a772e282 100644
>>>>> --- a/mm/rmap.c
>>>>> +++ b/mm/rmap.c
>>>>> @@ -1494,7 +1494,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>>>> enum rmap_level level)
>>>>> {
>>>>> atomic_t *mapped = &folio->_nr_pages_mapped;
>>>>> - int last, nr = 0, nr_pmdmapped = 0;
>>>>> + int last, nr = 0, nr_pmdmapped = 0, mapcount = 0;
>>>>> enum node_stat_item idx;
>>>>> __folio_rmap_sanity_checks(folio, page, nr_pages, level);
>>>>> @@ -1506,7 +1506,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>>>> break;
>>>>> }
>>>>> - atomic_sub(nr_pages, &folio->_large_mapcount);
>>>>> + mapcount = atomic_sub_return(nr_pages,
>>>>> + &folio->_large_mapcount) + 1;
>>>>
>>>> That becomes a new memory barrier on some archs. Rather just re-read it below. Re-reading should be fine here.
>>>
>>> Would atomic_sub_return_relaxed() work? Originally I was using atomic_read(mapped)
>>> below, but to save an atomic op, I chose to read mapcount here.
>>
>> Some points:
>>
>> (1) I suggest reading about atomic get/set vs. atomic RMW vs. atomic
>> RMW that return a value -- and how they interact with memory barriers.
>> Further, how relaxed variants are only optimized on some architectures.
>>
>> atomic_read() is usually READ_ONCE(), which is just an "ordinary" memory
>> access that should not be refetched. Usually cheaper than most other stuff
>> that involves atomics.
>
> I should have checked the actual implementation instead of being fooled
> by the name. Will read about it. Thanks.
>
>>
>> (2) We can either use folio_large_mapcount() == 0 or !atomic_read(mapped)
>> to figure out if the folio is now completely unmapped.
>>
>> (3) There is one fundamental issue: if we are not batch-unmapping the whole
>> thing, we will still add the folios to the deferred split queue. Migration
>> would still do that, or if there are multiple VMAs covering a folio.
>>
>> (4) We should really avoid making common operations slower only to make
>> some unreliable stats less unreliable.
>>
>>
>> We should likely do something like the following, which might even be a bit
>> faster in some cases because we avoid a function call in case we unmap
>> individual PTEs by checking _deferred_list ahead of time
>>
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 2608c40dffad..356598b3dc3c 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1553,9 +1553,11 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>> * page of the folio is unmapped and at least one page
>> * is still mapped.
>> */
>> - if (folio_test_large(folio) && folio_test_anon(folio))
>> - if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped)
>> - deferred_split_folio(folio);
>> + if (folio_test_large(folio) && folio_test_anon(folio) &&
>> + (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped) &&
>> + atomic_read(mapped) &&
>> + data_race(list_empty(&folio->_deferred_list)))
>
> data_race() might not be needed, as Ryan pointed out[1]

Right, I keep getting confused by that. Likely we should add data_race()
only if we get actual reports.

--
Cheers,

David / dhildenb


2024-04-15 15:41:03

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH] mm/rmap: do not add fully unmapped large folio to deferred split list

On 13.04.24 00:29, Yang Shi wrote:
> On Fri, Apr 12, 2024 at 2:06 PM Zi Yan <[email protected]> wrote:
>>
>> On 12 Apr 2024, at 15:32, David Hildenbrand wrote:
>>
>>> On 12.04.24 16:35, Zi Yan wrote:
>>>> On 11 Apr 2024, at 11:46, David Hildenbrand wrote:
>>>>
>>>>> On 11.04.24 17:32, Zi Yan wrote:
>>>>>> From: Zi Yan <[email protected]>
>>>>>>
>>>>>> In __folio_remove_rmap(), a large folio is added to deferred split list
>>>>>> if any page in a folio loses its final mapping. It is possible that
>>>>>> the folio is unmapped fully, but it is unnecessary to add the folio
>>>>>> to deferred split list at all. Fix it by checking folio mapcount before
>>>>>> adding a folio to deferred split list.
>>>>>>
>>>>>> Signed-off-by: Zi Yan <[email protected]>
>>>>>> ---
>>>>>> mm/rmap.c | 9 ++++++---
>>>>>> 1 file changed, 6 insertions(+), 3 deletions(-)
>>>>>>
>>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>>>> index 2608c40dffad..d599a772e282 100644
>>>>>> --- a/mm/rmap.c
>>>>>> +++ b/mm/rmap.c
>>>>>> @@ -1494,7 +1494,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>>>>> enum rmap_level level)
>>>>>> {
>>>>>> atomic_t *mapped = &folio->_nr_pages_mapped;
>>>>>> - int last, nr = 0, nr_pmdmapped = 0;
>>>>>> + int last, nr = 0, nr_pmdmapped = 0, mapcount = 0;
>>>>>> enum node_stat_item idx;
>>>>>> __folio_rmap_sanity_checks(folio, page, nr_pages, level);
>>>>>> @@ -1506,7 +1506,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>>>>> break;
>>>>>> }
>>>>>> - atomic_sub(nr_pages, &folio->_large_mapcount);
>>>>>> + mapcount = atomic_sub_return(nr_pages,
>>>>>> + &folio->_large_mapcount) + 1;
>>>>>
>>>>> That becomes a new memory barrier on some archs. Rather just re-read it below. Re-reading should be fine here.
>>>>
>>>> Would atomic_sub_return_relaxed() work? Originally I was using atomic_read(mapped)
>>>> below, but to save an atomic op, I chose to read mapcount here.
>>>
>>> Some points:
>>>
>>> (1) I suggest reading about atomic get/set vs. atomic RMW vs. atomic
>>> RMW that return a value -- and how they interact with memory barriers.
>>> Further, how relaxed variants are only optimized on some architectures.
>>>
>>> atomic_read() is usually READ_ONCE(), which is just an "ordinary" memory
>>> access that should not be refetched. Usually cheaper than most other stuff
>>> that involves atomics.
>>
>> I should have checked the actual implementation instead of being fooled
>> by the name. Will read about it. Thanks.
>>
>>>
>>> (2) We can either use folio_large_mapcount() == 0 or !atomic_read(mapped)
>>> to figure out if the folio is now completely unmapped.
>>>
>>> (3) There is one fundamental issue: if we are not batch-unmapping the whole
>>> thing, we will still add the folios to the deferred split queue. Migration
>>> would still do that, or if there are multiple VMAs covering a folio.
>>>
>>> (4) We should really avoid making common operations slower only to make
>>> some unreliable stats less unreliable.
>>>
>>>
>>> We should likely do something like the following, which might even be a bit
>>> faster in some cases because we avoid a function call in case we unmap
>>> individual PTEs by checking _deferred_list ahead of time
>>>
>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>> index 2608c40dffad..356598b3dc3c 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -1553,9 +1553,11 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>> * page of the folio is unmapped and at least one page
>>> * is still mapped.
>>> */
>>> - if (folio_test_large(folio) && folio_test_anon(folio))
>>> - if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped)
>>> - deferred_split_folio(folio);
>>> + if (folio_test_large(folio) && folio_test_anon(folio) &&
>>> + (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped) &&
>>> + atomic_read(mapped) &&
>>> + data_race(list_empty(&folio->_deferred_list)))
>>
>> data_race() might not be needed, as Ryan pointed out[1]
>>
>>> + deferred_split_folio(folio);
>>> }
>>>
>>> I also thought about handling the scenario where we unmap the whole
>>> think in smaller chunks. We could detect "!atomic_read(mapped)" and
>>> detect that it is on the deferred split list, and simply remove it
>>> from that list incrementing an THP_UNDO_DEFERRED_SPLIT_PAGE event.
>>>
>>> But it would be racy with concurrent remapping of the folio (might happen with
>>> anon folios in corner cases I guess).
>>>
>>> What we can do is the following, though:
>>>
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index dc30139590e6..f05cba1807f2 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -3133,6 +3133,8 @@ void folio_undo_large_rmappable(struct folio *folio)
>>> ds_queue = get_deferred_split_queue(folio);
>>> spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>>> if (!list_empty(&folio->_deferred_list)) {
>>> + if (folio_test_pmd_mappable(folio))
>>> + count_vm_event(THP_UNDO_DEFERRED_SPLIT_PAGE);
>>> ds_queue->split_queue_len--;
>>> list_del_init(&folio->_deferred_list);
>>> }
>>>
>>> Adding the right event of course.
>>>
>>>
>>> Then it's easy to filter out these "temporarily added to the list, but never split
>>> before the folio was freed" cases.
>>
>> So instead of making THP_DEFERRED_SPLIT_PAGE precise, use
>> THP_DEFERRED_SPLIT_PAGE - THP_UNDO_DEFERRED_SPLIT_PAGE instead? That should work.
>
> It is definitely possible that the THP on the deferred split queue are
> freed instead of split. For example, 1M is unmapped for a 2M THP, then
> later the remaining 1M is unmapped, or the process exits before memory
> pressure happens. So how come we can tell it is "temporarily added to
> list"? Then THP_DEFERRED_SPLIT_PAGE - THP_UNDO_DEFERRED_SPLIT_PAGE
> actually just counts how many pages are still on deferred split queue.

Not quite I think. I don't think we have a counter that counts how many
large folios on the deferred list were split. I think we only have
THP_SPLIT_PAGE.

We could have
* THP_DEFERRED_SPLIT_PAGE
* THP_UNDO_DEFERRED_SPLIT_PAGE
* THP_PERFORM_DEFERRED_SPLIT_PAGE

Maybe that would catch more cases (not sure if all, though). Then, you
could tell how many are still on that list. THP_DEFERRED_SPLIT_PAGE -
THP_UNDO_DEFERRED_SPLIT_PAGE - THP_PERFORM_DEFERRED_SPLIT_PAGE.

That could give one a clearer picture how deferred split interacts with
actual splitting (possibly under memory pressure), the whole reason why
deferred splitting was added after all.

> It may be useful. However the counter is typically used to estimate
> how many THP are partially unmapped during a period of time.

I'd say it's a bit of an abuse of that counter; well, or interpreting
something into the counter that that counter never reliably represented.

I can easily write a program that keeps sending your counter to infinity
simply by triggering that behavior in a loop, so it's all a bit shaky.

Something like Ryans script makes more sense, where you get a clearer
picture of what's mapped where and how. Because that information can be
much more valuable than just knowing if it's mapped fully or partially
(again, relevant for handling with memory waste).

--
Cheers,

David / dhildenb


2024-04-15 15:44:03

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH] mm/rmap: do not add fully unmapped large folio to deferred split list

On 12.04.24 22:35, Yang Shi wrote:
> On Fri, Apr 12, 2024 at 12:33 PM David Hildenbrand <[email protected]> wrote:
>>
>> On 12.04.24 16:35, Zi Yan wrote:
>>> On 11 Apr 2024, at 11:46, David Hildenbrand wrote:
>>>
>>>> On 11.04.24 17:32, Zi Yan wrote:
>>>>> From: Zi Yan <[email protected]>
>>>>>
>>>>> In __folio_remove_rmap(), a large folio is added to deferred split list
>>>>> if any page in a folio loses its final mapping. It is possible that
>>>>> the folio is unmapped fully, but it is unnecessary to add the folio
>>>>> to deferred split list at all. Fix it by checking folio mapcount before
>>>>> adding a folio to deferred split list.
>>>>>
>>>>> Signed-off-by: Zi Yan <[email protected]>
>>>>> ---
>>>>> mm/rmap.c | 9 ++++++---
>>>>> 1 file changed, 6 insertions(+), 3 deletions(-)
>>>>>
>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>>> index 2608c40dffad..d599a772e282 100644
>>>>> --- a/mm/rmap.c
>>>>> +++ b/mm/rmap.c
>>>>> @@ -1494,7 +1494,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>>>> enum rmap_level level)
>>>>> {
>>>>> atomic_t *mapped = &folio->_nr_pages_mapped;
>>>>> - int last, nr = 0, nr_pmdmapped = 0;
>>>>> + int last, nr = 0, nr_pmdmapped = 0, mapcount = 0;
>>>>> enum node_stat_item idx;
>>>>> __folio_rmap_sanity_checks(folio, page, nr_pages, level);
>>>>> @@ -1506,7 +1506,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>>>> break;
>>>>> }
>>>>> - atomic_sub(nr_pages, &folio->_large_mapcount);
>>>>> + mapcount = atomic_sub_return(nr_pages,
>>>>> + &folio->_large_mapcount) + 1;
>>>>
>>>> That becomes a new memory barrier on some archs. Rather just re-read it below. Re-reading should be fine here.
>>>
>>> Would atomic_sub_return_relaxed() work? Originally I was using atomic_read(mapped)
>>> below, but to save an atomic op, I chose to read mapcount here.
>>
>> Some points:
>>
>> (1) I suggest reading about atomic get/set vs. atomic RMW vs. atomic
>> RMW that return a value -- and how they interact with memory barriers.
>> Further, how relaxed variants are only optimized on some architectures.
>>
>> atomic_read() is usually READ_ONCE(), which is just an "ordinary" memory
>> access that should not be refetched. Usually cheaper than most other stuff
>> that involves atomics.
>>
>> (2) We can either use folio_large_mapcount() == 0 or !atomic_read(mapped)
>> to figure out if the folio is now completely unmapped.
>>
>> (3) There is one fundamental issue: if we are not batch-unmapping the whole
>> thing, we will still add the folios to the deferred split queue. Migration
>> would still do that, or if there are multiple VMAs covering a folio.
>
> Maybe we can let rmap remove code not touch deferred split queue in
> migration path at all because we know all the PTEs will be converted
> to migration entries. And all the migration entries will be converted
> back to PTEs regardless of whether try_to_migrate succeeded or not.

It's would be just another bandaid I think :/ Maybe a worthwile one, not
sure.

--
Cheers,

David / dhildenb


2024-04-15 17:55:03

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCH] mm/rmap: do not add fully unmapped large folio to deferred split list

On Mon, Apr 15, 2024 at 8:40 AM David Hildenbrand <[email protected]> wrote:
>
> On 13.04.24 00:29, Yang Shi wrote:
> > On Fri, Apr 12, 2024 at 2:06 PM Zi Yan <[email protected]> wrote:
> >>
> >> On 12 Apr 2024, at 15:32, David Hildenbrand wrote:
> >>
> >>> On 12.04.24 16:35, Zi Yan wrote:
> >>>> On 11 Apr 2024, at 11:46, David Hildenbrand wrote:
> >>>>
> >>>>> On 11.04.24 17:32, Zi Yan wrote:
> >>>>>> From: Zi Yan <[email protected]>
> >>>>>>
> >>>>>> In __folio_remove_rmap(), a large folio is added to deferred split list
> >>>>>> if any page in a folio loses its final mapping. It is possible that
> >>>>>> the folio is unmapped fully, but it is unnecessary to add the folio
> >>>>>> to deferred split list at all. Fix it by checking folio mapcount before
> >>>>>> adding a folio to deferred split list.
> >>>>>>
> >>>>>> Signed-off-by: Zi Yan <[email protected]>
> >>>>>> ---
> >>>>>> mm/rmap.c | 9 ++++++---
> >>>>>> 1 file changed, 6 insertions(+), 3 deletions(-)
> >>>>>>
> >>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
> >>>>>> index 2608c40dffad..d599a772e282 100644
> >>>>>> --- a/mm/rmap.c
> >>>>>> +++ b/mm/rmap.c
> >>>>>> @@ -1494,7 +1494,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
> >>>>>> enum rmap_level level)
> >>>>>> {
> >>>>>> atomic_t *mapped = &folio->_nr_pages_mapped;
> >>>>>> - int last, nr = 0, nr_pmdmapped = 0;
> >>>>>> + int last, nr = 0, nr_pmdmapped = 0, mapcount = 0;
> >>>>>> enum node_stat_item idx;
> >>>>>> __folio_rmap_sanity_checks(folio, page, nr_pages, level);
> >>>>>> @@ -1506,7 +1506,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
> >>>>>> break;
> >>>>>> }
> >>>>>> - atomic_sub(nr_pages, &folio->_large_mapcount);
> >>>>>> + mapcount = atomic_sub_return(nr_pages,
> >>>>>> + &folio->_large_mapcount) + 1;
> >>>>>
> >>>>> That becomes a new memory barrier on some archs. Rather just re-read it below. Re-reading should be fine here.
> >>>>
> >>>> Would atomic_sub_return_relaxed() work? Originally I was using atomic_read(mapped)
> >>>> below, but to save an atomic op, I chose to read mapcount here.
> >>>
> >>> Some points:
> >>>
> >>> (1) I suggest reading about atomic get/set vs. atomic RMW vs. atomic
> >>> RMW that return a value -- and how they interact with memory barriers.
> >>> Further, how relaxed variants are only optimized on some architectures.
> >>>
> >>> atomic_read() is usually READ_ONCE(), which is just an "ordinary" memory
> >>> access that should not be refetched. Usually cheaper than most other stuff
> >>> that involves atomics.
> >>
> >> I should have checked the actual implementation instead of being fooled
> >> by the name. Will read about it. Thanks.
> >>
> >>>
> >>> (2) We can either use folio_large_mapcount() == 0 or !atomic_read(mapped)
> >>> to figure out if the folio is now completely unmapped.
> >>>
> >>> (3) There is one fundamental issue: if we are not batch-unmapping the whole
> >>> thing, we will still add the folios to the deferred split queue. Migration
> >>> would still do that, or if there are multiple VMAs covering a folio.
> >>>
> >>> (4) We should really avoid making common operations slower only to make
> >>> some unreliable stats less unreliable.
> >>>
> >>>
> >>> We should likely do something like the following, which might even be a bit
> >>> faster in some cases because we avoid a function call in case we unmap
> >>> individual PTEs by checking _deferred_list ahead of time
> >>>
> >>> diff --git a/mm/rmap.c b/mm/rmap.c
> >>> index 2608c40dffad..356598b3dc3c 100644
> >>> --- a/mm/rmap.c
> >>> +++ b/mm/rmap.c
> >>> @@ -1553,9 +1553,11 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
> >>> * page of the folio is unmapped and at least one page
> >>> * is still mapped.
> >>> */
> >>> - if (folio_test_large(folio) && folio_test_anon(folio))
> >>> - if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped)
> >>> - deferred_split_folio(folio);
> >>> + if (folio_test_large(folio) && folio_test_anon(folio) &&
> >>> + (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped) &&
> >>> + atomic_read(mapped) &&
> >>> + data_race(list_empty(&folio->_deferred_list)))
> >>
> >> data_race() might not be needed, as Ryan pointed out[1]
> >>
> >>> + deferred_split_folio(folio);
> >>> }
> >>>
> >>> I also thought about handling the scenario where we unmap the whole
> >>> think in smaller chunks. We could detect "!atomic_read(mapped)" and
> >>> detect that it is on the deferred split list, and simply remove it
> >>> from that list incrementing an THP_UNDO_DEFERRED_SPLIT_PAGE event.
> >>>
> >>> But it would be racy with concurrent remapping of the folio (might happen with
> >>> anon folios in corner cases I guess).
> >>>
> >>> What we can do is the following, though:
> >>>
> >>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >>> index dc30139590e6..f05cba1807f2 100644
> >>> --- a/mm/huge_memory.c
> >>> +++ b/mm/huge_memory.c
> >>> @@ -3133,6 +3133,8 @@ void folio_undo_large_rmappable(struct folio *folio)
> >>> ds_queue = get_deferred_split_queue(folio);
> >>> spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> >>> if (!list_empty(&folio->_deferred_list)) {
> >>> + if (folio_test_pmd_mappable(folio))
> >>> + count_vm_event(THP_UNDO_DEFERRED_SPLIT_PAGE);
> >>> ds_queue->split_queue_len--;
> >>> list_del_init(&folio->_deferred_list);
> >>> }
> >>>
> >>> Adding the right event of course.
> >>>
> >>>
> >>> Then it's easy to filter out these "temporarily added to the list, but never split
> >>> before the folio was freed" cases.
> >>
> >> So instead of making THP_DEFERRED_SPLIT_PAGE precise, use
> >> THP_DEFERRED_SPLIT_PAGE - THP_UNDO_DEFERRED_SPLIT_PAGE instead? That should work.
> >
> > It is definitely possible that the THP on the deferred split queue are
> > freed instead of split. For example, 1M is unmapped for a 2M THP, then
> > later the remaining 1M is unmapped, or the process exits before memory
> > pressure happens. So how come we can tell it is "temporarily added to
> > list"? Then THP_DEFERRED_SPLIT_PAGE - THP_UNDO_DEFERRED_SPLIT_PAGE
> > actually just counts how many pages are still on deferred split queue.
>
> Not quite I think. I don't think we have a counter that counts how many
> large folios on the deferred list were split. I think we only have
> THP_SPLIT_PAGE.

Yes, we just count how many THP were split regardless of why they got
split. They may be split from a deferred split queue due to memory
pressure, migration, etc.

>
> We could have
> * THP_DEFERRED_SPLIT_PAGE
> * THP_UNDO_DEFERRED_SPLIT_PAGE
> * THP_PERFORM_DEFERRED_SPLIT_PAGE
>
> Maybe that would catch more cases (not sure if all, though). Then, you
> could tell how many are still on that list. THP_DEFERRED_SPLIT_PAGE -
> THP_UNDO_DEFERRED_SPLIT_PAGE - THP_PERFORM_DEFERRED_SPLIT_PAGE.
>
> That could give one a clearer picture how deferred split interacts with
> actual splitting (possibly under memory pressure), the whole reason why
> deferred splitting was added after all.

I'm not quite sure whether there is a solid usecase or not. If we
have, we could consider this. But a simpler counter may be more
preferred.

>
> > It may be useful. However the counter is typically used to estimate
> > how many THP are partially unmapped during a period of time.
>
> I'd say it's a bit of an abuse of that counter; well, or interpreting
> something into the counter that that counter never reliably represented.

It was way more reliable than now.

>
> I can easily write a program that keeps sending your counter to infinity
> simply by triggering that behavior in a loop, so it's all a bit shaky.

I don't doubt that. But let's get back to reality. The counter used to
stay reasonable and reliable with most real life workloads before
mTHP. There may be over-counting, for example, when unmapping a
PTE-mapped THP which was not on a deferred split queue before. But
such a case is not common for real life workloads because the huge PMD
has to be split by partial unmap for most cases. And the partial unmap
will add the THP to deferred split queue.

But now a common workload, for example, just process exit, may
probably send the counter to infinity.

>
> Something like Ryans script makes more sense, where you get a clearer
> picture of what's mapped where and how. Because that information can be
> much more valuable than just knowing if it's mapped fully or partially
> (again, relevant for handling with memory waste).

Ryan's script is very helpful. But the counter has been existing and
used for years, and it is a quick indicator and much easier to monitor
in a large-scale fleet.

If we think the reliability of the counter is not worth fixing, why
don't we just remove it. No counter is better than a broken counter.

>
> --
> Cheers,
>
> David / dhildenb
>

2024-04-15 19:19:49

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH] mm/rmap: do not add fully unmapped large folio to deferred split list

>>
>> We could have
>> * THP_DEFERRED_SPLIT_PAGE
>> * THP_UNDO_DEFERRED_SPLIT_PAGE
>> * THP_PERFORM_DEFERRED_SPLIT_PAGE
>>
>> Maybe that would catch more cases (not sure if all, though). Then, you
>> could tell how many are still on that list. THP_DEFERRED_SPLIT_PAGE -
>> THP_UNDO_DEFERRED_SPLIT_PAGE - THP_PERFORM_DEFERRED_SPLIT_PAGE.
>>
>> That could give one a clearer picture how deferred split interacts with
>> actual splitting (possibly under memory pressure), the whole reason why
>> deferred splitting was added after all.
>
> I'm not quite sure whether there is a solid usecase or not. If we
> have, we could consider this. But a simpler counter may be more
> preferred.

Yes.

>
>>
>>> It may be useful. However the counter is typically used to estimate
>>> how many THP are partially unmapped during a period of time.
>>
>> I'd say it's a bit of an abuse of that counter; well, or interpreting
>> something into the counter that that counter never reliably represented.
>
> It was way more reliable than now.

Correct me if I am wrong: now that we only adjust the counter for
PMD-sized THP, it is as (un)reliable as it always was.

Or was there another unintended change by some of my cleanups or
previous patches?

>
>>
>> I can easily write a program that keeps sending your counter to infinity
>> simply by triggering that behavior in a loop, so it's all a bit shaky.
>
> I don't doubt that. But let's get back to reality. The counter used to
> stay reasonable and reliable with most real life workloads before
> mTHP. There may be over-counting, for example, when unmapping a
> PTE-mapped THP which was not on a deferred split queue before. But
> such a case is not common for real life workloads because the huge PMD
> has to be split by partial unmap for most cases. And the partial unmap
> will add the THP to deferred split queue.
>
> But now a common workload, for example, just process exit, may
> probably send the counter to infinity.

Agreed, that's stupid.

>
>>
>> Something like Ryans script makes more sense, where you get a clearer
>> picture of what's mapped where and how. Because that information can be
>> much more valuable than just knowing if it's mapped fully or partially
>> (again, relevant for handling with memory waste).
>
> Ryan's script is very helpful. But the counter has been existing and
> used for years, and it is a quick indicator and much easier to monitor
> in a large-scale fleet.
>
> If we think the reliability of the counter is not worth fixing, why
> don't we just remove it. No counter is better than a broken counter.

Again, is only counting the PMD-sized THPs "fixing" the old use cases?
Then it should just stick around. And we can even optimize it for some
more cases as proposed in this patch. But there is no easy way to "get
it completely right" I'm afraid.

--
Cheers,

David / dhildenb


2024-04-15 21:19:43

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCH] mm/rmap: do not add fully unmapped large folio to deferred split list

On Mon, Apr 15, 2024 at 12:19 PM David Hildenbrand <[email protected]> wrote:
>
> >>
> >> We could have
> >> * THP_DEFERRED_SPLIT_PAGE
> >> * THP_UNDO_DEFERRED_SPLIT_PAGE
> >> * THP_PERFORM_DEFERRED_SPLIT_PAGE
> >>
> >> Maybe that would catch more cases (not sure if all, though). Then, you
> >> could tell how many are still on that list. THP_DEFERRED_SPLIT_PAGE -
> >> THP_UNDO_DEFERRED_SPLIT_PAGE - THP_PERFORM_DEFERRED_SPLIT_PAGE.
> >>
> >> That could give one a clearer picture how deferred split interacts with
> >> actual splitting (possibly under memory pressure), the whole reason why
> >> deferred splitting was added after all.
> >
> > I'm not quite sure whether there is a solid usecase or not. If we
> > have, we could consider this. But a simpler counter may be more
> > preferred.
>
> Yes.
>
> >
> >>
> >>> It may be useful. However the counter is typically used to estimate
> >>> how many THP are partially unmapped during a period of time.
> >>
> >> I'd say it's a bit of an abuse of that counter; well, or interpreting
> >> something into the counter that that counter never reliably represented.
> >
> > It was way more reliable than now.
>
> Correct me if I am wrong: now that we only adjust the counter for
> PMD-sized THP, it is as (un)reliable as it always was.

Yes. The problem introduced by mTHP was somehow workaround'ed by that commit.

>
> Or was there another unintended change by some of my cleanups or
> previous patches?

No, at least I didn't see for now.

>
> >
> >>
> >> I can easily write a program that keeps sending your counter to infinity
> >> simply by triggering that behavior in a loop, so it's all a bit shaky.
> >
> > I don't doubt that. But let's get back to reality. The counter used to
> > stay reasonable and reliable with most real life workloads before
> > mTHP. There may be over-counting, for example, when unmapping a
> > PTE-mapped THP which was not on a deferred split queue before. But
> > such a case is not common for real life workloads because the huge PMD
> > has to be split by partial unmap for most cases. And the partial unmap
> > will add the THP to deferred split queue.
> >
> > But now a common workload, for example, just process exit, may
> > probably send the counter to infinity.
>
> Agreed, that's stupid.
>
> >
> >>
> >> Something like Ryans script makes more sense, where you get a clearer
> >> picture of what's mapped where and how. Because that information can be
> >> much more valuable than just knowing if it's mapped fully or partially
> >> (again, relevant for handling with memory waste).
> >
> > Ryan's script is very helpful. But the counter has been existing and
> > used for years, and it is a quick indicator and much easier to monitor
> > in a large-scale fleet.
> >
> > If we think the reliability of the counter is not worth fixing, why
> > don't we just remove it. No counter is better than a broken counter.
>
> Again, is only counting the PMD-sized THPs "fixing" the old use cases?

Yes

> Then it should just stick around. And we can even optimize it for some
> more cases as proposed in this patch. But there is no easy way to "get
> it completely right" I'm afraid.

I don't mean we should revert that "fixing", my point is we should not
rely on it and we should make rmap remove code behave more reliable
regardless of whether we just count PMD-sized THP or not.

>
> --
> Cheers,
>
> David / dhildenb
>