LinuxLists.cc - Re: [RFC PATCH] mm: support large folio numa balancing

2023-11-17 10:08:09

Subject: Re: [RFC PATCH] mm: support large folio numa balancing

On Wed, Nov 15, 2023 at 10:58:32AM +0800, Huang, Ying wrote:
> Baolin Wang <[email protected]> writes:
>
> > On 11/14/2023 9:12 AM, Huang, Ying wrote:
> >> David Hildenbrand <[email protected]> writes:
> >>
> >>> On 13.11.23 11:45, Baolin Wang wrote:
> >>>> Currently, the file pages already support large folio, and supporting for
> >>>> anonymous pages is also under discussion[1]. Moreover, the numa balancing
> >>>> code are converted to use a folio by previous thread[2], and the migrate_pages
> >>>> function also already supports the large folio migration.
> >>>> So now I did not see any reason to continue restricting NUMA
> >>>> balancing for
> >>>> large folio.
> >>>
> >>> I recall John wanted to look into that. CCing him.
> >>>
> >>> I'll note that the "head page mapcount" heuristic to detect sharers will
> >>> now strike on the PTE path and make us believe that a large folios is
> >>> exclusive, although it isn't.
> >> Even 4k folio may be shared by multiple processes/threads. So, numa
> >> balancing uses a multi-stage node selection algorithm (mostly
> >> implemented in should_numa_migrate_memory()) to identify shared folios.
> >> I think that the algorithm needs to be adjusted for PTE mapped large
> >> folio for shared folios.
> >
> > Not sure I get you here. In should_numa_migrate_memory(), it will use
> > last CPU id, last PID and group numa faults to determine if this page
> > can be migrated to the target node. So for large folio, a precise
> > folio sharers check can make the numa faults of a group more accurate,
> > which is enough for should_numa_migrate_memory() to make a decision?
>
> A large folio that is mapped by multiple process may be accessed by one
> remote NUMA node, so we still want to migrate it. A large folio that is
> mapped by one process but accessed by multiple threads on multiple NUMA
> node may be not migrated.
>

This leads into a generic problem with large anything with NUMA
balancing -- false sharing. As it stands, THP can be false shared by
threads if thread-local data is split within a THP range. In this case,
the ideal would be the THP is migrated to the hottest node but such
support doesn't exist. The same applies for folios. If not handled
properly, a large folio of any type can ping-pong between nodes so just
migrating because we can is not necessarily a good idea. The patch
should cover a realistic case why this matters, why splitting the folio
is not better and supporting data.

--
Mel Gorman
SUSE Labs

2023-11-17 10:14:28

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC PATCH] mm: support large folio numa balancing

On Fri, Nov 17, 2023 at 10:07:45AM +0000, Mel Gorman wrote:

> This leads into a generic problem with large anything with NUMA
> balancing -- false sharing. As it stands, THP can be false shared by
> threads if thread-local data is split within a THP range. In this case,
> the ideal would be the THP is migrated to the hottest node but such
> support doesn't exist. The same applies for folios. If not handled
> properly, a large folio of any type can ping-pong between nodes so just
> migrating because we can is not necessarily a good idea. The patch
> should cover a realistic case why this matters, why splitting the folio
> is not better and supporting data.

Would it make sense to have THP merging conditional on all (most?) pages
having the same node?

2023-11-17 16:05:26

by Mel Gorman

[permalink] [raw]

Subject: Re: [RFC PATCH] mm: support large folio numa balancing

On Fri, Nov 17, 2023 at 11:13:43AM +0100, Peter Zijlstra wrote:
> On Fri, Nov 17, 2023 at 10:07:45AM +0000, Mel Gorman wrote:
>
> > This leads into a generic problem with large anything with NUMA
> > balancing -- false sharing. As it stands, THP can be false shared by
> > threads if thread-local data is split within a THP range. In this case,
> > the ideal would be the THP is migrated to the hottest node but such
> > support doesn't exist. The same applies for folios. If not handled
> > properly, a large folio of any type can ping-pong between nodes so just
> > migrating because we can is not necessarily a good idea. The patch
> > should cover a realistic case why this matters, why splitting the folio
> > is not better and supporting data.
>
> Would it make sense to have THP merging conditional on all (most?) pages
> having the same node?

Potentially yes, maybe with something similar to max_ptes_none, but it has
corner cases of it's own. THP can be allocated up-front so we don't get
the per-base-page hints unless the page is first split. I experimented
with this once upon a time but cost post-splitting was not offset by
the smarter NUMA placement. While we could always allocate small pages
and promote later (originally known as the promotion threshold), that
was known to have significant penalties of it's own so we still eagerly
allocate THP. Part of that is that KVM was the main load to benefit from
THP and always preferred eager promotion. Even if we always started with
base pages, sparse addressing within the THP range may mean the threshold
for collapsing can never be reached.

Both THP and folios have the same false sharing problem but at least
we knew about the false sharing problem for THP and NUMA balancing. It
was found initially that THP false sharing is mostly an edge-case issue
mitigated by the fact that large anonymous buffers tended to be either
2M aligned or only affected the boundaries. Later glibc and ABI changes
made it even more likely that private buffers were THP-aligned. The same
is not true of folios and it is a new problem so I'm uncomfortable with
a patch that essentially says "migrate folios because it's possible"
without considering any of the corner cases or measuring them.

--
Mel Gorman
SUSE Labs

2023-11-20 08:02:16

by Baolin Wang

[permalink] [raw]

Subject: Re: [RFC PATCH] mm: support large folio numa balancing

On 11/17/2023 6:07 PM, Mel Gorman wrote:
> On Wed, Nov 15, 2023 at 10:58:32AM +0800, Huang, Ying wrote:
>> Baolin Wang <[email protected]> writes:
>>
>>> On 11/14/2023 9:12 AM, Huang, Ying wrote:
>>>> David Hildenbrand <[email protected]> writes:
>>>>
>>>>> On 13.11.23 11:45, Baolin Wang wrote:
>>>>>> Currently, the file pages already support large folio, and supporting for
>>>>>> anonymous pages is also under discussion[1]. Moreover, the numa balancing
>>>>>> code are converted to use a folio by previous thread[2], and the migrate_pages
>>>>>> function also already supports the large folio migration.
>>>>>> So now I did not see any reason to continue restricting NUMA
>>>>>> balancing for
>>>>>> large folio.
>>>>>
>>>>> I recall John wanted to look into that. CCing him.
>>>>>
>>>>> I'll note that the "head page mapcount" heuristic to detect sharers will
>>>>> now strike on the PTE path and make us believe that a large folios is
>>>>> exclusive, although it isn't.
>>>> Even 4k folio may be shared by multiple processes/threads. So, numa
>>>> balancing uses a multi-stage node selection algorithm (mostly
>>>> implemented in should_numa_migrate_memory()) to identify shared folios.
>>>> I think that the algorithm needs to be adjusted for PTE mapped large
>>>> folio for shared folios.
>>>
>>> Not sure I get you here. In should_numa_migrate_memory(), it will use
>>> last CPU id, last PID and group numa faults to determine if this page
>>> can be migrated to the target node. So for large folio, a precise
>>> folio sharers check can make the numa faults of a group more accurate,
>>> which is enough for should_numa_migrate_memory() to make a decision?
>>
>> A large folio that is mapped by multiple process may be accessed by one
>> remote NUMA node, so we still want to migrate it. A large folio that is
>> mapped by one process but accessed by multiple threads on multiple NUMA
>> node may be not migrated.
>>
>
> This leads into a generic problem with large anything with NUMA
> balancing -- false sharing. As it stands, THP can be false shared by
> threads if thread-local data is split within a THP range. In this case,
> the ideal would be the THP is migrated to the hottest node but such
> support doesn't exist. The same applies for folios. If not handled

So below check in should_numa_migrate_memory() can not avoid the false
sharing of large folio you mentioned? Please correct me if I missed
anything.

/*
* Destination node is much more heavily used than the source
* node? Allow migration.
*/
if (group_faults_cpu(ng, dst_nid) > group_faults_cpu(ng, src_nid) *
ACTIVE_NODE_FRACTION)
return true;

/*
* Distribute memory according to CPU & memory use on each node,
* with 3/4 hysteresis to avoid unnecessary memory migrations:
*
* faults_cpu(dst) 3 faults_cpu(src)
* --------------- * - > ---------------
* faults_mem(dst) 4 faults_mem(src)
*/
return group_faults_cpu(ng, dst_nid) * group_faults(p, src_nid)
* 3 >
group_faults_cpu(ng, src_nid) * group_faults(p, dst_nid)
* 4;

> properly, a large folio of any type can ping-pong between nodes so just
> migrating because we can is not necessarily a good idea. The patch
> should cover a realistic case why this matters, why splitting the folio
> is not better and supporting data.

Sure. For a private mapping, we should always migrate the large folio.
The tricky part is the shared mapping as you and Ying said, which can
have different scenarios, and I'm thinking about how to validate it. Do
you have any suggestion? Thanks.