LinuxLists.cc - [RFC PATCH v2] mm: support multi-size THP numa balancing

2024-03-15 09:28:26

Subject: [RFC PATCH v2] mm: support multi-size THP numa balancing

Now the anonymous page allocation already supports multi-size THP (mTHP),
but the numa balancing still prohibits mTHP migration even though it is an
exclusive mapping, which is unreasonable. Thus let's support the exclusive
mTHP numa balancing firstly.

Allow scanning mTHP:
Commit 859d4adc3415 ("mm: numa: do not trap faults on shared data section
pages") skips shared CoW pages' NUMA page migration to avoid shared data
segment migration. In addition, commit 80d47f5de5e3 ("mm: don't try to
NUMA-migrate COW pages that have other uses") change to use page_count()
to avoid GUP pages migration, that will also skip the mTHP numa scaning.
Theoretically, we can use folio_maybe_dma_pinned() to detect the GUP
issue, although there is still a GUP race, the issue seems to have been
resolved by commit 80d47f5de5e3. Meanwhile, use the folio_estimated_sharers()
to skip shared CoW pages though this is not a precise sharers count. To
check if the folio is shared, ideally we want to make sure every page is
mapped to the same process, but doing that seems expensive and using
the estimated mapcount seems can work when running autonuma benchmark.

Allow migrating mTHP:
As mentioned in the previous thread[1], large folios are more susceptible
to false sharing issues, leading to pages ping-pong back and forth during
numa balancing, which is currently hard to resolve. Therefore, as a start to
support mTHP numa balancing, only exclusive mappings are allowed to perform
numa migration to avoid the false sharing issues with large folios. Similarly,
use the estimated mapcount to skip shared mappings, which seems can work
in most cases (?), and we've used folio_estimated_sharers() to skip shared
mappings in migrate_misplaced_folio() for numa balancing, seems no real
complaints.

Performance data:
Machine environment: 2 nodes, 128 cores Intel(R) Xeon(R) Platinum
Base: 2024-3-15 mm-unstable branch
Enable mTHP=64K to run autonuma-benchmark

Base without the patch:
numa01
222.97
numa01_THREAD_ALLOC
115.78
numa02
13.04
numa02_SMT
14.69

Base with the patch:
numa01
125.36
numa01_THREAD_ALLOC
44.58
numa02
9.22
numa02_SMT
7.46

[1] https://lore.kernel.org/all/[email protected]/
Signed-off-by: Baolin Wang <[email protected]>
---
Changes from RFC v1:
- Add some preformance data per Huang, Ying.
- Allow mTHP scanning per David Hildenbrand.
- Avoid sharing mapping for numa balancing to avoid false sharing.
- Add more commit message.
---
mm/memory.c | 9 +++++----
mm/mprotect.c | 3 ++-
2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index f2bc6dd15eb8..b9d5d88c5a76 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5059,7 +5059,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
int last_cpupid;
int target_nid;
pte_t pte, old_pte;
- int flags = 0;
+ int flags = 0, nr_pages = 0;

/*
* The pte cannot be used safely until we verify, while holding the page
@@ -5089,8 +5089,8 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
if (!folio || folio_is_zone_device(folio))
goto out_map;

- /* TODO: handle PTE-mapped THP */
- if (folio_test_large(folio))
+ /* Avoid large folio false sharing */
+ if (folio_test_large(folio) && folio_estimated_sharers(folio) > 1)
goto out_map;

/*
@@ -5112,6 +5112,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
flags |= TNF_SHARED;

nid = folio_nid(folio);
+ nr_pages = folio_nr_pages(folio);
/*
* For memory tiering mode, cpupid of slow memory page is used
* to record page access time. So use default value.
@@ -5148,7 +5149,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)

out:
if (nid != NUMA_NO_NODE)
- task_numa_fault(last_cpupid, nid, 1, flags);
+ task_numa_fault(last_cpupid, nid, nr_pages, flags);
return 0;
out_map:
/*
diff --git a/mm/mprotect.c b/mm/mprotect.c
index f8a4544b4601..f0b9c974aaae 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -129,7 +129,8 @@ static long change_pte_range(struct mmu_gather *tlb,

/* Also skip shared copy-on-write pages */
if (is_cow_mapping(vma->vm_flags) &&
- folio_ref_count(folio) != 1)
+ (folio_maybe_dma_pinned(folio) ||
+ folio_estimated_sharers(folio) > 1))
continue;

/*
--
2.39.3

2024-03-18 06:19:09

by Huang, Ying

[permalink] [raw]

Subject: Re: [RFC PATCH v2] mm: support multi-size THP numa balancing

Baolin Wang <[email protected]> writes:

> Now the anonymous page allocation already supports multi-size THP (mTHP),
> but the numa balancing still prohibits mTHP migration even though it is an
> exclusive mapping, which is unreasonable. Thus let's support the exclusive
> mTHP numa balancing firstly.
>
> Allow scanning mTHP:
> Commit 859d4adc3415 ("mm: numa: do not trap faults on shared data section
> pages") skips shared CoW pages' NUMA page migration to avoid shared data
> segment migration. In addition, commit 80d47f5de5e3 ("mm: don't try to
> NUMA-migrate COW pages that have other uses") change to use page_count()
> to avoid GUP pages migration, that will also skip the mTHP numa scaning.
> Theoretically, we can use folio_maybe_dma_pinned() to detect the GUP
> issue, although there is still a GUP race, the issue seems to have been
> resolved by commit 80d47f5de5e3. Meanwhile, use the folio_estimated_sharers()
> to skip shared CoW pages though this is not a precise sharers count. To
> check if the folio is shared, ideally we want to make sure every page is
> mapped to the same process, but doing that seems expensive and using
> the estimated mapcount seems can work when running autonuma benchmark.
>
> Allow migrating mTHP:
> As mentioned in the previous thread[1], large folios are more susceptible
> to false sharing issues, leading to pages ping-pong back and forth during
> numa balancing, which is currently hard to resolve. Therefore, as a start to
> support mTHP numa balancing, only exclusive mappings are allowed to perform
> numa migration to avoid the false sharing issues with large folios. Similarly,
> use the estimated mapcount to skip shared mappings, which seems can work
> in most cases (?), and we've used folio_estimated_sharers() to skip shared
> mappings in migrate_misplaced_folio() for numa balancing, seems no real
> complaints.

IIUC, folio_estimated_sharers() cannot identify multi-thread
applications. If some mTHP is shared by multiple threads in one
process, how to deal with that?

For example, I think that we should avoid to migrate on the first fault
for mTHP in should_numa_migrate_memory().

More thoughts? Can we add a field in struct folio for mTHP to count
hint page faults from the same node?

--
Best Regards,
Huang, Ying

> Performance data:
> Machine environment: 2 nodes, 128 cores Intel(R) Xeon(R) Platinum
> Base: 2024-3-15 mm-unstable branch
> Enable mTHP=64K to run autonuma-benchmark
>
> Base without the patch:
> numa01
> 222.97
> numa01_THREAD_ALLOC
> 115.78
> numa02
> 13.04
> numa02_SMT
> 14.69
>
> Base with the patch:
> numa01
> 125.36
> numa01_THREAD_ALLOC
> 44.58
> numa02
> 9.22
> numa02_SMT
> 7.46
>
> [1] https://lore.kernel.org/all/[email protected]/
> Signed-off-by: Baolin Wang <[email protected]>
> ---
> Changes from RFC v1:
> - Add some preformance data per Huang, Ying.
> - Allow mTHP scanning per David Hildenbrand.
> - Avoid sharing mapping for numa balancing to avoid false sharing.
> - Add more commit message.
> ---
> mm/memory.c | 9 +++++----
> mm/mprotect.c | 3 ++-
> 2 files changed, 7 insertions(+), 5 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index f2bc6dd15eb8..b9d5d88c5a76 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5059,7 +5059,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
> int last_cpupid;
> int target_nid;
> pte_t pte, old_pte;
> - int flags = 0;
> + int flags = 0, nr_pages = 0;
>
> /*
> * The pte cannot be used safely until we verify, while holding the page
> @@ -5089,8 +5089,8 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
> if (!folio || folio_is_zone_device(folio))
> goto out_map;
>
> - /* TODO: handle PTE-mapped THP */
> - if (folio_test_large(folio))
> + /* Avoid large folio false sharing */
> + if (folio_test_large(folio) && folio_estimated_sharers(folio) > 1)
> goto out_map;
>
> /*
> @@ -5112,6 +5112,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
> flags |= TNF_SHARED;
>
> nid = folio_nid(folio);
> + nr_pages = folio_nr_pages(folio);
> /*
> * For memory tiering mode, cpupid of slow memory page is used
> * to record page access time. So use default value.
> @@ -5148,7 +5149,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>
> out:
> if (nid != NUMA_NO_NODE)
> - task_numa_fault(last_cpupid, nid, 1, flags);
> + task_numa_fault(last_cpupid, nid, nr_pages, flags);
> return 0;
> out_map:
> /*
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index f8a4544b4601..f0b9c974aaae 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -129,7 +129,8 @@ static long change_pte_range(struct mmu_gather *tlb,
>
> /* Also skip shared copy-on-write pages */
> if (is_cow_mapping(vma->vm_flags) &&
> - folio_ref_count(folio) != 1)
> + (folio_maybe_dma_pinned(folio) ||
> + folio_estimated_sharers(folio) > 1))
> continue;
>
> /*

2024-03-18 09:44:32

by Baolin Wang

[permalink] [raw]

Subject: Re: [RFC PATCH v2] mm: support multi-size THP numa balancing

On 2024/3/18 14:16, Huang, Ying wrote:
> Baolin Wang <[email protected]> writes:
>
>> Now the anonymous page allocation already supports multi-size THP (mTHP),
>> but the numa balancing still prohibits mTHP migration even though it is an
>> exclusive mapping, which is unreasonable. Thus let's support the exclusive
>> mTHP numa balancing firstly.
>>
>> Allow scanning mTHP:
>> Commit 859d4adc3415 ("mm: numa: do not trap faults on shared data section
>> pages") skips shared CoW pages' NUMA page migration to avoid shared data
>> segment migration. In addition, commit 80d47f5de5e3 ("mm: don't try to
>> NUMA-migrate COW pages that have other uses") change to use page_count()
>> to avoid GUP pages migration, that will also skip the mTHP numa scaning.
>> Theoretically, we can use folio_maybe_dma_pinned() to detect the GUP
>> issue, although there is still a GUP race, the issue seems to have been
>> resolved by commit 80d47f5de5e3. Meanwhile, use the folio_estimated_sharers()
>> to skip shared CoW pages though this is not a precise sharers count. To
>> check if the folio is shared, ideally we want to make sure every page is
>> mapped to the same process, but doing that seems expensive and using
>> the estimated mapcount seems can work when running autonuma benchmark.
>>
>> Allow migrating mTHP:
>> As mentioned in the previous thread[1], large folios are more susceptible
>> to false sharing issues, leading to pages ping-pong back and forth during
>> numa balancing, which is currently hard to resolve. Therefore, as a start to
>> support mTHP numa balancing, only exclusive mappings are allowed to perform
>> numa migration to avoid the false sharing issues with large folios. Similarly,
>> use the estimated mapcount to skip shared mappings, which seems can work
>> in most cases (?), and we've used folio_estimated_sharers() to skip shared
>> mappings in migrate_misplaced_folio() for numa balancing, seems no real
>> complaints.
>
> IIUC, folio_estimated_sharers() cannot identify multi-thread
> applications. If some mTHP is shared by multiple threads in one

Right.

> process, how to deal with that?

IMHO, seems the should_numa_migrate_memory() already did something to help?

.....
if (!cpupid_pid_unset(last_cpupid) &&
cpupid_to_nid(last_cpupid) != dst_nid)
return false;

/* Always allow migrate on private faults */
if (cpupid_match_pid(p, last_cpupid))
return true;
.....

If the node of the CPU that accessed the mTHP last time is different
from this time, which means there is some contention for that mTHP among
threads. So it will not allow migration.

If the contention for the mTHP among threads is light or the accessing
is relatively stable, then we can allow migration?

> For example, I think that we should avoid to migrate on the first fault
> for mTHP in should_numa_migrate_memory().
>
> More thoughts? Can we add a field in struct folio for mTHP to count
> hint page faults from the same node?

IIUC, we do not need add a new field for folio, seems we can reuse
->_flags_2a field. But how to use it? If there are multiple consecutive
NUMA faults from the same node, then allow migration?

>> Performance data:
>> Machine environment: 2 nodes, 128 cores Intel(R) Xeon(R) Platinum
>> Base: 2024-3-15 mm-unstable branch
>> Enable mTHP=64K to run autonuma-benchmark
>>
>> Base without the patch:
>> numa01
>> 222.97
>> numa01_THREAD_ALLOC
>> 115.78
>> numa02
>> 13.04
>> numa02_SMT
>> 14.69
>>
>> Base with the patch:
>> numa01
>> 125.36
>> numa01_THREAD_ALLOC
>> 44.58
>> numa02
>> 9.22
>> numa02_SMT
>> 7.46
>>
>> [1] https://lore.kernel.org/all/[email protected]/
>> Signed-off-by: Baolin Wang <[email protected]>
>> ---
>> Changes from RFC v1:
>> - Add some preformance data per Huang, Ying.
>> - Allow mTHP scanning per David Hildenbrand.
>> - Avoid sharing mapping for numa balancing to avoid false sharing.
>> - Add more commit message.
>> ---
>> mm/memory.c | 9 +++++----
>> mm/mprotect.c | 3 ++-
>> 2 files changed, 7 insertions(+), 5 deletions(-)
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index f2bc6dd15eb8..b9d5d88c5a76 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -5059,7 +5059,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>> int last_cpupid;
>> int target_nid;
>> pte_t pte, old_pte;
>> - int flags = 0;
>> + int flags = 0, nr_pages = 0;
>>
>> /*
>> * The pte cannot be used safely until we verify, while holding the page
>> @@ -5089,8 +5089,8 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>> if (!folio || folio_is_zone_device(folio))
>> goto out_map;
>>
>> - /* TODO: handle PTE-mapped THP */
>> - if (folio_test_large(folio))
>> + /* Avoid large folio false sharing */
>> + if (folio_test_large(folio) && folio_estimated_sharers(folio) > 1)
>> goto out_map;
>>
>> /*
>> @@ -5112,6 +5112,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>> flags |= TNF_SHARED;
>>
>> nid = folio_nid(folio);
>> + nr_pages = folio_nr_pages(folio);
>> /*
>> * For memory tiering mode, cpupid of slow memory page is used
>> * to record page access time. So use default value.
>> @@ -5148,7 +5149,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>>
>> out:
>> if (nid != NUMA_NO_NODE)
>> - task_numa_fault(last_cpupid, nid, 1, flags);
>> + task_numa_fault(last_cpupid, nid, nr_pages, flags);
>> return 0;
>> out_map:
>> /*
>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>> index f8a4544b4601..f0b9c974aaae 100644
>> --- a/mm/mprotect.c
>> +++ b/mm/mprotect.c
>> @@ -129,7 +129,8 @@ static long change_pte_range(struct mmu_gather *tlb,
>>
>> /* Also skip shared copy-on-write pages */
>> if (is_cow_mapping(vma->vm_flags) &&
>> - folio_ref_count(folio) != 1)
>> + (folio_maybe_dma_pinned(folio) ||
>> + folio_estimated_sharers(folio) > 1))
>> continue;
>>
>> /*

2024-03-18 09:48:15

by David Hildenbrand

[permalink] [raw]

Subject: Re: [RFC PATCH v2] mm: support multi-size THP numa balancing

On 18.03.24 10:42, Baolin Wang wrote:
>
>
> On 2024/3/18 14:16, Huang, Ying wrote:
>> Baolin Wang <[email protected]> writes:
>>
>>> Now the anonymous page allocation already supports multi-size THP (mTHP),
>>> but the numa balancing still prohibits mTHP migration even though it is an
>>> exclusive mapping, which is unreasonable. Thus let's support the exclusive
>>> mTHP numa balancing firstly.
>>>
>>> Allow scanning mTHP:
>>> Commit 859d4adc3415 ("mm: numa: do not trap faults on shared data section
>>> pages") skips shared CoW pages' NUMA page migration to avoid shared data
>>> segment migration. In addition, commit 80d47f5de5e3 ("mm: don't try to
>>> NUMA-migrate COW pages that have other uses") change to use page_count()
>>> to avoid GUP pages migration, that will also skip the mTHP numa scaning.
>>> Theoretically, we can use folio_maybe_dma_pinned() to detect the GUP
>>> issue, although there is still a GUP race, the issue seems to have been
>>> resolved by commit 80d47f5de5e3. Meanwhile, use the folio_estimated_sharers()
>>> to skip shared CoW pages though this is not a precise sharers count. To
>>> check if the folio is shared, ideally we want to make sure every page is
>>> mapped to the same process, but doing that seems expensive and using
>>> the estimated mapcount seems can work when running autonuma benchmark.
>>>
>>> Allow migrating mTHP:
>>> As mentioned in the previous thread[1], large folios are more susceptible
>>> to false sharing issues, leading to pages ping-pong back and forth during
>>> numa balancing, which is currently hard to resolve. Therefore, as a start to
>>> support mTHP numa balancing, only exclusive mappings are allowed to perform
>>> numa migration to avoid the false sharing issues with large folios. Similarly,
>>> use the estimated mapcount to skip shared mappings, which seems can work
>>> in most cases (?), and we've used folio_estimated_sharers() to skip shared
>>> mappings in migrate_misplaced_folio() for numa balancing, seems no real
>>> complaints.
>>
>> IIUC, folio_estimated_sharers() cannot identify multi-thread
>> applications. If some mTHP is shared by multiple threads in one
>
> Right.
>

Wasn't this "false sharing" previously raised/described by Mel in this
context?

>> process, how to deal with that?
>
> IMHO, seems the should_numa_migrate_memory() already did something to help?
>
> ......
> if (!cpupid_pid_unset(last_cpupid) &&
> cpupid_to_nid(last_cpupid) != dst_nid)
> return false;
>
> /* Always allow migrate on private faults */
> if (cpupid_match_pid(p, last_cpupid))
> return true;
> ......
>
> If the node of the CPU that accessed the mTHP last time is different
> from this time, which means there is some contention for that mTHP among
> threads. So it will not allow migration.
>
> If the contention for the mTHP among threads is light or the accessing
> is relatively stable, then we can allow migration?
>
>> For example, I think that we should avoid to migrate on the first fault
>> for mTHP in should_numa_migrate_memory().
>>
>> More thoughts? Can we add a field in struct folio for mTHP to count
>> hint page faults from the same node?
>
> IIUC, we do not need add a new field for folio, seems we can reuse
> ->_flags_2a field. But how to use it? If there are multiple consecutive
> NUMA faults from the same node, then allow migration?

_flags_2a cannot be used. You could place something after _deferred_list
IIRC. But only for folios with order>1.

But I also wonder how one could achieve that using a new field.

--
Cheers,

David / dhildenb

2024-03-18 10:15:58

by Baolin Wang

[permalink] [raw]

Subject: Re: [RFC PATCH v2] mm: support multi-size THP numa balancing

On 2024/3/18 17:48, David Hildenbrand wrote:
> On 18.03.24 10:42, Baolin Wang wrote:
>>
>>
>> On 2024/3/18 14:16, Huang, Ying wrote:
>>> Baolin Wang <[email protected]> writes:
>>>
>>>> Now the anonymous page allocation already supports multi-size THP
>>>> (mTHP),
>>>> but the numa balancing still prohibits mTHP migration even though it
>>>> is an
>>>> exclusive mapping, which is unreasonable. Thus let's support the
>>>> exclusive
>>>> mTHP numa balancing firstly.
>>>>
>>>> Allow scanning mTHP:
>>>> Commit 859d4adc3415 ("mm: numa: do not trap faults on shared data
>>>> section
>>>> pages") skips shared CoW pages' NUMA page migration to avoid shared
>>>> data
>>>> segment migration. In addition, commit 80d47f5de5e3 ("mm: don't try to
>>>> NUMA-migrate COW pages that have other uses") change to use
>>>> page_count()
>>>> to avoid GUP pages migration, that will also skip the mTHP numa
>>>> scaning.
>>>> Theoretically, we can use folio_maybe_dma_pinned() to detect the GUP
>>>> issue, although there is still a GUP race, the issue seems to have been
>>>> resolved by commit 80d47f5de5e3. Meanwhile, use the
>>>> folio_estimated_sharers()
>>>> to skip shared CoW pages though this is not a precise sharers count. To
>>>> check if the folio is shared, ideally we want to make sure every
>>>> page is
>>>> mapped to the same process, but doing that seems expensive and using
>>>> the estimated mapcount seems can work when running autonuma benchmark.
>>>>
>>>> Allow migrating mTHP:
>>>> As mentioned in the previous thread[1], large folios are more
>>>> susceptible
>>>> to false sharing issues, leading to pages ping-pong back and forth
>>>> during
>>>> numa balancing, which is currently hard to resolve. Therefore, as a
>>>> start to
>>>> support mTHP numa balancing, only exclusive mappings are allowed to
>>>> perform
>>>> numa migration to avoid the false sharing issues with large folios.
>>>> Similarly,
>>>> use the estimated mapcount to skip shared mappings, which seems can
>>>> work
>>>> in most cases (?), and we've used folio_estimated_sharers() to skip
>>>> shared
>>>> mappings in migrate_misplaced_folio() for numa balancing, seems no real
>>>> complaints.
>>>
>>> IIUC, folio_estimated_sharers() cannot identify multi-thread
>>> applications. If some mTHP is shared by multiple threads in one
>>
>> Right.
>>
>
> Wasn't this "false sharing" previously raised/described by Mel in this
> context?

Yes, I got confused with the process's false sharing.

>>> process, how to deal with that?
>>
>> IMHO, seems the should_numa_migrate_memory() already did something to
>> help?
>>
>> ......
>>     if (!cpupid_pid_unset(last_cpupid) &&
>>                 cpupid_to_nid(last_cpupid) != dst_nid)
>>         return false;
>>
>>     /* Always allow migrate on private faults */
>>     if (cpupid_match_pid(p, last_cpupid))
>>         return true;
>> ......
>>
>> If the node of the CPU that accessed the mTHP last time is different
>> from this time, which means there is some contention for that mTHP among
>> threads. So it will not allow migration.
>>
>> If the contention for the mTHP among threads is light or the accessing
>> is relatively stable, then we can allow migration?
>>
>>> For example, I think that we should avoid to migrate on the first fault
>>> for mTHP in should_numa_migrate_memory().
>>>
>>> More thoughts? Can we add a field in struct folio for mTHP to count
>>> hint page faults from the same node?
>>
>> IIUC, we do not need add a new field for folio, seems we can reuse
>> ->_flags_2a field. But how to use it? If there are multiple consecutive
>> NUMA faults from the same node, then allow migration?
>
> _flags_2a cannot be used. You could place something after _deferred_list

Could you be more explicit? I didn't see _flags_2 currently being used,
did I miss something?

> IIRC. But only for folios with order>1.

Yes, order 1 folio may use the same strategy with order 0, but need some
evaluation.

> But I also wonder how one could achieve that using a new field.

2024-03-18 10:19:49

by David Hildenbrand

[permalink] [raw]

Subject: Re: [RFC PATCH v2] mm: support multi-size THP numa balancing

On 18.03.24 11:13, Baolin Wang wrote:
>
>
> On 2024/3/18 17:48, David Hildenbrand wrote:
>> On 18.03.24 10:42, Baolin Wang wrote:
>>>
>>>
>>> On 2024/3/18 14:16, Huang, Ying wrote:
>>>> Baolin Wang <[email protected]> writes:
>>>>
>>>>> Now the anonymous page allocation already supports multi-size THP
>>>>> (mTHP),
>>>>> but the numa balancing still prohibits mTHP migration even though it
>>>>> is an
>>>>> exclusive mapping, which is unreasonable. Thus let's support the
>>>>> exclusive
>>>>> mTHP numa balancing firstly.
>>>>>
>>>>> Allow scanning mTHP:
>>>>> Commit 859d4adc3415 ("mm: numa: do not trap faults on shared data
>>>>> section
>>>>> pages") skips shared CoW pages' NUMA page migration to avoid shared
>>>>> data
>>>>> segment migration. In addition, commit 80d47f5de5e3 ("mm: don't try to
>>>>> NUMA-migrate COW pages that have other uses") change to use
>>>>> page_count()
>>>>> to avoid GUP pages migration, that will also skip the mTHP numa
>>>>> scaning.
>>>>> Theoretically, we can use folio_maybe_dma_pinned() to detect the GUP
>>>>> issue, although there is still a GUP race, the issue seems to have been
>>>>> resolved by commit 80d47f5de5e3. Meanwhile, use the
>>>>> folio_estimated_sharers()
>>>>> to skip shared CoW pages though this is not a precise sharers count. To
>>>>> check if the folio is shared, ideally we want to make sure every
>>>>> page is
>>>>> mapped to the same process, but doing that seems expensive and using
>>>>> the estimated mapcount seems can work when running autonuma benchmark.
>>>>>
>>>>> Allow migrating mTHP:
>>>>> As mentioned in the previous thread[1], large folios are more
>>>>> susceptible
>>>>> to false sharing issues, leading to pages ping-pong back and forth
>>>>> during
>>>>> numa balancing, which is currently hard to resolve. Therefore, as a
>>>>> start to
>>>>> support mTHP numa balancing, only exclusive mappings are allowed to
>>>>> perform
>>>>> numa migration to avoid the false sharing issues with large folios.
>>>>> Similarly,
>>>>> use the estimated mapcount to skip shared mappings, which seems can
>>>>> work
>>>>> in most cases (?), and we've used folio_estimated_sharers() to skip
>>>>> shared
>>>>> mappings in migrate_misplaced_folio() for numa balancing, seems no real
>>>>> complaints.
>>>>
>>>> IIUC, folio_estimated_sharers() cannot identify multi-thread
>>>> applications. If some mTHP is shared by multiple threads in one
>>>
>>> Right.
>>>
>>
>> Wasn't this "false sharing" previously raised/described by Mel in this
>> context?
>
> Yes, I got confused with the process's false sharing.
>
>>>> process, how to deal with that?
>>>
>>> IMHO, seems the should_numa_migrate_memory() already did something to
>>> help?
>>>
>>> ......
>>>     if (!cpupid_pid_unset(last_cpupid) &&
>>>                 cpupid_to_nid(last_cpupid) != dst_nid)
>>>         return false;
>>>
>>>     /* Always allow migrate on private faults */
>>>     if (cpupid_match_pid(p, last_cpupid))
>>>         return true;
>>> ......
>>>
>>> If the node of the CPU that accessed the mTHP last time is different
>>> from this time, which means there is some contention for that mTHP among
>>> threads. So it will not allow migration.
>>>
>>> If the contention for the mTHP among threads is light or the accessing
>>> is relatively stable, then we can allow migration?
>>>
>>>> For example, I think that we should avoid to migrate on the first fault
>>>> for mTHP in should_numa_migrate_memory().
>>>>
>>>> More thoughts? Can we add a field in struct folio for mTHP to count
>>>> hint page faults from the same node?
>>>
>>> IIUC, we do not need add a new field for folio, seems we can reuse
>>> ->_flags_2a field. But how to use it? If there are multiple consecutive
>>> NUMA faults from the same node, then allow migration?
>>
>> _flags_2a cannot be used. You could place something after _deferred_list
>
> Could you be more explicit? I didn't see _flags_2 currently being used,
> did I miss something?

Yes, that we use it implicitly via page->flags on subpages (for example,
some flags are still per-subpage and not per-folio).

--
Cheers,

David / dhildenb

2024-03-18 10:31:57

by Baolin Wang

[permalink] [raw]

Subject: Re: [RFC PATCH v2] mm: support multi-size THP numa balancing

On 2024/3/18 18:15, David Hildenbrand wrote:
> On 18.03.24 11:13, Baolin Wang wrote:
>>
>>
>> On 2024/3/18 17:48, David Hildenbrand wrote:
>>> On 18.03.24 10:42, Baolin Wang wrote:
>>>>
>>>>
>>>> On 2024/3/18 14:16, Huang, Ying wrote:
>>>>> Baolin Wang <[email protected]> writes:
>>>>>
>>>>>> Now the anonymous page allocation already supports multi-size THP
>>>>>> (mTHP),
>>>>>> but the numa balancing still prohibits mTHP migration even though it
>>>>>> is an
>>>>>> exclusive mapping, which is unreasonable. Thus let's support the
>>>>>> exclusive
>>>>>> mTHP numa balancing firstly.
>>>>>>
>>>>>> Allow scanning mTHP:
>>>>>> Commit 859d4adc3415 ("mm: numa: do not trap faults on shared data
>>>>>> section
>>>>>> pages") skips shared CoW pages' NUMA page migration to avoid shared
>>>>>> data
>>>>>> segment migration. In addition, commit 80d47f5de5e3 ("mm: don't
>>>>>> try to
>>>>>> NUMA-migrate COW pages that have other uses") change to use
>>>>>> page_count()
>>>>>> to avoid GUP pages migration, that will also skip the mTHP numa
>>>>>> scaning.
>>>>>> Theoretically, we can use folio_maybe_dma_pinned() to detect the GUP
>>>>>> issue, although there is still a GUP race, the issue seems to have
>>>>>> been
>>>>>> resolved by commit 80d47f5de5e3. Meanwhile, use the
>>>>>> folio_estimated_sharers()
>>>>>> to skip shared CoW pages though this is not a precise sharers
>>>>>> count. To
>>>>>> check if the folio is shared, ideally we want to make sure every
>>>>>> page is
>>>>>> mapped to the same process, but doing that seems expensive and using
>>>>>> the estimated mapcount seems can work when running autonuma
>>>>>> benchmark.
>>>>>>
>>>>>> Allow migrating mTHP:
>>>>>> As mentioned in the previous thread[1], large folios are more
>>>>>> susceptible
>>>>>> to false sharing issues, leading to pages ping-pong back and forth
>>>>>> during
>>>>>> numa balancing, which is currently hard to resolve. Therefore, as a
>>>>>> start to
>>>>>> support mTHP numa balancing, only exclusive mappings are allowed to
>>>>>> perform
>>>>>> numa migration to avoid the false sharing issues with large folios.
>>>>>> Similarly,
>>>>>> use the estimated mapcount to skip shared mappings, which seems can
>>>>>> work
>>>>>> in most cases (?), and we've used folio_estimated_sharers() to skip
>>>>>> shared
>>>>>> mappings in migrate_misplaced_folio() for numa balancing, seems no
>>>>>> real
>>>>>> complaints.
>>>>>
>>>>> IIUC, folio_estimated_sharers() cannot identify multi-thread
>>>>> applications. If some mTHP is shared by multiple threads in one
>>>>
>>>> Right.
>>>>
>>>
>>> Wasn't this "false sharing" previously raised/described by Mel in this
>>> context?
>>
>> Yes, I got confused with the process's false sharing.
>>
>>>>> process, how to deal with that?
>>>>
>>>> IMHO, seems the should_numa_migrate_memory() already did something to
>>>> help?
>>>>
>>>> ......
>>>>      if (!cpupid_pid_unset(last_cpupid) &&
>>>>                  cpupid_to_nid(last_cpupid) != dst_nid)
>>>>          return false;
>>>>
>>>>      /* Always allow migrate on private faults */
>>>>      if (cpupid_match_pid(p, last_cpupid))
>>>>          return true;
>>>> ......
>>>>
>>>> If the node of the CPU that accessed the mTHP last time is different
>>>> from this time, which means there is some contention for that mTHP
>>>> among
>>>> threads. So it will not allow migration.
>>>>
>>>> If the contention for the mTHP among threads is light or the accessing
>>>> is relatively stable, then we can allow migration?
>>>>
>>>>> For example, I think that we should avoid to migrate on the first
>>>>> fault
>>>>> for mTHP in should_numa_migrate_memory().
>>>>>
>>>>> More thoughts? Can we add a field in struct folio for mTHP to count
>>>>> hint page faults from the same node?
>>>>
>>>> IIUC, we do not need add a new field for folio, seems we can reuse
>>>> ->_flags_2a field. But how to use it? If there are multiple consecutive
>>>> NUMA faults from the same node, then allow migration?
>>>
>>> _flags_2a cannot be used. You could place something after _deferred_list
>>
>> Could you be more explicit? I didn't see _flags_2 currently being used,
>> did I miss something?
>
> Yes, that we use it implicitly via page->flags on subpages (for example,
> some flags are still per-subpage and not per-folio).

Yes, thanks for reminding:)

2024-03-19 07:29:17

by Huang, Ying

[permalink] [raw]

Subject: Re: [RFC PATCH v2] mm: support multi-size THP numa balancing

Baolin Wang <[email protected]> writes:

> On 2024/3/18 14:16, Huang, Ying wrote:
>> Baolin Wang <[email protected]> writes:
>>
>>> Now the anonymous page allocation already supports multi-size THP (mTHP),
>>> but the numa balancing still prohibits mTHP migration even though it is an
>>> exclusive mapping, which is unreasonable. Thus let's support the exclusive
>>> mTHP numa balancing firstly.
>>>
>>> Allow scanning mTHP:
>>> Commit 859d4adc3415 ("mm: numa: do not trap faults on shared data section
>>> pages") skips shared CoW pages' NUMA page migration to avoid shared data
>>> segment migration. In addition, commit 80d47f5de5e3 ("mm: don't try to
>>> NUMA-migrate COW pages that have other uses") change to use page_count()
>>> to avoid GUP pages migration, that will also skip the mTHP numa scaning.
>>> Theoretically, we can use folio_maybe_dma_pinned() to detect the GUP
>>> issue, although there is still a GUP race, the issue seems to have been
>>> resolved by commit 80d47f5de5e3. Meanwhile, use the folio_estimated_sharers()
>>> to skip shared CoW pages though this is not a precise sharers count. To
>>> check if the folio is shared, ideally we want to make sure every page is
>>> mapped to the same process, but doing that seems expensive and using
>>> the estimated mapcount seems can work when running autonuma benchmark.
>>>
>>> Allow migrating mTHP:
>>> As mentioned in the previous thread[1], large folios are more susceptible
>>> to false sharing issues, leading to pages ping-pong back and forth during
>>> numa balancing, which is currently hard to resolve. Therefore, as a start to
>>> support mTHP numa balancing, only exclusive mappings are allowed to perform
>>> numa migration to avoid the false sharing issues with large folios. Similarly,
>>> use the estimated mapcount to skip shared mappings, which seems can work
>>> in most cases (?), and we've used folio_estimated_sharers() to skip shared
>>> mappings in migrate_misplaced_folio() for numa balancing, seems no real
>>> complaints.
>> IIUC, folio_estimated_sharers() cannot identify multi-thread
>> applications. If some mTHP is shared by multiple threads in one
>
> Right.
>
>> process, how to deal with that?
>
> IMHO, seems the should_numa_migrate_memory() already did something to help?
>
> ......
> if (!cpupid_pid_unset(last_cpupid) &&
> cpupid_to_nid(last_cpupid) != dst_nid)
> return false;
>
> /* Always allow migrate on private faults */
> if (cpupid_match_pid(p, last_cpupid))
> return true;
> ......
>
> If the node of the CPU that accessed the mTHP last time is different
> from this time, which means there is some contention for that mTHP
> among threads. So it will not allow migration.

Yes. The two-stage filter in should_numa_migrate_memory() helps at some
degree.

But the situation is somewhat different after your change. Previously,
in one round of NUMA balancing page table scanning, the number of the
hint page fault for one process and one folio is 1. After your change,
the number may become folio_nr_pages(). So we need to evaluate the
original algorithm in the new situation and revise. For example, use a
N-stage filter for mTHP.

Anyway, the NUMA balancing algorithm adjustment needs to be based on
test results.

Another possibility is to emulate the original behavior as much as
possible to try to reuse the original algorithm. For example, we can
restore all PTE maps upon the first hint page fault of a folio. Then,
the behavior is almost same as the original PMD mapped THP. Personally,
I prefer to use this as the first step. Then, try to adjust the
algorithm to take advantage of more information available.

>
> If the contention for the mTHP among threads is light or the accessing
> is relatively stable, then we can allow migration?
>
>> For example, I think that we should avoid to migrate on the first fault
>> for mTHP in should_numa_migrate_memory().

I am referring to the following code in should_numa_migrate_memory().

/*
* Allow first faults or private faults to migrate immediately early in
* the lifetime of a task. The magic number 4 is based on waiting for
* two full passes of the "multi-stage node selection" test that is
* executed below.
*/
if ((p->numa_preferred_nid == NUMA_NO_NODE || p->numa_scan_seq <= 4) &&
(cpupid_pid_unset(last_cpupid) || cpupid_match_pid(p, last_cpupid)))
return true;

But, after thought about this again, I realized that the original PMD
mapped THP may be migrated on the first fault sometimes. So, this isn't
a new problem. We may "optimize" it. But it needn't to be part of this
series.

>> More thoughts? Can we add a field in struct folio for mTHP to count
>> hint page faults from the same node?
>
> IIUC, we do not need add a new field for folio, seems we can reuse
> ->_flags_2a field. But how to use it? If there are multiple
> consecutive NUMA faults from the same node, then allow migration?
>
>>> Performance data:
>>> Machine environment: 2 nodes, 128 cores Intel(R) Xeon(R) Platinum
>>> Base: 2024-3-15 mm-unstable branch
>>> Enable mTHP=64K to run autonuma-benchmark
>>>
>>> Base without the patch:
>>> numa01
>>> 222.97
>>> numa01_THREAD_ALLOC
>>> 115.78
>>> numa02
>>> 13.04
>>> numa02_SMT
>>> 14.69
>>>
>>> Base with the patch:
>>> numa01
>>> 125.36
>>> numa01_THREAD_ALLOC
>>> 44.58
>>> numa02
>>> 9.22
>>> numa02_SMT
>>> 7.46
>>>

[snip]

--
Best Regards,
Huang, Ying

2024-03-21 07:13:07

by Baolin Wang

[permalink] [raw]

Subject: Re: [RFC PATCH v2] mm: support multi-size THP numa balancing

(sorry for late reply)

On 2024/3/19 15:26, Huang, Ying wrote:
> Baolin Wang <[email protected]> writes:
>
>> On 2024/3/18 14:16, Huang, Ying wrote:
>>> Baolin Wang <[email protected]> writes:
>>>
>>>> Now the anonymous page allocation already supports multi-size THP (mTHP),
>>>> but the numa balancing still prohibits mTHP migration even though it is an
>>>> exclusive mapping, which is unreasonable. Thus let's support the exclusive
>>>> mTHP numa balancing firstly.
>>>>
>>>> Allow scanning mTHP:
>>>> Commit 859d4adc3415 ("mm: numa: do not trap faults on shared data section
>>>> pages") skips shared CoW pages' NUMA page migration to avoid shared data
>>>> segment migration. In addition, commit 80d47f5de5e3 ("mm: don't try to
>>>> NUMA-migrate COW pages that have other uses") change to use page_count()
>>>> to avoid GUP pages migration, that will also skip the mTHP numa scaning.
>>>> Theoretically, we can use folio_maybe_dma_pinned() to detect the GUP
>>>> issue, although there is still a GUP race, the issue seems to have been
>>>> resolved by commit 80d47f5de5e3. Meanwhile, use the folio_estimated_sharers()
>>>> to skip shared CoW pages though this is not a precise sharers count. To
>>>> check if the folio is shared, ideally we want to make sure every page is
>>>> mapped to the same process, but doing that seems expensive and using
>>>> the estimated mapcount seems can work when running autonuma benchmark.
>>>>
>>>> Allow migrating mTHP:
>>>> As mentioned in the previous thread[1], large folios are more susceptible
>>>> to false sharing issues, leading to pages ping-pong back and forth during
>>>> numa balancing, which is currently hard to resolve. Therefore, as a start to
>>>> support mTHP numa balancing, only exclusive mappings are allowed to perform
>>>> numa migration to avoid the false sharing issues with large folios. Similarly,
>>>> use the estimated mapcount to skip shared mappings, which seems can work
>>>> in most cases (?), and we've used folio_estimated_sharers() to skip shared
>>>> mappings in migrate_misplaced_folio() for numa balancing, seems no real
>>>> complaints.
>>> IIUC, folio_estimated_sharers() cannot identify multi-thread
>>> applications. If some mTHP is shared by multiple threads in one
>>
>> Right.
>>
>>> process, how to deal with that?
>>
>> IMHO, seems the should_numa_migrate_memory() already did something to help?
>>
>> ......
>> if (!cpupid_pid_unset(last_cpupid) &&
>> cpupid_to_nid(last_cpupid) != dst_nid)
>> return false;
>>
>> /* Always allow migrate on private faults */
>> if (cpupid_match_pid(p, last_cpupid))
>> return true;
>> ......
>>
>> If the node of the CPU that accessed the mTHP last time is different
>> from this time, which means there is some contention for that mTHP
>> among threads. So it will not allow migration.
>
> Yes. The two-stage filter in should_numa_migrate_memory() helps at some
> degree.
>
> But the situation is somewhat different after your change. Previously,
> in one round of NUMA balancing page table scanning, the number of the
> hint page fault for one process and one folio is 1. After your change,
> the number may become folio_nr_pages(). So we need to evaluate the

Yes, this follows the same strategy as THP.

> original algorithm in the new situation and revise. For example, use a
> N-stage filter for mTHP.

Yes, let me try if N-stage filter for mTHP can helpful.

>
> Anyway, the NUMA balancing algorithm adjustment needs to be based on
> test results.
>
> Another possibility is to emulate the original behavior as much as
> possible to try to reuse the original algorithm. For example, we can
> restore all PTE maps upon the first hint page fault of a folio. Then,
> the behavior is almost same as the original PMD mapped THP. Personally,
> I prefer to use this as the first step. Then, try to adjust the
> algorithm to take advantage of more information available.

OK, sounds reasonable, I will try.

>>
>> If the contention for the mTHP among threads is light or the accessing
>> is relatively stable, then we can allow migration?
>>
>>> For example, I think that we should avoid to migrate on the first fault
>>> for mTHP in should_numa_migrate_memory().
>
> I am referring to the following code in should_numa_migrate_memory().
>
> /*
> * Allow first faults or private faults to migrate immediately early in
> * the lifetime of a task. The magic number 4 is based on waiting for
> * two full passes of the "multi-stage node selection" test that is
> * executed below.
> */
> if ((p->numa_preferred_nid == NUMA_NO_NODE || p->numa_scan_seq <= 4) &&
> (cpupid_pid_unset(last_cpupid) || cpupid_match_pid(p, last_cpupid)))
> return true;
>
> But, after thought about this again, I realized that the original PMD
> mapped THP may be migrated on the first fault sometimes. So, this isn't
> a new problem. We may "optimize" it. But it needn't to be part of this
> series.

Make sense. Thanks for your input.