LinuxLists.cc - [PATCH v3] mm/khugepaged: sched to numa node when collapse huge page

2022-03-17 08:58:07

Subject: [PATCH v3] mm/khugepaged: sched to numa node when collapse huge page

collapse huge page will copy huge page from general small pages,
dest node is calculated from most one of source pages, however
THP daemon is not scheduled on dest node. The performance may be
poor since huge page copying across nodes, also cache is not used
for target node. With this patch, khugepaged daemon switches to
the same numa node with huge page. It saves copying time and makes
use of local cache better.

With this patch, specint 2006 base performance is improved with 6%
on Loongson 3C5000L platform with 32 cores and 8 numa nodes.

Signed-off-by: Bibo Mao <[email protected]>
---
changelog:
V2: remove node record for thp daemon
V3: remove unlikely statement
---
mm/khugepaged.c | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 131492fd1148..b3cf0885f5a2 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1066,6 +1066,7 @@ static void collapse_huge_page(struct mm_struct *mm,
struct vm_area_struct *vma;
struct mmu_notifier_range range;
gfp_t gfp;
+ const struct cpumask *cpumask;

VM_BUG_ON(address & ~HPAGE_PMD_MASK);

@@ -1079,6 +1080,13 @@ static void collapse_huge_page(struct mm_struct *mm,
* that. We will recheck the vma after taking it again in write mode.
*/
mmap_read_unlock(mm);
+
+ /* sched to specified node before huage page memory copy */
+ if (task_node(current) != node) {
+ cpumask = cpumask_of_node(node);
+ if (!cpumask_empty(cpumask))
+ set_cpus_allowed_ptr(current, cpumask);
+ }
new_page = khugepaged_alloc_page(hpage, gfp, node);
if (!new_page) {
result = SCAN_ALLOC_HUGE_PAGE_FAIL;
--
2.31.1

2022-04-27 21:45:00

by Andrew Morton

[permalink] [raw]

Subject: Re: [PATCH v3] mm/khugepaged: sched to numa node when collapse huge page

On Thu, 17 Mar 2022 02:50:24 -0400 Bibo Mao <[email protected]> wrote:

> collapse huge page will copy huge page from general small pages,
> dest node is calculated from most one of source pages, however
> THP daemon is not scheduled on dest node. The performance may be
> poor since huge page copying across nodes, also cache is not used
> for target node. With this patch, khugepaged daemon switches to
> the same numa node with huge page. It saves copying time and makes
> use of local cache better.
>
> With this patch, specint 2006 base performance is improved with 6%
> on Loongson 3C5000L platform with 32 cores and 8 numa nodes.
>

Are there any acks for this one please?

> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1066,6 +1066,7 @@ static void collapse_huge_page(struct mm_struct *mm,
> struct vm_area_struct *vma;
> struct mmu_notifier_range range;
> gfp_t gfp;
> + const struct cpumask *cpumask;
>
> VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>
> @@ -1079,6 +1080,13 @@ static void collapse_huge_page(struct mm_struct *mm,
> * that. We will recheck the vma after taking it again in write mode.
> */
> mmap_read_unlock(mm);
> +
> + /* sched to specified node before huage page memory copy */
> + if (task_node(current) != node) {
> + cpumask = cpumask_of_node(node);
> + if (!cpumask_empty(cpumask))
> + set_cpus_allowed_ptr(current, cpumask);
> + }
> new_page = khugepaged_alloc_page(hpage, gfp, node);
> if (!new_page) {
> result = SCAN_ALLOC_HUGE_PAGE_FAIL;
> --
> 2.31.1
>

2022-04-28 08:40:48

by Yang Shi

[permalink] [raw]

Subject: Re: [PATCH v3] mm/khugepaged: sched to numa node when collapse huge page

On Wed, Apr 27, 2022 at 1:48 PM Andrew Morton <[email protected]> wrote:
>
> On Thu, 17 Mar 2022 02:50:24 -0400 Bibo Mao <[email protected]> wrote:
>
> > collapse huge page will copy huge page from general small pages,
> > dest node is calculated from most one of source pages, however
> > THP daemon is not scheduled on dest node. The performance may be
> > poor since huge page copying across nodes, also cache is not used
> > for target node. With this patch, khugepaged daemon switches to
> > the same numa node with huge page. It saves copying time and makes
> > use of local cache better.
> >
> > With this patch, specint 2006 base performance is improved with 6%
> > on Loongson 3C5000L platform with 32 cores and 8 numa nodes.
> >
>
> Are there any acks for this one please?

TBH, I'm a little bit reluctant to this patch. I agree running
khugepaged on the same node with the source and dest pages could
reduce cross socket traffic and use cache more efficiently. But I'm
not sure whether it is really worth it or not. For example, on a busy
system, khugepaged may jump from cpus to cpus, that may interfere with
the scheduler, and khugepaged has to wait to run on the target cpu, it
may take indefinite time. In addition the yield also depends on the
locality of source pages (how many of them are on the same node), how
often khugepaged is woken up on a different node, etc.

Even though it was proved worth it, I'd prefer set_cpus_allowed_ptr()
is called between mmap_read_unlock() and mmap_write_lock() in order to
avoid waste effort for some error paths.

>
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -1066,6 +1066,7 @@ static void collapse_huge_page(struct mm_struct *mm,
> > struct vm_area_struct *vma;
> > struct mmu_notifier_range range;
> > gfp_t gfp;
> > + const struct cpumask *cpumask;
> >
> > VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> >
> > @@ -1079,6 +1080,13 @@ static void collapse_huge_page(struct mm_struct *mm,
> > * that. We will recheck the vma after taking it again in write mode.
> > */
> > mmap_read_unlock(mm);
> > +
> > + /* sched to specified node before huage page memory copy */
> > + if (task_node(current) != node) {
> > + cpumask = cpumask_of_node(node);
> > + if (!cpumask_empty(cpumask))
> > + set_cpus_allowed_ptr(current, cpumask);
> > + }
> > new_page = khugepaged_alloc_page(hpage, gfp, node);
> > if (!new_page) {
> > result = SCAN_ALLOC_HUGE_PAGE_FAIL;
> > --
> > 2.31.1
> >

2022-04-28 15:18:39

by Peter Xu

[permalink] [raw]

Subject: Re: [PATCH v3] mm/khugepaged: sched to numa node when collapse huge page

Hi, Bibo,

On Thu, Mar 17, 2022 at 02:50:24AM -0400, Bibo Mao wrote:
> collapse huge page will copy huge page from general small pages,
> dest node is calculated from most one of source pages, however
> THP daemon is not scheduled on dest node. The performance may be
> poor since huge page copying across nodes, also cache is not used
> for target node. With this patch, khugepaged daemon switches to
> the same numa node with huge page. It saves copying time and makes
> use of local cache better.
>
> With this patch, specint 2006 base performance is improved with 6%
> on Loongson 3C5000L platform with 32 cores and 8 numa nodes.

Totally not familiar with specint, so a pure question is whether it'll make
a real difference in real-world workloads? As I assume in real world the
memory affinity to the processors should change relatively slow on tuned
systems, so even if khugepaged copied a bit slower then it'll not affect
much on the real workload after the movement completes?

The other question is if it makes sense, whether it's applicable to file
thps too (collapse_file)?

Thanks,

>
> Signed-off-by: Bibo Mao <[email protected]>
> ---
> changelog:
> V2: remove node record for thp daemon
> V3: remove unlikely statement
> ---
> mm/khugepaged.c | 8 ++++++++
> 1 file changed, 8 insertions(+)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 131492fd1148..b3cf0885f5a2 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1066,6 +1066,7 @@ static void collapse_huge_page(struct mm_struct *mm,
> struct vm_area_struct *vma;
> struct mmu_notifier_range range;
> gfp_t gfp;
> + const struct cpumask *cpumask;
>
> VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>
> @@ -1079,6 +1080,13 @@ static void collapse_huge_page(struct mm_struct *mm,
> * that. We will recheck the vma after taking it again in write mode.
> */
> mmap_read_unlock(mm);
> +
> + /* sched to specified node before huage page memory copy */
> + if (task_node(current) != node) {
> + cpumask = cpumask_of_node(node);
> + if (!cpumask_empty(cpumask))
> + set_cpus_allowed_ptr(current, cpumask);
> + }
> new_page = khugepaged_alloc_page(hpage, gfp, node);
> if (!new_page) {
> result = SCAN_ALLOC_HUGE_PAGE_FAIL;

--
Peter Xu

2022-04-28 16:19:43

by David Hildenbrand

[permalink] [raw]

Subject: Re: [PATCH v3] mm/khugepaged: sched to numa node when collapse huge page

On 27.04.22 22:48, Andrew Morton wrote:
> On Thu, 17 Mar 2022 02:50:24 -0400 Bibo Mao <[email protected]> wrote:
>
>> collapse huge page will copy huge page from general small pages,
>> dest node is calculated from most one of source pages, however
>> THP daemon is not scheduled on dest node. The performance may be
>> poor since huge page copying across nodes, also cache is not used
>> for target node. With this patch, khugepaged daemon switches to
>> the same numa node with huge page. It saves copying time and makes
>> use of local cache better.
>>
>> With this patch, specint 2006 base performance is improved with 6%
>> on Loongson 3C5000L platform with 32 cores and 8 numa nodes.
>>
>
> Are there any acks for this one please?

I'll have a look!

--
Thanks,

David / dhildenb

2022-04-28 21:25:42

by David Hildenbrand

[permalink] [raw]

Subject: Re: [PATCH v3] mm/khugepaged: sched to numa node when collapse huge page

On 17.03.22 07:50, Bibo Mao wrote:
> collapse huge page will copy huge page from general small pages,
> dest node is calculated from most one of source pages, however
> THP daemon is not scheduled on dest node. The performance may be
> poor since huge page copying across nodes, also cache is not used
> for target node. With this patch, khugepaged daemon switches to
> the same numa node with huge page. It saves copying time and makes
> use of local cache better.
>
> With this patch, specint 2006 base performance is improved with 6%
> on Loongson 3C5000L platform with 32 cores and 8 numa nodes.

If it helps, that's nice as long as it doesn't hurt other cases.

>
> Signed-off-by: Bibo Mao <[email protected]>
> ---
> changelog:
> V2: remove node record for thp daemon
> V3: remove unlikely statement
> ---
> mm/khugepaged.c | 8 ++++++++
> 1 file changed, 8 insertions(+)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 131492fd1148..b3cf0885f5a2 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1066,6 +1066,7 @@ static void collapse_huge_page(struct mm_struct *mm,
> struct vm_area_struct *vma;
> struct mmu_notifier_range range;
> gfp_t gfp;
> + const struct cpumask *cpumask;
>
> VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>
> @@ -1079,6 +1080,13 @@ static void collapse_huge_page(struct mm_struct *mm,
> * that. We will recheck the vma after taking it again in write mode.
> */
> mmap_read_unlock(mm);
> +
> + /* sched to specified node before huage page memory copy */

huage? I assume "huge"

> + if (task_node(current) != node) {
> + cpumask = cpumask_of_node(node);
> + if (!cpumask_empty(cpumask))
> + set_cpus_allowed_ptr(current, cpumask);
> + }

I wonder if that will always be optimized out without NUMA and if we
want to check for IS_ENABLED(CONFIG_NUMA).

Regarding comments from others, I agree: I think what we'd actually want
is something like "try to reschedule to one of these CPUs immediately.
If they are all busy, just stay here.

Also, I do wonder if there could already be scenarios where someone
wants to let khugepaged run only on selected housekeeping CPUs (e.g.,
when pinning VCPUs in a VM to physical CPUs). It might even degrade the
VM performance in that case if we schedule something unrelated on these
CPUs. (I don't know which interfaces we might already have to configure
housekeeping CPUs for kthreads).

I can spot in kernel/kthread.c:kthread()

set_cpus_allowed_ptr(current, housekeeping_cpumask(HK_TYPE_KTHREAD));

Hmmmmm ...

--
Thanks,

David / dhildenb

2022-05-03 00:05:55

by Peter Xu

[permalink] [raw]

Subject: Re: [PATCH v3] mm/khugepaged: sched to numa node when collapse huge page

On Thu, Apr 28, 2022 at 05:17:07PM +0200, David Hildenbrand wrote:
> On 17.03.22 07:50, Bibo Mao wrote:
> > collapse huge page will copy huge page from general small pages,
> > dest node is calculated from most one of source pages, however
> > THP daemon is not scheduled on dest node. The performance may be
> > poor since huge page copying across nodes, also cache is not used
> > for target node. With this patch, khugepaged daemon switches to
> > the same numa node with huge page. It saves copying time and makes
> > use of local cache better.
> >
> > With this patch, specint 2006 base performance is improved with 6%
> > on Loongson 3C5000L platform with 32 cores and 8 numa nodes.
>
> If it helps, that's nice as long as it doesn't hurt other cases.
>
> >
> > Signed-off-by: Bibo Mao <[email protected]>
> > ---
> > changelog:
> > V2: remove node record for thp daemon
> > V3: remove unlikely statement
> > ---
> > mm/khugepaged.c | 8 ++++++++
> > 1 file changed, 8 insertions(+)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 131492fd1148..b3cf0885f5a2 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -1066,6 +1066,7 @@ static void collapse_huge_page(struct mm_struct *mm,
> > struct vm_area_struct *vma;
> > struct mmu_notifier_range range;
> > gfp_t gfp;
> > + const struct cpumask *cpumask;
> >
> > VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> >
> > @@ -1079,6 +1080,13 @@ static void collapse_huge_page(struct mm_struct *mm,
> > * that. We will recheck the vma after taking it again in write mode.
> > */
> > mmap_read_unlock(mm);
> > +
> > + /* sched to specified node before huage page memory copy */
>
> huage? I assume "huge"
>
> > + if (task_node(current) != node) {
> > + cpumask = cpumask_of_node(node);
> > + if (!cpumask_empty(cpumask))
> > + set_cpus_allowed_ptr(current, cpumask);
> > + }
>
> I wonder if that will always be optimized out without NUMA and if we
> want to check for IS_ENABLED(CONFIG_NUMA).
>
>
> Regarding comments from others, I agree: I think what we'd actually want
> is something like "try to reschedule to one of these CPUs immediately.
> If they are all busy, just stay here.
>
>
> Also, I do wonder if there could already be scenarios where someone
> wants to let khugepaged run only on selected housekeeping CPUs (e.g.,
> when pinning VCPUs in a VM to physical CPUs). It might even degrade the
> VM performance in that case if we schedule something unrelated on these
> CPUs. (I don't know which interfaces we might already have to configure
> housekeeping CPUs for kthreads).
>
> I can spot in kernel/kthread.c:kthread()
>
> set_cpus_allowed_ptr(current, housekeeping_cpumask(HK_TYPE_KTHREAD));
>
> Hmmmmm ...

Yes that's a valid point, for RT afaik many users tunes the kernel threads
specifically on demand by pinning them. So I'm not sure how this new
algorithm could break some users already, by either (1) trying to pin
khugepaged onto some isolated cores (which can cause spikes?), or (2) mess
up with the admin's previous pin settings on the khugepagd kthread.

The other thing is the khugepaged movement on the cores seems to be quite
random, because the pages it scans can be unpredictably stored on different
numa nodes, so logically it can start bouncing easily on some hosts and
that does sound questionalbe.. as I raised the (pure) question previously
on the 2nd point irrelevant of the benchmark result.

--
Peter Xu

2022-05-13 05:46:06

by maobibo

[permalink] [raw]

Subject: Re: [PATCH v3] mm/khugepaged: sched to numa node when collapse huge page

在 2022/5/13 08:36, Andrew Morton 写道:
> On Thu, 28 Apr 2022 12:34:07 -0400 Peter Xu <[email protected]> wrote:
>
>> On Thu, Apr 28, 2022 at 05:17:07PM +0200, David Hildenbrand wrote:
>>> On 17.03.22 07:50, Bibo Mao wrote:
>>>> collapse huge page will copy huge page from general small pages,
>>>> dest node is calculated from most one of source pages, however
>>>> THP daemon is not scheduled on dest node. The performance may be
>>>> poor since huge page copying across nodes, also cache is not used
>>>> for target node. With this patch, khugepaged daemon switches to
>>>> the same numa node with huge page. It saves copying time and makes
>>>> use of local cache better.
>>>>
>>>> With this patch, specint 2006 base performance is improved with 6%
>>>> on Loongson 3C5000L platform with 32 cores and 8 numa nodes.
>>>
>>> If it helps, that's nice as long as it doesn't hurt other cases.
>>>
>
> Quite a bit of doubtful feedback and we have yet to hear from the
> author. I'll drop the patch.
>
> Bibo, please resend at a later time if you feel the patch remains
> desirable. Please attempt to address the feedback via code changes
> and/or changelogging.
Sorry for the late response, the mail is filtered and I did not notice that. The result is not so obvious after bandwidth is improved between nodes, it is only about 1% improvement for specint2006 for 32 core on my box.

Now I do not see negative effective about this patch unless user wants to keep some cores separated from khugepaged daemon process.

regards
bibo,mao

2022-05-13 08:23:42

by maobibo

[permalink] [raw]

Subject: Re: [PATCH v3] mm/khugepaged: sched to numa node when collapse huge page

在 2022/5/13 09:49, Andrew Morton 写道:
> On Fri, 13 May 2022 09:29:07 +0800 maobibo <[email protected]> wrote:
>
>>
>>
>>>> and/or changelogging.
>>> Sorry for the late response, the mail is filtered and I did not notice that. The result is not so obvious after bandwidth is improved between nodes, it is only about 1% improvement for specint2006 for 32 core on my box.
>>>
>>> Now I do not see negative effective about this patch unless user wants to keep some cores separated from khugepaged daemon process.
>>
>> Can we provide an extra parameter to let khugepaged daemon scheduling binded to node or freely? If can, I will provide updated patch.
>
> It has always surprised me that we have a single khugepaged thread. If
> we had a thread per node, you'd be all fixed up, yes?
yes, it will solve this issue if there is a thread per node, also it can speed up huge page scanning. It should be useful for some workloads in short time like specint.
>
> Ditto ksmd.

2022-05-13 14:59:44

by Andrew Morton

[permalink] [raw]

Subject: Re: [PATCH v3] mm/khugepaged: sched to numa node when collapse huge page

On Fri, 13 May 2022 09:29:07 +0800 maobibo <[email protected]> wrote:

>
>
> >> and/or changelogging.
> > Sorry for the late response, the mail is filtered and I did not notice that. The result is not so obvious after bandwidth is improved between nodes, it is only about 1% improvement for specint2006 for 32 core on my box.
> >
> > Now I do not see negative effective about this patch unless user wants to keep some cores separated from khugepaged daemon process.
>
> Can we provide an extra parameter to let khugepaged daemon scheduling binded to node or freely? If can, I will provide updated patch.

It has always surprised me that we have a single khugepaged thread. If
we had a thread per node, you'd be all fixed up, yes?

Ditto ksmd.

2022-05-14 01:40:16

by Andrew Morton

[permalink] [raw]

Subject: Re: [PATCH v3] mm/khugepaged: sched to numa node when collapse huge page

On Thu, 28 Apr 2022 12:34:07 -0400 Peter Xu <[email protected]> wrote:

> On Thu, Apr 28, 2022 at 05:17:07PM +0200, David Hildenbrand wrote:
> > On 17.03.22 07:50, Bibo Mao wrote:
> > > collapse huge page will copy huge page from general small pages,
> > > dest node is calculated from most one of source pages, however
> > > THP daemon is not scheduled on dest node. The performance may be
> > > poor since huge page copying across nodes, also cache is not used
> > > for target node. With this patch, khugepaged daemon switches to
> > > the same numa node with huge page. It saves copying time and makes
> > > use of local cache better.
> > >
> > > With this patch, specint 2006 base performance is improved with 6%
> > > on Loongson 3C5000L platform with 32 cores and 8 numa nodes.
> >
> > If it helps, that's nice as long as it doesn't hurt other cases.
> >

Quite a bit of doubtful feedback and we have yet to hear from the
author. I'll drop the patch.

Bibo, please resend at a later time if you feel the patch remains
desirable. Please attempt to address the feedback via code changes
and/or changelogging.

2022-05-14 03:22:09

by maobibo

[permalink] [raw]

Subject: Re: [PATCH v3] mm/khugepaged: sched to numa node when collapse huge page

在 2022/5/13 09:19, maobibo 写道:
>
>
> 在 2022/5/13 08:36, Andrew Morton 写道:
>> On Thu, 28 Apr 2022 12:34:07 -0400 Peter Xu <[email protected]> wrote:
>>
>>> On Thu, Apr 28, 2022 at 05:17:07PM +0200, David Hildenbrand wrote:
>>>> On 17.03.22 07:50, Bibo Mao wrote:
>>>>> collapse huge page will copy huge page from general small pages,
>>>>> dest node is calculated from most one of source pages, however
>>>>> THP daemon is not scheduled on dest node. The performance may be
>>>>> poor since huge page copying across nodes, also cache is not used
>>>>> for target node. With this patch, khugepaged daemon switches to
>>>>> the same numa node with huge page. It saves copying time and makes
>>>>> use of local cache better.
>>>>>
>>>>> With this patch, specint 2006 base performance is improved with 6%
>>>>> on Loongson 3C5000L platform with 32 cores and 8 numa nodes.
>>>>
>>>> If it helps, that's nice as long as it doesn't hurt other cases.
>>>>
>>
>> Quite a bit of doubtful feedback and we have yet to hear from the
>> author. I'll drop the patch.
>>
>> Bibo, please resend at a later time if you feel the patch remains
>> desirable. Please attempt to address the feedback via code changes
>> and/or changelogging.
> Sorry for the late response, the mail is filtered and I did not notice that. The result is not so obvious after bandwidth is improved between nodes, it is only about 1% improvement for specint2006 for 32 core on my box.
>
> Now I do not see negative effective about this patch unless user wants to keep some cores separated from khugepaged daemon process.

Can we provide an extra parameter to let khugepaged daemon scheduling binded to node or freely? If can, I will provide updated patch.

>
>
> regards
> bibo,mao

2022-05-14 04:14:06

by Yang Shi

[permalink] [raw]

Subject: Re: [PATCH v3] mm/khugepaged: sched to numa node when collapse huge page

On Thu, May 12, 2022 at 6:49 PM Andrew Morton <[email protected]> wrote:
>
> On Fri, 13 May 2022 09:29:07 +0800 maobibo <[email protected]> wrote:
>
> >
> >
> > >> and/or changelogging.
> > > Sorry for the late response, the mail is filtered and I did not notice that. The result is not so obvious after bandwidth is improved between nodes, it is only about 1% improvement for specint2006 for 32 core on my box.
> > >
> > > Now I do not see negative effective about this patch unless user wants to keep some cores separated from khugepaged daemon process.
> >
> > Can we provide an extra parameter to let khugepaged daemon scheduling binded to node or freely? If can, I will provide updated patch.
>
> It has always surprised me that we have a single khugepaged thread. If
> we had a thread per node, you'd be all fixed up, yes?

Actually I was thinking about this before, but I didn't see too much
benefit with this approach TBH. The khugepaged scans vmas and the
mapped pages may spread on all nodes. It is not like kswapd which
could scan the LRUs for the specific node.

>
> Ditto ksmd.
>