This patch adds an mm flag (MMF_THP_DISABLE) to disable transparent
hugepages using prctl. It is based on my original patch to add a
per-task_struct flag to disable THP:
v1 - https://lkml.org/lkml/2013/8/2/671
v2 - https://lkml.org/lkml/2013/8/2/703
After looking at alternate methods of modifying how THPs are handed out,
it sounds like people might be more in favor of this type of approach,
so I'm re-introducing the patch.
It seemed that everyone was in favor of moving this control over to the
mm_struct, if it is to be implemented. That's the only major change
here, aside from the added ability to both set and clear the flag from
prctl.
The main motivation behind this patch is to provide a way to disable THP
for jobs where the code cannot be modified and using a malloc hook with
madvise is not an option (i.e. statically allocated data). This patch
allows us to do just that, without affecting other jobs running on the
system.
Here are some results showing the improvement that my test case gets
when the MMF_THP_DISABLE flag is clear vs. set:
MMF_THP_DISABLE clear:
# perf stat -a -r 3 ./prctl_wrapper_mm 0 ./thp_pthread -C 0 -m 0 -c 512 -b 256g
Performance counter stats for './prctl_wrapper_mm 0 ./thp_pthread -C 0 -m 0 -c 512 -b 256g' (3 runs):
267694862.049279 task-clock # 641.100 CPUs utilized ( +- 0.23% ) [100.00%]
908,846 context-switches # 0.000 M/sec ( +- 0.23% ) [100.00%]
874 CPU-migrations # 0.000 M/sec ( +- 4.01% ) [100.00%]
131,966 page-faults # 0.000 M/sec ( +- 2.75% )
351,127,909,744,906 cycles # 1.312 GHz ( +- 0.27% ) [100.00%]
523,537,415,562,692 stalled-cycles-frontend # 149.10% frontend cycles idle ( +- 0.26% ) [100.00%]
392,400,753,609,156 stalled-cycles-backend # 111.75% backend cycles idle ( +- 0.29% ) [100.00%]
147,467,956,557,895 instructions # 0.42 insns per cycle
# 3.55 stalled cycles per insn ( +- 0.09% ) [100.00%]
26,922,737,309,021 branches # 100.572 M/sec ( +- 0.24% ) [100.00%]
1,308,714,545 branch-misses # 0.00% of all branches ( +- 0.18% )
417.555688399 seconds time elapsed ( +- 0.23% )
MMF_THP_DISABLE set:
# perf stat -a -r 3 ./prctl_wrapper_mm 1 ./thp_pthread -C 0 -m 0 -c 512 -b 256g
Performance counter stats for './prctl_wrapper_mm 1 ./thp_pthread -C 0 -m 0 -c 512 -b 256g' (3 runs):
141674994.160138 task-clock # 642.107 CPUs utilized ( +- 0.23% ) [100.00%]
1,190,415 context-switches # 0.000 M/sec ( +- 42.87% ) [100.00%]
688 CPU-migrations # 0.000 M/sec ( +- 2.47% ) [100.00%]
62,394,646 page-faults # 0.000 M/sec ( +- 0.00% )
156,748,225,096,919 cycles # 1.106 GHz ( +- 0.20% ) [100.00%]
211,440,354,290,433 stalled-cycles-frontend # 134.89% frontend cycles idle ( +- 0.40% ) [100.00%]
114,304,536,881,102 stalled-cycles-backend # 72.92% backend cycles idle ( +- 0.88% ) [100.00%]
179,939,084,230,732 instructions # 1.15 insns per cycle
# 1.18 stalled cycles per insn ( +- 0.26% ) [100.00%]
26,659,099,949,509 branches # 188.171 M/sec ( +- 0.72% ) [100.00%]
762,772,361 branch-misses # 0.00% of all branches ( +- 0.97% )
220.640905073 seconds time elapsed ( +- 0.23% )
As you can see, this particular test gets about a 2x performance boost
when THP is turned off. Here's a link to the test, along with the
wrapper that I used:
http://oss.sgi.com/projects/memtests/thp_pthread_mmprctl.tar.gz
There are still a few things that might need tweaked here, but I wanted
to get the patch out there to get a discussion started. Two things I
noted from the old patch discussion that will likely need to be
addressed are:
* Patch doesn't currently account for get_user_pages or khugepaged
allocations. Easy enough to fix, but it's not there yet.
* Current behavior is to have fork()/exec()'d processes inherit the
flag. Andrew Morton pointed out some possible issues with this, so we
may need to rethink this behavior.
- If parent process has THP disabled, and forks off a child, the child
will also have THP disabled. This may not be the desired behavior.
Signed-off-by: Alex Thorlton <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: [email protected]
---
include/linux/huge_mm.h | 6 ++++--
include/linux/sched.h | 8 ++++++++
include/uapi/linux/prctl.h | 4 ++++
kernel/fork.c | 1 +
kernel/sys.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
5 files changed, 62 insertions(+), 2 deletions(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 91672e2..475f59f 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -1,6 +1,8 @@
#ifndef _LINUX_HUGE_MM_H
#define _LINUX_HUGE_MM_H
+#include <linux/sched.h>
+
extern int do_huge_pmd_anonymous_page(struct mm_struct *mm,
struct vm_area_struct *vma,
unsigned long address, pmd_t *pmd,
@@ -74,7 +76,8 @@ extern bool is_vma_temporary_stack(struct vm_area_struct *vma);
(1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG) && \
((__vma)->vm_flags & VM_HUGEPAGE))) && \
!((__vma)->vm_flags & VM_NOHUGEPAGE) && \
- !is_vma_temporary_stack(__vma))
+ !is_vma_temporary_stack(__vma) && \
+ !test_bit(MMF_THP_DISABLE, &(__vma)->vm_mm->flags))
#define transparent_hugepage_defrag(__vma) \
((transparent_hugepage_flags & \
(1<<TRANSPARENT_HUGEPAGE_DEFRAG_FLAG)) || \
@@ -227,7 +230,6 @@ static inline int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_str
{
return 0;
}
-
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
#endif /* _LINUX_HUGE_MM_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 53f97eb..70623e1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -373,7 +373,15 @@ extern int get_dumpable(struct mm_struct *mm);
#define MMF_HAS_UPROBES 19 /* has uprobes */
#define MMF_RECALC_UPROBES 20 /* MMF_HAS_UPROBES can be wrong */
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define MMF_THP_DISABLE 21 /* disable THP for this mm */
+#define MMF_THP_DISABLE_MASK (1 << MMF_THP_DISABLE)
+
+#define MMF_INIT_MASK (MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK | MMF_THP_DISABLE_MASK)
+#else
#define MMF_INIT_MASK (MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK)
+#endif
+
struct sighand_struct {
atomic_t count;
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 289760f..a8b6ba5 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -149,4 +149,8 @@
#define PR_GET_TID_ADDRESS 40
+#define PR_SET_THP_DISABLE 41
+#define PR_CLEAR_THP_DISABLE 42
+#define PR_GET_THP_DISABLE 43
+
#endif /* _LINUX_PRCTL_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index 5721f0e..3337e85 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -818,6 +818,7 @@ struct mm_struct *dup_mm(struct task_struct *tsk)
#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
mm->pmd_huge_pte = NULL;
#endif
+
if (!mm_init(mm, tsk))
goto fail_nomem;
diff --git a/kernel/sys.c b/kernel/sys.c
index c723113..4864aaf 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1835,6 +1835,42 @@ static int prctl_get_tid_address(struct task_struct *me, int __user **tid_addr)
}
#endif
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static int prctl_set_thp_disable(struct task_struct *me)
+{
+ set_bit(MMF_THP_DISABLE, &me->mm->flags);
+ return 0;
+}
+
+static int prctl_clear_thp_disable(struct task_struct *me)
+{
+ clear_bit(MMF_THP_DISABLE, &me->mm->flags);
+ return 0;
+}
+
+static int prctl_get_thp_disable(struct task_struct *me,
+ int __user *thp_disabled)
+{
+ return put_user(test_bit(MMF_THP_DISABLE, &me->mm->flags), thp_disabled);
+}
+#else
+static int prctl_set_thp_disable(struct task_struct *me)
+{
+ return -EINVAL;
+}
+
+static int prctl_clear_thp_disable(struct task_struct *me)
+{
+ return -EINVAL;
+}
+
+static int prctl_get_thp_disable(struct task_struct *me,
+ int __user *thp_disabled)
+{
+ return -EINVAL;
+}
+#endif
+
SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
unsigned long, arg4, unsigned long, arg5)
{
@@ -1998,6 +2034,15 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
if (arg2 || arg3 || arg4 || arg5)
return -EINVAL;
return current->no_new_privs ? 1 : 0;
+ case PR_SET_THP_DISABLE:
+ error = prctl_set_thp_disable(me);
+ break;
+ case PR_CLEAR_THP_DISABLE:
+ error = prctl_clear_thp_disable(me);
+ break;
+ case PR_GET_THP_DISABLE:
+ error = prctl_get_thp_disable(me, (int __user *) arg2);
+ break;
default:
error = -EINVAL;
break;
--
1.7.12.4
On Fri, Jan 10, 2014 at 01:55:18PM -0600, Alex Thorlton wrote:
> This patch adds an mm flag (MMF_THP_DISABLE) to disable transparent
> hugepages using prctl. It is based on my original patch to add a
> per-task_struct flag to disable THP:
>
> v1 - https://lkml.org/lkml/2013/8/2/671
> v2 - https://lkml.org/lkml/2013/8/2/703
>
> After looking at alternate methods of modifying how THPs are handed out,
> it sounds like people might be more in favor of this type of approach,
> so I'm re-introducing the patch.
>
> It seemed that everyone was in favor of moving this control over to the
> mm_struct, if it is to be implemented. That's the only major change
> here, aside from the added ability to both set and clear the flag from
> prctl.
>
> The main motivation behind this patch is to provide a way to disable THP
> for jobs where the code cannot be modified and using a malloc hook with
> madvise is not an option (i.e. statically allocated data). This patch
> allows us to do just that, without affecting other jobs running on the
> system.
>
> Here are some results showing the improvement that my test case gets
> when the MMF_THP_DISABLE flag is clear vs. set:
>
> MMF_THP_DISABLE clear:
>
> # perf stat -a -r 3 ./prctl_wrapper_mm 0 ./thp_pthread -C 0 -m 0 -c 512 -b 256g
>
> Performance counter stats for './prctl_wrapper_mm 0 ./thp_pthread -C 0 -m 0 -c 512 -b 256g' (3 runs):
>
> 267694862.049279 task-clock # 641.100 CPUs utilized ( +- 0.23% ) [100.00%]
> 908,846 context-switches # 0.000 M/sec ( +- 0.23% ) [100.00%]
> 874 CPU-migrations # 0.000 M/sec ( +- 4.01% ) [100.00%]
> 131,966 page-faults # 0.000 M/sec ( +- 2.75% )
> 351,127,909,744,906 cycles # 1.312 GHz ( +- 0.27% ) [100.00%]
> 523,537,415,562,692 stalled-cycles-frontend # 149.10% frontend cycles idle ( +- 0.26% ) [100.00%]
> 392,400,753,609,156 stalled-cycles-backend # 111.75% backend cycles idle ( +- 0.29% ) [100.00%]
> 147,467,956,557,895 instructions # 0.42 insns per cycle
> # 3.55 stalled cycles per insn ( +- 0.09% ) [100.00%]
> 26,922,737,309,021 branches # 100.572 M/sec ( +- 0.24% ) [100.00%]
> 1,308,714,545 branch-misses # 0.00% of all branches ( +- 0.18% )
>
> 417.555688399 seconds time elapsed ( +- 0.23% )
>
>
> MMF_THP_DISABLE set:
>
> # perf stat -a -r 3 ./prctl_wrapper_mm 1 ./thp_pthread -C 0 -m 0 -c 512 -b 256g
>
> Performance counter stats for './prctl_wrapper_mm 1 ./thp_pthread -C 0 -m 0 -c 512 -b 256g' (3 runs):
>
> 141674994.160138 task-clock # 642.107 CPUs utilized ( +- 0.23% ) [100.00%]
> 1,190,415 context-switches # 0.000 M/sec ( +- 42.87% ) [100.00%]
> 688 CPU-migrations # 0.000 M/sec ( +- 2.47% ) [100.00%]
> 62,394,646 page-faults # 0.000 M/sec ( +- 0.00% )
> 156,748,225,096,919 cycles # 1.106 GHz ( +- 0.20% ) [100.00%]
> 211,440,354,290,433 stalled-cycles-frontend # 134.89% frontend cycles idle ( +- 0.40% ) [100.00%]
> 114,304,536,881,102 stalled-cycles-backend # 72.92% backend cycles idle ( +- 0.88% ) [100.00%]
> 179,939,084,230,732 instructions # 1.15 insns per cycle
> # 1.18 stalled cycles per insn ( +- 0.26% ) [100.00%]
> 26,659,099,949,509 branches # 188.171 M/sec ( +- 0.72% ) [100.00%]
> 762,772,361 branch-misses # 0.00% of all branches ( +- 0.97% )
>
> 220.640905073 seconds time elapsed ( +- 0.23% )
>
> As you can see, this particular test gets about a 2x performance boost
> when THP is turned off.
Do you know what cause the difference? I prefer to fix THP instead of
adding new knob to disable it.
--
Kirill A. Shutemov
On Fri, Jan 10, 2014 at 10:23:10PM +0200, Kirill A. Shutemov wrote:
> Do you know what cause the difference? I prefer to fix THP instead of
> adding new knob to disable it.
The issue is that when you touch 1 byte of an untouched, contiguous 2MB
chunk, a THP will be handed out, and the THP will be stuck on whatever
node the chunk was originally referenced from. If many remote nodes
need to do work on that same chunk, they'll be making remote accesses.
With THP disabled, 4K pages can be handed out to separate nodes as
they're needed, greatly reducing the amount of remote accesses to
memory. I give a bit better description here:
https://lkml.org/lkml/2013/8/27/397
I had been looking into better ways to handle this issues, but after
spinning through a few other ideas:
- Per cpuset flag to control THP:
https://lkml.org/lkml/2013/6/10/331
- Threshold to determine when to hand out THPs:
https://lkml.org/lkml/2013/12/12/394
We've arrived back here. Andrea seemed to think that this is an
acceptable approach to solve the problem, at least as a starting point:
https://lkml.org/lkml/2013/12/17/397
I agree that we should, ideally, come up with a way to appropriately
handle this problem in the kernel, but as of right now, it appears that
that might be a rather large undertaking.
- Alex
On Fri, Jan 10, 2014 at 04:01:55PM -0600, Alex Thorlton wrote:
> On Fri, Jan 10, 2014 at 10:23:10PM +0200, Kirill A. Shutemov wrote:
> > Do you know what cause the difference? I prefer to fix THP instead of
> > adding new knob to disable it.
>
> The issue is that when you touch 1 byte of an untouched, contiguous 2MB
> chunk, a THP will be handed out, and the THP will be stuck on whatever
> node the chunk was originally referenced from. If many remote nodes
> need to do work on that same chunk, they'll be making remote accesses.
> With THP disabled, 4K pages can be handed out to separate nodes as
> they're needed, greatly reducing the amount of remote accesses to
> memory. I give a bit better description here:
>
> https://lkml.org/lkml/2013/8/27/397
>
> I had been looking into better ways to handle this issues, but after
> spinning through a few other ideas:
>
> - Per cpuset flag to control THP:
> https://lkml.org/lkml/2013/6/10/331
>
> - Threshold to determine when to hand out THPs:
> https://lkml.org/lkml/2013/12/12/394
>
> We've arrived back here. Andrea seemed to think that this is an
> acceptable approach to solve the problem, at least as a starting point:
>
> https://lkml.org/lkml/2013/12/17/397
>
> I agree that we should, ideally, come up with a way to appropriately
> handle this problem in the kernel, but as of right now, it appears that
> that might be a rather large undertaking.
We already have the information to determine if a page is shared across
nodes, Mel even had some prototype code to do splits under those
conditions.
On Fri, Jan 10, 2014 at 04:01:55PM -0600, Alex Thorlton wrote:
> On Fri, Jan 10, 2014 at 10:23:10PM +0200, Kirill A. Shutemov wrote:
> > Do you know what cause the difference? I prefer to fix THP instead of
> > adding new knob to disable it.
>
> The issue is that when you touch 1 byte of an untouched, contiguous 2MB
> chunk, a THP will be handed out, and the THP will be stuck on whatever
> node the chunk was originally referenced from. If many remote nodes
> need to do work on that same chunk, they'll be making remote accesses.
> With THP disabled, 4K pages can be handed out to separate nodes as
> they're needed, greatly reducing the amount of remote accesses to
> memory.
I think this problem *potentially* could be fixed by NUMA balancer.
(Although, I don't really know how balancer works...)
If we see NUMA hint faults for addresses in different 4k pages inside huge
page from more then one node, we could split the huge page.
Mel, is it possible? Do we collect enough info to make the decision?
--
Kirill A. Shutemov
On Fri, Jan 10, 2014 at 11:10:10PM +0100, Peter Zijlstra wrote:
> We already have the information to determine if a page is shared across
> nodes, Mel even had some prototype code to do splits under those
> conditions.
I'm aware that we can determine if pages are shared across nodes, but I
thought that Mel's code to split pages under these conditions had some
performance issues. I know I've seen the code that Mel wrote to do
this, but I can't seem to dig it up right now. Could you point me to
it?
- Alex
On 01/10, Alex Thorlton wrote:
>
> This patch adds an mm flag (MMF_THP_DISABLE) to disable transparent
> hugepages using prctl. It is based on my original patch to add a
> per-task_struct flag to disable THP:
I leave the "whether we need this feature" to other reviewers, although
personally I think it probably makes sense anyway.
But the patch doesn't look nice imho.
> @@ -373,7 +373,15 @@ extern int get_dumpable(struct mm_struct *mm);
> #define MMF_HAS_UPROBES 19 /* has uprobes */
> #define MMF_RECALC_UPROBES 20 /* MMF_HAS_UPROBES can be wrong */
>
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +#define MMF_THP_DISABLE 21 /* disable THP for this mm */
> +#define MMF_THP_DISABLE_MASK (1 << MMF_THP_DISABLE)
> +
> +#define MMF_INIT_MASK (MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK | MMF_THP_DISABLE_MASK)
> +#else
> #define MMF_INIT_MASK (MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK)
> +#endif
It would be nice to lessen the number of ifdef's. Why we can't define
MMF_THP_DISABLE unconditionally and include it into MMF_INIT_MASK?
Or define it == 0 if !CONFIG_THP. But this is minor.
> +#define PR_SET_THP_DISABLE 41
> +#define PR_CLEAR_THP_DISABLE 42
> +#define PR_GET_THP_DISABLE 43
Why we can't add 2 PR_'s, set and get?
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -818,6 +818,7 @@ struct mm_struct *dup_mm(struct task_struct *tsk)
> #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
> mm->pmd_huge_pte = NULL;
> #endif
> +
> if (!mm_init(mm, tsk))
> goto fail_nomem;
Why? looks like the accidental change.
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -1835,6 +1835,42 @@ static int prctl_get_tid_address(struct task_struct *me, int __user **tid_addr)
> }
> #endif
>
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +static int prctl_set_thp_disable(struct task_struct *me)
> +{
> + set_bit(MMF_THP_DISABLE, &me->mm->flags);
> + return 0;
> +}
> +
> +static int prctl_clear_thp_disable(struct task_struct *me)
> +{
> + clear_bit(MMF_THP_DISABLE, &me->mm->flags);
> + return 0;
> +}
> +
> +static int prctl_get_thp_disable(struct task_struct *me,
> + int __user *thp_disabled)
> +{
> + return put_user(test_bit(MMF_THP_DISABLE, &me->mm->flags), thp_disabled);
> +}
> +#else
> +static int prctl_set_thp_disable(struct task_struct *me)
> +{
> + return -EINVAL;
> +}
> +
> +static int prctl_clear_thp_disable(struct task_struct *me)
> +{
> + return -EINVAL;
> +}
> +
> +static int prctl_get_thp_disable(struct task_struct *me,
> + int __user *thp_disabled)
> +{
> + return -EINVAL;
> +}
> +#endif
> +
> SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
> unsigned long, arg4, unsigned long, arg5)
> {
> @@ -1998,6 +2034,15 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
> if (arg2 || arg3 || arg4 || arg5)
> return -EINVAL;
> return current->no_new_privs ? 1 : 0;
> + case PR_SET_THP_DISABLE:
> + error = prctl_set_thp_disable(me);
> + break;
> + case PR_CLEAR_THP_DISABLE:
> + error = prctl_clear_thp_disable(me);
> + break;
> + case PR_GET_THP_DISABLE:
> + error = prctl_get_thp_disable(me, (int __user *) arg2);
> + break;
> default:
> error = -EINVAL;
> break;
I simply can't understand, this all looks like overkill. Can't you simply add
#idfef CONFIG_TRANSPARENT_HUGEPAGE
case GET:
error = test_bit(MMF_THP_DISABLE);
break;
case PUT:
if (arg2)
set_bit();
else
clear_bit();
break;
#endif
into sys_prctl() ?
Oleg.
On 01/10, Kirill A. Shutemov wrote:
>
> I prefer to fix THP instead of
> adding new knob to disable it.
I agree. But if we have the per-vma MADV/VM_ flags, it looks
natural to also have the per-mm know which affects all vmas.
Besides this allows to control the thp behaviour after exec.
Oleg.
On Sat, Jan 11, 2014 at 04:53:37PM +0100, Oleg Nesterov wrote:
> On 01/10, Alex Thorlton wrote:
> >
> > This patch adds an mm flag (MMF_THP_DISABLE) to disable transparent
> > hugepages using prctl. It is based on my original patch to add a
> > per-task_struct flag to disable THP:
>
> I leave the "whether we need this feature" to other reviewers, although
> personally I think it probably makes sense anyway.
>
> But the patch doesn't look nice imho.
>
> > @@ -373,7 +373,15 @@ extern int get_dumpable(struct mm_struct *mm);
> > #define MMF_HAS_UPROBES 19 /* has uprobes */
> > #define MMF_RECALC_UPROBES 20 /* MMF_HAS_UPROBES can be wrong */
> >
> > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > +#define MMF_THP_DISABLE 21 /* disable THP for this mm */
> > +#define MMF_THP_DISABLE_MASK (1 << MMF_THP_DISABLE)
> > +
> > +#define MMF_INIT_MASK (MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK | MMF_THP_DISABLE_MASK)
> > +#else
> > #define MMF_INIT_MASK (MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK)
> > +#endif
>
> It would be nice to lessen the number of ifdef's. Why we can't define
> MMF_THP_DISABLE unconditionally and include it into MMF_INIT_MASK?
> Or define it == 0 if !CONFIG_THP. But this is minor.
That's a good idea. I guess I was thinking that we don't want to define
any THP specific stuff when THP isn't configured, but I guess it doesn't
make much of a difference since the flag will never be set if THP isn't
configured.
> > +#define PR_SET_THP_DISABLE 41
> > +#define PR_CLEAR_THP_DISABLE 42
> > +#define PR_GET_THP_DISABLE 43
>
> Why we can't add 2 PR_'s, set and get?
See response below.
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -818,6 +818,7 @@ struct mm_struct *dup_mm(struct task_struct *tsk)
> > #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
> > mm->pmd_huge_pte = NULL;
> > #endif
> > +
> > if (!mm_init(mm, tsk))
> > goto fail_nomem;
>
> Why? looks like the accidental change.
Ah, yes. Didn't catch that when I looked over the patch. I'll fix
that.
>
> > --- a/kernel/sys.c
> > +++ b/kernel/sys.c
> > @@ -1835,6 +1835,42 @@ static int prctl_get_tid_address(struct task_struct *me, int __user **tid_addr)
> > }
> > #endif
> >
> > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > +static int prctl_set_thp_disable(struct task_struct *me)
> > +{
> > + set_bit(MMF_THP_DISABLE, &me->mm->flags);
> > + return 0;
> > +}
> > +
> > +static int prctl_clear_thp_disable(struct task_struct *me)
> > +{
> > + clear_bit(MMF_THP_DISABLE, &me->mm->flags);
> > + return 0;
> > +}
> > +
> > +static int prctl_get_thp_disable(struct task_struct *me,
> > + int __user *thp_disabled)
> > +{
> > + return put_user(test_bit(MMF_THP_DISABLE, &me->mm->flags), thp_disabled);
> > +}
> > +#else
> > +static int prctl_set_thp_disable(struct task_struct *me)
> > +{
> > + return -EINVAL;
> > +}
> > +
> > +static int prctl_clear_thp_disable(struct task_struct *me)
> > +{
> > + return -EINVAL;
> > +}
> > +
> > +static int prctl_get_thp_disable(struct task_struct *me,
> > + int __user *thp_disabled)
> > +{
> > + return -EINVAL;
> > +}
> > +#endif
> > +
> > SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
> > unsigned long, arg4, unsigned long, arg5)
> > {
> > @@ -1998,6 +2034,15 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
> > if (arg2 || arg3 || arg4 || arg5)
> > return -EINVAL;
> > return current->no_new_privs ? 1 : 0;
> > + case PR_SET_THP_DISABLE:
> > + error = prctl_set_thp_disable(me);
> > + break;
> > + case PR_CLEAR_THP_DISABLE:
> > + error = prctl_clear_thp_disable(me);
> > + break;
> > + case PR_GET_THP_DISABLE:
> > + error = prctl_get_thp_disable(me, (int __user *) arg2);
> > + break;
> > default:
> > error = -EINVAL;
> > break;
>
> I simply can't understand, this all looks like overkill. Can't you simply add
>
> #idfef CONFIG_TRANSPARENT_HUGEPAGE
> case GET:
> error = test_bit(MMF_THP_DISABLE);
> break;
> case PUT:
> if (arg2)
> set_bit();
> else
> clear_bit();
> break;
> #endif
>
> into sys_prctl() ?
That's probably a better solution. I wasn't sure whether or not it was
better to have two functions to handle this, or to have one function
handle both. If you think it's better to just handle both with one,
that's easy enough to change.
- Alex
On 01/11, Alex Thorlton wrote:
>
> On Sat, Jan 11, 2014 at 04:53:37PM +0100, Oleg Nesterov wrote:
>
> > I simply can't understand, this all looks like overkill. Can't you simply add
> >
> > #idfef CONFIG_TRANSPARENT_HUGEPAGE
> > case GET:
> > error = test_bit(MMF_THP_DISABLE);
> > break;
> > case PUT:
> > if (arg2)
> > set_bit();
> > else
> > clear_bit();
> > break;
> > #endif
> >
> > into sys_prctl() ?
>
> That's probably a better solution. I wasn't sure whether or not it was
> better to have two functions to handle this, or to have one function
> handle both. If you think it's better to just handle both with one,
> that's easy enough to change.
Personally I think sys_prctl() can handle this itself, without a helper.
But of course I won't argue, this is up to you.
My only point is, the kernel is already huge ;) Imho it makes sense to
try to lessen the code size, when the logic is simple.
Oleg.
On Sun, Jan 12, 2014 at 02:56:00PM +0100, Oleg Nesterov wrote:
> On 01/11, Alex Thorlton wrote:
> >
> > On Sat, Jan 11, 2014 at 04:53:37PM +0100, Oleg Nesterov wrote:
> >
> > > I simply can't understand, this all looks like overkill. Can't you simply add
> > >
> > > #idfef CONFIG_TRANSPARENT_HUGEPAGE
> > > case GET:
> > > error = test_bit(MMF_THP_DISABLE);
> > > break;
> > > case PUT:
> > > if (arg2)
> > > set_bit();
> > > else
> > > clear_bit();
> > > break;
> > > #endif
> > >
> > > into sys_prctl() ?
> >
> > That's probably a better solution. I wasn't sure whether or not it was
> > better to have two functions to handle this, or to have one function
> > handle both. If you think it's better to just handle both with one,
> > that's easy enough to change.
>
> Personally I think sys_prctl() can handle this itself, without a helper.
> But of course I won't argue, this is up to you.
>
> My only point is, the kernel is already huge ;) Imho it makes sense to
> try to lessen the code size, when the logic is simple.
I agree with you here as well. There was a mixed bag of PRCTLs using
helpers vs. ones that put the code right into sys_prctl. I just
arbitrarily chose to use a helper here. I'll switch that over for v2.
- Alex
On Fri, Jan 10, 2014 at 04:39:09PM -0600, Alex Thorlton wrote:
> On Fri, Jan 10, 2014 at 11:10:10PM +0100, Peter Zijlstra wrote:
> > We already have the information to determine if a page is shared across
> > nodes, Mel even had some prototype code to do splits under those
> > conditions.
>
> I'm aware that we can determine if pages are shared across nodes, but I
> thought that Mel's code to split pages under these conditions had some
> performance issues. I know I've seen the code that Mel wrote to do
> this, but I can't seem to dig it up right now. Could you point me to
> it?
>
It was a lot of revisions ago! The git branches no longer exist but the
diff from the monolithic patches is below. The baseline was v3.10 and
this will no longer apply but you'll see the two places where I added a
split_huge_page and prevented khugepaged collapsing them again. At the
time, the performance with it applied was much worse but it was a 10
minute patch as a distraction. There was a range of basic problems that
had to be tackled before there was any point looking at splitting THP due
to locality. I did not pursue it further and have not revisited it since.
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c8b25a8..2b80abe 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1317,6 +1317,23 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
last_nidpid = page_nidpid_last(page);
target_nid = mpol_misplaced(page, vma, haddr);
if (target_nid == -1) {
+ int last_pid = nidpid_to_pid(last_nidpid);
+
+ /*
+ * If the fault failed to pass the two-stage filter but is on
+ * a remote node then it could be due to false sharing of the
+ * THP page. Remote accesses are more expensive than base
+ * page TLB accesses so split the huge page and return to
+ * retry the fault.
+ */
+ if (!nidpid_nid_unset(last_nidpid) &&
+ src_nid != page_to_nid(page) &&
+ last_pid != (current->pid & LAST__PID_MASK)) {
+ spin_unlock(&mm->page_table_lock);
+ split_huge_page(page);
+ put_page(page);
+ return 0;
+ }
put_page(page);
goto clear_pmdnuma;
}
@@ -2398,6 +2415,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
unsigned long _address;
spinlock_t *ptl;
int node = NUMA_NO_NODE;
+ int hint_node = NUMA_NO_NODE;
VM_BUG_ON(address & ~HPAGE_PMD_MASK);
@@ -2427,8 +2445,20 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
* be more sophisticated and look at more pages,
* but isn't for now.
*/
- if (node == NUMA_NO_NODE)
+ if (node == NUMA_NO_NODE) {
+ int nidpid = page_nidpid_last(page);
node = page_to_nid(page);
+ hint_node = nidpid_to_nid(nidpid);
+ }
+
+ /*
+ * If the range is receiving hinting faults from CPUs on
+ * different nodes then prioritise locality over TLB
+ * misses
+ */
+ if (nidpid_to_nid(page_nidpid_last(page)) != hint_node)
+ goto out_unmap;
+
VM_BUG_ON(PageCompound(page));
if (!PageLRU(page) || PageLocked(page) || !PageAnon(page))
goto out_unmap;
--
Mel Gorman
SUSE Labs
On Sat, Jan 11, 2014 at 12:23:15AM +0200, Kirill A. Shutemov wrote:
> On Fri, Jan 10, 2014 at 04:01:55PM -0600, Alex Thorlton wrote:
> > On Fri, Jan 10, 2014 at 10:23:10PM +0200, Kirill A. Shutemov wrote:
> > > Do you know what cause the difference? I prefer to fix THP instead of
> > > adding new knob to disable it.
> >
> > The issue is that when you touch 1 byte of an untouched, contiguous 2MB
> > chunk, a THP will be handed out, and the THP will be stuck on whatever
> > node the chunk was originally referenced from. If many remote nodes
> > need to do work on that same chunk, they'll be making remote accesses.
> > With THP disabled, 4K pages can be handed out to separate nodes as
> > they're needed, greatly reducing the amount of remote accesses to
> > memory.
>
> I think this problem *potentially* could be fixed by NUMA balancer.
> (Although, I don't really know how balancer works...)
>
> If we see NUMA hint faults for addresses in different 4k pages inside huge
> page from more then one node, we could split the huge page.
>
> Mel, is it possible? Do we collect enough info to make the decision?
>
Potentially the hinting faults can be used to decide whether to split or
not but currently there is only limited information. You can detect if
the last faulting process was on the same node but not if the faults were
in different parts of the THP that would justify a split.
--
Mel Gorman
SUSE Labs
On Tue, Jan 14, 2014 at 03:44:57PM +0000, Mel Gorman wrote:
> On Fri, Jan 10, 2014 at 04:39:09PM -0600, Alex Thorlton wrote:
> > On Fri, Jan 10, 2014 at 11:10:10PM +0100, Peter Zijlstra wrote:
> > > We already have the information to determine if a page is shared across
> > > nodes, Mel even had some prototype code to do splits under those
> > > conditions.
> >
> > I'm aware that we can determine if pages are shared across nodes, but I
> > thought that Mel's code to split pages under these conditions had some
> > performance issues. I know I've seen the code that Mel wrote to do
> > this, but I can't seem to dig it up right now. Could you point me to
> > it?
> >
>
> It was a lot of revisions ago! The git branches no longer exist but the
> diff from the monolithic patches is below. The baseline was v3.10 and
> this will no longer apply but you'll see the two places where I added a
> split_huge_page and prevented khugepaged collapsing them again.
Thanks, Mel. I remember seeing this code a while back when we were
discussing THP/locality issues.
> At the time, the performance with it applied was much worse but it was a
> 10 minute patch as a distraction. There was a range of basic problems that
> had to be tackled before there was any point looking at splitting THP due
> to locality. I did not pursue it further and have not revisited it since.
So, in your opinion, is this something we should look into further
before moving towards the per-mm switch that I propose here? I
personally think that it will be tough to get this to perform as well as
a method that totally disables THP when requested, or a method that
tries to prevent THPs from being handed out in certain situations, since
we'll be doing the work of both making and splitting a THP in the case
where remote accesses are made to the page.
I also think there could be some issues with over-zealously splitting
pages, since it sounds like we can only determine if an access is from a
remote node. We don't have a good way of determining how many accesses
are remote vs. local, or how many separate nodes are accessing a page.
For example, I can see this being a problem if we have a large
multi-node system, where only two nodes are accessing a THP. We might
end up splitting that THP, but if relatively few remote nodes are
accessing it, it may not be worth the time. The split only seems
worthwhile to me if the majority of accesses are remote, which sounds
like it would be hard to determine.
One thing we could possibly do would be to add some structures to do a
bit of accounting work into the mm_struct or some other appropriate
location, then we could keep track of how many distinct remote nodes are
accessing a THP and decide to split based on that. However, there's
still the overhead to creating/splitting the THP, and the extra
space/time needed to do the proper accounting work may be
counterproductive (if this is even possible, I'm just thinking out loud
here).
- Alex
On Tue, Jan 14, 2014 at 01:38:01PM -0600, Alex Thorlton wrote:
> On Tue, Jan 14, 2014 at 03:44:57PM +0000, Mel Gorman wrote:
> > On Fri, Jan 10, 2014 at 04:39:09PM -0600, Alex Thorlton wrote:
> > > On Fri, Jan 10, 2014 at 11:10:10PM +0100, Peter Zijlstra wrote:
> > > > We already have the information to determine if a page is shared across
> > > > nodes, Mel even had some prototype code to do splits under those
> > > > conditions.
> > >
> > > I'm aware that we can determine if pages are shared across nodes, but I
> > > thought that Mel's code to split pages under these conditions had some
> > > performance issues. I know I've seen the code that Mel wrote to do
> > > this, but I can't seem to dig it up right now. Could you point me to
> > > it?
> > >
> >
> > It was a lot of revisions ago! The git branches no longer exist but the
> > diff from the monolithic patches is below. The baseline was v3.10 and
> > this will no longer apply but you'll see the two places where I added a
> > split_huge_page and prevented khugepaged collapsing them again.
>
> Thanks, Mel. I remember seeing this code a while back when we were
> discussing THP/locality issues.
>
> > At the time, the performance with it applied was much worse but it was a
> > 10 minute patch as a distraction. There was a range of basic problems that
> > had to be tackled before there was any point looking at splitting THP due
> > to locality. I did not pursue it further and have not revisited it since.
>
> So, in your opinion, is this something we should look into further
> before moving towards the per-mm switch that I propose here?
No because they have different purposes. Any potential split of THP from
automatic NUMA balancing context is due to it detecting that threads running
on CPUs on different nodes are accessing a THP. You are proposing to have
a per-mm flag that prevents THP being allocated in the first place. They
are two separate problems with decisions that are made at completely
different times.
> I
> personally think that it will be tough to get this to perform as well as
> a method that totally disables THP when requested, or a method that
> tries to prevent THPs from being handed out in certain situations, since
> we'll be doing the work of both making and splitting a THP in the case
> where remote accesses are made to the page.
>
I would expect that the alternative solution to a per-mm switch is to
reserve the naturally aligned pages for a THP promotion. Have a threshold
of pages pages that must be faulted before the full THP's worth of pages
is allocated, zero'd and a huge pmd established. That would defer the
THP setup costs until it was detected that it was necessary.
The per-mm THP switch is a massive hammer but not necessarily a bad one.
> I also think there could be some issues with over-zealously splitting
> pages, since it sounds like we can only determine if an access is from a
> remote node. We don't have a good way of determining how many accesses
> are remote vs. local, or how many separate nodes are accessing a page.
>
Indeed not, but it's a different problem. We also do not know if the
remote accesses are to a single page in which case splitting it would
have zero benefit anyway.
--
Mel Gorman
SUSE Labs
On Wed, Jan 22, 2014 at 10:26:21AM +0000, Mel Gorman wrote:
> On Tue, Jan 14, 2014 at 01:38:01PM -0600, Alex Thorlton wrote:
> > On Tue, Jan 14, 2014 at 03:44:57PM +0000, Mel Gorman wrote:
> > > On Fri, Jan 10, 2014 at 04:39:09PM -0600, Alex Thorlton wrote:
> > > > On Fri, Jan 10, 2014 at 11:10:10PM +0100, Peter Zijlstra wrote:
> > > > > We already have the information to determine if a page is shared across
> > > > > nodes, Mel even had some prototype code to do splits under those
> > > > > conditions.
> > > >
> > > > I'm aware that we can determine if pages are shared across nodes, but I
> > > > thought that Mel's code to split pages under these conditions had some
> > > > performance issues. I know I've seen the code that Mel wrote to do
> > > > this, but I can't seem to dig it up right now. Could you point me to
> > > > it?
> > > >
> > >
> > > It was a lot of revisions ago! The git branches no longer exist but the
> > > diff from the monolithic patches is below. The baseline was v3.10 and
> > > this will no longer apply but you'll see the two places where I added a
> > > split_huge_page and prevented khugepaged collapsing them again.
> >
> > Thanks, Mel. I remember seeing this code a while back when we were
> > discussing THP/locality issues.
> >
> > > At the time, the performance with it applied was much worse but it was a
> > > 10 minute patch as a distraction. There was a range of basic problems that
> > > had to be tackled before there was any point looking at splitting THP due
> > > to locality. I did not pursue it further and have not revisited it since.
> >
> > So, in your opinion, is this something we should look into further
> > before moving towards the per-mm switch that I propose here?
>
> No because they have different purposes. Any potential split of THP from
> automatic NUMA balancing context is due to it detecting that threads running
> on CPUs on different nodes are accessing a THP. You are proposing to have
> a per-mm flag that prevents THP being allocated in the first place. They
> are two separate problems with decisions that are made at completely
> different times.
Makes sense. That discussion doesn't really fit here.
> > I
> > personally think that it will be tough to get this to perform as well as
> > a method that totally disables THP when requested, or a method that
> > tries to prevent THPs from being handed out in certain situations, since
> > we'll be doing the work of both making and splitting a THP in the case
> > where remote accesses are made to the page.
> >
>
> I would expect that the alternative solution to a per-mm switch is to
> reserve the naturally aligned pages for a THP promotion. Have a threshold
> of pages pages that must be faulted before the full THP's worth of pages
> is allocated, zero'd and a huge pmd established. That would defer the
> THP setup costs until it was detected that it was necessary.
I have some half-finished patches that I was working on a month or so
ago, to do exactly this (I think you were involved with some of the
discussion, maybe? I'd have to dig up the e-mails to be sure). After
cycling through numerous other methods of handling this problem, I still
like that idea, but I think it's going to require a decent amount of
effort to get finished.
> The per-mm THP switch is a massive hammer but not necessarily a bad one.
I agree that it's a big hammer, but I suppose it's a bit more of a
surgical strike than using the system-wide switch, and can be especially
useful in some cases I've discussed, where madvise isn't really an
option.
> > I also think there could be some issues with over-zealously splitting
> > pages, since it sounds like we can only determine if an access is from a
> > remote node. We don't have a good way of determining how many accesses
> > are remote vs. local, or how many separate nodes are accessing a page.
> >
>
> Indeed not, but it's a different problem. We also do not know if the
> remote accesses are to a single page in which case splitting it would
> have zero benefit anyway.
Agreed. There are several possible problems with that approach, but, as
you've stated, that's a different problem, so no reason to discuss
here :)
Thanks for the input, Mel!
- Alex
On Wed, 22 Jan 2014, Alex Thorlton wrote:
> > I would expect that the alternative solution to a per-mm switch is to
> > reserve the naturally aligned pages for a THP promotion. Have a threshold
> > of pages pages that must be faulted before the full THP's worth of pages
> > is allocated, zero'd and a huge pmd established. That would defer the
> > THP setup costs until it was detected that it was necessary.
>
> I have some half-finished patches that I was working on a month or so
> ago, to do exactly this (I think you were involved with some of the
> discussion, maybe? I'd have to dig up the e-mails to be sure). After
> cycling through numerous other methods of handling this problem, I still
> like that idea, but I think it's going to require a decent amount of
> effort to get finished.
>
If you're going to go this route, I think a sane value would be
max_ptes_none that controls when khugepaged would re-collapse a split
hugepage into another hugepage after something does madvise(MADV_DONTNEED)
to a region, for example. Otherwise it could collapse that memory into a
hugepage even though it wasn't faulted as huge.