This patch adds an mm flag (MMF_THP_DISABLE) to disable transparent
hugepages using prctl.
Changes for v2:
* Pulled code for prctl helper functions into prctl to make things more
concise.
* Changed PRCTL_SET_THP_DISABLE to accept an argument to set/clear the
THP_DISABLE bit, instead of having two separate PRCTLs for this.
* Removed ifdef in prctl.h that defined MMF_THP_DISABLE based on whether
or not CONFIG_TRANSPARENT_HUGEPAGE was set.
* Added code to get khugepaged to ignore mm_structs with THP disabled.
The main motivation behind this patch is to provide a way to disable THP
for jobs where the code cannot be modified and using a malloc hook with
madvise is not an option (i.e. statically allocated data). This patch
allows us to do just that, without affecting other jobs running on the
system.
We need to do this sort of thing for jobs where THP hurts performance,
due to the possibility of increased remote memory accesses that can be
created by situations such as the following:
When you touch 1 byte of an untouched, contiguous 2MB chunk, a THP will
be handed out, and the THP will be stuck on whatever node the chunk was
originally referenced from. If many remote nodes need to do work on that
same chunk, they'll be making remote accesses.
With THP disabled, 4K pages can be handed out to separate nodes as
they're needed, greatly reducing the amount of remote accesses to memory.
Here are some results showing the improvement that my test case gets
when the MMF_THP_DISABLE flag is clear vs. set:
MMF_THP_DISABLE clear:
# perf stat -a -r 3 ./prctl_wrapper_mmv2 0 ./thp_pthread -C 0 -m 0 -c 512 -b 256g
Performance counter stats for './prctl_wrapper_mmv2 0 ./thp_pthread -C 0 -m 0 -c 512 -b 256g' (3 runs):
267537198.932548 task-clock # 641.115 CPUs utilized ( +- 0.03% ) [100.00%]
909,086 context-switches # 0.000 M/sec ( +- 0.07% ) [100.00%]
1,004 CPU-migrations # 0.000 M/sec ( +- 1.49% ) [100.00%]
137,942 page-faults # 0.000 M/sec ( +- 1.70% )
350,607,742,932,846 cycles # 1.311 GHz ( +- 0.03% ) [100.00%]
523,280,989,487,579 stalled-cycles-frontend # 149.25% frontend cycles idle ( +- 0.04% ) [100.00%]
395,143,659,263,350 stalled-cycles-backend # 112.70% backend cycles idle ( +- 0.24% ) [100.00%]
147,359,655,811,699 instructions # 0.42 insns per cycle
# 3.55 stalled cycles per insn ( +- 0.05% ) [100.00%]
26,897,860,986,646 branches # 100.539 M/sec ( +- 0.10% ) [100.00%]
1,264,232,340 branch-misses # 0.00% of all branches ( +- 0.65% )
417.299580464 seconds time elapsed ( +- 0.03% )
MMF_THP_DISABLE set:
# perf stat -a -r 3 ./prctl_wrapper_mmv2 1 ./thp_pthread -C 0 -m 0 -c 512 -b 256g
Performance counter stats for './prctl_wrapper_mmv2 1 ./thp_pthread -C 0 -m 0 -c 512 -b 256g' (3 runs):
142442476.218751 task-clock # 642.085 CPUs utilized ( +- 0.74% ) [100.00%]
520,084 context-switches # 0.000 M/sec ( +- 0.79% ) [100.00%]
853 CPU-migrations # 0.000 M/sec ( +- 14.53% ) [100.00%]
62,396,741 page-faults # 0.000 M/sec ( +- 0.01% )
155,509,431,078,100 cycles # 1.092 GHz ( +- 0.75% ) [100.00%]
213,552,817,573,474 stalled-cycles-frontend # 137.32% frontend cycles idle ( +- 1.23% ) [100.00%]
117,337,842,556,506 stalled-cycles-backend # 75.45% backend cycles idle ( +- 2.09% ) [100.00%]
178,809,541,860,114 instructions # 1.15 insns per cycle
# 1.19 stalled cycles per insn ( +- 0.18% ) [100.00%]
26,295,305,012,722 branches # 184.603 M/sec ( +- 0.42% ) [100.00%]
754,391,541 branch-misses # 0.00% of all branches ( +- 0.50% )
221.843813599 seconds time elapsed ( +- 0.75% )
As you can see, this particular test gets about a 2x performance boost
when THP is turned off. Here's a link to the test, along with the
wrapper that I used:
http://oss.sgi.com/projects/memtests/thp_pthread_mmprctlv2.tar.gz
There are still a few things that might need tweaked here, but I wanted
to get the patch out there to get a discussion started. Two things I
noted from the old patch discussion that will likely need to be
addressed are:
* Patch doesn't currently account for get_user_pages allocations. I'm
actually not sure if this needs to be addressed. From what I know, get
user pages calls down to handle_mm_fault, which should prevent THPs
from being handed out where necessary. If anybody can confirm that,
it would be appreciated.
* Current behavior is to have fork()/exec()'d processes inherit the
flag. Andrew Morton pointed out some possible issues with this, so we
may need to rethink this behavior.
- If parent process has THP disabled, and forks off a child, the child
will also have THP disabled. This may not be the desired behavior.
Signed-off-by: Alex Thorlton <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: [email protected]
---
include/linux/huge_mm.h | 6 ++++--
include/linux/sched.h | 6 +++++-
include/uapi/linux/prctl.h | 3 +++
kernel/sys.c | 11 +++++++++++
4 files changed, 23 insertions(+), 3 deletions(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 91672e2..475f59f 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -1,6 +1,8 @@
#ifndef _LINUX_HUGE_MM_H
#define _LINUX_HUGE_MM_H
+#include <linux/sched.h>
+
extern int do_huge_pmd_anonymous_page(struct mm_struct *mm,
struct vm_area_struct *vma,
unsigned long address, pmd_t *pmd,
@@ -74,7 +76,8 @@ extern bool is_vma_temporary_stack(struct vm_area_struct *vma);
(1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG) && \
((__vma)->vm_flags & VM_HUGEPAGE))) && \
!((__vma)->vm_flags & VM_NOHUGEPAGE) && \
- !is_vma_temporary_stack(__vma))
+ !is_vma_temporary_stack(__vma) && \
+ !test_bit(MMF_THP_DISABLE, &(__vma)->vm_mm->flags))
#define transparent_hugepage_defrag(__vma) \
((transparent_hugepage_flags & \
(1<<TRANSPARENT_HUGEPAGE_DEFRAG_FLAG)) || \
@@ -227,7 +230,6 @@ static inline int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_str
{
return 0;
}
-
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
#endif /* _LINUX_HUGE_MM_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 53f97eb..0ff0c74 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -373,7 +373,11 @@ extern int get_dumpable(struct mm_struct *mm);
#define MMF_HAS_UPROBES 19 /* has uprobes */
#define MMF_RECALC_UPROBES 20 /* MMF_HAS_UPROBES can be wrong */
-#define MMF_INIT_MASK (MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK)
+#define MMF_THP_DISABLE 21 /* disable THP for this mm */
+#define MMF_THP_DISABLE_MASK (1 << MMF_THP_DISABLE)
+
+#define MMF_INIT_MASK (MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK | MMF_THP_DISABLE_MASK)
+
struct sighand_struct {
atomic_t count;
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 289760f..58afc04 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -149,4 +149,7 @@
#define PR_GET_TID_ADDRESS 40
+#define PR_SET_THP_DISABLE 41
+#define PR_GET_THP_DISABLE 42
+
#endif /* _LINUX_PRCTL_H */
diff --git a/kernel/sys.c b/kernel/sys.c
index c723113..097bfaa 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1998,6 +1998,17 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
if (arg2 || arg3 || arg4 || arg5)
return -EINVAL;
return current->no_new_privs ? 1 : 0;
+ case PR_SET_THP_DISABLE:
+ if (arg2)
+ set_bit(MMF_THP_DISABLE, &me->mm->flags);
+ else
+ clear_bit(MMF_THP_DISABLE, &me->mm->flags);
+ break;
+ case PR_GET_THP_DISABLE:
+ error = put_user(test_bit(MMF_THP_DISABLE,
+ &me->mm->flags),
+ (int __user *) arg2);
+ break;
default:
error = -EINVAL;
break;
--
1.7.12.4
This just adds a simple check to get khugepaged to behave
appropriately when MMF_THP_DISABLE is set.
Signed-off-by: Alex Thorlton <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: [email protected]
---
mm/huge_memory.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9c0b172..3cfe6b4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2049,7 +2049,8 @@ static void insert_to_mm_slots_hash(struct mm_struct *mm,
static inline int khugepaged_test_exit(struct mm_struct *mm)
{
- return atomic_read(&mm->mm_users) == 0;
+ return atomic_read(&mm->mm_users) == 0 ||
+ (mm->flags & MMF_THP_DISABLE_MASK);
}
int __khugepaged_enter(struct mm_struct *mm)
--
1.7.12.4
On Thu, Jan 16, 2014 at 1:01 PM, Alex Thorlton <[email protected]> wrote:
> This patch adds an mm flag (MMF_THP_DISABLE) to disable transparent
> hugepages using prctl.
>
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -1998,6 +1998,17 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
> if (arg2 || arg3 || arg4 || arg5)
> return -EINVAL;
> return current->no_new_privs ? 1 : 0;
> + case PR_SET_THP_DISABLE:
> + if (arg2)
> + set_bit(MMF_THP_DISABLE, &me->mm->flags);
> + else
> + clear_bit(MMF_THP_DISABLE, &me->mm->flags);
> + break;
> + case PR_GET_THP_DISABLE:
> + error = put_user(test_bit(MMF_THP_DISABLE,
> + &me->mm->flags),
> + (int __user *) arg2);
> + break;
Usual nit: please return -EINVAL if unused args are nonzero. This
makes it possible to extend these APIs in the future.
--Andy
On 01/16, Alex Thorlton wrote:
>
> + case PR_GET_THP_DISABLE:
> + error = put_user(test_bit(MMF_THP_DISABLE,
> + &me->mm->flags),
> + (int __user *) arg2);
> + break;
Do we really want put_user?
error = test_bit(MMF_THP_DISABLE);
can work too and looks simpler (and to me simpler to use in userspace).
PR_GET_DUMPABLE works this way.
Oleg.
On 01/16, Alex Thorlton wrote:
>
> static inline int khugepaged_test_exit(struct mm_struct *mm)
> {
> - return atomic_read(&mm->mm_users) == 0;
> + return atomic_read(&mm->mm_users) == 0 ||
> + (mm->flags & MMF_THP_DISABLE_MASK);
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
test_bit(MMF_THP_DISABLE) ?
And I am not sure this and another check in transparent_hugepage_enabled
is actually right...
I think that MMF_THP_DISABLE_MASK should not disable thp if this
vma has VM_HUGEPAGE set, iow perhaps madvise() should work even
after PR_SET_THP_DISABLE?
IOW, MMF_THP_DISABLE should act as khugepaged_req_madv().
But again, I won't argue.
Oleg.
On Fri, Jan 17, 2014 at 09:09:23PM +0100, Oleg Nesterov wrote:
> On 01/16, Alex Thorlton wrote:
> >
> > + case PR_GET_THP_DISABLE:
> > + error = put_user(test_bit(MMF_THP_DISABLE,
> > + &me->mm->flags),
> > + (int __user *) arg2);
> > + break;
>
> Do we really want put_user?
>
> error = test_bit(MMF_THP_DISABLE);
>
> can work too and looks simpler (and to me simpler to use in userspace).
That makes sense. I'll change that; definitely easier to use from
userspace that way.
- Alex
On Fri, Jan 17, 2014 at 11:54:09AM -0800, Andy Lutomirski wrote:
> Usual nit: please return -EINVAL if unused args are nonzero. This
> makes it possible to extend these APIs in the future.
Got it. Thanks, Andy!
- Alex
On Fri, Jan 17, 2014 at 09:34:44PM +0100, Oleg Nesterov wrote:
> On 01/16, Alex Thorlton wrote:
> >
> > static inline int khugepaged_test_exit(struct mm_struct *mm)
> > {
> > - return atomic_read(&mm->mm_users) == 0;
> > + return atomic_read(&mm->mm_users) == 0 ||
> > + (mm->flags & MMF_THP_DISABLE_MASK);
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> test_bit(MMF_THP_DISABLE) ?
Probably should just use the bitop here, good call.
> And I am not sure this and another check in transparent_hugepage_enabled
> is actually right...
>
> I think that MMF_THP_DISABLE_MASK should not disable thp if this
> vma has VM_HUGEPAGE set, iow perhaps madvise() should work even
> after PR_SET_THP_DISABLE?
>
> IOW, MMF_THP_DISABLE should act as khugepaged_req_madv().
I hadn't thought of this, but maybe that's a good idea. That way we can
turn off THP in general for an mm, but the places in code that
*specifically* request THP will still get it. I don't see why that
would be a problem, as long as we go with the assumption that, if
somebody is explicitly requesting THPs, they probably have a good reason
for doing so.
Thanks for the input!
- Alex
On Thu, Jan 16, 2014 at 03:01:43PM -0600, Alex Thorlton wrote:
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 91672e2..475f59f 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -1,6 +1,8 @@
> #ifndef _LINUX_HUGE_MM_H
> #define _LINUX_HUGE_MM_H
>
> +#include <linux/sched.h>
> +
Hm, now <linux/mm.h> depends on <linux/sched.h>. It doesn't look as a good
idea.
Why do we have MMF_* defines in <linux/sched.h>?
Wouldn't it more appropriate to put them in <linux/mm_types.h>?
> extern int do_huge_pmd_anonymous_page(struct mm_struct *mm,
> struct vm_area_struct *vma,
> unsigned long address, pmd_t *pmd,
> @@ -74,7 +76,8 @@ extern bool is_vma_temporary_stack(struct vm_area_struct *vma);
> (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG) && \
> ((__vma)->vm_flags & VM_HUGEPAGE))) && \
> !((__vma)->vm_flags & VM_NOHUGEPAGE) && \
> - !is_vma_temporary_stack(__vma))
> + !is_vma_temporary_stack(__vma) && \
> + !test_bit(MMF_THP_DISABLE, &(__vma)->vm_mm->flags))
> #define transparent_hugepage_defrag(__vma) \
> ((transparent_hugepage_flags & \
> (1<<TRANSPARENT_HUGEPAGE_DEFRAG_FLAG)) || \
> @@ -227,7 +230,6 @@ static inline int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_str
> {
> return 0;
> }
> -
> #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>
> #endif /* _LINUX_HUGE_MM_H */
Why?
--
Kirill A. Shutemov
On Thu, Jan 16, 2014 at 03:01:44PM -0600, Alex Thorlton wrote:
> This just adds a simple check to get khugepaged to behave
> appropriately when MMF_THP_DISABLE is set.
>
> Signed-off-by: Alex Thorlton <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: "Kirill A. Shutemov" <[email protected]>
> Cc: Benjamin Herrenschmidt <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Cc: Naoya Horiguchi <[email protected]>
> Cc: Oleg Nesterov <[email protected]>
> Cc: "Eric W. Biederman" <[email protected]>
> Cc: Andy Lutomirski <[email protected]>
> Cc: Al Viro <[email protected]>
> Cc: Kees Cook <[email protected]>
> Cc: Andrea Arcangeli <[email protected]>
> Cc: [email protected]
>
> ---
> mm/huge_memory.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 9c0b172..3cfe6b4 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2049,7 +2049,8 @@ static void insert_to_mm_slots_hash(struct mm_struct *mm,
>
> static inline int khugepaged_test_exit(struct mm_struct *mm)
> {
> - return atomic_read(&mm->mm_users) == 0;
> + return atomic_read(&mm->mm_users) == 0 ||
> + (mm->flags & MMF_THP_DISABLE_MASK);
__khugepaged_enter() has VM_BUG_ON(khugepaged_test_exit(mm)).
Do we really want to crash there if MMF_THP_DISABLE is set?
--
Kirill A. Shutemov
On Sun, Jan 19, 2014 at 01:41:34AM +0200, Kirill A. Shutemov wrote:
> On Thu, Jan 16, 2014 at 03:01:43PM -0600, Alex Thorlton wrote:
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index 91672e2..475f59f 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -1,6 +1,8 @@
> > #ifndef _LINUX_HUGE_MM_H
> > #define _LINUX_HUGE_MM_H
> >
> > +#include <linux/sched.h>
> > +
>
> Hm, now <linux/mm.h> depends on <linux/sched.h>. It doesn't look as a good
> idea.
>
> Why do we have MMF_* defines in <linux/sched.h>?
> Wouldn't it more appropriate to put them in <linux/mm_types.h>?
I'm not sure about that. I just placed my definitions where everything
else was defined. It might make more sense to have these in
mm_types.h, though that's a separate effort. For now, I don't think I
have a choice but to include sched.h here, but it does seem to make
sense that I should be able to see the defined mm flags from huge_mm.h,
without having to include sched.h. I'm sure there are other places where
this applies as well.
> > extern int do_huge_pmd_anonymous_page(struct mm_struct *mm,
> > struct vm_area_struct *vma,
> > unsigned long address, pmd_t *pmd,
> > @@ -74,7 +76,8 @@ extern bool is_vma_temporary_stack(struct vm_area_struct *vma);
> > (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG) && \
> > ((__vma)->vm_flags & VM_HUGEPAGE))) && \
> > !((__vma)->vm_flags & VM_NOHUGEPAGE) && \
> > - !is_vma_temporary_stack(__vma))
> > + !is_vma_temporary_stack(__vma) && \
> > + !test_bit(MMF_THP_DISABLE, &(__vma)->vm_mm->flags))
> > #define transparent_hugepage_defrag(__vma) \
> > ((transparent_hugepage_flags & \
> > (1<<TRANSPARENT_HUGEPAGE_DEFRAG_FLAG)) || \
> > @@ -227,7 +230,6 @@ static inline int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_str
> > {
> > return 0;
> > }
> > -
> > #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> >
> > #endif /* _LINUX_HUGE_MM_H */
>
> Why?
Are you asking about the line I removed there? That was probably an
accident. I'll fix that
- Alex
On Sun, Jan 19, 2014 at 01:49:57AM +0200, Kirill A. Shutemov wrote:
> __khugepaged_enter() has VM_BUG_ON(khugepaged_test_exit(mm)).
> Do we really want to crash there if MMF_THP_DISABLE is set?
No, definitely not. Upon review, khugepaged_test_exit is the wrong
place to do this check. I think I need to move it up to
khugepaged_scan_mm_slot for this to work correctly. Thanks for pointing
that out, Kirill.
- Alex
On 01/20, Alex Thorlton wrote:
>
> No, definitely not. Upon review, khugepaged_test_exit is the wrong
> place to do this check. I think I need to move it up to
> khugepaged_scan_mm_slot for this to work correctly.
Why? unless a MMF_THP_DISABLE task does madvise(MADV_HUGEPAGE)
khugepaged_scan_mm_slot() should never see its ->mm ?
Although I got lost a bit, and probably misunderstood... but it
seems to me that whatever you do this patch should not touch
khugepaged_scan_mm_slot.
Oleg.
On Mon, Jan 20, 2014 at 09:15:25PM +0100, Oleg Nesterov wrote:
> On 01/20, Alex Thorlton wrote:
> >
> > No, definitely not. Upon review, khugepaged_test_exit is the wrong
> > place to do this check. I think I need to move it up to
> > khugepaged_scan_mm_slot for this to work correctly.
>
> Why? unless a MMF_THP_DISABLE task does madvise(MADV_HUGEPAGE)
> khugepaged_scan_mm_slot() should never see its ->mm ?
>
> Although I got lost a bit, and probably misunderstood... but it
> seems to me that whatever you do this patch should not touch
> khugepaged_scan_mm_slot.
Maybe I've gotten myself confused as well :) After looking through the
code some more, my understanding is that khugepaged_test_exit is used to
make sure that __khugepaged_exit isn't running from underneath at certain
times, so to have khugepaged_test_exit return true when __khugepaged_exit
is not necessarily running, seems incorrect to me.
I think the check for MMF_THP_DISABLE should occur at the same time as
we check khugepaged_test_exit, but should occur separately, since I
don't really believe the two checks are related. Something like this in
khugepaged_scan_mm_slot:
mm = mm_slot->mm;
down_read(&mm->mmap_sem);
- if (unlikely(khugepaged_test_exit(mm)))
+ if (unlikely(khugepaged_test_exit(mm) || check_mmf_thp_disable(mm)))
vma = NULL;
else
vma = find_vma(mm, khugepaged_scan.address);
progress++;
I think this makes more sense, but I may not be looking at this
correctly. Thoughts?
- Alex
Alex, Andrew, I think this simple series makes sense in any case,
but _perhaps_ it can also help THP_DISABLE.
On 01/20, Alex Thorlton wrote:
>
> On Mon, Jan 20, 2014 at 09:15:25PM +0100, Oleg Nesterov wrote:
> >
> > Although I got lost a bit, and probably misunderstood... but it
> > seems to me that whatever you do this patch should not touch
> > khugepaged_scan_mm_slot.
>
> Maybe I've gotten myself confused as well :) After looking through the
> code some more, my understanding is that khugepaged_test_exit is used to
> make sure that __khugepaged_exit isn't running from underneath at certain
> times, so to have khugepaged_test_exit return true when __khugepaged_exit
> is not necessarily running, seems incorrect to me.
Still can't understand... probably I need to see v3.
But you know, I have another idea. Not sure you will like it, and probably
I missed something.
Can't we simply add VM_NOHUGEPAGE into ->def_flags? See the (untested)
patch below, on top of this series.
What do you think?
Oleg.
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1cedd00..bc1dd9e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -167,6 +167,8 @@ extern unsigned int kobjsize(const void *objp);
*/
#define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_PFNMAP)
+#define VM_INIT_DEF_MASK VM_NOHUGEPAGE
+
/*
* mapping from the currently active vm_flags protection bits (the
* low four bits) to a page protection mask..
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 289760f..58afc04 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -149,4 +149,7 @@
#define PR_GET_TID_ADDRESS 40
+#define PR_SET_THP_DISABLE 41
+#define PR_GET_THP_DISABLE 42
+
#endif /* _LINUX_PRCTL_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index b84bef7..f6d020b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -529,8 +529,6 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
atomic_set(&mm->mm_count, 1);
init_rwsem(&mm->mmap_sem);
INIT_LIST_HEAD(&mm->mmlist);
- mm->flags = (current->mm) ?
- (current->mm->flags & MMF_INIT_MASK) : default_dump_filter;
mm->core_state = NULL;
atomic_long_set(&mm->nr_ptes, 0);
memset(&mm->rss_stat, 0, sizeof(mm->rss_stat));
@@ -538,8 +536,15 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
mm_init_aio(mm);
mm_init_owner(mm, p);
- if (likely(!mm_alloc_pgd(mm))) {
+ if (current->mm) {
+ mm->flags = current->mm->flags & MMF_INIT_MASK;
+ mm->def_flags = current->mm->def_flags & VM_INIT_DEF_MASK;
+ } else {
+ mm->flags = default_dump_filter;
mm->def_flags = 0;
+ }
+
+ if (likely(!mm_alloc_pgd(mm))) {
mmu_notifier_mm_init(mm);
return mm;
}
diff --git a/kernel/sys.c b/kernel/sys.c
index ac1842e..eb8b0fc 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2029,6 +2029,19 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
if (arg2 || arg3 || arg4 || arg5)
return -EINVAL;
return current->no_new_privs ? 1 : 0;
+ case PR_SET_THP_DISABLE:
+ case PR_GET_THP_DISABLE:
+ down_write(&me->mm->mmap_sem);
+ if (option == PR_SET_THP_DISABLE) {
+ if (arg2)
+ me->mm->def_flags |= VM_NOHUGEPAGE;
+ else
+ me->mm->def_flags &= ~VM_NOHUGEPAGE;
+ } else {
+ error = !!(me->mm->flags && VM_NOHUGEPAGE);
+ }
+ up_write(&me->mm->mmap_sem);
+ break;
default:
error = -EINVAL;
break;
load_elf_binary() sets current->mm->def_flags = def_flags and
def_flags is always zero. Not only this looks strange, this is
unnecessary because mm_init() has already set ->def_flags = 0.
Signed-off-by: Oleg Nesterov <[email protected]>
---
fs/binfmt_elf.c | 4 ----
1 files changed, 0 insertions(+), 4 deletions(-)
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 571a423..19f569c 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -582,7 +582,6 @@ static int load_elf_binary(struct linux_binprm *bprm)
unsigned long start_code, end_code, start_data, end_data;
unsigned long reloc_func_desc __maybe_unused = 0;
int executable_stack = EXSTACK_DEFAULT;
- unsigned long def_flags = 0;
struct pt_regs *regs = current_pt_regs();
struct {
struct elfhdr elf_ex;
@@ -722,9 +721,6 @@ static int load_elf_binary(struct linux_binprm *bprm)
if (retval)
goto out_free_dentry;
- /* OK, This is the point of no return */
- current->mm->def_flags = def_flags;
-
/* Do this immediately, since STACK_TOP as used in setup_arg_pages
may depend on the personality. */
SET_PERSONALITY(loc->elf_ex);
--
1.5.5.1
hugepage_madvise() checks "mm->def_flags & VM_NOHUGEPAGE" but
this can be never true, currently mm->def_flags can only have
VM_LOCKED.
And otoh we might want to add VM_NOHUGEPAGE into ->def_flags
but override it in vma->vm_flags via madvise(MADV_HUGEPAGE).
Signed-off-by: Oleg Nesterov <[email protected]>
---
mm/huge_memory.c | 2 --
1 files changed, 0 insertions(+), 2 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index bccd5a6..e8b656c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1941,8 +1941,6 @@ int hugepage_madvise(struct vm_area_struct *vma,
*/
if (*vm_flags & (VM_HUGEPAGE | VM_NO_THP))
return -EINVAL;
- if (mm->def_flags & VM_NOHUGEPAGE)
- return -EINVAL;
*vm_flags &= ~VM_NOHUGEPAGE;
*vm_flags |= VM_HUGEPAGE;
/*
--
1.5.5.1
On 01/22, Oleg Nesterov wrote:
>
> Can't we simply add VM_NOHUGEPAGE into ->def_flags? See the (untested)
> patch below, on top of this series.
And perhaps the patch below makes sense as a separate change, I dunno.
------------------------------------------------------------------------------
Subject: [PATCH] mm: introduce VM_INIT_DEF_MASK
mm_init() always sets new_mm->def_flags = 0. We might want to
inherit some bits like we do with mm->flags.
Add the new VM_INIT_DEF_MASK (currently zero) which acts like
MMF_INIT_MASK for ->def_flags.
---
include/linux/mm.h | 2 ++
kernel/fork.c | 11 ++++++++---
2 files changed, 10 insertions(+), 3 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1cedd00..0a340c9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -167,6 +167,8 @@ extern unsigned int kobjsize(const void *objp);
*/
#define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_PFNMAP)
+#define VM_INIT_DEF_MASK 0
+
/*
* mapping from the currently active vm_flags protection bits (the
* low four bits) to a page protection mask..
diff --git a/kernel/fork.c b/kernel/fork.c
index b84bef7..f6d020b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -529,8 +529,6 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
atomic_set(&mm->mm_count, 1);
init_rwsem(&mm->mmap_sem);
INIT_LIST_HEAD(&mm->mmlist);
- mm->flags = (current->mm) ?
- (current->mm->flags & MMF_INIT_MASK) : default_dump_filter;
mm->core_state = NULL;
atomic_long_set(&mm->nr_ptes, 0);
memset(&mm->rss_stat, 0, sizeof(mm->rss_stat));
@@ -538,8 +536,15 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
mm_init_aio(mm);
mm_init_owner(mm, p);
- if (likely(!mm_alloc_pgd(mm))) {
+ if (current->mm) {
+ mm->flags = current->mm->flags & MMF_INIT_MASK;
+ mm->def_flags = current->mm->def_flags & VM_INIT_DEF_MASK;
+ } else {
+ mm->flags = default_dump_filter;
mm->def_flags = 0;
+ }
+
+ if (likely(!mm_alloc_pgd(mm))) {
mmu_notifier_mm_init(mm);
return mm;
}
--
1.5.5.1
On Wed, Jan 22, 2014 at 06:45:53PM +0100, Oleg Nesterov wrote:
> Alex, Andrew, I think this simple series makes sense in any case,
> but _perhaps_ it can also help THP_DISABLE.
>
> On 01/20, Alex Thorlton wrote:
> >
> > On Mon, Jan 20, 2014 at 09:15:25PM +0100, Oleg Nesterov wrote:
> > >
> > > Although I got lost a bit, and probably misunderstood... but it
> > > seems to me that whatever you do this patch should not touch
> > > khugepaged_scan_mm_slot.
> >
> > Maybe I've gotten myself confused as well :) After looking through the
> > code some more, my understanding is that khugepaged_test_exit is used to
> > make sure that __khugepaged_exit isn't running from underneath at certain
> > times, so to have khugepaged_test_exit return true when __khugepaged_exit
> > is not necessarily running, seems incorrect to me.
>
> Still can't understand... probably I need to see v3.
I think the appropriate place to check this is actually in
hugepage_vma_check, so you're correct that we don't need to directly
touch khugepaged_scan_mm_slot, we just need to change a different
function underneath it.
We'll table that for now, though. I'll put out a v3 later today, so you
can see what I'm talking about, but I think your idea looks like it will
be better in the end.
> But you know, I have another idea. Not sure you will like it, and probably
> I missed something.
>
> Can't we simply add VM_NOHUGEPAGE into ->def_flags? See the (untested)
> patch below, on top of this series.
>
> What do you think?
At a glance, without testing, it looks like a good idea to me. By
using def_flags, we leverage functionality that's already in place to
achieve the same result. We don't need to add any new checks into the
fault path or into khugepaged, since we're just leveraging the
VM_HUGEPAGE/NOHUGEPAGE flag, which we already check for. We also get
the behavior that you suggested (madvise is still respected, even with
the new THP disable prctl set), for free with this method.
I like the idea, but I think that it should probably be a separate
change from the other few cleanups that you proposed along with it, since
they're somewhat unrelated to this particular issue. Do you agree?
Also, one small note on the code:
> diff --git a/kernel/sys.c b/kernel/sys.c
> index ac1842e..eb8b0fc 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -2029,6 +2029,19 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
> if (arg2 || arg3 || arg4 || arg5)
> return -EINVAL;
> return current->no_new_privs ? 1 : 0;
> + case PR_SET_THP_DISABLE:
> + case PR_GET_THP_DISABLE:
> + down_write(&me->mm->mmap_sem);
> + if (option == PR_SET_THP_DISABLE) {
> + if (arg2)
> + me->mm->def_flags |= VM_NOHUGEPAGE;
> + else
> + me->mm->def_flags &= ~VM_NOHUGEPAGE;
> + } else {
> + error = !!(me->mm->flags && VM_NOHUGEPAGE);
Should be:
error = !!(me->mm->def_flags && VM_NOHUGEPAGE);
Correct?
- Alex
On 01/22, Alex Thorlton wrote:
>
> At a glance, without testing, it looks like a good idea to me. By
> using def_flags, we leverage functionality that's already in place to
> achieve the same result. We don't need to add any new checks into the
> fault path or into khugepaged, since we're just leveraging the
> VM_HUGEPAGE/NOHUGEPAGE flag, which we already check for. We also get
> the behavior that you suggested (madvise is still respected, even with
> the new THP disable prctl set), for free with this method.
Yes, exactly.
> I like the idea, but I think that it should probably be a separate
> change from the other few cleanups that you proposed along with it,
Yes, sure, that is why I sent them separately,
> since
> they're somewhat unrelated to this particular issue. Do you agree?
Not really. Note that without 1/2 VM_NOHUGEPAGE won't survive after
exec. And without 2/2 madvise(MADV_HUGEPAGE) won't work after
PR_SET_THP_DISABLE.
But again, I think that these 2 simple cleanups make sense even without
PR_SET_THP_DISABLE.
> > diff --git a/kernel/sys.c b/kernel/sys.c
> > index ac1842e..eb8b0fc 100644
> > --- a/kernel/sys.c
> > +++ b/kernel/sys.c
> > @@ -2029,6 +2029,19 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
> > if (arg2 || arg3 || arg4 || arg5)
> > return -EINVAL;
> > return current->no_new_privs ? 1 : 0;
> > + case PR_SET_THP_DISABLE:
> > + case PR_GET_THP_DISABLE:
> > + down_write(&me->mm->mmap_sem);
> > + if (option == PR_SET_THP_DISABLE) {
> > + if (arg2)
> > + me->mm->def_flags |= VM_NOHUGEPAGE;
> > + else
> > + me->mm->def_flags &= ~VM_NOHUGEPAGE;
> > + } else {
> > + error = !!(me->mm->flags && VM_NOHUGEPAGE);
>
> Should be:
>
> error = !!(me->mm->def_flags && VM_NOHUGEPAGE);
No, we need to return 1 if this bit is set ;)
Oleg.
On 01/22, Oleg Nesterov wrote:
>
> On 01/22, Alex Thorlton wrote:
> > > + case PR_SET_THP_DISABLE:
> > > + case PR_GET_THP_DISABLE:
> > > + down_write(&me->mm->mmap_sem);
> > > + if (option == PR_SET_THP_DISABLE) {
> > > + if (arg2)
> > > + me->mm->def_flags |= VM_NOHUGEPAGE;
> > > + else
> > > + me->mm->def_flags &= ~VM_NOHUGEPAGE;
> > > + } else {
> > > + error = !!(me->mm->flags && VM_NOHUGEPAGE);
> >
> > Should be:
> >
> > error = !!(me->mm->def_flags && VM_NOHUGEPAGE);
>
> No, we need to return 1 if this bit is set ;)
Damn, you are right of course, we need "&". I didn't notice "&&"
in the patch I sent and misunderstood your "&&" above ;) Sorry.
Oleg.
On Wed, Jan 22, 2014 at 08:43:27PM +0100, Oleg Nesterov wrote:
> On 01/22, Oleg Nesterov wrote:
> >
> > On 01/22, Alex Thorlton wrote:
> > > > + case PR_SET_THP_DISABLE:
> > > > + case PR_GET_THP_DISABLE:
> > > > + down_write(&me->mm->mmap_sem);
> > > > + if (option == PR_SET_THP_DISABLE) {
> > > > + if (arg2)
> > > > + me->mm->def_flags |= VM_NOHUGEPAGE;
> > > > + else
> > > > + me->mm->def_flags &= ~VM_NOHUGEPAGE;
> > > > + } else {
> > > > + error = !!(me->mm->flags && VM_NOHUGEPAGE);
> > >
> > > Should be:
> > >
> > > error = !!(me->mm->def_flags && VM_NOHUGEPAGE);
> >
> > No, we need to return 1 if this bit is set ;)
>
> Damn, you are right of course, we need "&". I didn't notice "&&"
> in the patch I sent and misunderstood your "&&" above ;) Sorry.
Actually, I didn't catch that either! Looking at it, though, we
definitely do want bitwise AND here, not logical.
However, what I was originally referring to is: Shouldn't we be
checking mm->***def_flags*** for the VM_NOHUGEPAGE bit, as opposed
to mm->flags? i.e. I think we want this:
error = !!(me->mm->def_flags & VM_NOHUGEPAGE);
As opposed to:
error = !!(me->mm->flags && VM_NOHUGEPAGE);
The way I understand it, the VM_NOHUGEPAGE bit is defined for
mm->vma->flags, but not for mm->flags. Am I wrong here?
- Alex
On Wed, 22 Jan 2014, Oleg Nesterov wrote:
> hugepage_madvise() checks "mm->def_flags & VM_NOHUGEPAGE" but
> this can be never true, currently mm->def_flags can only have
> VM_LOCKED.
But line 1087 of arch/s390/mm/pgtable.c says
mm->def_flags |= VM_NOHUGEPAGE;
from 3eabaee998c787e7e1565574821652548f7fc003
"KVM: s390: allow sie enablement for multi-threaded programs".
Hugh
>
> And otoh we might want to add VM_NOHUGEPAGE into ->def_flags
> but override it in vma->vm_flags via madvise(MADV_HUGEPAGE).
>
> Signed-off-by: Oleg Nesterov <[email protected]>
> ---
> mm/huge_memory.c | 2 --
> 1 files changed, 0 insertions(+), 2 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index bccd5a6..e8b656c 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1941,8 +1941,6 @@ int hugepage_madvise(struct vm_area_struct *vma,
> */
> if (*vm_flags & (VM_HUGEPAGE | VM_NO_THP))
> return -EINVAL;
> - if (mm->def_flags & VM_NOHUGEPAGE)
> - return -EINVAL;
> *vm_flags &= ~VM_NOHUGEPAGE;
> *vm_flags |= VM_HUGEPAGE;
> /*
> --
> 1.5.5.1
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
On 01/22, Hugh Dickins wrote:
>
> On Wed, 22 Jan 2014, Oleg Nesterov wrote:
>
> > hugepage_madvise() checks "mm->def_flags & VM_NOHUGEPAGE" but
> > this can be never true, currently mm->def_flags can only have
> > VM_LOCKED.
>
> But line 1087 of arch/s390/mm/pgtable.c says
> mm->def_flags |= VM_NOHUGEPAGE;
> from 3eabaee998c787e7e1565574821652548f7fc003
> "KVM: s390: allow sie enablement for multi-threaded programs".
Argh. Thanks Hugh!
Another case when I forgot about /bin/grep. So the patch is wrong,
at least the changelog is certainly is. I am stupid.
But, perhaps, this all still can work? Looks like, s390 already
implements PR_SET_THP_DISABLE using the same idea, it would be
nice to avoid another hack.
Gerald, any chance we can revert 8e72033f2a489 "thp: make MADV_HUGEPAGE
check for mm->def_flags" ? The changelog says "In order to also prevent
MADV_HUGEPAGE on such an mm", is it really important? I mean, if the
application calls madvise(MADV_HUGEPAGE) it should probably know what
it does and, this can be useful after if PR_SET_THP_DISABLE or
KVM_S390_ENABLE_SIE. Of course I do not understand this code, perhaps
MADV_HUGEPAGE is simply impossible.
Another question, can't fork/exec preserve VM_NOHUGEPAGE in ->def_flags?
This is what PR_SET_THP_DISABLE wants.
I'll send you mbox with the previous discussion privately.
Thanks,
Oleg.
On 01/22, Alex Thorlton wrote:
>
> On Wed, Jan 22, 2014 at 08:43:27PM +0100, Oleg Nesterov wrote:
> > On 01/22, Oleg Nesterov wrote:
> > >
> > > On 01/22, Alex Thorlton wrote:
> > > > > + case PR_SET_THP_DISABLE:
> > > > > + case PR_GET_THP_DISABLE:
> > > > > + down_write(&me->mm->mmap_sem);
> > > > > + if (option == PR_SET_THP_DISABLE) {
> > > > > + if (arg2)
> > > > > + me->mm->def_flags |= VM_NOHUGEPAGE;
> > > > > + else
> > > > > + me->mm->def_flags &= ~VM_NOHUGEPAGE;
> > > > > + } else {
> > > > > + error = !!(me->mm->flags && VM_NOHUGEPAGE);
> > > >
> > > > Should be:
> > > >
> > > > error = !!(me->mm->def_flags && VM_NOHUGEPAGE);
> > >
> > > No, we need to return 1 if this bit is set ;)
> >
> > Damn, you are right of course, we need "&". I didn't notice "&&"
> > in the patch I sent and misunderstood your "&&" above ;) Sorry.
>
> Actually, I didn't catch that either! Looking at it, though, we
> definitely do want bitwise AND here, not logical.
>
> However, what I was originally referring to is: Shouldn't we be
> checking mm->***def_flags*** for the VM_NOHUGEPAGE bit, as opposed
> to mm->flags? i.e. I think we want this:
>
> error = !!(me->mm->def_flags & VM_NOHUGEPAGE);
Damn, of course you are right. I misunderstood you twice.
But so far I'm afraid this idea can't work anyway, although lets wait
for reply from s390 maintainers.
Oleg.
On Thu, 23 Jan 2014 17:43:34 +0100
Oleg Nesterov <[email protected]> wrote:
> On 01/22, Hugh Dickins wrote:
> >
> > On Wed, 22 Jan 2014, Oleg Nesterov wrote:
> >
> > > hugepage_madvise() checks "mm->def_flags & VM_NOHUGEPAGE" but
> > > this can be never true, currently mm->def_flags can only have
> > > VM_LOCKED.
> >
> > But line 1087 of arch/s390/mm/pgtable.c says
> > mm->def_flags |= VM_NOHUGEPAGE;
> > from 3eabaee998c787e7e1565574821652548f7fc003
> > "KVM: s390: allow sie enablement for multi-threaded programs".
>
> Argh. Thanks Hugh!
>
> Another case when I forgot about /bin/grep. So the patch is wrong,
> at least the changelog is certainly is. I am stupid.
>
> But, perhaps, this all still can work? Looks like, s390 already
> implements PR_SET_THP_DISABLE using the same idea, it would be
> nice to avoid another hack.
>
> Gerald, any chance we can revert 8e72033f2a489 "thp: make
> MADV_HUGEPAGE check for mm->def_flags" ? The changelog says "In order
> to also prevent MADV_HUGEPAGE on such an mm", is it really important?
> I mean, if the application calls madvise(MADV_HUGEPAGE) it should
> probably know what it does and, this can be useful after if
> PR_SET_THP_DISABLE or KVM_S390_ENABLE_SIE. Of course I do not
> understand this code, perhaps MADV_HUGEPAGE is simply impossible.
Yes, after discussion with Martin, I think that commit 8e72033f2a489 can
be reverted if we add a small add-on patch to the s390 gmap code instead,
like this:
diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index 3584ed9..a87cdb4 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -504,6 +504,9 @@ static int gmap_connect_pgtable(unsigned long address, unsigned long segment,
if (!pmd_present(*pmd) &&
__pte_alloc(mm, vma, pmd, vmaddr))
return -ENOMEM;
+ /* large pmds cannot yet be handled */
+ if (pmd_large(*pmd))
+ return -EFAULT;
/* pmd now points to a valid segment table entry. */
rmap = kmalloc(sizeof(*rmap), GFP_KERNEL|__GFP_REPEAT);
if (!rmap)