2017-06-02 15:03:53

by Mike Rapoport

[permalink] [raw]
Subject: [PATCH] mm: make PR_SET_THP_DISABLE immediately active

From: Michal Hocko <[email protected]>

PR_SET_THP_DISABLE has a rather subtle semantic. It doesn't affect any
existing mapping because it only updated mm->def_flags which is a template
for new mappings. The mappings created after prctl(PR_SET_THP_DISABLE) have
VM_NOHUGEPAGE flag set. This can be quite surprising for all those
applications which do not do prctl(); fork() & exec() and want to control
their own THP behavior.

Another usecase when the immediate semantic of the prctl might be useful is
a combination of pre- and post-copy migration of containers with CRIU. In
this case CRIU populates a part of a memory region with data that was saved
during the pre-copy stage. Afterwards, the region is registered with
userfaultfd and CRIU expects to get page faults for the parts of the region
that were not yet populated. However, khugepaged collapses the pages and
the expected page faults do not occur.

In more general case, the prctl(PR_SET_THP_DISABLE) could be used as a
temporary mechanism for enabling/disabling THP process wide.

Implementation wise, a new MMF_DISABLE_THP flag is added. This flag is
tested when decision whether to use huge pages is taken either during page
fault of at the time of THP collapse.

It should be noted, that the new implementation makes PR_SET_THP_DISABLE
master override to any per-VMA setting, which was not the case previously.

Fixes: a0715cc22601 ("mm, thp: add VM_INIT_DEF_MASK and PRCTL_THP_DISABLE")
Signed-off-by: Michal Hocko <[email protected]>
Signed-off-by: Mike Rapoport <[email protected]>
---
include/linux/huge_mm.h | 1 +
include/linux/khugepaged.h | 3 ++-
include/linux/sched/coredump.h | 5 ++++-
kernel/sys.c | 6 +++---
mm/khugepaged.c | 3 ++-
mm/shmem.c | 8 +++++---
6 files changed, 17 insertions(+), 9 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index a3762d4..9da053c 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -92,6 +92,7 @@ extern bool is_vma_temporary_stack(struct vm_area_struct *vma);
(1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG) && \
((__vma)->vm_flags & VM_HUGEPAGE))) && \
!((__vma)->vm_flags & VM_NOHUGEPAGE) && \
+ !test_bit(MMF_DISABLE_THP, &(__vma)->vm_mm->flags) && \
!is_vma_temporary_stack(__vma))
#define transparent_hugepage_use_zero_page() \
(transparent_hugepage_flags & \
diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
index 5d9a400..f0d7335 100644
--- a/include/linux/khugepaged.h
+++ b/include/linux/khugepaged.h
@@ -48,7 +48,8 @@ static inline int khugepaged_enter(struct vm_area_struct *vma,
if (!test_bit(MMF_VM_HUGEPAGE, &vma->vm_mm->flags))
if ((khugepaged_always() ||
(khugepaged_req_madv() && (vm_flags & VM_HUGEPAGE))) &&
- !(vm_flags & VM_NOHUGEPAGE))
+ !(vm_flags & VM_NOHUGEPAGE) &&
+ !test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
if (__khugepaged_enter(vma->vm_mm))
return -ENOMEM;
return 0;
diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h
index 69eedce..98ae0d0 100644
--- a/include/linux/sched/coredump.h
+++ b/include/linux/sched/coredump.h
@@ -68,7 +68,10 @@ static inline int get_dumpable(struct mm_struct *mm)
#define MMF_OOM_SKIP 21 /* mm is of no interest for the OOM killer */
#define MMF_UNSTABLE 22 /* mm is unstable for copy_from_user */
#define MMF_HUGE_ZERO_PAGE 23 /* mm has ever used the global huge zero page */
+#define MMF_DISABLE_THP 24 /* disable THP for all VMAs */
+#define MMF_DISABLE_THP_MASK (1 << MMF_DISABLE_THP)

-#define MMF_INIT_MASK (MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK)
+#define MMF_INIT_MASK (MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK |\
+ MMF_DISABLE_THP_MASK)

#endif /* _LINUX_SCHED_COREDUMP_H */
diff --git a/kernel/sys.c b/kernel/sys.c
index 8a94b4e..e48f063 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2266,7 +2266,7 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
case PR_GET_THP_DISABLE:
if (arg2 || arg3 || arg4 || arg5)
return -EINVAL;
- error = !!(me->mm->def_flags & VM_NOHUGEPAGE);
+ error = !!test_bit(MMF_DISABLE_THP, &me->mm->flags);
break;
case PR_SET_THP_DISABLE:
if (arg3 || arg4 || arg5)
@@ -2274,9 +2274,9 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
if (down_write_killable(&me->mm->mmap_sem))
return -EINTR;
if (arg2)
- me->mm->def_flags |= VM_NOHUGEPAGE;
+ set_bit(MMF_DISABLE_THP, &me->mm->flags);
else
- me->mm->def_flags &= ~VM_NOHUGEPAGE;
+ clear_bit(MMF_DISABLE_THP, &me->mm->flags);
up_write(&me->mm->mmap_sem);
break;
case PR_MPX_ENABLE_MANAGEMENT:
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 32c66e7..b444b9b 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -824,7 +824,8 @@ khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
static bool hugepage_vma_check(struct vm_area_struct *vma)
{
if ((!(vma->vm_flags & VM_HUGEPAGE) && !khugepaged_always()) ||
- (vma->vm_flags & VM_NOHUGEPAGE))
+ (vma->vm_flags & VM_NOHUGEPAGE) ||
+ test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
return false;
if (shmem_file(vma->vm_file)) {
if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGE_PAGECACHE))
diff --git a/mm/shmem.c b/mm/shmem.c
index e67d6ba..33b908b 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1977,10 +1977,12 @@ static int shmem_fault(struct vm_fault *vmf)
}

sgp = SGP_CACHE;
- if (vma->vm_flags & VM_HUGEPAGE)
- sgp = SGP_HUGE;
- else if (vma->vm_flags & VM_NOHUGEPAGE)
+
+ if ((vma->vm_flags & VM_NOHUGEPAGE) ||
+ test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
sgp = SGP_NOHUGE;
+ else if (vma->vm_flags & VM_HUGEPAGE)
+ sgp = SGP_HUGE;

error = shmem_getpage_gfp(inode, vmf->pgoff, &vmf->page, sgp,
gfp, vma, vmf, &ret);
--
2.7.4


2017-06-02 19:51:03

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] mm: make PR_SET_THP_DISABLE immediately active

On Fri, 2 Jun 2017 18:03:22 +0300 "Mike Rapoport" <[email protected]> wrote:

> PR_SET_THP_DISABLE has a rather subtle semantic. It doesn't affect any
> existing mapping because it only updated mm->def_flags which is a template
> for new mappings. The mappings created after prctl(PR_SET_THP_DISABLE) have
> VM_NOHUGEPAGE flag set. This can be quite surprising for all those
> applications which do not do prctl(); fork() & exec() and want to control
> their own THP behavior.
>
> Another usecase when the immediate semantic of the prctl might be useful is
> a combination of pre- and post-copy migration of containers with CRIU. In
> this case CRIU populates a part of a memory region with data that was saved
> during the pre-copy stage. Afterwards, the region is registered with
> userfaultfd and CRIU expects to get page faults for the parts of the region
> that were not yet populated. However, khugepaged collapses the pages and
> the expected page faults do not occur.
>
> In more general case, the prctl(PR_SET_THP_DISABLE) could be used as a
> temporary mechanism for enabling/disabling THP process wide.
>
> Implementation wise, a new MMF_DISABLE_THP flag is added. This flag is
> tested when decision whether to use huge pages is taken either during page
> fault of at the time of THP collapse.
>
> It should be noted, that the new implementation makes PR_SET_THP_DISABLE
> master override to any per-VMA setting, which was not the case previously.
>
> Fixes: a0715cc22601 ("mm, thp: add VM_INIT_DEF_MASK and PRCTL_THP_DISABLE")

"Fixes" is a bit strong. I'd say "alters". And significantly altering
the runtime behaviour of a three-year-old interface is rather a worry,
no?

Perhaps we should be adding new prctl modes to select this new
behaviour and leave the existing PR_SET_THP_DISABLE behaviour as-is?

2017-06-02 20:32:49

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH] mm: make PR_SET_THP_DISABLE immediately active

On 06/02/2017 09:50 PM, Andrew Morton wrote:
> On Fri, 2 Jun 2017 18:03:22 +0300 "Mike Rapoport" <[email protected]> wrote:
>
>> PR_SET_THP_DISABLE has a rather subtle semantic. It doesn't affect any
>> existing mapping because it only updated mm->def_flags which is a template
>> for new mappings. The mappings created after prctl(PR_SET_THP_DISABLE) have
>> VM_NOHUGEPAGE flag set. This can be quite surprising for all those
>> applications which do not do prctl(); fork() & exec() and want to control
>> their own THP behavior.
>>
>> Another usecase when the immediate semantic of the prctl might be useful is
>> a combination of pre- and post-copy migration of containers with CRIU. In
>> this case CRIU populates a part of a memory region with data that was saved
>> during the pre-copy stage. Afterwards, the region is registered with
>> userfaultfd and CRIU expects to get page faults for the parts of the region
>> that were not yet populated. However, khugepaged collapses the pages and
>> the expected page faults do not occur.
>>
>> In more general case, the prctl(PR_SET_THP_DISABLE) could be used as a
>> temporary mechanism for enabling/disabling THP process wide.
>>
>> Implementation wise, a new MMF_DISABLE_THP flag is added. This flag is
>> tested when decision whether to use huge pages is taken either during page
>> fault of at the time of THP collapse.
>>
>> It should be noted, that the new implementation makes PR_SET_THP_DISABLE
>> master override to any per-VMA setting, which was not the case previously.
>>
>> Fixes: a0715cc22601 ("mm, thp: add VM_INIT_DEF_MASK and PRCTL_THP_DISABLE")
>
> "Fixes" is a bit strong. I'd say "alters". And significantly altering
> the runtime behaviour of a three-year-old interface is rather a worry,
> no?
>
> Perhaps we should be adding new prctl modes to select this new
> behaviour and leave the existing PR_SET_THP_DISABLE behaviour as-is?

I think we can reasonably assume that most users of the prctl do just
the fork() & exec() thing, so they will be unaffected. And as usual, if
somebody does complain in the end, we revert and try the other way?

2017-06-02 20:40:41

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] mm: make PR_SET_THP_DISABLE immediately active

On Fri, 2 Jun 2017 22:31:47 +0200 Vlastimil Babka <[email protected]> wrote:

> On 06/02/2017 09:50 PM, Andrew Morton wrote:
> > On Fri, 2 Jun 2017 18:03:22 +0300 "Mike Rapoport" <[email protected]> wrote:
> >
> >> PR_SET_THP_DISABLE has a rather subtle semantic. It doesn't affect any
> >> existing mapping because it only updated mm->def_flags which is a template
> >> for new mappings. The mappings created after prctl(PR_SET_THP_DISABLE) have
> >> VM_NOHUGEPAGE flag set. This can be quite surprising for all those
> >> applications which do not do prctl(); fork() & exec() and want to control
> >> their own THP behavior.
> >>
> >> Another usecase when the immediate semantic of the prctl might be useful is
> >> a combination of pre- and post-copy migration of containers with CRIU. In
> >> this case CRIU populates a part of a memory region with data that was saved
> >> during the pre-copy stage. Afterwards, the region is registered with
> >> userfaultfd and CRIU expects to get page faults for the parts of the region
> >> that were not yet populated. However, khugepaged collapses the pages and
> >> the expected page faults do not occur.
> >>
> >> In more general case, the prctl(PR_SET_THP_DISABLE) could be used as a
> >> temporary mechanism for enabling/disabling THP process wide.
> >>
> >> Implementation wise, a new MMF_DISABLE_THP flag is added. This flag is
> >> tested when decision whether to use huge pages is taken either during page
> >> fault of at the time of THP collapse.
> >>
> >> It should be noted, that the new implementation makes PR_SET_THP_DISABLE
> >> master override to any per-VMA setting, which was not the case previously.
> >>
> >> Fixes: a0715cc22601 ("mm, thp: add VM_INIT_DEF_MASK and PRCTL_THP_DISABLE")
> >
> > "Fixes" is a bit strong. I'd say "alters". And significantly altering
> > the runtime behaviour of a three-year-old interface is rather a worry,
> > no?
> >
> > Perhaps we should be adding new prctl modes to select this new
> > behaviour and leave the existing PR_SET_THP_DISABLE behaviour as-is?
>
> I think we can reasonably assume that most users of the prctl do just
> the fork() & exec() thing, so they will be unaffected.

That sounds optimistic. Perhaps people are using the current behaviour
to set on particular mapping to MMF_DISABLE_THP, with

prctl(PR_SET_THP_DISABLE)
mmap()
prctl(PR_CLR_THP_DISABLE)

?

Seems a reasonable thing to do. But who knows - people do all sorts of
inventive things.

> And as usual, if
> somebody does complain in the end, we revert and try the other way?

But by then it's too late - the new behaviour will be out in the field.

2017-06-02 20:55:52

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH] mm: make PR_SET_THP_DISABLE immediately active

On 06/02/2017 10:40 PM, Andrew Morton wrote:
> On Fri, 2 Jun 2017 22:31:47 +0200 Vlastimil Babka <[email protected]> wrote:
>>> Perhaps we should be adding new prctl modes to select this new
>>> behaviour and leave the existing PR_SET_THP_DISABLE behaviour as-is?
>>
>> I think we can reasonably assume that most users of the prctl do just
>> the fork() & exec() thing, so they will be unaffected.
>
> That sounds optimistic. Perhaps people are using the current behaviour
> to set on particular mapping to MMF_DISABLE_THP, with
>
> prctl(PR_SET_THP_DISABLE)
> mmap()
> prctl(PR_CLR_THP_DISABLE)
>
> ?
>
> Seems a reasonable thing to do.

Using madvise(MADV_NOHUGEPAGE) seems reasonabler to me, with the same
effect. And it's older (2.6.38).

> But who knows - people do all sorts of
> inventive things.

Yeah :( but we can hope they don't even know that the prctl currently
behaves they way it does - man page doesn't suggest it would, and most
of us in this thread found it surprising.

>> And as usual, if
>> somebody does complain in the end, we revert and try the other way?
>
> But by then it's too late - the new behaviour will be out in the field.

Revert in stable then?
But I don't think this patch should go to stable. I understand right
that CRIU will switch to the UFFDIO_COPY approach and doesn't need the
prctl change/new madvise anymore?

2017-06-02 21:10:50

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] mm: make PR_SET_THP_DISABLE immediately active

On Fri, 2 Jun 2017 22:55:12 +0200 Vlastimil Babka <[email protected]> wrote:

> On 06/02/2017 10:40 PM, Andrew Morton wrote:
> > On Fri, 2 Jun 2017 22:31:47 +0200 Vlastimil Babka <[email protected]> wrote:
> >>> Perhaps we should be adding new prctl modes to select this new
> >>> behaviour and leave the existing PR_SET_THP_DISABLE behaviour as-is?
> >>
> >> I think we can reasonably assume that most users of the prctl do just
> >> the fork() & exec() thing, so they will be unaffected.
> >
> > That sounds optimistic. Perhaps people are using the current behaviour
> > to set on particular mapping to MMF_DISABLE_THP, with
> >
> > prctl(PR_SET_THP_DISABLE)
> > mmap()
> > prctl(PR_CLR_THP_DISABLE)
> >
> > ?
> >
> > Seems a reasonable thing to do.
>
> Using madvise(MADV_NOHUGEPAGE) seems reasonabler to me, with the same
> effect. And it's older (2.6.38).
>
> > But who knows - people do all sorts of
> > inventive things.
>
> Yeah :( but we can hope they don't even know that the prctl currently
> behaves they way it does - man page doesn't suggest it would, and most
> of us in this thread found it surprising.

Well. There might be such people and sometimes we do make people
unhappy. it partly depends on how traumatic it would be to leave the
current behaviour as-is. Have you evaluated such a patch?


> >> And as usual, if
> >> somebody does complain in the end, we revert and try the other way?
> >
> > But by then it's too late - the new behaviour will be out in the field.
>
> Revert in stable then?
> But I don't think this patch should go to stable. I understand right
> that CRIU will switch to the UFFDIO_COPY approach and doesn't need the
> prctl change/new madvise anymore?

What I mean is that the new behaviour will go out in 4.12 and it may
be many months before we find out that we broke someone. By then, we
can't go back because others may be assuming the new behaviour.

2017-06-03 07:40:06

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm: make PR_SET_THP_DISABLE immediately active

On Fri 02-06-17 13:40:38, Andrew Morton wrote:
> On Fri, 2 Jun 2017 22:31:47 +0200 Vlastimil Babka <[email protected]> wrote:
>
> > On 06/02/2017 09:50 PM, Andrew Morton wrote:
> > > On Fri, 2 Jun 2017 18:03:22 +0300 "Mike Rapoport" <[email protected]> wrote:
> > >
> > >> PR_SET_THP_DISABLE has a rather subtle semantic. It doesn't affect any
> > >> existing mapping because it only updated mm->def_flags which is a template
> > >> for new mappings. The mappings created after prctl(PR_SET_THP_DISABLE) have
> > >> VM_NOHUGEPAGE flag set. This can be quite surprising for all those
> > >> applications which do not do prctl(); fork() & exec() and want to control
> > >> their own THP behavior.
> > >>
> > >> Another usecase when the immediate semantic of the prctl might be useful is
> > >> a combination of pre- and post-copy migration of containers with CRIU. In
> > >> this case CRIU populates a part of a memory region with data that was saved
> > >> during the pre-copy stage. Afterwards, the region is registered with
> > >> userfaultfd and CRIU expects to get page faults for the parts of the region
> > >> that were not yet populated. However, khugepaged collapses the pages and
> > >> the expected page faults do not occur.
> > >>
> > >> In more general case, the prctl(PR_SET_THP_DISABLE) could be used as a
> > >> temporary mechanism for enabling/disabling THP process wide.
> > >>
> > >> Implementation wise, a new MMF_DISABLE_THP flag is added. This flag is
> > >> tested when decision whether to use huge pages is taken either during page
> > >> fault of at the time of THP collapse.
> > >>
> > >> It should be noted, that the new implementation makes PR_SET_THP_DISABLE
> > >> master override to any per-VMA setting, which was not the case previously.
> > >>
> > >> Fixes: a0715cc22601 ("mm, thp: add VM_INIT_DEF_MASK and PRCTL_THP_DISABLE")
> > >
> > > "Fixes" is a bit strong. I'd say "alters". And significantly altering
> > > the runtime behaviour of a three-year-old interface is rather a worry,
> > > no?
> > >
> > > Perhaps we should be adding new prctl modes to select this new
> > > behaviour and leave the existing PR_SET_THP_DISABLE behaviour as-is?
> >
> > I think we can reasonably assume that most users of the prctl do just
> > the fork() & exec() thing, so they will be unaffected.
>
> That sounds optimistic. Perhaps people are using the current behaviour
> to set on particular mapping to MMF_DISABLE_THP, with
>
> prctl(PR_SET_THP_DISABLE)
> mmap()
> prctl(PR_CLR_THP_DISABLE)
>
> ?
>
> Seems a reasonable thing to do.

Is it? The documentation is not very specific but it is clear about the
scope being thread (I would argue process would be more approapriate
but whatever) "Set the state of the "THP disable" flag for the calling
thread." So the above seems like an incorrect usage to me.

> But who knows - people do all sorts of inventive things.

well yes.

> > And as usual, if
> > somebody does complain in the end, we revert and try the other way?
>
> But by then it's too late - the new behaviour will be out in the field.

Well, the interface is currently broken for anything other than prctl
& exec. And those will work properly even with the patch. So I am not
really sure whether keeping the current status quo is reasonable.

--
Michal Hocko
SUSE Labs

2017-06-03 10:35:07

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH] mm: make PR_SET_THP_DISABLE immediately active



On June 2, 2017 11:55:12 PM GMT+03:00, Vlastimil Babka <[email protected]> wrote:
>On 06/02/2017 10:40 PM, Andrew Morton wrote:
>> On Fri, 2 Jun 2017 22:31:47 +0200 Vlastimil Babka <[email protected]>
>wrote:
>>>> Perhaps we should be adding new prctl modes to select this new
>>>> behaviour and leave the existing PR_SET_THP_DISABLE behaviour
>as-is?
>>>
>>> I think we can reasonably assume that most users of the prctl do
>just
>>> the fork() & exec() thing, so they will be unaffected.
>>
>> That sounds optimistic. Perhaps people are using the current
>behaviour
>> to set on particular mapping to MMF_DISABLE_THP, with
>>
>> prctl(PR_SET_THP_DISABLE)
>> mmap()
>> prctl(PR_CLR_THP_DISABLE)
>>
>> ?
>>
>> Seems a reasonable thing to do.
>
>Using madvise(MADV_NOHUGEPAGE) seems reasonabler to me, with the same
>effect. And it's older (2.6.38).
>
>> But who knows - people do all sorts of
>> inventive things.
>
>Yeah :( but we can hope they don't even know that the prctl currently
>behaves they way it does - man page doesn't suggest it would, and most
>of us in this thread found it surprising.
>
>>> And as usual, if
>>> somebody does complain in the end, we revert and try the other way?
>>
>> But by then it's too late - the new behaviour will be out in the
>field.
>
>Revert in stable then?
>But I don't think this patch should go to stable. I understand right
>that CRIU will switch to the UFFDIO_COPY approach and doesn't need the
>prctl change/new madvise anymore?

Yes, we are going to use UFFDIO_COPY. We still might want to have control over THP in the future without changing per-VMA flags, though.

--
Sincerely yours,
Mike.

2017-06-03 10:40:33

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH] mm: make PR_SET_THP_DISABLE immediately active



On June 2, 2017 10:50:59 PM GMT+03:00, Andrew Morton <[email protected]> wrote:
>On Fri, 2 Jun 2017 18:03:22 +0300 "Mike Rapoport"
><[email protected]> wrote:
>
>> PR_SET_THP_DISABLE has a rather subtle semantic. It doesn't affect
>any
>> existing mapping because it only updated mm->def_flags which is a
>template
>> for new mappings. The mappings created after
>prctl(PR_SET_THP_DISABLE) have
>> VM_NOHUGEPAGE flag set. This can be quite surprising for all those
>> applications which do not do prctl(); fork() & exec() and want to
>control
>> their own THP behavior.
>>
>> Another usecase when the immediate semantic of the prctl might be
>useful is
>> a combination of pre- and post-copy migration of containers with
>CRIU. In
>> this case CRIU populates a part of a memory region with data that was
>saved
>> during the pre-copy stage. Afterwards, the region is registered with
>> userfaultfd and CRIU expects to get page faults for the parts of the
>region
>> that were not yet populated. However, khugepaged collapses the pages
>and
>> the expected page faults do not occur.
>>
>> In more general case, the prctl(PR_SET_THP_DISABLE) could be used as
>a
>> temporary mechanism for enabling/disabling THP process wide.
>>
>> Implementation wise, a new MMF_DISABLE_THP flag is added. This flag
>is
>> tested when decision whether to use huge pages is taken either during
>page
>> fault of at the time of THP collapse.
>>
>> It should be noted, that the new implementation makes
>PR_SET_THP_DISABLE
>> master override to any per-VMA setting, which was not the case
>previously.
>>
>> Fixes: a0715cc22601 ("mm, thp: add VM_INIT_DEF_MASK and
>PRCTL_THP_DISABLE")
>
>"Fixes" is a bit strong. I'd say "alters". And significantly altering
>the runtime behaviour of a three-year-old interface is rather a worry,
>no?

Well, there are people that consider current behavior as bug :)
One can argue we alter the implementation​details and users should not rely on that...

>Perhaps we should be adding new prctl modes to select this new
>behaviour and leave the existing PR_SET_THP_DISABLE behaviour as-is?



--
Sincerely yours,
Mike.

2017-06-05 07:05:26

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH] mm: make PR_SET_THP_DISABLE immediately active

On Sat, Jun 03, 2017 at 01:34:52PM +0300, Mike Rapoprt wrote:
>
>
> On June 2, 2017 11:55:12 PM GMT+03:00, Vlastimil Babka <[email protected]> wrote:
> >On 06/02/2017 10:40 PM, Andrew Morton wrote:
> >> On Fri, 2 Jun 2017 22:31:47 +0200 Vlastimil Babka <[email protected]>
> >wrote:
> >>>> Perhaps we should be adding new prctl modes to select this new
> >>>> behaviour and leave the existing PR_SET_THP_DISABLE behaviour
> >as-is?
> >>>
> >>> I think we can reasonably assume that most users of the prctl do
> >just
> >>> the fork() & exec() thing, so they will be unaffected.
> >>
> >> That sounds optimistic. Perhaps people are using the current
> >behaviour
> >> to set on particular mapping to MMF_DISABLE_THP, with
> >>
> >> prctl(PR_SET_THP_DISABLE)
> >> mmap()
> >> prctl(PR_CLR_THP_DISABLE)
> >>
> >> ?
> >>
> >> Seems a reasonable thing to do.
> >
> >Using madvise(MADV_NOHUGEPAGE) seems reasonabler to me, with the same
> >effect. And it's older (2.6.38).
> >
> >> But who knows - people do all sorts of
> >> inventive things.
> >
> >Yeah :( but we can hope they don't even know that the prctl currently
> >behaves they way it does - man page doesn't suggest it would, and most
> >of us in this thread found it surprising.
> >
> >>> And as usual, if
> >>> somebody does complain in the end, we revert and try the other way?
> >>
> >> But by then it's too late - the new behaviour will be out in the
> >field.
> >
> >Revert in stable then?
> >But I don't think this patch should go to stable. I understand right
> >that CRIU will switch to the UFFDIO_COPY approach and doesn't need the
> >prctl change/new madvise anymore?
>
> Yes, we are going to use UFFDIO_COPY. We still might want to have control
> over THP in the future without changing per-VMA flags, though.

Unfortunately, I was over optimistic about ability of CRIU to use
UFFDIO_COPY for pre-copy part :(
I was too concentrated on the simplified flow and overlooked some important
details. After I've spent some time trying to actually implement usage of
UFFDIO_COPY, I realized that registering memory with userfault at that
point of the restore flow quite contradicts CRIU architecture :(

That said, we would really want to have the interface this patch proposes.

--
Sincerely yours,
Mike.

2017-06-05 09:27:27

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH] mm: make PR_SET_THP_DISABLE immediately active

On 06/02/2017 11:10 PM, Andrew Morton wrote:
> On Fri, 2 Jun 2017 22:55:12 +0200 Vlastimil Babka <[email protected]> wrote:
>
>> On 06/02/2017 10:40 PM, Andrew Morton wrote:
>>> On Fri, 2 Jun 2017 22:31:47 +0200 Vlastimil Babka <[email protected]> wrote:
>>>>> Perhaps we should be adding new prctl modes to select this new
>>>>> behaviour and leave the existing PR_SET_THP_DISABLE behaviour as-is?
>>>>
>>>> I think we can reasonably assume that most users of the prctl do just
>>>> the fork() & exec() thing, so they will be unaffected.
>>>
>>> That sounds optimistic. Perhaps people are using the current behaviour
>>> to set on particular mapping to MMF_DISABLE_THP, with
>>>
>>> prctl(PR_SET_THP_DISABLE)
>>> mmap()
>>> prctl(PR_CLR_THP_DISABLE)
>>>
>>> ?
>>>
>>> Seems a reasonable thing to do.
>>
>> Using madvise(MADV_NOHUGEPAGE) seems reasonabler to me, with the same
>> effect. And it's older (2.6.38).
>>
>>> But who knows - people do all sorts of
>>> inventive things.
>>
>> Yeah :( but we can hope they don't even know that the prctl currently
>> behaves they way it does - man page doesn't suggest it would, and most
>> of us in this thread found it surprising.
>
> Well. There might be such people and sometimes we do make people
> unhappy. it partly depends on how traumatic it would be to leave the
> current behaviour as-is. Have you evaluated such a patch?

You mean introducing a new prctl instead of changing the existing one? I
can evaluate that as being ugly :)
Well, maybe we could use arg3, because currently we have:
case PR_SET_THP_DISABLE:
if (arg3 || arg4 || arg5)
return -EINVAL;

We could make non-zero arg3 (or specific value of arg3) set the new
"immediate" behavior. This would also take care of the discovery of
kernels that support the fixed/altered behavior, without having to check
uname etc - just check if we got -EINVAL.

I'm just not sure how to implement PR_GET_THP_DISABLE properly in such
scenario. Or what happens when somebody calls SET with arg3==0 and then
arg3==1 (or vice versa). But we would have to think about it even when
we introduced a newly named option. Reminds me of the MLOCK_ONFAULT
discussions...

>>>> And as usual, if
>>>> somebody does complain in the end, we revert and try the other way?
>>>
>>> But by then it's too late - the new behaviour will be out in the field.
>>
>> Revert in stable then?
>> But I don't think this patch should go to stable. I understand right
>> that CRIU will switch to the UFFDIO_COPY approach and doesn't need the
>> prctl change/new madvise anymore?
>
> What I mean is that the new behaviour will go out in 4.12 and it may
> be many months before we find out that we broke someone. By then, we
> can't go back because others may be assuming the new behaviour.
>