On 02/26, Christian Borntraeger wrote:
>
> On 26/02/14 00:53, [email protected] wrote:
> > Subject: + mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags.patch added to -mm tree
> > To: [email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected]
> > From: [email protected]
> > Date: Tue, 25 Feb 2014 15:53:13 -0800
> >
> >
> > The patch titled
> > Subject: mm: revert "thp: make MADV_HUGEPAGE check for mm->def_flags"
> > has been added to the -mm tree. Its filename is
> > mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags.patch
> >
> > This patch should soon appear at
> > http://ozlabs.org/~akpm/mmots/broken-out/mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags.patch
> > and later at
> > http://ozlabs.org/~akpm/mmotm/broken-out/mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags.patch
>
>
> NAK.
>
> Since 2012 qemu does call "qemu_madvise(new_block->host, size, QEMU_MADV_HUGEPAGE);" for all kvm pages.
> (commit ad0b5321f1f797274603ebbe20108b0750baee94 Call MADV_HUGEPAGE for guest RAM allocations) so this
> breaks any recent kvm guest on s390.
Well, I can't really discuss the changes in arch/s390.
But perhaps qemu can be changed to avoid MADV_HUGEPAGE on s390 ?
Otherwise I'd suggest the change below.
Oleg.
--- x/mm/huge_memory.c
+++ x/mm/huge_memory.c
@@ -1968,8 +1968,6 @@ out:
int hugepage_madvise(struct vm_area_struct *vma,
unsigned long *vm_flags, int advice)
{
- struct mm_struct *mm = vma->vm_mm;
-
switch (advice) {
case MADV_HUGEPAGE:
/*
@@ -1977,8 +1975,16 @@ int hugepage_madvise(struct vm_area_stru
*/
if (*vm_flags & (VM_HUGEPAGE | VM_NO_THP))
return -EINVAL;
- if (mm->def_flags & VM_NOHUGEPAGE)
+
+/*
+ * MADV_HUGEPAGE after PRCTL_THP_DISABLE is broken on s390 because
+ * qemu blindly does madvise(MADV_HUGEPAGE) after s390_enable_sie().
+ */
+#ifdef CONFIG_S390
+ if (vma->vm_mm->def_flags & VM_NOHUGEPAGE)
return -EINVAL;
+#endif
+
*vm_flags &= ~VM_NOHUGEPAGE;
*vm_flags |= VM_HUGEPAGE;
/*
On 26/02/14 15:50, Oleg Nesterov wrote:
[...]
>> NAK.
>>
>> Since 2012 qemu does call "qemu_madvise(new_block->host, size, QEMU_MADV_HUGEPAGE);" for all kvm pages.
>> (commit ad0b5321f1f797274603ebbe20108b0750baee94 Call MADV_HUGEPAGE for guest RAM allocations) so this
>> breaks any recent kvm guest on s390.
>
> Well, I can't really discuss the changes in arch/s390.
>
> But perhaps qemu can be changed to avoid MADV_HUGEPAGE on s390 ?
> Otherwise I'd suggest the change below.
>
> Oleg.
>
>
> --- x/mm/huge_memory.c
> +++ x/mm/huge_memory.c
> @@ -1968,8 +1968,6 @@ out:
> int hugepage_madvise(struct vm_area_struct *vma,
> unsigned long *vm_flags, int advice)
> {
> - struct mm_struct *mm = vma->vm_mm;
> -
> switch (advice) {
> case MADV_HUGEPAGE:
> /*
> @@ -1977,8 +1975,16 @@ int hugepage_madvise(struct vm_area_stru
> */
> if (*vm_flags & (VM_HUGEPAGE | VM_NO_THP))
> return -EINVAL;
> - if (mm->def_flags & VM_NOHUGEPAGE)
> +
> +/*
> + * MADV_HUGEPAGE after PRCTL_THP_DISABLE is broken on s390 because
> + * qemu blindly does madvise(MADV_HUGEPAGE) after s390_enable_sie().
> + */
> +#ifdef CONFIG_S390
> + if (vma->vm_mm->def_flags & VM_NOHUGEPAGE)
> return -EINVAL;
> +#endif
> +
Ifdefs are ugly but might be the only way of not breaking existing userspace.
If we come up with a solution for THP in KVM host processes on s390, we can
then remove that wart. We could even limit that hack to KVM only processes
to retain Alex' prctl capability by checking mm_has_pgste (defined in
arch/s390/include/asm/pgtable.h should be included via linux/mm.h)
> +
> +/*
> + * MADV_HUGEPAGE after PRCTL_THP_DISABLE is broken on s390 because
> + * qemu blindly does madvise(MADV_HUGEPAGE) after s390_enable_sie().
> + */
> +#ifdef CONFIG_S390
> + if ((vma->vm_mm->def_flags & VM_NOHUGEPAGE) && mm_has_pgste(vma->vm_mm))
> return -EINVAL;
> +#endif
> +
Christian Borntraeger wrote:
> On 26/02/14 15:50, Oleg Nesterov wrote:
> [...]
>
> >> NAK.
> >>
> >> Since 2012 qemu does call "qemu_madvise(new_block->host, size, QEMU_MADV_HUGEPAGE);" for all kvm pages.
> >> (commit ad0b5321f1f797274603ebbe20108b0750baee94 Call MADV_HUGEPAGE for guest RAM allocations) so this
> >> breaks any recent kvm guest on s390.
> >
> > Well, I can't really discuss the changes in arch/s390.
> >
> > But perhaps qemu can be changed to avoid MADV_HUGEPAGE on s390 ?
> > Otherwise I'd suggest the change below.
> >
> > Oleg.
> >
> >
> > --- x/mm/huge_memory.c
> > +++ x/mm/huge_memory.c
> > @@ -1968,8 +1968,6 @@ out:
> > int hugepage_madvise(struct vm_area_struct *vma,
> > unsigned long *vm_flags, int advice)
> > {
> > - struct mm_struct *mm = vma->vm_mm;
> > -
> > switch (advice) {
> > case MADV_HUGEPAGE:
> > /*
> > @@ -1977,8 +1975,16 @@ int hugepage_madvise(struct vm_area_stru
> > */
> > if (*vm_flags & (VM_HUGEPAGE | VM_NO_THP))
> > return -EINVAL;
> > - if (mm->def_flags & VM_NOHUGEPAGE)
> > +
> > +/*
> > + * MADV_HUGEPAGE after PRCTL_THP_DISABLE is broken on s390 because
> > + * qemu blindly does madvise(MADV_HUGEPAGE) after s390_enable_sie().
> > + */
> > +#ifdef CONFIG_S390
> > + if (vma->vm_mm->def_flags & VM_NOHUGEPAGE)
> > return -EINVAL;
> > +#endif
> > +
>
> Ifdefs are ugly
IS_ENABLED(CONFIG_S390) should help with this.
--
Kirill A. Shutemov
On 02/26, Christian Borntraeger wrote:
>
> On 26/02/14 15:50, Oleg Nesterov wrote:
> >
> > But perhaps qemu can be changed to avoid MADV_HUGEPAGE on s390 ?
> > Otherwise I'd suggest the change below.
> >
> > Oleg.
> >
> >
> > --- x/mm/huge_memory.c
> > +++ x/mm/huge_memory.c
> > @@ -1968,8 +1968,6 @@ out:
> > int hugepage_madvise(struct vm_area_struct *vma,
> > unsigned long *vm_flags, int advice)
> > {
> > - struct mm_struct *mm = vma->vm_mm;
> > -
> > switch (advice) {
> > case MADV_HUGEPAGE:
> > /*
> > @@ -1977,8 +1975,16 @@ int hugepage_madvise(struct vm_area_stru
> > */
> > if (*vm_flags & (VM_HUGEPAGE | VM_NO_THP))
> > return -EINVAL;
> > - if (mm->def_flags & VM_NOHUGEPAGE)
> > +
> > +/*
> > + * MADV_HUGEPAGE after PRCTL_THP_DISABLE is broken on s390 because
> > + * qemu blindly does madvise(MADV_HUGEPAGE) after s390_enable_sie().
> > + */
> > +#ifdef CONFIG_S390
> > + if (vma->vm_mm->def_flags & VM_NOHUGEPAGE)
> > return -EINVAL;
> > +#endif
> > +
>
> Ifdefs are ugly but might be the only way of not breaking existing userspace.
Yes, agreed. but note that this code is already s390-specific, nobody
else puts VM_NOHUGEPAGE into ->def_flags (until this series of course).
> If we come up with a solution for THP in KVM host processes on s390, we can
> then remove that wart. We could even limit that hack to KVM only processes
> to retain Alex' prctl capability by checking mm_has_pgste
Yes, yes, I looked at mm->context.has_pgste too ;) But I wasn't sure
we can rely on it. Thanks.
> > +/*
> > + * MADV_HUGEPAGE after PRCTL_THP_DISABLE is broken on s390 because
> > + * qemu blindly does madvise(MADV_HUGEPAGE) after s390_enable_sie().
> > + */
> > +#ifdef CONFIG_S390
> > + if ((vma->vm_mm->def_flags & VM_NOHUGEPAGE) && mm_has_pgste(vma->vm_mm))
> > return -EINVAL;
Hmm... but why do we need to check VM_NOHUGEPAGE then? Can't the
change below work?
Oleg.
--- x/arch/s390/mm/pgtable.c
+++ x/arch/s390/mm/pgtable.c
@@ -1084,7 +1084,6 @@ static inline void thp_split_mm(struct m
vma->vm_flags &= ~VM_HUGEPAGE;
vma->vm_flags |= VM_NOHUGEPAGE;
}
- mm->def_flags |= VM_NOHUGEPAGE;
}
#else
static inline void thp_split_mm(struct mm_struct *mm)
--- x/mm/huge_memory.c
+++ x/mm/huge_memory.c
@@ -1968,8 +1968,6 @@ out:
int hugepage_madvise(struct vm_area_struct *vma,
unsigned long *vm_flags, int advice)
{
- struct mm_struct *mm = vma->vm_mm;
-
switch (advice) {
case MADV_HUGEPAGE:
/*
@@ -1977,8 +1975,12 @@ int hugepage_madvise(struct vm_area_stru
*/
if (*vm_flags & (VM_HUGEPAGE | VM_NO_THP))
return -EINVAL;
- if (mm->def_flags & VM_NOHUGEPAGE)
+
+#ifdef CONFIG_S390
+ if (mm_has_pgste(vma->vm_mm))
return -EINVAL;
+#endif
+
*vm_flags &= ~VM_NOHUGEPAGE;
*vm_flags |= VM_HUGEPAGE;
/*
On Wed, 26 Feb 2014 16:31:44 +0100
Oleg Nesterov <[email protected]> wrote:
> On 02/26, Christian Borntraeger wrote:
> >
> > On 26/02/14 15:50, Oleg Nesterov wrote:
> > >
> > > But perhaps qemu can be changed to avoid MADV_HUGEPAGE on s390 ?
> > > Otherwise I'd suggest the change below.
> > >
> > > Oleg.
> > >
> > >
> > > --- x/mm/huge_memory.c
> > > +++ x/mm/huge_memory.c
> > > @@ -1968,8 +1968,6 @@ out:
> > > int hugepage_madvise(struct vm_area_struct *vma,
> > > unsigned long *vm_flags, int advice)
> > > {
> > > - struct mm_struct *mm = vma->vm_mm;
> > > -
> > > switch (advice) {
> > > case MADV_HUGEPAGE:
> > > /*
> > > @@ -1977,8 +1975,16 @@ int hugepage_madvise(struct vm_area_stru
> > > */
> > > if (*vm_flags & (VM_HUGEPAGE | VM_NO_THP))
> > > return -EINVAL;
> > > - if (mm->def_flags & VM_NOHUGEPAGE)
> > > +
> > > +/*
> > > + * MADV_HUGEPAGE after PRCTL_THP_DISABLE is broken on s390 because
> > > + * qemu blindly does madvise(MADV_HUGEPAGE) after s390_enable_sie().
> > > + */
> > > +#ifdef CONFIG_S390
> > > + if (vma->vm_mm->def_flags & VM_NOHUGEPAGE)
> > > return -EINVAL;
> > > +#endif
> > > +
> >
> > Ifdefs are ugly but might be the only way of not breaking existing userspace.
>
> Yes, agreed. but note that this code is already s390-specific, nobody
> else puts VM_NOHUGEPAGE into ->def_flags (until this series of course).
>
> > If we come up with a solution for THP in KVM host processes on s390, we can
> > then remove that wart. We could even limit that hack to KVM only processes
> > to retain Alex' prctl capability by checking mm_has_pgste
>
> Yes, yes, I looked at mm->context.has_pgste too ;) But I wasn't sure
> we can rely on it. Thanks.
>
> > > +/*
> > > + * MADV_HUGEPAGE after PRCTL_THP_DISABLE is broken on s390 because
> > > + * qemu blindly does madvise(MADV_HUGEPAGE) after s390_enable_sie().
> > > + */
> > > +#ifdef CONFIG_S390
> > > + if ((vma->vm_mm->def_flags & VM_NOHUGEPAGE) && mm_has_pgste(vma->vm_mm))
> > > return -EINVAL;
>
> Hmm... but why do we need to check VM_NOHUGEPAGE then? Can't the
> change below work?
Yes, good point, that should do the trick.
>
> Oleg.
>
> --- x/arch/s390/mm/pgtable.c
> +++ x/arch/s390/mm/pgtable.c
> @@ -1084,7 +1084,6 @@ static inline void thp_split_mm(struct m
> vma->vm_flags &= ~VM_HUGEPAGE;
> vma->vm_flags |= VM_NOHUGEPAGE;
> }
> - mm->def_flags |= VM_NOHUGEPAGE;
> }
> #else
> static inline void thp_split_mm(struct mm_struct *mm)
> --- x/mm/huge_memory.c
> +++ x/mm/huge_memory.c
> @@ -1968,8 +1968,6 @@ out:
> int hugepage_madvise(struct vm_area_struct *vma,
> unsigned long *vm_flags, int advice)
> {
> - struct mm_struct *mm = vma->vm_mm;
> -
> switch (advice) {
> case MADV_HUGEPAGE:
> /*
> @@ -1977,8 +1975,12 @@ int hugepage_madvise(struct vm_area_stru
> */
> if (*vm_flags & (VM_HUGEPAGE | VM_NO_THP))
> return -EINVAL;
> - if (mm->def_flags & VM_NOHUGEPAGE)
> +
> +#ifdef CONFIG_S390
> + if (mm_has_pgste(vma->vm_mm))
> return -EINVAL;
> +#endif
> +
> *vm_flags &= ~VM_NOHUGEPAGE;
> *vm_flags |= VM_HUGEPAGE;
> /*
>
On Wed, Feb 26, 2014 at 04:31:44PM +0100, Oleg Nesterov wrote:
> --- x/arch/s390/mm/pgtable.c
> +++ x/arch/s390/mm/pgtable.c
> @@ -1084,7 +1084,6 @@ static inline void thp_split_mm(struct m
> vma->vm_flags &= ~VM_HUGEPAGE;
> vma->vm_flags |= VM_NOHUGEPAGE;
> }
> - mm->def_flags |= VM_NOHUGEPAGE;
> }
> #else
> static inline void thp_split_mm(struct mm_struct *mm)
> --- x/mm/huge_memory.c
> +++ x/mm/huge_memory.c
> @@ -1968,8 +1968,6 @@ out:
> int hugepage_madvise(struct vm_area_struct *vma,
> unsigned long *vm_flags, int advice)
> {
> - struct mm_struct *mm = vma->vm_mm;
> -
> switch (advice) {
> case MADV_HUGEPAGE:
> /*
> @@ -1977,8 +1975,12 @@ int hugepage_madvise(struct vm_area_stru
> */
> if (*vm_flags & (VM_HUGEPAGE | VM_NO_THP))
> return -EINVAL;
> - if (mm->def_flags & VM_NOHUGEPAGE)
> +
> +#ifdef CONFIG_S390
Do we want a comment here, explaining why s390 is special again?
> + if (mm_has_pgste(vma->vm_mm))
> return -EINVAL;
> +#endif
> +
> *vm_flags &= ~VM_NOHUGEPAGE;
> *vm_flags |= VM_HUGEPAGE;
> /*
>
On Wed, Feb 26, 2014 at 05:57:59PM +0100, Peter Zijlstra wrote:
> On Wed, Feb 26, 2014 at 04:31:44PM +0100, Oleg Nesterov wrote:
> Do we want a comment here, explaining why s390 is special again?
Here's what I've got, with everybody's suggestions spun together:
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 82166bf..7f01491 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1968,8 +1968,6 @@ out:
int hugepage_madvise(struct vm_area_struct *vma,
unsigned long *vm_flags, int advice)
{
- struct mm_struct *mm = vma->vm_mm;
-
switch (advice) {
case MADV_HUGEPAGE:
/*
@@ -1977,8 +1975,16 @@ int hugepage_madvise(struct vm_area_struct *vma,
*/
if (*vm_flags & (VM_HUGEPAGE | VM_NO_THP))
return -EINVAL;
- if (mm->def_flags & VM_NOHUGEPAGE)
+
+/*
+ * MADV_HUGEPAGE after PRCTL_THP_DISABLE is broken on s390 because
+ * qemu blindly does madvise(MADV_HUGEPAGE) after s390_enable_sie().
+ */
+#ifdef CONFIG_S390
+ if (mm_has_pgste(vma->vm_mm))
return -EINVAL;
+#endif
+
*vm_flags &= ~VM_NOHUGEPAGE;
*vm_flags |= VM_HUGEPAGE;
/*
I've compiled and tested and it works ok on my machines (I don't have an
s390 to test on though :). Is everybody okay with this solution?
BTW, Kirill, I looked at using IS_ENABLED to clean up the ifdef, but it
won't work here, since mm_has_pgste isn't defined unless CONFIG_S390
is defined. i.e.: This line:
if (IS_ENABLED(CONFIG_S390) && mm_has_pgste(vma->vm_mm))
won't compile.
On 02/26, Alex Thorlton wrote:
>
> + * MADV_HUGEPAGE after PRCTL_THP_DISABLE is broken on s390 because
> + * qemu blindly does madvise(MADV_HUGEPAGE) after s390_enable_sie().
> + */
> +#ifdef CONFIG_S390
> + if (mm_has_pgste(vma->vm_mm))
> return -EINVAL;
> +#endif
The comment is not really right...
And personally I think that
@@ -504,6 +504,9 @@ static int gmap_connect_pgtable(unsigned long address, unsigned long segment,
if (!pmd_present(*pmd) &&
__pte_alloc(mm, vma, pmd, vmaddr))
return -ENOMEM;
+ /* large pmds cannot yet be handled */
+ if (pmd_large(*pmd))
+ return -EFAULT;
change still makes sense, so that we can simply revert this s390-
specific hack in hugepage_madvise().
I'd suggest the patch below on top of your changes, but I won't argue.
It would be nice to also change thp_split_mm() to not not play with
mm->def_flags, but I am not sure if we can do this.
Oleg.
---
Subject: [PATCH] s390: make sure MADV_HUGEPAGE fails after s390_enable_sie()
As Christian pointed out, the recent 'Revert "thp: make MADV_HUGEPAGE
check for mm->def_flags"' breaks qemu, it does QEMU_MADV_HUGEPAGE for
all kvm pages but this doesn't work after s390_enable_sie/thp_split_mm.
Reported-by: Christian Borntraeger <[email protected]>
Suggested-by: Christian Borntraeger <[email protected]>
Signed-off-by: Oleg Nesterov <[email protected]>
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a4310a5..0e08d92 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1970,11 +1970,22 @@ int hugepage_madvise(struct vm_area_struct *vma,
{
switch (advice) {
case MADV_HUGEPAGE:
+#ifdef CONFIG_S390
+ /*
+ * MADV_HUGEPAGE is broken after s390_enable_sie(), qemu
+ * blindly does madvise(MADV_HUGEPAGE) for for all kvm pages
+ * and expects it must fail on s390. Avoid a possible SIGSEGV
+ * until qemu is changed.
+ */
+ if (mm_has_pgste(vma->vm_mm))
+ return -EINVAL;
+#endif
/*
* Be somewhat over-protective like KSM for now!
*/
if (*vm_flags & (VM_HUGEPAGE | VM_NO_THP))
return -EINVAL;
+
*vm_flags &= ~VM_NOHUGEPAGE;
*vm_flags |= VM_HUGEPAGE;
/*
On 02/26, Peter Zijlstra wrote:
>
> On Wed, Feb 26, 2014 at 04:31:44PM +0100, Oleg Nesterov wrote:
> > /*
> > @@ -1977,8 +1975,12 @@ int hugepage_madvise(struct vm_area_stru
> > */
> > if (*vm_flags & (VM_HUGEPAGE | VM_NO_THP))
> > return -EINVAL;
> > - if (mm->def_flags & VM_NOHUGEPAGE)
> > +
> > +#ifdef CONFIG_S390
>
> Do we want a comment here, explaining why s390 is special again?
Yes, yes, sure. Especially because this is (hopefully) the temporary
hack.
Oleg.
On Wed, 26 Feb 2014 19:06:03 +0100
Oleg Nesterov <[email protected]> wrote:
> On 02/26, Alex Thorlton wrote:
> >
> > + * MADV_HUGEPAGE after PRCTL_THP_DISABLE is broken on s390 because
> > + * qemu blindly does madvise(MADV_HUGEPAGE) after s390_enable_sie().
> > + */
> > +#ifdef CONFIG_S390
> > + if (mm_has_pgste(vma->vm_mm))
> > return -EINVAL;
> > +#endif
>
> The comment is not really right...
>
> And personally I think that
>
> @@ -504,6 +504,9 @@ static int gmap_connect_pgtable(unsigned long address, unsigned long segment,
> if (!pmd_present(*pmd) &&
> __pte_alloc(mm, vma, pmd, vmaddr))
> return -ENOMEM;
> + /* large pmds cannot yet be handled */
> + if (pmd_large(*pmd))
> + return -EFAULT;
>
> change still makes sense, so that we can simply revert this s390-
> specific hack in hugepage_madvise().
Yes, agreed.
>
> I'd suggest the patch below on top of your changes, but I won't argue.
>
> It would be nice to also change thp_split_mm() to not not play with
> mm->def_flags, but I am not sure if we can do this.
Hmm, I'm also wondering about this. Basically, we only need VM_NOHUGEPAGE
in vma->vm_flags, which is done for all existing vmas in thp_split_mm().
But if there should be new vmas created afterwards, it would still be
necessary to also have VM_NOHUGEPAGE in mm->def_flags, because the
vm_flags for new vmas will be set via OR of mm->def_flags, e.g. in
do_brk() and do_mmap_pgoff().
I guess the question is if new vmas can be created for the qemu/kvm host
process?
Anyway, this would then have to be a separate patch, to keep the
"revertability" of this hack.
>
> Oleg.
> ---
>
> Subject: [PATCH] s390: make sure MADV_HUGEPAGE fails after s390_enable_sie()
>
> As Christian pointed out, the recent 'Revert "thp: make MADV_HUGEPAGE
> check for mm->def_flags"' breaks qemu, it does QEMU_MADV_HUGEPAGE for
> all kvm pages but this doesn't work after s390_enable_sie/thp_split_mm.
>
> Reported-by: Christian Borntraeger <[email protected]>
> Suggested-by: Christian Borntraeger <[email protected]>
> Signed-off-by: Oleg Nesterov <[email protected]>
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index a4310a5..0e08d92 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1970,11 +1970,22 @@ int hugepage_madvise(struct vm_area_struct *vma,
> {
> switch (advice) {
> case MADV_HUGEPAGE:
> +#ifdef CONFIG_S390
> + /*
> + * MADV_HUGEPAGE is broken after s390_enable_sie(), qemu
> + * blindly does madvise(MADV_HUGEPAGE) for for all kvm pages
> + * and expects it must fail on s390. Avoid a possible SIGSEGV
> + * until qemu is changed.
> + */
> + if (mm_has_pgste(vma->vm_mm))
> + return -EINVAL;
> +#endif
> /*
> * Be somewhat over-protective like KSM for now!
> */
> if (*vm_flags & (VM_HUGEPAGE | VM_NO_THP))
> return -EINVAL;
> +
> *vm_flags &= ~VM_NOHUGEPAGE;
> *vm_flags |= VM_HUGEPAGE;
> /*
>
On 26/02/14 19:06, Oleg Nesterov wrote:
> On 02/26, Alex Thorlton wrote:
>>
>> + * MADV_HUGEPAGE after PRCTL_THP_DISABLE is broken on s390 because
>> + * qemu blindly does madvise(MADV_HUGEPAGE) after s390_enable_sie().
>> + */
>> +#ifdef CONFIG_S390
>> + if (mm_has_pgste(vma->vm_mm))
>> return -EINVAL;
>> +#endif
>
> The comment is not really right...
>
> And personally I think that
>
> @@ -504,6 +504,9 @@ static int gmap_connect_pgtable(unsigned long address, unsigned long segment,
> if (!pmd_present(*pmd) &&
> __pte_alloc(mm, vma, pmd, vmaddr))
> return -ENOMEM;
> + /* large pmds cannot yet be handled */
> + if (pmd_large(*pmd))
> + return -EFAULT;
>
> change still makes sense, so that we can simply revert this s390-
> specific hack in hugepage_madvise().
Yes, it still makes sense to cover existing THPs here.
> I'd suggest the patch below on top of your changes, but I won't argue.
>
> It would be nice to also change thp_split_mm() to not not play with
> mm->def_flags, but I am not sure if we can do this.
>
> Oleg.
> ---
>
> Subject: [PATCH] s390: make sure MADV_HUGEPAGE fails after s390_enable_sie()
>
> As Christian pointed out, the recent 'Revert "thp: make MADV_HUGEPAGE
> check for mm->def_flags"' breaks qemu, it does QEMU_MADV_HUGEPAGE for
> all kvm pages but this doesn't work after s390_enable_sie/thp_split_mm.
>
> Reported-by: Christian Borntraeger <[email protected]>
> Suggested-by: Christian Borntraeger <[email protected]>
> Signed-off-by: Oleg Nesterov <[email protected]>
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index a4310a5..0e08d92 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1970,11 +1970,22 @@ int hugepage_madvise(struct vm_area_struct *vma,
> {
> switch (advice) {
> case MADV_HUGEPAGE:
> +#ifdef CONFIG_S390
> + /*
> + * MADV_HUGEPAGE is broken after s390_enable_sie(), qemu
> + * blindly does madvise(MADV_HUGEPAGE) for for all kvm pages
> + * and expects it must fail on s390. Avoid a possible SIGSEGV
> + * until qemu is changed.
I prefer:
* until kvm/s390 can handle large pages in the host.
Otherwise qemu has to be changed again, if we get THP working for kvm.
> + */
> + if (mm_has_pgste(vma->vm_mm))
> + return -EINVAL;
> +#endif
> /*
> * Be somewhat over-protective like KSM for now!
> */
> if (*vm_flags & (VM_HUGEPAGE | VM_NO_THP))
> return -EINVAL;
> +
Unrelated white space?
> *vm_flags &= ~VM_NOHUGEPAGE;
> *vm_flags |= VM_HUGEPAGE;
> /*
>
With the comment and white space change:
Tested-by: Christian Borntraeger <[email protected]>
Reviewed-by: Christian Borntraeger <[email protected]>
Thanks for the quick patch
On Wed, Feb 26, 2014 at 08:27:36PM +0100, Christian Borntraeger wrote:
> On 26/02/14 19:06, Oleg Nesterov wrote:
> > On 02/26, Alex Thorlton wrote:
> >>
> >> + * MADV_HUGEPAGE after PRCTL_THP_DISABLE is broken on s390 because
> >> + * qemu blindly does madvise(MADV_HUGEPAGE) after s390_enable_sie().
> >> + */
> >> +#ifdef CONFIG_S390
> >> + if (mm_has_pgste(vma->vm_mm))
> >> return -EINVAL;
> >> +#endif
> >
> > The comment is not really right...
> >
> > And personally I think that
> >
> > @@ -504,6 +504,9 @@ static int gmap_connect_pgtable(unsigned long address, unsigned long segment,
> > if (!pmd_present(*pmd) &&
> > __pte_alloc(mm, vma, pmd, vmaddr))
> > return -ENOMEM;
> > + /* large pmds cannot yet be handled */
> > + if (pmd_large(*pmd))
> > + return -EFAULT;
> >
> > change still makes sense, so that we can simply revert this s390-
> > specific hack in hugepage_madvise().
>
> Yes, it still makes sense to cover existing THPs here.
Yes, it does. I just snipped the chunk of the original patch that I
actually changed in my last e-mail. I didn't intend to actually remove
the above portion in the final patch - sorry for the confusion!
>
>
> > I'd suggest the patch below on top of your changes, but I won't argue.
I like this approach, and your updated comment as well. This should
probably go in as [PATCH 2/4] in my series. Do I need to spin a v5
with this new patch included in the series, or will Andrew pull this in?
> >
> > It would be nice to also change thp_split_mm() to not not play with
> > mm->def_flags, but I am not sure if we can do this.
> >
> > Oleg.
> > ---
> >
> > Subject: [PATCH] s390: make sure MADV_HUGEPAGE fails after s390_enable_sie()
> >
> > As Christian pointed out, the recent 'Revert "thp: make MADV_HUGEPAGE
> > check for mm->def_flags"' breaks qemu, it does QEMU_MADV_HUGEPAGE for
> > all kvm pages but this doesn't work after s390_enable_sie/thp_split_mm.
> >
> > Reported-by: Christian Borntraeger <[email protected]>
> > Suggested-by: Christian Borntraeger <[email protected]>
> > Signed-off-by: Oleg Nesterov <[email protected]>
> >
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index a4310a5..0e08d92 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1970,11 +1970,22 @@ int hugepage_madvise(struct vm_area_struct *vma,
> > {
> > switch (advice) {
> > case MADV_HUGEPAGE:
> > +#ifdef CONFIG_S390
> > + /*
> > + * MADV_HUGEPAGE is broken after s390_enable_sie(), qemu
> > + * blindly does madvise(MADV_HUGEPAGE) for for all kvm pages
> > + * and expects it must fail on s390. Avoid a possible SIGSEGV
> > + * until qemu is changed.
>
> I prefer:
> * until kvm/s390 can handle large pages in the host.
>
> Otherwise qemu has to be changed again, if we get THP working for kvm.
>
> > + */
> > + if (mm_has_pgste(vma->vm_mm))
> > + return -EINVAL;
> > +#endif
> > /*
> > * Be somewhat over-protective like KSM for now!
> > */
> > if (*vm_flags & (VM_HUGEPAGE | VM_NO_THP))
> > return -EINVAL;
> > +
>
> Unrelated white space?
> > *vm_flags &= ~VM_NOHUGEPAGE;
> > *vm_flags |= VM_HUGEPAGE;
> > /*
> >
>
>
>
> With the comment and white space change:
>
> Tested-by: Christian Borntraeger <[email protected]>
> Reviewed-by: Christian Borntraeger <[email protected]>
>
> Thanks for the quick patch
Yes, thank you all for the quick responses, and for looking over these
patches!
> +#ifdef CONFIG_S390
> + /*
> + * MADV_HUGEPAGE is broken after s390_enable_sie(), qemu
> + * blindly does madvise(MADV_HUGEPAGE) for for all kvm pages
> + * and expects it must fail on s390. Avoid a possible SIGSEGV
> + * until qemu is changed.
> + */
> + if (mm_has_pgste(vma->vm_mm))
> + return -EINVAL;
> +#endif
The comment is not quite true. QEMU doesn't expect that the madvise fails.
It simply expects that the madvise doesn't cause SIGSEGVs or later
malfunctioning, because (quoting tha man page) madvise "does not influence
the semantics of the application".
There's nothing to fix in QEMU, in fact I believe this should return 0
rather than -EINVAL. It's perfectly legal for the kernel to ignore an madvise,
and this is what should happen here.
Paolo
On Wed, 26 Feb 2014 13:39:17 -0600 Alex Thorlton <[email protected]> wrote:
> >
> >
> > > I'd suggest the patch below on top of your changes, but I won't argue.
>
> I like this approach, and your updated comment as well. This should
> probably go in as [PATCH 2/4] in my series. Do I need to spin a v5
> with this new patch included in the series, or will Andrew pull this in?
Paolo's comments on Oleg's patch remain unaddressed, so please take a
look at that and send out the final version?
An incremental patch would be preferred, please. I'll later fold that
into mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags.patch to
avoid any bisection holes.
On Wed, Feb 26, 2014 at 03:24:34PM -0800, Andrew Morton wrote:
> On Wed, 26 Feb 2014 13:39:17 -0600 Alex Thorlton <[email protected]> wrote:
>
> > >
> > >
> > > > I'd suggest the patch below on top of your changes, but I won't argue.
> >
> > I like this approach, and your updated comment as well. This should
> > probably go in as [PATCH 2/4] in my series. Do I need to spin a v5
> > with this new patch included in the series, or will Andrew pull this in?
>
> Paolo's comments on Oleg's patch remain unaddressed, so please take a
> look at that and send out the final version?
Got it. I'll get that to you tonight/tomorrow morning.
> An incremental patch would be preferred, please. I'll later fold that
> into mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags.patch to
> avoid any bisection holes.
Understood. I'll keep them separate.
Thanks, Andrew!
- Alex
On 02/26, Paolo Bonzini wrote:
>
> > +#ifdef CONFIG_S390
> > + /*
> > + * MADV_HUGEPAGE is broken after s390_enable_sie(), qemu
> > + * blindly does madvise(MADV_HUGEPAGE) for for all kvm pages
> > + * and expects it must fail on s390. Avoid a possible SIGSEGV
> > + * until qemu is changed.
> > + */
> > + if (mm_has_pgste(vma->vm_mm))
> > + return -EINVAL;
> > +#endif
>
> The comment is not quite true. QEMU doesn't expect that the madvise fails.
Yes, sorry. I didn't mean "it expects -EINVAL".
> It simply expects that the madvise doesn't cause SIGSEGVs or later
> malfunctioning, because (quoting tha man page) madvise "does not influence
> the semantics of the application".
Yes, I understand. But currently this means "MADV_HUGEPAGE should not
actually work", this is what I tried to say.
> There's nothing to fix in QEMU,
I was going to argue, but this is probably true too.
In short: I agree with any comment ;)
Oleg.
On 02/26, Gerald Schaefer wrote:
>
> On Wed, 26 Feb 2014 19:06:03 +0100
> Oleg Nesterov <[email protected]> wrote:
>
> > It would be nice to also change thp_split_mm() to not not play with
> > mm->def_flags, but I am not sure if we can do this.
>
> Hmm, I'm also wondering about this. Basically, we only need VM_NOHUGEPAGE
> in vma->vm_flags, which is done for all existing vmas in thp_split_mm().
> But if there should be new vmas created afterwards, it would still be
> necessary to also have VM_NOHUGEPAGE in mm->def_flags, because the
> vm_flags for new vmas will be set via OR of mm->def_flags, e.g. in
> do_brk() and do_mmap_pgoff().
Yes, exactly, this was my concern.
And while I know nothing about s390, it seems to me that huge pages should
be forbidden for any vma if ->has_pgste was set.
> Anyway, this would then have to be a separate patch, to keep the
> "revertability" of this hack.
Agreed. Thanks!
Oleg.
On Wed, Feb 26, 2014 at 06:01:53PM -0600, Alex Thorlton wrote:
> On Wed, Feb 26, 2014 at 03:24:34PM -0800, Andrew Morton wrote:
> > On Wed, 26 Feb 2014 13:39:17 -0600 Alex Thorlton <[email protected]> wrote:
> >
> > > >
> > > >
> > > > > I'd suggest the patch below on top of your changes, but I won't argue.
> > >
> > > I like this approach, and your updated comment as well. This should
> > > probably go in as [PATCH 2/4] in my series. Do I need to spin a v5
> > > with this new patch included in the series, or will Andrew pull this in?
> >
> > Paolo's comments on Oleg's patch remain unaddressed, so please take a
> > look at that and send out the final version?
>
> Got it. I'll get that to you tonight/tomorrow morning.
Just kicked out another version of the patch that should cover all
comments/suggestions. Let me know if you need anything else from me!
- Alex