2014-01-31 18:23:58

by Alex Thorlton

[permalink] [raw]
Subject: [PATCHv3 0/3] Add mm flag to control THP

This patch is based on some of my work combined with some
suggestions/patches given by Oleg Nesterov. The main goal here is to
add a prctl switch to allow us to disable to THP on a per mm_struct
basis.

Changes for v3:

* Pulled in Oleg's idea to use mm->def_flags and the VM_NOHUGEPAGE flag,
which will get copied down to each vm, instead of adding in a whole
new MMF_THP_DISABLE flag to mm->flags. This also creates a
VM_INIT_DEF_MASK which allows the VM_NOHUGEPAGE flag to get carried
down from def_flags.
- Main benefit of implementing the flag this way is that, if a
user specifically requests THP via madvise, that request can
still be respected in vmas where necessary; however, for all
other vmas we can have THP turned off.
- This also prevents us from having to check for a new flag in
multiple locations, since the VM_NOHUGEPAGE flag is already
respected wherever necessary.
* Made some adjustments to the way that the prctl call returns
information, made sure to return -EINVAL when unnecessary arguments
are passed for PRCTL_GET/SET_THP_DISABLE.
* Reverted/added some code for s390 arch that was needed to get the
VM_INIT_DEF_MASK idea working.

The main motivation behind this patch is to provide a way to disable THP
for jobs where the code cannot be modified, and using a malloc hook with
madvise is not an option (i.e. statically allocated data). This patch
allows us to do just that, without affecting other jobs running on the
system.

We need to do this sort of thing for jobs where THP hurts performance,
due to the possibility of increased remote memory accesses that can be
created by situations such as the following:

When you touch 1 byte of an untouched, contiguous 2MB chunk, a THP will
be handed out, and the THP will be stuck on whatever node the chunk was
originally referenced from. If many remote nodes need to do work on
that same chunk, they'll be making remote accesses.

With THP disabled, 4K pages can be handed out to separate nodes as
they're needed, greatly reducing the amount of remote accesses to
memory.

First with the flag unset:

# perf stat -a ./prctl_wrapper_mmv3 0 ./thp_pthread -C 0 -m 0 -c 512 -b 256g
Setting thp_disabled for this task...
thp_disable: 0
Set thp_disabled state to 0
Process pid = 18027

PF/
MAX MIN TOTCPU/ TOT_PF/ TOT_PF/ WSEC/
TYPE: CPUS WALL WALL SYS USER TOTCPU CPU WALL_SEC SYS_SEC CPU NODES
512 1.120 0.060 0.000 0.110 0.110 0.000 28571428864 -9223372036854775808 55803572 23

Performance counter stats for './prctl_wrapper_mmv3_hack 0 ./thp_pthread -C 0 -m 0 -c 512 -b 256g':

273719072.841402 task-clock # 641.026 CPUs utilized [100.00%]
1,008,986 context-switches # 0.000 M/sec [100.00%]
7,717 CPU-migrations # 0.000 M/sec [100.00%]
1,698,932 page-faults # 0.000 M/sec
355,222,544,890,379 cycles # 1.298 GHz [100.00%]
536,445,412,234,588 stalled-cycles-frontend # 151.02% frontend cycles idle [100.00%]
409,110,531,310,223 stalled-cycles-backend # 115.17% backend cycles idle [100.00%]
148,286,797,266,411 instructions # 0.42 insns per cycle
# 3.62 stalled cycles per insn [100.00%]
27,061,793,159,503 branches # 98.867 M/sec [100.00%]
1,188,655,196 branch-misses # 0.00% of all branches

427.001706337 seconds time elapsed

Now with the flag set:

# perf stat -a ./prctl_wrapper_mmv3 1 ./thp_pthread -C 0 -m 0 -c 512 -b 256g
Setting thp_disabled for this task...
thp_disable: 1
Set thp_disabled state to 1
Process pid = 144957

PF/
MAX MIN TOTCPU/ TOT_PF/ TOT_PF/ WSEC/
TYPE: CPUS WALL WALL SYS USER TOTCPU CPU WALL_SEC SYS_SEC CPU NODES
512 0.620 0.260 0.250 0.320 0.570 0.001 51612901376 128000000000 100806448 23

Performance counter stats for './prctl_wrapper_mmv3_hack 1 ./thp_pthread -C 0 -m 0 -c 512 -b 256g':

138789390.540183 task-clock # 641.959 CPUs utilized [100.00%]
534,205 context-switches # 0.000 M/sec [100.00%]
4,595 CPU-migrations # 0.000 M/sec [100.00%]
63,133,119 page-faults # 0.000 M/sec
147,977,747,269,768 cycles # 1.066 GHz [100.00%]
200,524,196,493,108 stalled-cycles-frontend # 135.51% frontend cycles idle [100.00%]
105,175,163,716,388 stalled-cycles-backend # 71.07% backend cycles idle [100.00%]
180,916,213,503,160 instructions # 1.22 insns per cycle
# 1.11 stalled cycles per insn [100.00%]
26,999,511,005,868 branches # 194.536 M/sec [100.00%]
714,066,351 branch-misses # 0.00% of all branches

216.196778807 seconds time elapsed

As with previous versions of the patch, We're getting about a 2x
performance increase here. Here's a link to the test case I used, along
with the little wrapper to activate the flag:

http://oss.sgi.com/projects/memtests/thp_pthread_mmprctlv3.tar.gz

Let me know if anybody has any further suggestions here. Thanks!

Cc: Alexander Viro <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiang Liu <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Robin Holt <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]

Alex Thorlton (3):
Revert "thp: make MADV_HUGEPAGE check for mm->def_flags"
Add VM_INIT_DEF_MASK and PRCTL_THP_DISABLE
exec: kill the unnecessary mm->def_flags setting in load_elf_binary()

arch/s390/mm/pgtable.c | 3 +++
fs/binfmt_elf.c | 4 ----
include/linux/mm.h | 2 ++
include/uapi/linux/prctl.h | 3 +++
kernel/fork.c | 11 ++++++++---
kernel/sys.c | 17 +++++++++++++++++
mm/huge_memory.c | 4 ----
7 files changed, 33 insertions(+), 11 deletions(-)

--
1.7.12.4


2014-01-31 18:24:05

by Alex Thorlton

[permalink] [raw]
Subject: [PATCH 1/3] Revert "thp: make MADV_HUGEPAGE check for mm->def_flags"

This reverts commit 8e72033f2a489b6c98c4e3c7cc281b1afd6cb85cm, and adds
in code to fix up any issues caused by the revert.

The revert is necessary because hugepage_madvise would return -EINVAL
when VM_NOHUGEPAGE is set, which will break subsequent chunks of this
patch set.

Signed-off-by: Alex Thorlton <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Sasha Levin <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]

---
arch/s390/mm/pgtable.c | 3 +++
mm/huge_memory.c | 4 ----
2 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index 3584ed9..a87cdb4 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -504,6 +504,9 @@ static int gmap_connect_pgtable(unsigned long address, unsigned long segment,
if (!pmd_present(*pmd) &&
__pte_alloc(mm, vma, pmd, vmaddr))
return -ENOMEM;
+ /* large pmds cannot yet be handled */
+ if (pmd_large(*pmd))
+ return -EFAULT;
/* pmd now points to a valid segment table entry. */
rmap = kmalloc(sizeof(*rmap), GFP_KERNEL|__GFP_REPEAT);
if (!rmap)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 82166bf..a4310a5 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1968,8 +1968,6 @@ out:
int hugepage_madvise(struct vm_area_struct *vma,
unsigned long *vm_flags, int advice)
{
- struct mm_struct *mm = vma->vm_mm;
-
switch (advice) {
case MADV_HUGEPAGE:
/*
@@ -1977,8 +1975,6 @@ int hugepage_madvise(struct vm_area_struct *vma,
*/
if (*vm_flags & (VM_HUGEPAGE | VM_NO_THP))
return -EINVAL;
- if (mm->def_flags & VM_NOHUGEPAGE)
- return -EINVAL;
*vm_flags &= ~VM_NOHUGEPAGE;
*vm_flags |= VM_HUGEPAGE;
/*
--
1.7.12.4

2014-01-31 18:24:03

by Alex Thorlton

[permalink] [raw]
Subject: [PATCH 1/3] Revert "thp: make MADV_HUGEPAGE check for mm->def_flags"

This reverts commit 8e72033f2a489b6c98c4e3c7cc281b1afd6cb85cm, and adds
in code to fix up any issues caused by the revert.

The revert is necessary because hugepage_madvise would return -EINVAL
when VM_NOHUGEPAGE is set, which will break subsequent chunks of this
patch set.

Signed-off-by: Alex Thorlton <[email protected]>
Suggested-by: Oleg Nesterov <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Sasha Levin <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]

---
arch/s390/mm/pgtable.c | 3 +++
mm/huge_memory.c | 4 ----
2 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index 3584ed9..a87cdb4 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -504,6 +504,9 @@ static int gmap_connect_pgtable(unsigned long address, unsigned long segment,
if (!pmd_present(*pmd) &&
__pte_alloc(mm, vma, pmd, vmaddr))
return -ENOMEM;
+ /* large pmds cannot yet be handled */
+ if (pmd_large(*pmd))
+ return -EFAULT;
/* pmd now points to a valid segment table entry. */
rmap = kmalloc(sizeof(*rmap), GFP_KERNEL|__GFP_REPEAT);
if (!rmap)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 82166bf..a4310a5 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1968,8 +1968,6 @@ out:
int hugepage_madvise(struct vm_area_struct *vma,
unsigned long *vm_flags, int advice)
{
- struct mm_struct *mm = vma->vm_mm;
-
switch (advice) {
case MADV_HUGEPAGE:
/*
@@ -1977,8 +1975,6 @@ int hugepage_madvise(struct vm_area_struct *vma,
*/
if (*vm_flags & (VM_HUGEPAGE | VM_NO_THP))
return -EINVAL;
- if (mm->def_flags & VM_NOHUGEPAGE)
- return -EINVAL;
*vm_flags &= ~VM_NOHUGEPAGE;
*vm_flags |= VM_HUGEPAGE;
/*
--
1.7.12.4

2014-01-31 18:24:39

by Alex Thorlton

[permalink] [raw]
Subject: [PATCH 3/3] exec: kill the unnecessary mm->def_flags setting in load_elf_binary()

load_elf_binary() sets current->mm->def_flags = def_flags and
def_flags is always zero. Not only this looks strange, this is
unnecessary because mm_init() has already set ->def_flags = 0.

Signed-off-by: Alex Thorlton <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]

---
fs/binfmt_elf.c | 4 ----
1 file changed, 4 deletions(-)

diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 67be295..d09bd9c 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -579,7 +579,6 @@ static int load_elf_binary(struct linux_binprm *bprm)
unsigned long start_code, end_code, start_data, end_data;
unsigned long reloc_func_desc __maybe_unused = 0;
int executable_stack = EXSTACK_DEFAULT;
- unsigned long def_flags = 0;
struct pt_regs *regs = current_pt_regs();
struct {
struct elfhdr elf_ex;
@@ -719,9 +718,6 @@ static int load_elf_binary(struct linux_binprm *bprm)
if (retval)
goto out_free_dentry;

- /* OK, This is the point of no return */
- current->mm->def_flags = def_flags;
-
/* Do this immediately, since STACK_TOP as used in setup_arg_pages
may depend on the personality. */
SET_PERSONALITY(loc->elf_ex);
--
1.7.12.4

2014-01-31 18:25:21

by Alex Thorlton

[permalink] [raw]
Subject: [PATCH 3/3] exec: kill the unnecessary mm->def_flags setting in load_elf_binary()

load_elf_binary() sets current->mm->def_flags = def_flags and
def_flags is always zero. Not only this looks strange, this is
unnecessary because mm_init() has already set ->def_flags = 0.

Signed-off-by: Alex Thorlton <[email protected]>
Suggested-by: Oleg Nesterov <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]

---
fs/binfmt_elf.c | 4 ----
1 file changed, 4 deletions(-)

diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 67be295..d09bd9c 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -579,7 +579,6 @@ static int load_elf_binary(struct linux_binprm *bprm)
unsigned long start_code, end_code, start_data, end_data;
unsigned long reloc_func_desc __maybe_unused = 0;
int executable_stack = EXSTACK_DEFAULT;
- unsigned long def_flags = 0;
struct pt_regs *regs = current_pt_regs();
struct {
struct elfhdr elf_ex;
@@ -719,9 +718,6 @@ static int load_elf_binary(struct linux_binprm *bprm)
if (retval)
goto out_free_dentry;

- /* OK, This is the point of no return */
- current->mm->def_flags = def_flags;
-
/* Do this immediately, since STACK_TOP as used in setup_arg_pages
may depend on the personality. */
SET_PERSONALITY(loc->elf_ex);
--
1.7.12.4

2014-01-31 18:24:01

by Alex Thorlton

[permalink] [raw]
Subject: [PATCHv3 0/3] Add mm flag to control THP

This patch is based on some of my work combined with some
suggestions/patches given by Oleg Nesterov. The main goal here is to
add a prctl switch to allow us to disable to THP on a per mm_struct
basis.

Changes for v3:

* Pulled in Oleg's idea to use mm->def_flags and the VM_NOHUGEPAGE flag,
which will get copied down to each vm, instead of adding in a whole
new MMF_THP_DISABLE flag to mm->flags. This also creates a
VM_INIT_DEF_MASK which allows the VM_NOHUGEPAGE flag to get carried
down from def_flags.
- Main benefit of implementing the flag this way is that, if a
user specifically requests THP via madvise, that request can
still be respected in vmas where necessary; however, for all
other vmas we can have THP turned off.
- This also prevents us from having to check for a new flag in
multiple locations, since the VM_NOHUGEPAGE flag is already
respected wherever necessary.
* Made some adjustments to the way that the prctl call returns
information, made sure to return -EINVAL when unnecessary arguments
are passed for PRCTL_GET/SET_THP_DISABLE.
* Reverted/added some code for s390 arch that was needed to get the
VM_INIT_DEF_MASK idea working.

The main motivation behind this patch is to provide a way to disable THP
for jobs where the code cannot be modified, and using a malloc hook with
madvise is not an option (i.e. statically allocated data). This patch
allows us to do just that, without affecting other jobs running on the
system.

We need to do this sort of thing for jobs where THP hurts performance,
due to the possibility of increased remote memory accesses that can be
created by situations such as the following:

When you touch 1 byte of an untouched, contiguous 2MB chunk, a THP will
be handed out, and the THP will be stuck on whatever node the chunk was
originally referenced from. If many remote nodes need to do work on
that same chunk, they'll be making remote accesses.

With THP disabled, 4K pages can be handed out to separate nodes as
they're needed, greatly reducing the amount of remote accesses to
memory.

First with the flag unset:

# perf stat -a ./prctl_wrapper_mmv3 0 ./thp_pthread -C 0 -m 0 -c 512 -b 256g
Setting thp_disabled for this task...
thp_disable: 0
Set thp_disabled state to 0
Process pid = 18027

PF/
MAX MIN TOTCPU/ TOT_PF/ TOT_PF/ WSEC/
TYPE: CPUS WALL WALL SYS USER TOTCPU CPU WALL_SEC SYS_SEC CPU NODES
512 1.120 0.060 0.000 0.110 0.110 0.000 28571428864 -9223372036854775808 55803572 23

Performance counter stats for './prctl_wrapper_mmv3_hack 0 ./thp_pthread -C 0 -m 0 -c 512 -b 256g':

273719072.841402 task-clock # 641.026 CPUs utilized [100.00%]
1,008,986 context-switches # 0.000 M/sec [100.00%]
7,717 CPU-migrations # 0.000 M/sec [100.00%]
1,698,932 page-faults # 0.000 M/sec
355,222,544,890,379 cycles # 1.298 GHz [100.00%]
536,445,412,234,588 stalled-cycles-frontend # 151.02% frontend cycles idle [100.00%]
409,110,531,310,223 stalled-cycles-backend # 115.17% backend cycles idle [100.00%]
148,286,797,266,411 instructions # 0.42 insns per cycle
# 3.62 stalled cycles per insn [100.00%]
27,061,793,159,503 branches # 98.867 M/sec [100.00%]
1,188,655,196 branch-misses # 0.00% of all branches

427.001706337 seconds time elapsed

Now with the flag set:

# perf stat -a ./prctl_wrapper_mmv3 1 ./thp_pthread -C 0 -m 0 -c 512 -b 256g
Setting thp_disabled for this task...
thp_disable: 1
Set thp_disabled state to 1
Process pid = 144957

PF/
MAX MIN TOTCPU/ TOT_PF/ TOT_PF/ WSEC/
TYPE: CPUS WALL WALL SYS USER TOTCPU CPU WALL_SEC SYS_SEC CPU NODES
512 0.620 0.260 0.250 0.320 0.570 0.001 51612901376 128000000000 100806448 23

Performance counter stats for './prctl_wrapper_mmv3_hack 1 ./thp_pthread -C 0 -m 0 -c 512 -b 256g':

138789390.540183 task-clock # 641.959 CPUs utilized [100.00%]
534,205 context-switches # 0.000 M/sec [100.00%]
4,595 CPU-migrations # 0.000 M/sec [100.00%]
63,133,119 page-faults # 0.000 M/sec
147,977,747,269,768 cycles # 1.066 GHz [100.00%]
200,524,196,493,108 stalled-cycles-frontend # 135.51% frontend cycles idle [100.00%]
105,175,163,716,388 stalled-cycles-backend # 71.07% backend cycles idle [100.00%]
180,916,213,503,160 instructions # 1.22 insns per cycle
# 1.11 stalled cycles per insn [100.00%]
26,999,511,005,868 branches # 194.536 M/sec [100.00%]
714,066,351 branch-misses # 0.00% of all branches

216.196778807 seconds time elapsed

As with previous versions of the patch, We're getting about a 2x
performance increase here. Here's a link to the test case I used, along
with the little wrapper to activate the flag:

http://oss.sgi.com/projects/memtests/thp_pthread_mmprctlv3.tar.gz

Let me know if anybody has any further suggestions here. Thanks!

Alex Thorlton (3):
Revert "thp: make MADV_HUGEPAGE check for mm->def_flags"
Add VM_INIT_DEF_MASK and PRCTL_THP_DISABLE
exec: kill the unnecessary mm->def_flags setting in load_elf_binary()

Cc: Alexander Viro <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiang Liu <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Robin Holt <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]

arch/s390/mm/pgtable.c | 3 +++
fs/binfmt_elf.c | 4 ----
include/linux/mm.h | 2 ++
include/uapi/linux/prctl.h | 3 +++
kernel/fork.c | 11 ++++++++---
kernel/sys.c | 17 +++++++++++++++++
mm/huge_memory.c | 4 ----
7 files changed, 33 insertions(+), 11 deletions(-)

--
1.7.12.4

2014-01-31 18:25:53

by Alex Thorlton

[permalink] [raw]
Subject: Re: [PATCHv3 0/3] Add mm flag to control THP

Ugh. Screwed up the git send-email somehow. Sorry for the duplicates
in the thread. I'll get it right one of these days...

- Alex

2014-01-31 18:26:37

by Alex Thorlton

[permalink] [raw]
Subject: [PATCH 2/3] Add VM_INIT_DEF_MASK and PRCTL_THP_DISABLE

This patch adds a VM_INIT_DEF_MASK, to allow us to set the default flags
for VMs. It also adds a prctl control which alllows us to set the THP
disable bit in mm->def_flags so that VMs will pick up the setting as
they are created.

Signed-off-by: Alex Thorlton <[email protected]>
Suggested-by: Oleg Nesterov <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Jiang Liu <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: Robin Holt <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: liguang <[email protected]>
Cc: [email protected]
Cc: [email protected]

---
include/linux/mm.h | 2 ++
include/uapi/linux/prctl.h | 3 +++
kernel/fork.c | 11 ++++++++---
kernel/sys.c | 17 +++++++++++++++++
4 files changed, 30 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f28f46e..c0a94ad 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -177,6 +177,8 @@ extern unsigned int kobjsize(const void *objp);
*/
#define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_PFNMAP)

+#define VM_INIT_DEF_MASK VM_NOHUGEPAGE
+
/*
* mapping from the currently active vm_flags protection bits (the
* low four bits) to a page protection mask..
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 289760f..58afc04 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -149,4 +149,7 @@

#define PR_GET_TID_ADDRESS 40

+#define PR_SET_THP_DISABLE 41
+#define PR_GET_THP_DISABLE 42
+
#endif /* _LINUX_PRCTL_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index a17621c..9fc0a30 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -529,8 +529,6 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
atomic_set(&mm->mm_count, 1);
init_rwsem(&mm->mmap_sem);
INIT_LIST_HEAD(&mm->mmlist);
- mm->flags = (current->mm) ?
- (current->mm->flags & MMF_INIT_MASK) : default_dump_filter;
mm->core_state = NULL;
atomic_long_set(&mm->nr_ptes, 0);
memset(&mm->rss_stat, 0, sizeof(mm->rss_stat));
@@ -539,8 +537,15 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
mm_init_owner(mm, p);
clear_tlb_flush_pending(mm);

- if (likely(!mm_alloc_pgd(mm))) {
+ if (current->mm) {
+ mm->flags = current->mm->flags & MMF_INIT_MASK;
+ mm->def_flags = current->mm->def_flags & VM_INIT_DEF_MASK;
+ } else {
+ mm->flags = default_dump_filter;
mm->def_flags = 0;
+ }
+
+ if (likely(!mm_alloc_pgd(mm))) {
mmu_notifier_mm_init(mm);
return mm;
}
diff --git a/kernel/sys.c b/kernel/sys.c
index c0a58be..d59524a 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1996,6 +1996,23 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
if (arg2 || arg3 || arg4 || arg5)
return -EINVAL;
return current->no_new_privs ? 1 : 0;
+ case PR_GET_THP_DISABLE:
+ if (arg2 || arg3 || arg4 || arg5)
+ return -EINVAL;
+ case PR_SET_THP_DISABLE:
+ if (arg3 || arg4 || arg5)
+ return -EINVAL;
+ down_write(&me->mm->mmap_sem);
+ if (option == PR_SET_THP_DISABLE) {
+ if (arg2)
+ me->mm->def_flags |= VM_NOHUGEPAGE;
+ else
+ me->mm->def_flags &= ~VM_NOHUGEPAGE;
+ } else {
+ error = !!(me->mm->def_flags & VM_NOHUGEPAGE);
+ }
+ up_write(&me->mm->mmap_sem);
+ break;
default:
error = -EINVAL;
break;
--
1.7.12.4

2014-01-31 18:27:09

by Alex Thorlton

[permalink] [raw]
Subject: [PATCH 2/3] Add VM_INIT_DEF_MASK and PRCTL_THP_DISABLE

This patch adds a VM_INIT_DEF_MASK, to allow us to set the default flags
for VMs. It also adds a prctl control which alllows us to set the THP
disable bit in mm->def_flags so that VMs will pick up the setting as
they are created.

Signed-off-by: Alex Thorlton <[email protected]>
Suggested-by: Oleg Nesterov <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Jiang Liu <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: Robin Holt <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: liguang <[email protected]>
Cc: [email protected]
Cc: [email protected]

---
include/linux/mm.h | 2 ++
include/uapi/linux/prctl.h | 3 +++
kernel/fork.c | 11 ++++++++---
kernel/sys.c | 17 +++++++++++++++++
4 files changed, 30 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f28f46e..c0a94ad 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -177,6 +177,8 @@ extern unsigned int kobjsize(const void *objp);
*/
#define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_PFNMAP)

+#define VM_INIT_DEF_MASK VM_NOHUGEPAGE
+
/*
* mapping from the currently active vm_flags protection bits (the
* low four bits) to a page protection mask..
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 289760f..58afc04 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -149,4 +149,7 @@

#define PR_GET_TID_ADDRESS 40

+#define PR_SET_THP_DISABLE 41
+#define PR_GET_THP_DISABLE 42
+
#endif /* _LINUX_PRCTL_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index a17621c..9fc0a30 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -529,8 +529,6 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
atomic_set(&mm->mm_count, 1);
init_rwsem(&mm->mmap_sem);
INIT_LIST_HEAD(&mm->mmlist);
- mm->flags = (current->mm) ?
- (current->mm->flags & MMF_INIT_MASK) : default_dump_filter;
mm->core_state = NULL;
atomic_long_set(&mm->nr_ptes, 0);
memset(&mm->rss_stat, 0, sizeof(mm->rss_stat));
@@ -539,8 +537,15 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
mm_init_owner(mm, p);
clear_tlb_flush_pending(mm);

- if (likely(!mm_alloc_pgd(mm))) {
+ if (current->mm) {
+ mm->flags = current->mm->flags & MMF_INIT_MASK;
+ mm->def_flags = current->mm->def_flags & VM_INIT_DEF_MASK;
+ } else {
+ mm->flags = default_dump_filter;
mm->def_flags = 0;
+ }
+
+ if (likely(!mm_alloc_pgd(mm))) {
mmu_notifier_mm_init(mm);
return mm;
}
diff --git a/kernel/sys.c b/kernel/sys.c
index c0a58be..d59524a 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1996,6 +1996,23 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
if (arg2 || arg3 || arg4 || arg5)
return -EINVAL;
return current->no_new_privs ? 1 : 0;
+ case PR_GET_THP_DISABLE:
+ if (arg2 || arg3 || arg4 || arg5)
+ return -EINVAL;
+ case PR_SET_THP_DISABLE:
+ if (arg3 || arg4 || arg5)
+ return -EINVAL;
+ down_write(&me->mm->mmap_sem);
+ if (option == PR_SET_THP_DISABLE) {
+ if (arg2)
+ me->mm->def_flags |= VM_NOHUGEPAGE;
+ else
+ me->mm->def_flags &= ~VM_NOHUGEPAGE;
+ } else {
+ error = !!(me->mm->def_flags & VM_NOHUGEPAGE);
+ }
+ up_write(&me->mm->mmap_sem);
+ break;
default:
error = -EINVAL;
break;
--
1.7.12.4

2014-01-31 22:52:27

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 1/3] Revert "thp: make MADV_HUGEPAGE check for mm->def_flags"

On Fri, 31 Jan 2014 12:23:43 -0600 Alex Thorlton <[email protected]> wrote:

> This reverts commit 8e72033f2a489b6c98c4e3c7cc281b1afd6cb85cm, and adds

'm' is not a hex digit ;)

> in code to fix up any issues caused by the revert.
>
> The revert is necessary because hugepage_madvise would return -EINVAL
> when VM_NOHUGEPAGE is set, which will break subsequent chunks of this
> patch set.

This is a bit skimpy. Why doesn't the patch re-break kvm-on-s390?

it would be nice to have a lot more detail here, please. What was the
intent of 8e72033f2a48, how this patch retains 8e72033f2a48's behavior,
etc.

> --- a/arch/s390/mm/pgtable.c
> +++ b/arch/s390/mm/pgtable.c
> @@ -504,6 +504,9 @@ static int gmap_connect_pgtable(unsigned long address, unsigned long segment,
> if (!pmd_present(*pmd) &&
> __pte_alloc(mm, vma, pmd, vmaddr))
> return -ENOMEM;
> + /* large pmds cannot yet be handled */
> + if (pmd_large(*pmd))
> + return -EFAULT;

This bit wasn't in 8e72033f2a48.

2014-01-31 23:01:35

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 2/3] Add VM_INIT_DEF_MASK and PRCTL_THP_DISABLE

On Fri, 31 Jan 2014 12:23:45 -0600 Alex Thorlton <[email protected]> wrote:

> This patch adds a VM_INIT_DEF_MASK, to allow us to set the default flags
> for VMs. It also adds a prctl control which alllows us to set the THP
> disable bit in mm->def_flags so that VMs will pick up the setting as
> they are created.
>
> ...
>
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -177,6 +177,8 @@ extern unsigned int kobjsize(const void *objp);
> */
> #define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_PFNMAP)
>
> +#define VM_INIT_DEF_MASK VM_NOHUGEPAGE

Document this here?

> /*
> * mapping from the currently active vm_flags protection bits (the
> * low four bits) to a page protection mask..
> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> index 289760f..58afc04 100644
> --- a/include/uapi/linux/prctl.h
> +++ b/include/uapi/linux/prctl.h
> @@ -149,4 +149,7 @@
>
> #define PR_GET_TID_ADDRESS 40
>
> +#define PR_SET_THP_DISABLE 41
> +#define PR_GET_THP_DISABLE 42
> +
> #endif /* _LINUX_PRCTL_H */
> diff --git a/kernel/fork.c b/kernel/fork.c
> index a17621c..9fc0a30 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -529,8 +529,6 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
> atomic_set(&mm->mm_count, 1);
> init_rwsem(&mm->mmap_sem);
> INIT_LIST_HEAD(&mm->mmlist);
> - mm->flags = (current->mm) ?
> - (current->mm->flags & MMF_INIT_MASK) : default_dump_filter;
> mm->core_state = NULL;
> atomic_long_set(&mm->nr_ptes, 0);
> memset(&mm->rss_stat, 0, sizeof(mm->rss_stat));
> @@ -539,8 +537,15 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
> mm_init_owner(mm, p);
> clear_tlb_flush_pending(mm);
>
> - if (likely(!mm_alloc_pgd(mm))) {
> + if (current->mm) {
> + mm->flags = current->mm->flags & MMF_INIT_MASK;
> + mm->def_flags = current->mm->def_flags & VM_INIT_DEF_MASK;

So VM_INIT_DEF_MASK defines which vm flags a process may inherit from
its parent?

> + } else {
> + mm->flags = default_dump_filter;
> mm->def_flags = 0;
> + }
> +
> + if (likely(!mm_alloc_pgd(mm))) {
> mmu_notifier_mm_init(mm);
> return mm;
> }
> diff --git a/kernel/sys.c b/kernel/sys.c
> index c0a58be..d59524a 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -1996,6 +1996,23 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
> if (arg2 || arg3 || arg4 || arg5)
> return -EINVAL;
> return current->no_new_privs ? 1 : 0;
> + case PR_GET_THP_DISABLE:
> + if (arg2 || arg3 || arg4 || arg5)
> + return -EINVAL;

Please add

/* fall through */

here. So people don't think you added a bug. Also, iirc there's a
static checking tool which will complain about this and there was talk
about using the /* fall through */ to suppress the warning.

> + case PR_SET_THP_DISABLE:
> + if (arg3 || arg4 || arg5)
> + return -EINVAL;
> + down_write(&me->mm->mmap_sem);
> + if (option == PR_SET_THP_DISABLE) {
> + if (arg2)
> + me->mm->def_flags |= VM_NOHUGEPAGE;
> + else
> + me->mm->def_flags &= ~VM_NOHUGEPAGE;
> + } else {
> + error = !!(me->mm->def_flags & VM_NOHUGEPAGE);
> + }
> + up_write(&me->mm->mmap_sem);
> + break;

I suspect it would be simpler to not try to combine the set and get
code in the same lump.

The prctl() extension should be added to user-facing documentation.
Please work with Michael Kerrisk <[email protected]> on that.

2014-02-03 13:53:36

by Gerald Schaefer

[permalink] [raw]
Subject: Re: [PATCH 1/3] Revert "thp: make MADV_HUGEPAGE check for mm->def_flags"

On Fri, 31 Jan 2014 14:52:24 -0800
Andrew Morton <[email protected]> wrote:

> On Fri, 31 Jan 2014 12:23:43 -0600 Alex Thorlton <[email protected]> wrote:
>
> > This reverts commit 8e72033f2a489b6c98c4e3c7cc281b1afd6cb85cm, and adds
>
> 'm' is not a hex digit ;)
>
> > in code to fix up any issues caused by the revert.
> >
> > The revert is necessary because hugepage_madvise would return -EINVAL
> > when VM_NOHUGEPAGE is set, which will break subsequent chunks of this
> > patch set.
>
> This is a bit skimpy. Why doesn't the patch re-break kvm-on-s390?
>
> it would be nice to have a lot more detail here, please. What was the
> intent of 8e72033f2a48, how this patch retains 8e72033f2a48's behavior,
> etc.

The intent of 8e72033f2a48 was to guard against any future programming
errors that may result in an madvice(MADV_HUGEPAGE) on guest mappings,
which would crash the kernel.

Martin suggested adding the bit to arch/s390/mm/pgtable.c, if 8e72033f2a48
was to be reverted, because that check will also prevent a kernel crash
in the case described above, it will now send a SIGSEGV instead.

This would now also allow to do the madvise on other parts, if needed,
so it is a more flexible approach. One could also say that it would have
been better to do it this way right from the beginning...

> > --- a/arch/s390/mm/pgtable.c
> > +++ b/arch/s390/mm/pgtable.c
> > @@ -504,6 +504,9 @@ static int gmap_connect_pgtable(unsigned long address, unsigned long segment,
> > if (!pmd_present(*pmd) &&
> > __pte_alloc(mm, vma, pmd, vmaddr))
> > return -ENOMEM;
> > + /* large pmds cannot yet be handled */
> > + if (pmd_large(*pmd))
> > + return -EFAULT;
>
> This bit wasn't in 8e72033f2a48.

Yes, in order to be on the safe side regarding potential distribution
backports, it would be good to have the revert and the "replacement"
in the same patch.

>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-s390" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

2014-02-03 17:14:17

by Alex Thorlton

[permalink] [raw]
Subject: Re: [PATCH 1/3] Revert "thp: make MADV_HUGEPAGE check for mm->def_flags"

On Fri, Jan 31, 2014 at 02:52:24PM -0800, Andrew Morton wrote:
> On Fri, 31 Jan 2014 12:23:43 -0600 Alex Thorlton <[email protected]> wrote:
>
> > This reverts commit 8e72033f2a489b6c98c4e3c7cc281b1afd6cb85cm, and adds
>
> 'm' is not a hex digit ;)

My mistake! Sorry about that.

> > in code to fix up any issues caused by the revert.
> >
> > The revert is necessary because hugepage_madvise would return -EINVAL
> > when VM_NOHUGEPAGE is set, which will break subsequent chunks of this
> > patch set.
>
> This is a bit skimpy. Why doesn't the patch re-break kvm-on-s390?
>
> it would be nice to have a lot more detail here, please. What was the
> intent of 8e72033f2a48, how this patch retains 8e72033f2a48's behavior,
> etc.

I'm actually not too sure about this, off hand. I just know that we
couldn't have it in there because of the check for VM_NOHUGEPAGE. The
s390 guys approved the revert, as long as we added in the following
piece:

> > --- a/arch/s390/mm/pgtable.c
> > +++ b/arch/s390/mm/pgtable.c
> > @@ -504,6 +504,9 @@ static int gmap_connect_pgtable(unsigned long address, unsigned long segment,
> > if (!pmd_present(*pmd) &&
> > __pte_alloc(mm, vma, pmd, vmaddr))
> > return -ENOMEM;
> > + /* large pmds cannot yet be handled */
> > + if (pmd_large(*pmd))
> > + return -EFAULT;
>
> This bit wasn't in 8e72033f2a48.

I added the fix-up code in with the revert, so that it would all be in
one place; wasn't sure what the standard was for this sort of thing. If
it's preferable to see this code in a separate patch, that's easy enough
to do.

I'll look into exactly what the original commit was intended to do, and
get a better description of what's going on here. Let me know if I
should split the two changes into separate patches.

- Alex

2014-02-03 17:22:45

by Alex Thorlton

[permalink] [raw]
Subject: Re: [PATCH 2/3] Add VM_INIT_DEF_MASK and PRCTL_THP_DISABLE

On Fri, Jan 31, 2014 at 03:00:58PM -0800, Andrew Morton wrote:
> On Fri, 31 Jan 2014 12:23:45 -0600 Alex Thorlton <[email protected]> wrote:
>
> > This patch adds a VM_INIT_DEF_MASK, to allow us to set the default flags
> > for VMs. It also adds a prctl control which alllows us to set the THP
> > disable bit in mm->def_flags so that VMs will pick up the setting as
> > they are created.
> >
> > ...
> >
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -177,6 +177,8 @@ extern unsigned int kobjsize(const void *objp);
> > */
> > #define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_PFNMAP)
> >
> > +#define VM_INIT_DEF_MASK VM_NOHUGEPAGE
>
> Document this here?

Can do. I suppose it's not exactly self-explanatory :)

> > /*
> > * mapping from the currently active vm_flags protection bits (the
> > * low four bits) to a page protection mask..
> > diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> > index 289760f..58afc04 100644
> > --- a/include/uapi/linux/prctl.h
> > +++ b/include/uapi/linux/prctl.h
> > @@ -149,4 +149,7 @@
> >
> > #define PR_GET_TID_ADDRESS 40
> >
> > +#define PR_SET_THP_DISABLE 41
> > +#define PR_GET_THP_DISABLE 42
> > +
> > #endif /* _LINUX_PRCTL_H */
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index a17621c..9fc0a30 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -529,8 +529,6 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
> > atomic_set(&mm->mm_count, 1);
> > init_rwsem(&mm->mmap_sem);
> > INIT_LIST_HEAD(&mm->mmlist);
> > - mm->flags = (current->mm) ?
> > - (current->mm->flags & MMF_INIT_MASK) : default_dump_filter;
> > mm->core_state = NULL;
> > atomic_long_set(&mm->nr_ptes, 0);
> > memset(&mm->rss_stat, 0, sizeof(mm->rss_stat));
> > @@ -539,8 +537,15 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
> > mm_init_owner(mm, p);
> > clear_tlb_flush_pending(mm);
> >
> > - if (likely(!mm_alloc_pgd(mm))) {
> > + if (current->mm) {
> > + mm->flags = current->mm->flags & MMF_INIT_MASK;
> > + mm->def_flags = current->mm->def_flags & VM_INIT_DEF_MASK;
>
> So VM_INIT_DEF_MASK defines which vm flags a process may inherit from
> its parent?

Yep. It behaves pretty much the same way as MMF_INIT_MASK.

> > + } else {
> > + mm->flags = default_dump_filter;
> > mm->def_flags = 0;
> > + }
> > +
> > + if (likely(!mm_alloc_pgd(mm))) {
> > mmu_notifier_mm_init(mm);
> > return mm;
> > }
> > diff --git a/kernel/sys.c b/kernel/sys.c
> > index c0a58be..d59524a 100644
> > --- a/kernel/sys.c
> > +++ b/kernel/sys.c
> > @@ -1996,6 +1996,23 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
> > if (arg2 || arg3 || arg4 || arg5)
> > return -EINVAL;
> > return current->no_new_privs ? 1 : 0;
> > + case PR_GET_THP_DISABLE:
> > + if (arg2 || arg3 || arg4 || arg5)
> > + return -EINVAL;
>
> Please add
>
> /* fall through */
>
> here. So people don't think you added a bug. Also, iirc there's a
> static checking tool which will complain about this and there was talk
> about using the /* fall through */ to suppress the warning.

Understood. More comments below.

> > + case PR_SET_THP_DISABLE:
> > + if (arg3 || arg4 || arg5)
> > + return -EINVAL;
> > + down_write(&me->mm->mmap_sem);
> > + if (option == PR_SET_THP_DISABLE) {
> > + if (arg2)
> > + me->mm->def_flags |= VM_NOHUGEPAGE;
> > + else
> > + me->mm->def_flags &= ~VM_NOHUGEPAGE;
> > + } else {
> > + error = !!(me->mm->def_flags & VM_NOHUGEPAGE);
> > + }
> > + up_write(&me->mm->mmap_sem);
> > + break;
>
> I suspect it would be simpler to not try to combine the set and get
> code in the same lump.

I think you're right here. This is what we originally came up with for
this piece, but I think it will look simpler to do each check
separately. In that case, we won't need the /* fall through */ either,
so that will take care of both issues.

> The prctl() extension should be added to user-facing documentation.
> Please work with Michael Kerrisk <[email protected]> on that.

Got it. I'll make sure that gets in on the next pass.

Thanks for the input, Andrew!

- Alex

2014-02-03 17:47:21

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 2/3] Add VM_INIT_DEF_MASK and PRCTL_THP_DISABLE

On 01/31, Alex Thorlton wrote:
>
> This patch adds a VM_INIT_DEF_MASK,

Perhaps it makes sense to tell a bit more... We add this mask to preserve
VM_NOHUGEPAGE after fork/exec. And this is obviously affects s390, say the
result of KVM_S390_ENABLE_SIE will be preserved.

I hope this is fine, but should be documented and it would be nice to have
the acks from Gerald.


> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -1996,6 +1996,23 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
> if (arg2 || arg3 || arg4 || arg5)
> return -EINVAL;
> return current->no_new_privs ? 1 : 0;
> + case PR_GET_THP_DISABLE:
> + if (arg2 || arg3 || arg4 || arg5)
> + return -EINVAL;

Cosmetic, but PR_GET_THP_DISABLE only needs to check arg2.

OTOH,

> + case PR_SET_THP_DISABLE:
> + if (arg3 || arg4 || arg5)
> + return -EINVAL;
> + down_write(&me->mm->mmap_sem);
> + if (option == PR_SET_THP_DISABLE) {
> + if (arg2)
> + me->mm->def_flags |= VM_NOHUGEPAGE;
> + else
> + me->mm->def_flags &= ~VM_NOHUGEPAGE;
> + } else {
> + error = !!(me->mm->def_flags & VM_NOHUGEPAGE);
> + }
> + up_write(&me->mm->mmap_sem);
> + break;

Perhaps _GET_ doesn't even need ->mmap_sem, I do not see how the lockless
"&" can get the inconsistent result. But I am fine either way.

Oleg.