2018-03-20 17:28:34

by Laurent Dufour

[permalink] [raw]
Subject: [PATCH] mm/hugetlb: prevent hugetlb VMA to be misaligned

When running the sampler detailed below, the kernel, if built with the VM
debug option turned on (as many distro do), is panicing with the following
message :
kernel BUG at /build/linux-jWa1Fv/linux-4.15.0/mm/hugetlb.c:3310!
Oops: Exception in kernel mode, sig: 5 [#1]
LE SMP NR_CPUS=2048 NUMA PowerNV
Modules linked in: kcm nfc af_alg caif_socket caif phonet fcrypt
8<--8<--8<--8< snip 8<--8<--8<--8<
CPU: 18 PID: 43243 Comm: trinity-subchil Tainted: G C E
4.15.0-10-generic #11-Ubuntu
NIP: c00000000036e764 LR: c00000000036ee48 CTR: 0000000000000009
REGS: c000003fbcdcf810 TRAP: 0700 Tainted: G C E
(4.15.0-10-generic)
MSR: 9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 24002222 XER:
20040000
CFAR: c00000000036ee44 SOFTE: 1
GPR00: c00000000036ee48 c000003fbcdcfa90 c0000000016ea600 c000003fbcdcfc40
GPR04: c000003fd9858950 00007115e4e00000 00007115e4e10000 0000000000000000
GPR08: 0000000000000010 0000000000010000 0000000000000000 0000000000000000
GPR12: 0000000000002000 c000000007a2c600 00000fe3985954d0 00007115e4e00000
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR20: 00000fe398595a94 000000000000a6fc c000003fd9858950 0000000000018554
GPR24: c000003fdcd84500 c0000000019acd00 00007115e4e10000 c000003fbcdcfc40
GPR28: 0000000000200000 00007115e4e00000 c000003fbc9ac600 c000003fd9858950
NIP [c00000000036e764] __unmap_hugepage_range+0xa4/0x760
LR [c00000000036ee48] __unmap_hugepage_range_final+0x28/0x50
Call Trace:
[c000003fbcdcfa90] [00007115e4e00000] 0x7115e4e00000 (unreliable)
[c000003fbcdcfb50] [c00000000036ee48]
__unmap_hugepage_range_final+0x28/0x50
[c000003fbcdcfb80] [c00000000033497c] unmap_single_vma+0x11c/0x190
[c000003fbcdcfbd0] [c000000000334e14] unmap_vmas+0x94/0x140
[c000003fbcdcfc20] [c00000000034265c] exit_mmap+0x9c/0x1d0
[c000003fbcdcfce0] [c000000000105448] mmput+0xa8/0x1d0
[c000003fbcdcfd10] [c00000000010fad0] do_exit+0x360/0xc80
[c000003fbcdcfdd0] [c0000000001104c0] do_group_exit+0x60/0x100
[c000003fbcdcfe10] [c000000000110584] SyS_exit_group+0x24/0x30
[c000003fbcdcfe30] [c00000000000b184] system_call+0x58/0x6c
Instruction dump:
552907fe e94a0028 e94a0408 eb2a0018 81590008 7f9c5036 0b090000 e9390010
7d2948f8 7d2a2838 0b0a0000 7d293038 <0b090000> e9230086 2fa90000 419e0468
---[ end trace ee88f958a1c62605 ]---

The panic is due to a VMA pointing to a hugetlb area while the
vma->vm_start or vma->vm_end field are not aligned to the huge page
boundaries. The sampler is just unmapping a part of the hugetlb area,
leading to 2 VMAs which are not well aligned. The same could be achieved
by calling madvise() situation, as it is when running:
stress-ng --shm-sysv 1

The hugetlb code is assuming that the VMA will be well aligned when it is
unmapped, so we must prevent such a VMA to be split or shrink to a
misaligned address.

This patch is preventing this by checking the new VMA's boundaries when a
VMA is modified by calling vma_adjust().

If this patch is applied, stable should be Cced.

--- Sampler used to hit the panic
nclude <sys/ipc.h>

unsigned long page_size;

int main(void)
{
int shmid, ret=1;
void *addr;

setbuf(stdout, NULL);
page_size = getpagesize();

shmid = shmget(0x1410, LENGTH, IPC_CREAT | SHM_HUGETLB | SHM_R |
SHM_W);
if (shmid < 0) {
perror("shmget");
exit(1);
}

printf("shmid: %d\n", shmid);

addr = shmat(shmid, NULL, 0);
if (addr == (void*)-1) {
perror("shmat");
goto out;
}

/*
* The following munmap() call will split the VMA in 2, leading to
* unaligned to huge page size VMAs which will trigger a check when
* shmdt() is called.
*/
if (munmap(addr + HPSIZE + page_size, page_size)) {
perror("munmap");
goto out;
}

if (shmdt(addr)) {
perror("shmdt");
goto out;
}

printf("test done.\n");
ret = 0;

out:
shmctl(shmid, IPC_RMID, NULL);
return ret;
}
--- End of code

Signed-off-by: Laurent Dufour <[email protected]>
---
mm/mmap.c | 11 +++++++++++
1 file changed, 11 insertions(+)

diff --git a/mm/mmap.c b/mm/mmap.c
index 188f195883b9..5dbf4b69a798 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -692,6 +692,17 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
long adjust_next = 0;
int remove_next = 0;

+ if (is_vm_hugetlb_page(vma)) {
+ /*
+ * We must check against the huge page boundarie to not
+ * create misaligned VMA.
+ */
+ struct hstate *h = hstate_vma(vma);
+
+ if (start & ~huge_page_mask(h) || end & ~huge_page_mask(h))
+ return -EINVAL;
+ }
+
if (next && !insert) {
struct vm_area_struct *exporter = NULL, *importer = NULL;

--
2.7.4



2018-03-20 21:28:27

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH] mm/hugetlb: prevent hugetlb VMA to be misaligned

On 03/20/2018 10:25 AM, Laurent Dufour wrote:
> When running the sampler detailed below, the kernel, if built with the VM
> debug option turned on (as many distro do), is panicing with the following
> message :
> kernel BUG at /build/linux-jWa1Fv/linux-4.15.0/mm/hugetlb.c:3310!
> Oops: Exception in kernel mode, sig: 5 [#1]
> LE SMP NR_CPUS=2048 NUMA PowerNV
> Modules linked in: kcm nfc af_alg caif_socket caif phonet fcrypt
> 8<--8<--8<--8< snip 8<--8<--8<--8<
> CPU: 18 PID: 43243 Comm: trinity-subchil Tainted: G C E
> 4.15.0-10-generic #11-Ubuntu
> NIP: c00000000036e764 LR: c00000000036ee48 CTR: 0000000000000009
> REGS: c000003fbcdcf810 TRAP: 0700 Tainted: G C E
> (4.15.0-10-generic)
> MSR: 9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 24002222 XER:
> 20040000
> CFAR: c00000000036ee44 SOFTE: 1
> GPR00: c00000000036ee48 c000003fbcdcfa90 c0000000016ea600 c000003fbcdcfc40
> GPR04: c000003fd9858950 00007115e4e00000 00007115e4e10000 0000000000000000
> GPR08: 0000000000000010 0000000000010000 0000000000000000 0000000000000000
> GPR12: 0000000000002000 c000000007a2c600 00000fe3985954d0 00007115e4e00000
> GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> GPR20: 00000fe398595a94 000000000000a6fc c000003fd9858950 0000000000018554
> GPR24: c000003fdcd84500 c0000000019acd00 00007115e4e10000 c000003fbcdcfc40
> GPR28: 0000000000200000 00007115e4e00000 c000003fbc9ac600 c000003fd9858950
> NIP [c00000000036e764] __unmap_hugepage_range+0xa4/0x760
> LR [c00000000036ee48] __unmap_hugepage_range_final+0x28/0x50
> Call Trace:
> [c000003fbcdcfa90] [00007115e4e00000] 0x7115e4e00000 (unreliable)
> [c000003fbcdcfb50] [c00000000036ee48]
> __unmap_hugepage_range_final+0x28/0x50
> [c000003fbcdcfb80] [c00000000033497c] unmap_single_vma+0x11c/0x190
> [c000003fbcdcfbd0] [c000000000334e14] unmap_vmas+0x94/0x140
> [c000003fbcdcfc20] [c00000000034265c] exit_mmap+0x9c/0x1d0
> [c000003fbcdcfce0] [c000000000105448] mmput+0xa8/0x1d0
> [c000003fbcdcfd10] [c00000000010fad0] do_exit+0x360/0xc80
> [c000003fbcdcfdd0] [c0000000001104c0] do_group_exit+0x60/0x100
> [c000003fbcdcfe10] [c000000000110584] SyS_exit_group+0x24/0x30
> [c000003fbcdcfe30] [c00000000000b184] system_call+0x58/0x6c
> Instruction dump:
> 552907fe e94a0028 e94a0408 eb2a0018 81590008 7f9c5036 0b090000 e9390010
> 7d2948f8 7d2a2838 0b0a0000 7d293038 <0b090000> e9230086 2fa90000 419e0468
> ---[ end trace ee88f958a1c62605 ]---
>
> The panic is due to a VMA pointing to a hugetlb area while the
> vma->vm_start or vma->vm_end field are not aligned to the huge page
> boundaries. The sampler is just unmapping a part of the hugetlb area,
> leading to 2 VMAs which are not well aligned. The same could be achieved
> by calling madvise() situation, as it is when running:
> stress-ng --shm-sysv 1
>
> The hugetlb code is assuming that the VMA will be well aligned when it is
> unmapped, so we must prevent such a VMA to be split or shrink to a
> misaligned address.
>
> This patch is preventing this by checking the new VMA's boundaries when a
> VMA is modified by calling vma_adjust().
>
> If this patch is applied, stable should be Cced.

Thanks Laurent!

This bug was introduced by 31383c6865a5. Dan's changes for 31383c6865a5
seem pretty straight forward. It simply replaces an explicit check when
splitting a vma to a new vm_ops split callout. Unfortunately, mappings
created via shmget/shmat have their vm_ops replaced. Therefore, this
split callout is never made.

The shm vm_ops do indirectly call the original vm_ops routines as needed.
Therefore, I would suggest a patch something like the following instead.
If we move forward with the patch, we should include Laurent's BUG output
and perhaps test program in the commit message.

--
Mike Kravetz

From 7a19414319c7937fd2757c27f936258f16c1f61d Mon Sep 17 00:00:00 2001
From: Mike Kravetz <[email protected]>
Date: Tue, 20 Mar 2018 13:56:57 -0700
Subject: [PATCH] shm: add split function to shm_vm_ops

The split function was added to vm_operations_struct to determine
if a mapping can be split. This was mostly for device-dax and
hugetlbfs mappings which have specific alignment constraints.

mappings initiated via shmget/shmat have their original vm_ops
overwritten with shm_vm_ops. shm_vm_ops functions will call back
to the original vm_ops if needed. Add such a split function.

Fixes: 31383c6865a5 ("mm, hugetlbfs: introduce ->split() to vm_operations_struct)
Reported by: Laurent Dufour <[email protected]>
Signed-off-by: Mike Kravetz <[email protected]>
---
ipc/shm.c | 12 ++++++++++++
1 file changed, 12 insertions(+)

diff --git a/ipc/shm.c b/ipc/shm.c
index 7acda23430aa..50e88fc060b1 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -386,6 +386,17 @@ static int shm_fault(struct vm_fault *vmf)
return sfd->vm_ops->fault(vmf);
}

+static int shm_split(struct vm_area_struct *vma, unsigned long addr)
+{
+ struct file *file = vma->vm_file;
+ struct shm_file_data *sfd = shm_file_data(file);
+
+ if (sfd->vm_ops && sfd->vm_ops->split)
+ return sfd->vm_ops->split(vma, addr);
+
+ return 0;
+}
+
#ifdef CONFIG_NUMA
static int shm_set_policy(struct vm_area_struct *vma, struct mempolicy *new)
{
@@ -510,6 +521,7 @@ static const struct vm_operations_struct shm_vm_ops = {
.open = shm_open, /* callback for a new vm-area open */
.close = shm_close, /* callback for when the vm-area is released */
.fault = shm_fault,
+ .split = shm_split,
#if defined(CONFIG_NUMA)
.set_policy = shm_set_policy,
.get_policy = shm_get_policy,
--
2.13.6


2018-03-20 21:36:56

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH] mm/hugetlb: prevent hugetlb VMA to be misaligned

On 03/20/2018 02:26 PM, Mike Kravetz wrote:
> Thanks Laurent!
>
> This bug was introduced by 31383c6865a5. Dan's changes for 31383c6865a5
> seem pretty straight forward. It simply replaces an explicit check when
> splitting a vma to a new vm_ops split callout. Unfortunately, mappings
> created via shmget/shmat have their vm_ops replaced. Therefore, this
> split callout is never made.
>
> The shm vm_ops do indirectly call the original vm_ops routines as needed.
> Therefore, I would suggest a patch something like the following instead.
> If we move forward with the patch, we should include Laurent's BUG output
> and perhaps test program in the commit message.

Sorry, patch in previous mail was a mess

From 7a19414319c7937fd2757c27f936258f16c1f61d Mon Sep 17 00:00:00 2001
From: Mike Kravetz <[email protected]>
Date: Tue, 20 Mar 2018 13:56:57 -0700
Subject: [PATCH] shm: add split function to shm_vm_ops

The split function was added to vm_operations_struct to determine
if a mapping can be split. This was mostly for device-dax and
hugetlbfs mappings which have specific alignment constraints.

mappings initiated via shmget/shmat have their original vm_ops
overwritten with shm_vm_ops. shm_vm_ops functions will call back
to the original vm_ops if needed. Add such a split function.

Fixes: 31383c6865a5 ("mm, hugetlbfs: introduce ->split() to vm_operations_struct)
Reported by: Laurent Dufour <[email protected]>
Signed-off-by: Mike Kravetz <[email protected]>
---
ipc/shm.c | 12 ++++++++++++
1 file changed, 12 insertions(+)

diff --git a/ipc/shm.c b/ipc/shm.c
index 7acda23430aa..50e88fc060b1 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -386,6 +386,17 @@ static int shm_fault(struct vm_fault *vmf)
return sfd->vm_ops->fault(vmf);
}

+static int shm_split(struct vm_area_struct *vma, unsigned long addr)
+{
+ struct file *file = vma->vm_file;
+ struct shm_file_data *sfd = shm_file_data(file);
+
+ if (sfd->vm_ops && sfd->vm_ops->split)
+ return sfd->vm_ops->split(vma, addr);
+
+ return 0;
+}
+
#ifdef CONFIG_NUMA
static int shm_set_policy(struct vm_area_struct *vma, struct mempolicy *new)
{
@@ -510,6 +521,7 @@ static const struct vm_operations_struct shm_vm_ops = {
.open = shm_open, /* callback for a new vm-area open */
.close = shm_close, /* callback for when the vm-area is released */
.fault = shm_fault,
+ .split = shm_split,
#if defined(CONFIG_NUMA)
.set_policy = shm_set_policy,
.get_policy = shm_get_policy,
--
2.13.6


2018-03-21 08:19:45

by Laurent Dufour

[permalink] [raw]
Subject: Re: [PATCH] mm/hugetlb: prevent hugetlb VMA to be misaligned

On 20/03/2018 22:26, Mike Kravetz wrote:
> On 03/20/2018 10:25 AM, Laurent Dufour wrote:
>> When running the sampler detailed below, the kernel, if built with the VM
>> debug option turned on (as many distro do), is panicing with the following
>> message :
>> kernel BUG at /build/linux-jWa1Fv/linux-4.15.0/mm/hugetlb.c:3310!
>> Oops: Exception in kernel mode, sig: 5 [#1]
>> LE SMP NR_CPUS=2048 NUMA PowerNV
>> Modules linked in: kcm nfc af_alg caif_socket caif phonet fcrypt
>> 8<--8<--8<--8< snip 8<--8<--8<--8<
>> CPU: 18 PID: 43243 Comm: trinity-subchil Tainted: G C E
>> 4.15.0-10-generic #11-Ubuntu
>> NIP: c00000000036e764 LR: c00000000036ee48 CTR: 0000000000000009
>> REGS: c000003fbcdcf810 TRAP: 0700 Tainted: G C E
>> (4.15.0-10-generic)
>> MSR: 9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 24002222 XER:
>> 20040000
>> CFAR: c00000000036ee44 SOFTE: 1
>> GPR00: c00000000036ee48 c000003fbcdcfa90 c0000000016ea600 c000003fbcdcfc40
>> GPR04: c000003fd9858950 00007115e4e00000 00007115e4e10000 0000000000000000
>> GPR08: 0000000000000010 0000000000010000 0000000000000000 0000000000000000
>> GPR12: 0000000000002000 c000000007a2c600 00000fe3985954d0 00007115e4e00000
>> GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
>> GPR20: 00000fe398595a94 000000000000a6fc c000003fd9858950 0000000000018554
>> GPR24: c000003fdcd84500 c0000000019acd00 00007115e4e10000 c000003fbcdcfc40
>> GPR28: 0000000000200000 00007115e4e00000 c000003fbc9ac600 c000003fd9858950
>> NIP [c00000000036e764] __unmap_hugepage_range+0xa4/0x760
>> LR [c00000000036ee48] __unmap_hugepage_range_final+0x28/0x50
>> Call Trace:
>> [c000003fbcdcfa90] [00007115e4e00000] 0x7115e4e00000 (unreliable)
>> [c000003fbcdcfb50] [c00000000036ee48]
>> __unmap_hugepage_range_final+0x28/0x50
>> [c000003fbcdcfb80] [c00000000033497c] unmap_single_vma+0x11c/0x190
>> [c000003fbcdcfbd0] [c000000000334e14] unmap_vmas+0x94/0x140
>> [c000003fbcdcfc20] [c00000000034265c] exit_mmap+0x9c/0x1d0
>> [c000003fbcdcfce0] [c000000000105448] mmput+0xa8/0x1d0
>> [c000003fbcdcfd10] [c00000000010fad0] do_exit+0x360/0xc80
>> [c000003fbcdcfdd0] [c0000000001104c0] do_group_exit+0x60/0x100
>> [c000003fbcdcfe10] [c000000000110584] SyS_exit_group+0x24/0x30
>> [c000003fbcdcfe30] [c00000000000b184] system_call+0x58/0x6c
>> Instruction dump:
>> 552907fe e94a0028 e94a0408 eb2a0018 81590008 7f9c5036 0b090000 e9390010
>> 7d2948f8 7d2a2838 0b0a0000 7d293038 <0b090000> e9230086 2fa90000 419e0468
>> ---[ end trace ee88f958a1c62605 ]---
>>
>> The panic is due to a VMA pointing to a hugetlb area while the
>> vma->vm_start or vma->vm_end field are not aligned to the huge page
>> boundaries. The sampler is just unmapping a part of the hugetlb area,
>> leading to 2 VMAs which are not well aligned. The same could be achieved
>> by calling madvise() situation, as it is when running:
>> stress-ng --shm-sysv 1
>>
>> The hugetlb code is assuming that the VMA will be well aligned when it is
>> unmapped, so we must prevent such a VMA to be split or shrink to a
>> misaligned address.
>>
>> This patch is preventing this by checking the new VMA's boundaries when a
>> VMA is modified by calling vma_adjust().
>>
>> If this patch is applied, stable should be Cced.
>
> Thanks Laurent!
>
> This bug was introduced by 31383c6865a5. Dan's changes for 31383c6865a5
> seem pretty straight forward. It simply replaces an explicit check when
> splitting a vma to a new vm_ops split callout. Unfortunately, mappings
> created via shmget/shmat have their vm_ops replaced. Therefore, this
> split callout is never made.
>
> The shm vm_ops do indirectly call the original vm_ops routines as needed.
> Therefore, I would suggest a patch something like the following instead.
> If we move forward with the patch, we should include Laurent's BUG output
> and perhaps test program in the commit message.

Hi Mike,

That's definitively smarter ! I missed that split() new vm ops...

Cheers,
Laurent.


2018-03-21 08:23:59

by Laurent Dufour

[permalink] [raw]
Subject: Re: [PATCH] mm/hugetlb: prevent hugetlb VMA to be misaligned

On 20/03/2018 22:35, Mike Kravetz wrote:
> On 03/20/2018 02:26 PM, Mike Kravetz wrote:
>> Thanks Laurent!
>>
>> This bug was introduced by 31383c6865a5. Dan's changes for 31383c6865a5
>> seem pretty straight forward. It simply replaces an explicit check when
>> splitting a vma to a new vm_ops split callout. Unfortunately, mappings
>> created via shmget/shmat have their vm_ops replaced. Therefore, this
>> split callout is never made.
>>
>> The shm vm_ops do indirectly call the original vm_ops routines as needed.
>> Therefore, I would suggest a patch something like the following instead.
>> If we move forward with the patch, we should include Laurent's BUG output
>> and perhaps test program in the commit message.
>
> Sorry, patch in previous mail was a mess
>
> From 7a19414319c7937fd2757c27f936258f16c1f61d Mon Sep 17 00:00:00 2001
> From: Mike Kravetz <[email protected]>
> Date: Tue, 20 Mar 2018 13:56:57 -0700
> Subject: [PATCH] shm: add split function to shm_vm_ops
>
> The split function was added to vm_operations_struct to determine
> if a mapping can be split. This was mostly for device-dax and
> hugetlbfs mappings which have specific alignment constraints.
>
> mappings initiated via shmget/shmat have their original vm_ops
> overwritten with shm_vm_ops. shm_vm_ops functions will call back
> to the original vm_ops if needed. Add such a split function.

FWIW,
Reviewed-by: Laurent Dufour <[email protected]>
Tested-by: Laurent Dufour <[email protected]>

> Fixes: 31383c6865a5 ("mm, hugetlbfs: introduce ->split() to vm_operations_struct)
> Reported by: Laurent Dufour <[email protected]>
> Signed-off-by: Mike Kravetz <[email protected]>
> ---
> ipc/shm.c | 12 ++++++++++++
> 1 file changed, 12 insertions(+)
>
> diff --git a/ipc/shm.c b/ipc/shm.c
> index 7acda23430aa..50e88fc060b1 100644
> --- a/ipc/shm.c
> +++ b/ipc/shm.c
> @@ -386,6 +386,17 @@ static int shm_fault(struct vm_fault *vmf)
> return sfd->vm_ops->fault(vmf);
> }
>
> +static int shm_split(struct vm_area_struct *vma, unsigned long addr)
> +{
> + struct file *file = vma->vm_file;
> + struct shm_file_data *sfd = shm_file_data(file);
> +
> + if (sfd->vm_ops && sfd->vm_ops->split)
> + return sfd->vm_ops->split(vma, addr);
> +
> + return 0;
> +}
> +
> #ifdef CONFIG_NUMA
> static int shm_set_policy(struct vm_area_struct *vma, struct mempolicy *new)
> {
> @@ -510,6 +521,7 @@ static const struct vm_operations_struct shm_vm_ops = {
> .open = shm_open, /* callback for a new vm-area open */
> .close = shm_close, /* callback for when the vm-area is released */
> .fault = shm_fault,
> + .split = shm_split,
> #if defined(CONFIG_NUMA)
> .set_policy = shm_set_policy,
> .get_policy = shm_get_policy,
>


2018-03-21 08:42:28

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm/hugetlb: prevent hugetlb VMA to be misaligned

On Tue 20-03-18 14:35:28, Mike Kravetz wrote:
> On 03/20/2018 02:26 PM, Mike Kravetz wrote:
> > Thanks Laurent!
> >
> > This bug was introduced by 31383c6865a5. Dan's changes for 31383c6865a5
> > seem pretty straight forward. It simply replaces an explicit check when
> > splitting a vma to a new vm_ops split callout. Unfortunately, mappings
> > created via shmget/shmat have their vm_ops replaced. Therefore, this
> > split callout is never made.
> >
> > The shm vm_ops do indirectly call the original vm_ops routines as needed.
> > Therefore, I would suggest a patch something like the following instead.
> > If we move forward with the patch, we should include Laurent's BUG output
> > and perhaps test program in the commit message.
>
> Sorry, patch in previous mail was a mess
>
> >From 7a19414319c7937fd2757c27f936258f16c1f61d Mon Sep 17 00:00:00 2001
> From: Mike Kravetz <[email protected]>
> Date: Tue, 20 Mar 2018 13:56:57 -0700
> Subject: [PATCH] shm: add split function to shm_vm_ops
>
> The split function was added to vm_operations_struct to determine
> if a mapping can be split. This was mostly for device-dax and
> hugetlbfs mappings which have specific alignment constraints.
>
> mappings initiated via shmget/shmat have their original vm_ops
> overwritten with shm_vm_ops. shm_vm_ops functions will call back
> to the original vm_ops if needed. Add such a split function.
>
> Fixes: 31383c6865a5 ("mm, hugetlbfs: introduce ->split() to vm_operations_struct)
> Reported by: Laurent Dufour <[email protected]>
> Signed-off-by: Mike Kravetz <[email protected]>

Yes this looks much better than the original hugetlb specific code in
the generic vma code.

Please add the original VM_BUG_ON report to the changelog

Cc: stable
Acked-by: Michal Hocko <[email protected]>

> ---
> ipc/shm.c | 12 ++++++++++++
> 1 file changed, 12 insertions(+)
>
> diff --git a/ipc/shm.c b/ipc/shm.c
> index 7acda23430aa..50e88fc060b1 100644
> --- a/ipc/shm.c
> +++ b/ipc/shm.c
> @@ -386,6 +386,17 @@ static int shm_fault(struct vm_fault *vmf)
> return sfd->vm_ops->fault(vmf);
> }
>
> +static int shm_split(struct vm_area_struct *vma, unsigned long addr)
> +{
> + struct file *file = vma->vm_file;
> + struct shm_file_data *sfd = shm_file_data(file);
> +
> + if (sfd->vm_ops && sfd->vm_ops->split)
> + return sfd->vm_ops->split(vma, addr);
> +
> + return 0;
> +}
> +
> #ifdef CONFIG_NUMA
> static int shm_set_policy(struct vm_area_struct *vma, struct mempolicy *new)
> {
> @@ -510,6 +521,7 @@ static const struct vm_operations_struct shm_vm_ops = {
> .open = shm_open, /* callback for a new vm-area open */
> .close = shm_close, /* callback for when the vm-area is released */
> .fault = shm_fault,
> + .split = shm_split,
> #if defined(CONFIG_NUMA)
> .set_policy = shm_set_policy,
> .get_policy = shm_get_policy,
> --
> 2.13.6

--
Michal Hocko
SUSE Labs

2018-03-21 16:16:05

by Mike Kravetz

[permalink] [raw]
Subject: [PATCH v2] shm: add split function to shm_vm_ops

If System V shmget/shmat operations are used to create a hugetlbfs
backed mapping, it is possible to munmap part of the mapping and
split the underlying vma such that it is not huge page aligned.
This will untimately result in the following BUG:

kernel BUG at /build/linux-jWa1Fv/linux-4.15.0/mm/hugetlb.c:3310!
Oops: Exception in kernel mode, sig: 5 [#1]
LE SMP NR_CPUS=2048 NUMA PowerNV
Modules linked in: kcm nfc af_alg caif_socket caif phonet fcrypt
8<--8<--8<--8< snip 8<--8<--8<--8<
CPU: 18 PID: 43243 Comm: trinity-subchil Tainted: G C E
4.15.0-10-generic #11-Ubuntu
NIP: c00000000036e764 LR: c00000000036ee48 CTR: 0000000000000009
REGS: c000003fbcdcf810 TRAP: 0700 Tainted: G C E
(4.15.0-10-generic)
MSR: 9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 24002222 XER:
20040000
CFAR: c00000000036ee44 SOFTE: 1
GPR00: c00000000036ee48 c000003fbcdcfa90 c0000000016ea600 c000003fbcdcfc40
GPR04: c000003fd9858950 00007115e4e00000 00007115e4e10000 0000000000000000
GPR08: 0000000000000010 0000000000010000 0000000000000000 0000000000000000
GPR12: 0000000000002000 c000000007a2c600 00000fe3985954d0 00007115e4e00000
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR20: 00000fe398595a94 000000000000a6fc c000003fd9858950 0000000000018554
GPR24: c000003fdcd84500 c0000000019acd00 00007115e4e10000 c000003fbcdcfc40
GPR28: 0000000000200000 00007115e4e00000 c000003fbc9ac600 c000003fd9858950
NIP [c00000000036e764] __unmap_hugepage_range+0xa4/0x760
LR [c00000000036ee48] __unmap_hugepage_range_final+0x28/0x50
Call Trace:
[c000003fbcdcfa90] [00007115e4e00000] 0x7115e4e00000 (unreliable)
[c000003fbcdcfb50] [c00000000036ee48]
__unmap_hugepage_range_final+0x28/0x50
[c000003fbcdcfb80] [c00000000033497c] unmap_single_vma+0x11c/0x190
[c000003fbcdcfbd0] [c000000000334e14] unmap_vmas+0x94/0x140
[c000003fbcdcfc20] [c00000000034265c] exit_mmap+0x9c/0x1d0
[c000003fbcdcfce0] [c000000000105448] mmput+0xa8/0x1d0
[c000003fbcdcfd10] [c00000000010fad0] do_exit+0x360/0xc80
[c000003fbcdcfdd0] [c0000000001104c0] do_group_exit+0x60/0x100
[c000003fbcdcfe10] [c000000000110584] SyS_exit_group+0x24/0x30
[c000003fbcdcfe30] [c00000000000b184] system_call+0x58/0x6c
Instruction dump:
552907fe e94a0028 e94a0408 eb2a0018 81590008 7f9c5036 0b090000 e9390010
7d2948f8 7d2a2838 0b0a0000 7d293038 <0b090000> e9230086 2fa90000 419e0468
---[ end trace ee88f958a1c62605 ]---

This bug was introduced by commit 31383c6865a5 ("mm, hugetlbfs:
introduce ->split() to vm_operations_struct"). A split function
was added to vm_operations_struct to determine if a mapping can
be split. This was mostly for device-dax and hugetlbfs mappings
which have specific alignment constraints.

Mappings initiated via shmget/shmat have their original vm_ops
overwritten with shm_vm_ops. shm_vm_ops functions will call back
to the original vm_ops if needed. Add such a split function to
shm_vm_ops.

Fixes: 31383c6865a5 ("mm, hugetlbfs: introduce ->split() to vm_operations_struct")
Signed-off-by: Mike Kravetz <[email protected]>
Reported by: Laurent Dufour <[email protected]>
Tested-by: Laurent Dufour <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: [email protected]
---
Changes in v2
* Updated commit message
* Cc stable

ipc/shm.c | 12 ++++++++++++
1 file changed, 12 insertions(+)

diff --git a/ipc/shm.c b/ipc/shm.c
index 4643865e9171..93e0e3a4d009 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -386,6 +386,17 @@ static int shm_fault(struct vm_fault *vmf)
return sfd->vm_ops->fault(vmf);
}

+static int shm_split(struct vm_area_struct *vma, unsigned long addr)
+{
+ struct file *file = vma->vm_file;
+ struct shm_file_data *sfd = shm_file_data(file);
+
+ if (sfd->vm_ops && sfd->vm_ops->split)
+ return sfd->vm_ops->split(vma, addr);
+
+ return 0;
+}
+
#ifdef CONFIG_NUMA
static int shm_set_policy(struct vm_area_struct *vma, struct mempolicy *new)
{
@@ -510,6 +521,7 @@ static const struct vm_operations_struct shm_vm_ops = {
.open = shm_open, /* callback for a new vm-area open */
.close = shm_close, /* callback for when the vm-area is released */
.fault = shm_fault,
+ .split = shm_split,
#if defined(CONFIG_NUMA)
.set_policy = shm_set_policy,
.get_policy = shm_get_policy,
--
2.13.6


2018-03-21 18:43:56

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v2] shm: add split function to shm_vm_ops

On Wed, Mar 21, 2018 at 9:13 AM, Mike Kravetz <[email protected]> wrote:
>
> If System V shmget/shmat operations are used to create a hugetlbfs
> backed mapping, it is possible to munmap part of the mapping and
> split the underlying vma such that it is not huge page aligned.
> This will untimately result in the following BUG:
>
> kernel BUG at /build/linux-jWa1Fv/linux-4.15.0/mm/hugetlb.c:3310!
> Oops: Exception in kernel mode, sig: 5 [#1]
> LE SMP NR_CPUS=2048 NUMA PowerNV
> Modules linked in: kcm nfc af_alg caif_socket caif phonet fcrypt
> 8<--8<--8<--8< snip 8<--8<--8<--8<
> CPU: 18 PID: 43243 Comm: trinity-subchil Tainted: G C E
> 4.15.0-10-generic #11-Ubuntu
> NIP: c00000000036e764 LR: c00000000036ee48 CTR: 0000000000000009
> REGS: c000003fbcdcf810 TRAP: 0700 Tainted: G C E
> (4.15.0-10-generic)
> MSR: 9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 24002222 XER:
> 20040000
> CFAR: c00000000036ee44 SOFTE: 1
> GPR00: c00000000036ee48 c000003fbcdcfa90 c0000000016ea600 c000003fbcdcfc40
> GPR04: c000003fd9858950 00007115e4e00000 00007115e4e10000 0000000000000000
> GPR08: 0000000000000010 0000000000010000 0000000000000000 0000000000000000
> GPR12: 0000000000002000 c000000007a2c600 00000fe3985954d0 00007115e4e00000
> GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> GPR20: 00000fe398595a94 000000000000a6fc c000003fd9858950 0000000000018554
> GPR24: c000003fdcd84500 c0000000019acd00 00007115e4e10000 c000003fbcdcfc40
> GPR28: 0000000000200000 00007115e4e00000 c000003fbc9ac600 c000003fd9858950
> NIP [c00000000036e764] __unmap_hugepage_range+0xa4/0x760
> LR [c00000000036ee48] __unmap_hugepage_range_final+0x28/0x50
> Call Trace:
> [c000003fbcdcfa90] [00007115e4e00000] 0x7115e4e00000 (unreliable)
> [c000003fbcdcfb50] [c00000000036ee48]
> __unmap_hugepage_range_final+0x28/0x50
> [c000003fbcdcfb80] [c00000000033497c] unmap_single_vma+0x11c/0x190
> [c000003fbcdcfbd0] [c000000000334e14] unmap_vmas+0x94/0x140
> [c000003fbcdcfc20] [c00000000034265c] exit_mmap+0x9c/0x1d0
> [c000003fbcdcfce0] [c000000000105448] mmput+0xa8/0x1d0
> [c000003fbcdcfd10] [c00000000010fad0] do_exit+0x360/0xc80
> [c000003fbcdcfdd0] [c0000000001104c0] do_group_exit+0x60/0x100
> [c000003fbcdcfe10] [c000000000110584] SyS_exit_group+0x24/0x30
> [c000003fbcdcfe30] [c00000000000b184] system_call+0x58/0x6c
> Instruction dump:
> 552907fe e94a0028 e94a0408 eb2a0018 81590008 7f9c5036 0b090000 e9390010
> 7d2948f8 7d2a2838 0b0a0000 7d293038 <0b090000> e9230086 2fa90000 419e0468
> ---[ end trace ee88f958a1c62605 ]---
>
> This bug was introduced by commit 31383c6865a5 ("mm, hugetlbfs:
> introduce ->split() to vm_operations_struct"). A split function
> was added to vm_operations_struct to determine if a mapping can
> be split. This was mostly for device-dax and hugetlbfs mappings
> which have specific alignment constraints.
>
> Mappings initiated via shmget/shmat have their original vm_ops
> overwritten with shm_vm_ops. shm_vm_ops functions will call back
> to the original vm_ops if needed. Add such a split function to
> shm_vm_ops.
>
> Fixes: 31383c6865a5 ("mm, hugetlbfs: introduce ->split() to vm_operations_struct")
> Signed-off-by: Mike Kravetz <[email protected]>
> Reported by: Laurent Dufour <[email protected]>
> Tested-by: Laurent Dufour <[email protected]>
> Acked-by: Michal Hocko <[email protected]>
> Cc: [email protected]

Reviewed-by: Dan Williams <[email protected]>

...don't worry about resending if this has already hit a maintainer tree.

2018-03-21 20:57:51

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v2] shm: add split function to shm_vm_ops

On Wed, 21 Mar 2018 09:13:14 -0700 Mike Kravetz <[email protected]> wrote:

> If System V shmget/shmat operations are used to create a hugetlbfs
> backed mapping, it is possible to munmap part of the mapping and
> split the underlying vma such that it is not huge page aligned.
> This will untimately result in the following BUG:
>
> kernel BUG at /build/linux-jWa1Fv/linux-4.15.0/mm/hugetlb.c:3310!
> Oops: Exception in kernel mode, sig: 5 [#1]
> LE SMP NR_CPUS=2048 NUMA PowerNV
> Modules linked in: kcm nfc af_alg caif_socket caif phonet fcrypt
> 8<--8<--8<--8< snip 8<--8<--8<--8<
> CPU: 18 PID: 43243 Comm: trinity-subchil Tainted: G C E
> 4.15.0-10-generic #11-Ubuntu
> NIP: c00000000036e764 LR: c00000000036ee48 CTR: 0000000000000009
> REGS: c000003fbcdcf810 TRAP: 0700 Tainted: G C E
> (4.15.0-10-generic)
> MSR: 9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 24002222 XER:
> 20040000
> CFAR: c00000000036ee44 SOFTE: 1
> GPR00: c00000000036ee48 c000003fbcdcfa90 c0000000016ea600 c000003fbcdcfc40
> GPR04: c000003fd9858950 00007115e4e00000 00007115e4e10000 0000000000000000
> GPR08: 0000000000000010 0000000000010000 0000000000000000 0000000000000000
> GPR12: 0000000000002000 c000000007a2c600 00000fe3985954d0 00007115e4e00000
> GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> GPR20: 00000fe398595a94 000000000000a6fc c000003fd9858950 0000000000018554
> GPR24: c000003fdcd84500 c0000000019acd00 00007115e4e10000 c000003fbcdcfc40
> GPR28: 0000000000200000 00007115e4e00000 c000003fbc9ac600 c000003fd9858950
> NIP [c00000000036e764] __unmap_hugepage_range+0xa4/0x760
> LR [c00000000036ee48] __unmap_hugepage_range_final+0x28/0x50
> Call Trace:
> [c000003fbcdcfa90] [00007115e4e00000] 0x7115e4e00000 (unreliable)
> [c000003fbcdcfb50] [c00000000036ee48]
> __unmap_hugepage_range_final+0x28/0x50
> [c000003fbcdcfb80] [c00000000033497c] unmap_single_vma+0x11c/0x190
> [c000003fbcdcfbd0] [c000000000334e14] unmap_vmas+0x94/0x140
> [c000003fbcdcfc20] [c00000000034265c] exit_mmap+0x9c/0x1d0
> [c000003fbcdcfce0] [c000000000105448] mmput+0xa8/0x1d0
> [c000003fbcdcfd10] [c00000000010fad0] do_exit+0x360/0xc80
> [c000003fbcdcfdd0] [c0000000001104c0] do_group_exit+0x60/0x100
> [c000003fbcdcfe10] [c000000000110584] SyS_exit_group+0x24/0x30
> [c000003fbcdcfe30] [c00000000000b184] system_call+0x58/0x6c
> Instruction dump:
> 552907fe e94a0028 e94a0408 eb2a0018 81590008 7f9c5036 0b090000 e9390010
> 7d2948f8 7d2a2838 0b0a0000 7d293038 <0b090000> e9230086 2fa90000 419e0468
> ---[ end trace ee88f958a1c62605 ]---
>
> This bug was introduced by commit 31383c6865a5 ("mm, hugetlbfs:
> introduce ->split() to vm_operations_struct"). A split function
> was added to vm_operations_struct to determine if a mapping can
> be split. This was mostly for device-dax and hugetlbfs mappings
> which have specific alignment constraints.
>
> Mappings initiated via shmget/shmat have their original vm_ops
> overwritten with shm_vm_ops. shm_vm_ops functions will call back
> to the original vm_ops if needed. Add such a split function to
> shm_vm_ops.
>
> --- a/ipc/shm.c
> +++ b/ipc/shm.c
> @@ -386,6 +386,17 @@ static int shm_fault(struct vm_fault *vmf)
> return sfd->vm_ops->fault(vmf);
> }
>
> +static int shm_split(struct vm_area_struct *vma, unsigned long addr)
> +{
> + struct file *file = vma->vm_file;
> + struct shm_file_data *sfd = shm_file_data(file);
> +
> + if (sfd->vm_ops && sfd->vm_ops->split)
> + return sfd->vm_ops->split(vma, addr);

This will be the only site which tests for NULL shm_file_data.vm_ops.
It's a can't-happen, methinks.

I think I'll leave it as it is for now and will queue up a non-urgent
patch:



From: Andrew Morton <[email protected]>
Subject: ipc/shm.c: shm_split(): remove unneeded test for NULL shm_file_data.vm_ops

This was added by the recent "ipc/shm.c: add split function to
shm_vm_ops", but it is not necessary.

Cc: Laurent Dufour <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Manfred Spraul <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---

ipc/shm.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff -puN ipc/shm.c~ipc-shmc-shm_split-remove-unneeded-test-for-null-shm_file_datavm_ops ipc/shm.c
--- a/ipc/shm.c~ipc-shmc-shm_split-remove-unneeded-test-for-null-shm_file_datavm_ops
+++ a/ipc/shm.c
@@ -391,7 +391,7 @@ static int shm_split(struct vm_area_stru
struct file *file = vma->vm_file;
struct shm_file_data *sfd = shm_file_data(file);

- if (sfd->vm_ops && sfd->vm_ops->split)
+ if (sfd->vm_ops->split)
return sfd->vm_ops->split(vma, addr);

return 0;
_


2018-03-21 22:55:28

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH v2] shm: add split function to shm_vm_ops

On 03/21/2018 01:56 PM, Andrew Morton wrote:
> On Wed, 21 Mar 2018 09:13:14 -0700 Mike Kravetz <[email protected]> wrote:
>>
>> +static int shm_split(struct vm_area_struct *vma, unsigned long addr)
>> +{
>> + struct file *file = vma->vm_file;
>> + struct shm_file_data *sfd = shm_file_data(file);
>> +
>> + if (sfd->vm_ops && sfd->vm_ops->split)
>> + return sfd->vm_ops->split(vma, addr);
>
> This will be the only site which tests for NULL shm_file_data.vm_ops.
> It's a can't-happen, methinks.

You are correct, thanks for catching this.

>
> I think I'll leave it as it is for now and will queue up a non-urgent
> patch:
>
>
>
> From: Andrew Morton <[email protected]>
> Subject: ipc/shm.c: shm_split(): remove unneeded test for NULL shm_file_data.vm_ops
>
> This was added by the recent "ipc/shm.c: add split function to
> shm_vm_ops", but it is not necessary.
>
> Cc: Laurent Dufour <[email protected]>
> Cc: Dan Williams <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Davidlohr Bueso <[email protected]>
> Cc: Manfred Spraul <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>

Looks good, FWIW
Reviewed-by: Mike Kravetz <[email protected]>

--
Mike Kravetz

> ---
>
> ipc/shm.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff -puN ipc/shm.c~ipc-shmc-shm_split-remove-unneeded-test-for-null-shm_file_datavm_ops ipc/shm.c
> --- a/ipc/shm.c~ipc-shmc-shm_split-remove-unneeded-test-for-null-shm_file_datavm_ops
> +++ a/ipc/shm.c
> @@ -391,7 +391,7 @@ static int shm_split(struct vm_area_stru
> struct file *file = vma->vm_file;
> struct shm_file_data *sfd = shm_file_data(file);
>
> - if (sfd->vm_ops && sfd->vm_ops->split)
> + if (sfd->vm_ops->split)
> return sfd->vm_ops->split(vma, addr);
>
> return 0;
> _
>