2014-06-19 21:56:59

by Naoya Horiguchi

[permalink] [raw]
Subject: kernel BUG at /src/linux-dev/mm/mempolicy.c:1738! on v3.16-rc1

Hi,

I triggered the following bug on v3.16-rc1 when I did mbind() testing
where multiple processes repeat calling mbind() for a shared mapped file
(causing pingpong of page migration.)

In my investigation, it seems that some vma accidentally has vma->vm_start
= 0, which makes new_vma_page() choose a wrong vma and results in breaking
the assumption that the address passed to alloc_pages_vma() should be
inside a given vma.
I'm suspecting that mbind_range() do something wrong around vma handling,
but I don't have enough luck yet. Anyone has an idea?

Thanks,
Naoya Horiguchi

[ 339.133960] ------------[ cut here ]------------
[ 339.134893] kernel BUG at /src/linux-dev/mm/mempolicy.c:1738!
[ 339.134893] invalid opcode: 0000 [#1] SMP
[ 339.134893] Modules linked in: stap_2acbad8c3ba47062dbdc6f227d00f8f4__1958(O) bnep bluetooth cfg80211 rfkill ip6t_rpfilter ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw ppdev microcode i2c_piix4 pcspkr i2c_core virtio_balloon parport_pc parport serio_raw nfsd auth_rpcgss oid_registry nfs_acl lockd sunrpc virtio_blk virtio_net floppy ata_generic pata_acpi
[ 339.134893] CPU: 2 PID: 2840 Comm: mbind_fuzz Tainted: G O 3.16.0-rc1-140619-1205-00003-g80aa6b64a44e #157
[ 339.134893] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 339.134893] task: ffff88007c133b60 ti: ffff88007dd28000 task.ti: ffff88007dd28000
[ 339.134893] RIP: 0010:[<ffffffff811ebfe0>] [<ffffffff811ebfe0>] policy_zonelist+0x50/0xb0
[ 339.134893] RSP: 0000:ffff88007dd2bcf8 EFLAGS: 00010293
[ 339.134893] RAX: 0000000000000000 RBX: ffff88007c133b60 RCX: 0000000000000000
[ 339.134893] RDX: 0000000000000002 RSI: ffff88011bd3fad0 RDI: 00000000000200da
[ 339.134893] RBP: ffff88007dd2bd00 R08: 0000000000000002 R09: 0000000000000002
[ 339.134893] R10: ffff88007d8f3958 R11: 0000000000000001 R12: 00000000000200da
[ 339.134893] R13: 0000000000000000 R14: ffff88011bd3fad0 R15: 0000000000000000
[ 339.134893] FS: 00007f457cf90740(0000) GS:ffff8800bec00000(0000) knlGS:0000000000000000
[ 339.134893] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 339.134893] CR2: 0000700000184000 CR3: 000000007959b000 CR4: 00000000000006e0
[ 339.134893] Stack:
[ 339.134893] ffff88007c133b60 ffff88007dd2bd68 ffffffff811eeac8 ffff88007c133b60
[ 339.134893] ffff88007c133b60 0000000000000000 000000020000000c 0000000000000000
[ 339.134893] ffff88007c387e60 ffff88007c387e60 ffffea0000e19340 ffff88007d8f3958
[ 339.134893] Call Trace:
[ 339.134893] [<ffffffff811eeac8>] alloc_pages_vma+0x88/0x1a0
[ 339.134893] [<ffffffff811eec7b>] new_vma_page+0x9b/0xb0
[ 339.134893] [<ffffffff811fee4d>] unmap_and_move+0x3d/0x200
[ 339.134893] [<ffffffff811ff235>] migrate_pages+0xe5/0x1e0
[ 339.134893] [<ffffffff811eebe0>] ? alloc_pages_vma+0x1a0/0x1a0
[ 339.134893] [<ffffffff811ef3c2>] do_mbind+0x1f2/0x3a0
[ 339.134893] [<ffffffff811ef60b>] SyS_mbind+0x9b/0xb0
[ 339.134893] [<ffffffff8174798b>] tracesys+0xdd/0xe2
[ 339.134893] Code: 63 d2 31 c0 85 db 48 8b 14 d5 00 2d d6 81 0f 95 c0 48 69 c0 20 22 01 00 5b 5d 48 8d 84 02 00 1d 00 00 c3 0f 1f 84 00 00 00 00 00 <0f> 0b 66 0f 1f 44 00 00 f6 46 06 02 75 12 89 fb 48 0f bf 56 08
[ 339.134893] RIP [<ffffffff811ebfe0>] policy_zonelist+0x50/0xb0
[ 339.134893] RSP <ffff88007dd2bcf8>
[ 339.178924] ---[ end trace 37c12438b6936769 ]---


2014-06-20 04:37:15

by Hugh Dickins

[permalink] [raw]
Subject: Re: kernel BUG at /src/linux-dev/mm/mempolicy.c:1738! on v3.16-rc1

On Thu, 19 Jun 2014, Naoya Horiguchi wrote:
> Hi,
>
> I triggered the following bug on v3.16-rc1 when I did mbind() testing
> where multiple processes repeat calling mbind() for a shared mapped file
> (causing pingpong of page migration.)

The shared mapped file on shmem/tmpfs? So involving shared policy stuff?

>
> In my investigation, it seems that some vma accidentally has vma->vm_start
> = 0, which makes new_vma_page() choose a wrong vma and results in breaking
> the assumption that the address passed to alloc_pages_vma() should be
> inside a given vma.

I've not heard of that before. What evidence led you there?

> I'm suspecting that mbind_range() do something wrong around vma handling,
> but I don't have enough luck yet. Anyone has an idea?

No idea at present.

Please send disassembly (objdump -d, or objdump -ld if you had DEBUG_INFO)
of policy_zonelist() - the Code line isn't enough to go on, since it just
shows where the BUG jumped to out-of-line, with no clue as to what might
be in the registers - thanks.

Hugh

>
> Thanks,
> Naoya Horiguchi
>
> [ 339.133960] ------------[ cut here ]------------
> [ 339.134893] kernel BUG at /src/linux-dev/mm/mempolicy.c:1738!
> [ 339.134893] invalid opcode: 0000 [#1] SMP
> [ 339.134893] Modules linked in: stap_2acbad8c3ba47062dbdc6f227d00f8f4__1958(O) bnep bluetooth cfg80211 rfkill ip6t_rpfilter ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw ppdev microcode i2c_piix4 pcspkr i2c_core virtio_balloon parport_pc parport serio_raw nfsd auth_rpcgss oid_registry nfs_acl lockd sunrpc virtio_blk virtio_net floppy ata_generic pata_acpi
> [ 339.134893] CPU: 2 PID: 2840 Comm: mbind_fuzz Tainted: G O 3.16.0-rc1-140619-1205-00003-g80aa6b64a44e #157
> [ 339.134893] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> [ 339.134893] task: ffff88007c133b60 ti: ffff88007dd28000 task.ti: ffff88007dd28000
> [ 339.134893] RIP: 0010:[<ffffffff811ebfe0>] [<ffffffff811ebfe0>] policy_zonelist+0x50/0xb0
> [ 339.134893] RSP: 0000:ffff88007dd2bcf8 EFLAGS: 00010293
> [ 339.134893] RAX: 0000000000000000 RBX: ffff88007c133b60 RCX: 0000000000000000
> [ 339.134893] RDX: 0000000000000002 RSI: ffff88011bd3fad0 RDI: 00000000000200da
> [ 339.134893] RBP: ffff88007dd2bd00 R08: 0000000000000002 R09: 0000000000000002
> [ 339.134893] R10: ffff88007d8f3958 R11: 0000000000000001 R12: 00000000000200da
> [ 339.134893] R13: 0000000000000000 R14: ffff88011bd3fad0 R15: 0000000000000000
> [ 339.134893] FS: 00007f457cf90740(0000) GS:ffff8800bec00000(0000) knlGS:0000000000000000
> [ 339.134893] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 339.134893] CR2: 0000700000184000 CR3: 000000007959b000 CR4: 00000000000006e0
> [ 339.134893] Stack:
> [ 339.134893] ffff88007c133b60 ffff88007dd2bd68 ffffffff811eeac8 ffff88007c133b60
> [ 339.134893] ffff88007c133b60 0000000000000000 000000020000000c 0000000000000000
> [ 339.134893] ffff88007c387e60 ffff88007c387e60 ffffea0000e19340 ffff88007d8f3958
> [ 339.134893] Call Trace:
> [ 339.134893] [<ffffffff811eeac8>] alloc_pages_vma+0x88/0x1a0
> [ 339.134893] [<ffffffff811eec7b>] new_vma_page+0x9b/0xb0
> [ 339.134893] [<ffffffff811fee4d>] unmap_and_move+0x3d/0x200
> [ 339.134893] [<ffffffff811ff235>] migrate_pages+0xe5/0x1e0
> [ 339.134893] [<ffffffff811eebe0>] ? alloc_pages_vma+0x1a0/0x1a0
> [ 339.134893] [<ffffffff811ef3c2>] do_mbind+0x1f2/0x3a0
> [ 339.134893] [<ffffffff811ef60b>] SyS_mbind+0x9b/0xb0
> [ 339.134893] [<ffffffff8174798b>] tracesys+0xdd/0xe2
> [ 339.134893] Code: 63 d2 31 c0 85 db 48 8b 14 d5 00 2d d6 81 0f 95 c0 48 69 c0 20 22 01 00 5b 5d 48 8d 84 02 00 1d 00 00 c3 0f 1f 84 00 00 00 00 00 <0f> 0b 66 0f 1f 44 00 00 f6 46 06 02 75 12 89 fb 48 0f bf 56 08
> [ 339.134893] RIP [<ffffffff811ebfe0>] policy_zonelist+0x50/0xb0
> [ 339.134893] RSP <ffff88007dd2bcf8>
> [ 339.178924] ---[ end trace 37c12438b6936769 ]---

Subject: Re: kernel BUG at /src/linux-dev/mm/mempolicy.c:1738! on v3.16-rc1

On Thu, 19 Jun 2014, Naoya Horiguchi wrote:

> I'm suspecting that mbind_range() do something wrong around vma handling,
> but I don't have enough luck yet. Anyone has an idea?

Well memory policy data corrupted. This looks like you were trying to do
page migration via mbind()? Could we get some more details as to what is
going on here? Specifically the parameters passed to mbind would be
interesting.

2014-06-20 15:41:41

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: kernel BUG at /src/linux-dev/mm/mempolicy.c:1738! on v3.16-rc1

On Thu, Jun 19, 2014 at 09:35:48PM -0700, Hugh Dickins wrote:
> On Thu, 19 Jun 2014, Naoya Horiguchi wrote:
> > Hi,
> >
> > I triggered the following bug on v3.16-rc1 when I did mbind() testing
> > where multiple processes repeat calling mbind() for a shared mapped file
> > (causing pingpong of page migration.)
>
> The shared mapped file on shmem/tmpfs? So involving shared policy stuff?

Sorry if it was confusing, I used an ext4 file and mapped it to multiple
processes, and then let the processes call mbind() with arguments of
random address, random length, and random node.
And yes, I think it's memory policy thing too.

> >
> > In my investigation, it seems that some vma accidentally has vma->vm_start
> > = 0, which makes new_vma_page() choose a wrong vma and results in breaking
> > the assumption that the address passed to alloc_pages_vma() should be
> > inside a given vma.
>
> I've not heard of that before. What evidence led you there?

First of all, I checked the return value of get_vma_policy() when BUG() happens
(the BUG() was triggered when policy->mode is not MPOL_PREFERRED or MPOL_BIND,
so I thought that something happened on struct mempolicy.)
The result was weird, I got something like this:
policy->flags 0xffff, policy->mode 0x8801, policy->refcnt 0x1d2cd5a9

So I tried to see why this happened, and checked the argument vma of
page_address_in_vma(), then got vma->vm_start == 0 when the BUG() was triggered.
page_address_in_vma() returns the address of a given page if the address is
within the vma, so if vma's range is corrupted, the result is corrupted too.

Unfortunately, I'm not sure why this wrong vma happened, but when I check
the vma->vm_prev and vma->vm_next, I got something like this:

corrupted vma, vm_start:0, vm_end:700000083000
vma->vm_prev, vm_start:70000000d000, vm_end:7000003e5000
# mempolicy associated with this vma was like this:
# policy->flags 0, policy->mode 2, policy->refcnt 1
# so this vma seems fine.
vma->vm_next was NULL

so these vmas seems to be partially overlapped, that's why I suspected
vma might be split or merged in the wrong way.

> > I'm suspecting that mbind_range() do something wrong around vma handling,
> > but I don't have enough luck yet. Anyone has an idea?
>
> No idea at present.
>
> Please send disassembly (objdump -d, or objdump -ld if you had DEBUG_INFO)
> of policy_zonelist() - the Code line isn't enough to go on, since it just
> shows where the BUG jumped to out-of-line, with no clue as to what might
> be in the registers - thanks.

OK, it's like below:

ffffffff811ebf90 <policy_zonelist>:
policy_zonelist():
/src/linux-dev/mm/mempolicy.c:1720
ffffffff811ebf90: e8 9b d5 55 00 callq ffffffff81749530 <__fentry__>
ffffffff811ebf95: 55 push %rbp
ffffffff811ebf96: 48 89 e5 mov %rsp,%rbp
ffffffff811ebf99: 53 push %rbx
/src/linux-dev/mm/mempolicy.c:1721
ffffffff811ebf9a: 0f b7 46 04 movzwl 0x4(%rsi),%eax
ffffffff811ebf9e: 66 83 f8 01 cmp $0x1,%ax
ffffffff811ebfa2: 74 44 je ffffffff811ebfe8 <policy_zonelist+0x58>
ffffffff811ebfa4: 66 83 f8 02 cmp $0x2,%ax
ffffffff811ebfa8: 75 36 jne ffffffff811ebfe0 <policy_zonelist+0x50>
/src/linux-dev/mm/mempolicy.c:1733
ffffffff811ebfaa: 89 fb mov %edi,%ebx
ffffffff811ebfac: 81 e3 00 00 04 00 and $0x40000,%ebx
ffffffff811ebfb2: 75 56 jne ffffffff811ec00a <policy_zonelist+0x7a>
ffffffff811ebfb4: 48 63 d2 movslq %edx,%rdx
gfp_zonelist():
/src/linux-dev/include/linux/gfp.h:274
ffffffff811ebfb7: 31 c0 xor %eax,%eax
ffffffff811ebfb9: 85 db test %ebx,%ebx
policy_zonelist():
/src/linux-dev/mm/mempolicy.c:1740
ffffffff811ebfbb: 48 8b 14 d5 00 2d d6 mov -0x7e29d300(,%rdx,8),%rdx
ffffffff811ebfc2: 81
node_zonelist():
/src/linux-dev/include/linux/gfp.h:274
ffffffff811ebfc3: 0f 95 c0 setne %al
policy_zonelist():
/src/linux-dev/mm/mempolicy.c:1740
ffffffff811ebfc6: 48 69 c0 20 22 01 00 imul $0x12220,%rax,%rax
/src/linux-dev/mm/mempolicy.c:1741
ffffffff811ebfcd: 5b pop %rbx
ffffffff811ebfce: 5d pop %rbp
/src/linux-dev/mm/mempolicy.c:1740
ffffffff811ebfcf: 48 8d 84 02 00 1d 00 lea 0x1d00(%rdx,%rax,1),%rax
ffffffff811ebfd6: 00
/src/linux-dev/mm/mempolicy.c:1741
ffffffff811ebfd7: c3 retq
ffffffff811ebfd8: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1)
ffffffff811ebfdf: 00
/src/linux-dev/mm/mempolicy.c:1738
ffffffff811ebfe0: 0f 0b ud2
ffffffff811ebfe2: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
/src/linux-dev/mm/mempolicy.c:1723
ffffffff811ebfe8: f6 46 06 02 testb $0x2,0x6(%rsi)
ffffffff811ebfec: 75 12 jne ffffffff811ec000 <policy_zonelist+0x70>
ffffffff811ebfee: 89 fb mov %edi,%ebx
ffffffff811ebff0: 48 0f bf 56 08 movswq 0x8(%rsi),%rdx
ffffffff811ebff5: 81 e3 00 00 04 00 and $0x40000,%ebx
ffffffff811ebffb: eb ba jmp ffffffff811ebfb7 <policy_zonelist+0x27>
ffffffff811ebffd: 0f 1f 00 nopl (%rax)
ffffffff811ec000: 81 e7 00 00 04 00 and $0x40000,%edi
ffffffff811ec006: 89 fb mov %edi,%ebx
ffffffff811ec008: eb aa jmp ffffffff811ebfb4 <policy_zonelist+0x24>
/src/linux-dev/mm/mempolicy.c:1734
ffffffff811ec00a: 48 8d 7e 08 lea 0x8(%rsi),%rdi
ffffffff811ec00e: 48 63 d2 movslq %edx,%rdx
variable_test_bit():
/src/linux-dev/arch/x86/include/asm/bitops.h:318
ffffffff811ec011: 48 0f a3 56 08 bt %rdx,0x8(%rsi)
ffffffff811ec016: 19 c0 sbb %eax,%eax
policy_zonelist():
/src/linux-dev/mm/mempolicy.c:1733
ffffffff811ec018: 85 c0 test %eax,%eax
ffffffff811ec01a: 75 9b jne ffffffff811ebfb7 <policy_zonelist+0x27>
__first_node():
/src/linux-dev/include/linux/nodemask.h:248
ffffffff811ec01c: be 00 04 00 00 mov $0x400,%esi
ffffffff811ec021: e8 9a ae 1b 00 callq ffffffff813a6ec0 <find_first_bit>
ffffffff811ec026: ba 00 04 00 00 mov $0x400,%edx
ffffffff811ec02b: 3d 00 04 00 00 cmp $0x400,%eax
ffffffff811ec030: 0f 4e d0 cmovle %eax,%edx
ffffffff811ec033: 48 63 d2 movslq %edx,%rdx
ffffffff811ec036: e9 7c ff ff ff jmpq ffffffff811ebfb7 <policy_zonelist+0x27>
policy_zonelist():
ffffffff811ec03b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)


Thanks,
Naoya Horiguchi

2014-06-20 19:46:58

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: kernel BUG at /src/linux-dev/mm/mempolicy.c:1738! on v3.16-rc1

On Fri, Jun 20, 2014 at 09:24:36AM -0500, Christoph Lameter wrote:
> On Thu, 19 Jun 2014, Naoya Horiguchi wrote:
>
> > I'm suspecting that mbind_range() do something wrong around vma handling,
> > but I don't have enough luck yet. Anyone has an idea?
>
> Well memory policy data corrupted. This looks like you were trying to do
> page migration via mbind()?

Right.

> Could we get some more details as to what is
> going on here? Specifically the parameters passed to mbind would be
> interesting.

My view about the kernel behavior was in another email a few hours ago.
And as for what userspace did, I attach the reproducer below. It's simply
doing mbind(mode=MPOL_BIND, flags=MPOL_MF_MOVE_ALL) on random address/length/node.

What I did to trigger the bug is like below:

while true ; do
dd if=/dev/urandom of=testfile bs=4096 count=1000
for i in $(seq 10) ; do
./mbind_bug_reproducer testfile > /dev/null &
done
sleep 3
pkill -SIGUSR1 -f mbind_bug_reproducer
done

mbind_bug_reproducer.c
---
#include <stdio.h>
#include <signal.h>
#include <sys/mman.h>
#include <numa.h>
#include <numaif.h>
#include <fcntl.h>

#define ADDR_INPUT 0x700000000000
#define PS 4096
#define err(x) perror(x),exit(EXIT_FAILURE)
#define errmsg(x, ...) fprintf(stderr, x, ##__VA_ARGS__),exit(EXIT_FAILURE)

int flag = 1;

void sig_handle_flag(int signo) { flag = 0; }

void set_new_nodes(struct bitmask *mask, unsigned long node) {
numa_bitmask_clearall(mask);
numa_bitmask_setbit(mask, node);
}

int main(int argc, char *argv[]) {
int nr = 1000;
int fd = -1;
char *pfile;
struct timeval tv;
struct bitmask *nodes;
unsigned long nr_nodes;
unsigned long memsize = nr * PS;

nr_nodes = numa_max_node() + 1; /* numa_num_possible_nodes(); */
nodes = numa_bitmask_alloc(nr_nodes);
if (nr_nodes < 2)
errmsg("A minimum of 2 nodes is required for this test.\n");

gettimeofday(&tv, NULL);
srandom(tv.tv_usec);

fd = open(argv[1], O_RDWR, S_IRWXU);
if (fd < 0)
err("open");
pfile = mmap((void *)ADDR_INPUT, memsize, PROT_READ|PROT_WRITE,
MAP_SHARED, fd, 0);
if (pfile == (void*)-1L)
err("mmap");

signal(SIGUSR1, sig_handle_flag);

while (flag) {
int node;
unsigned long offset;
unsigned long length;

memset(pfile, 'a', memsize);

node = random() % nr_nodes;
set_new_nodes(nodes, random() & nr_nodes);
offset = (random() % nr) * PS;
length = (random() % (nr - offset/PS)) * PS;
printf("[%d] node:%x, offset:%x, length:%x\n",
getpid(), node, offset, length);
mbind(pfile + offset, length, MPOL_BIND, nodes->maskp,
nodes->size + 1, MPOL_MF_MOVE_ALL);
}

munmap(pfile, memsize);
return 0;
}

2014-06-20 20:05:24

by Hugh Dickins

[permalink] [raw]
Subject: Re: kernel BUG at /src/linux-dev/mm/mempolicy.c:1738! on v3.16-rc1

On Fri, 20 Jun 2014, Naoya Horiguchi wrote:
> On Fri, Jun 20, 2014 at 09:24:36AM -0500, Christoph Lameter wrote:
> > On Thu, 19 Jun 2014, Naoya Horiguchi wrote:
> >
> > > I'm suspecting that mbind_range() do something wrong around vma handling,
> > > but I don't have enough luck yet. Anyone has an idea?
> >
> > Well memory policy data corrupted. This looks like you were trying to do
> > page migration via mbind()?
>
> Right.
>
> > Could we get some more details as to what is
> > going on here? Specifically the parameters passed to mbind would be
> > interesting.
>
> My view about the kernel behavior was in another email a few hours ago.
> And as for what userspace did, I attach the reproducer below. It's simply
> doing mbind(mode=MPOL_BIND, flags=MPOL_MF_MOVE_ALL) on random address/length/node.

Thanks for the additional information earlier. ext4, so no shmem
shared mempolicy involved: that cuts down the bugspace considerably.

I agree from what you said that it looked like corrupt vm_area_struct
and hence corrupt policy.

Here's an obvious patch to try, entirely untested - thanks for the
reproducer, but I'd rather leave the testing to you. Sounds like
you have a useful fuzzer there: good catch.


[PATCH] mm: fix crashes from mbind() merging vmas

v2.6.34's 9d8cebd4bcd7 ("mm: fix mbind vma merge problem") introduced
vma merging to mbind(), but it should have also changed the convention
of passing start vma from queue_pages_range() (formerly check_range())
to new_vma_page(): vma merging may have already freed that structure,
resulting in BUG at mm/mempolicy.c:1738 and probably worse crashes.

Fixes: 9d8cebd4bcd7 ("mm: fix mbind vma merge problem")
Reported-by: Naoya Horiguchi <[email protected]>
Signed-off-by: Hugh Dickins <[email protected]>
Cc: [email protected] # 2.6.34+
---

mm/mempolicy.c | 46 ++++++++++++++++++++--------------------------
1 file changed, 20 insertions(+), 26 deletions(-)

--- 3.16-rc1/mm/mempolicy.c 2014-06-16 00:28:55.116076530 -0700
+++ linux/mm/mempolicy.c 2014-06-20 12:40:00.000204558 -0700
@@ -656,19 +656,18 @@ static unsigned long change_prot_numa(st
* @nodes and @flags,) it's isolated and queued to the pagelist which is
* passed via @private.)
*/
-static struct vm_area_struct *
+static int
queue_pages_range(struct mm_struct *mm, unsigned long start, unsigned long end,
const nodemask_t *nodes, unsigned long flags, void *private)
{
- int err;
- struct vm_area_struct *first, *vma, *prev;
-
+ int err = 0;
+ struct vm_area_struct *vma, *prev;

- first = find_vma(mm, start);
- if (!first)
- return ERR_PTR(-EFAULT);
+ vma = find_vma(mm, start);
+ if (!vma)
+ return -EFAULT;
prev = NULL;
- for (vma = first; vma && vma->vm_start < end; vma = vma->vm_next) {
+ for (; vma && vma->vm_start < end; vma = vma->vm_next) {
unsigned long endvma = vma->vm_end;

if (endvma > end)
@@ -678,9 +677,9 @@ queue_pages_range(struct mm_struct *mm,

if (!(flags & MPOL_MF_DISCONTIG_OK)) {
if (!vma->vm_next && vma->vm_end < end)
- return ERR_PTR(-EFAULT);
+ return -EFAULT;
if (prev && prev->vm_end < vma->vm_start)
- return ERR_PTR(-EFAULT);
+ return -EFAULT;
}

if (flags & MPOL_MF_LAZY) {
@@ -694,15 +693,13 @@ queue_pages_range(struct mm_struct *mm,

err = queue_pages_pgd_range(vma, start, endvma, nodes,
flags, private);
- if (err) {
- first = ERR_PTR(err);
+ if (err)
break;
- }
}
next:
prev = vma;
}
- return first;
+ return err;
}

/*
@@ -1156,16 +1153,17 @@ out:

/*
* Allocate a new page for page migration based on vma policy.
- * Start assuming that page is mapped by vma pointed to by @private.
+ * Start by assuming the page is mapped by the same vma as contains @start.
* Search forward from there, if not. N.B., this assumes that the
* list of pages handed to migrate_pages()--which is how we get here--
* is in virtual address order.
*/
-static struct page *new_vma_page(struct page *page, unsigned long private, int **x)
+static struct page *new_page(struct page *page, unsigned long start, int **x)
{
- struct vm_area_struct *vma = (struct vm_area_struct *)private;
+ struct vm_area_struct *vma;
unsigned long uninitialized_var(address);

+ vma = find_vma(current->mm, start);
while (vma) {
address = page_address_in_vma(page, vma);
if (address != -EFAULT)
@@ -1195,7 +1193,7 @@ int do_migrate_pages(struct mm_struct *m
return -ENOSYS;
}

-static struct page *new_vma_page(struct page *page, unsigned long private, int **x)
+static struct page *new_page(struct page *page, unsigned long start, int **x)
{
return NULL;
}
@@ -1205,7 +1203,6 @@ static long do_mbind(unsigned long start
unsigned short mode, unsigned short mode_flags,
nodemask_t *nmask, unsigned long flags)
{
- struct vm_area_struct *vma;
struct mm_struct *mm = current->mm;
struct mempolicy *new;
unsigned long end;
@@ -1271,11 +1268,9 @@ static long do_mbind(unsigned long start
if (err)
goto mpol_out;

- vma = queue_pages_range(mm, start, end, nmask,
+ err = queue_pages_range(mm, start, end, nmask,
flags | MPOL_MF_INVERT, &pagelist);
-
- err = PTR_ERR(vma); /* maybe ... */
- if (!IS_ERR(vma))
+ if (!err)
err = mbind_range(mm, start, end, new);

if (!err) {
@@ -1283,9 +1278,8 @@ static long do_mbind(unsigned long start

if (!list_empty(&pagelist)) {
WARN_ON_ONCE(flags & MPOL_MF_LAZY);
- nr_failed = migrate_pages(&pagelist, new_vma_page,
- NULL, (unsigned long)vma,
- MIGRATE_SYNC, MR_MEMPOLICY_MBIND);
+ nr_failed = migrate_pages(&pagelist, new_page, NULL,
+ start, MIGRATE_SYNC, MR_MEMPOLICY_MBIND);
if (nr_failed)
putback_movable_pages(&pagelist);
}

Subject: Re: kernel BUG at /src/linux-dev/mm/mempolicy.c:1738! on v3.16-rc1

On Fri, 20 Jun 2014, Hugh Dickins wrote:

> [PATCH] mm: fix crashes from mbind() merging vmas
>
> v2.6.34's 9d8cebd4bcd7 ("mm: fix mbind vma merge problem") introduced
> vma merging to mbind(), but it should have also changed the convention
> of passing start vma from queue_pages_range() (formerly check_range())
> to new_vma_page(): vma merging may have already freed that structure,
> resulting in BUG at mm/mempolicy.c:1738 and probably worse crashes.

Good catch. Cannot find fault with what I see.

Acked-by: Christoph Lameter <[email protected]>

2014-06-20 21:13:08

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: kernel BUG at /src/linux-dev/mm/mempolicy.c:1738! on v3.16-rc1

On Fri, Jun 20, 2014 at 01:03:58PM -0700, Hugh Dickins wrote:
> On Fri, 20 Jun 2014, Naoya Horiguchi wrote:
> > On Fri, Jun 20, 2014 at 09:24:36AM -0500, Christoph Lameter wrote:
> > > On Thu, 19 Jun 2014, Naoya Horiguchi wrote:
> > >
> > > > I'm suspecting that mbind_range() do something wrong around vma handling,
> > > > but I don't have enough luck yet. Anyone has an idea?
> > >
> > > Well memory policy data corrupted. This looks like you were trying to do
> > > page migration via mbind()?
> >
> > Right.
> >
> > > Could we get some more details as to what is
> > > going on here? Specifically the parameters passed to mbind would be
> > > interesting.
> >
> > My view about the kernel behavior was in another email a few hours ago.
> > And as for what userspace did, I attach the reproducer below. It's simply
> > doing mbind(mode=MPOL_BIND, flags=MPOL_MF_MOVE_ALL) on random address/length/node.
>
> Thanks for the additional information earlier. ext4, so no shmem
> shared mempolicy involved: that cuts down the bugspace considerably.
>
> I agree from what you said that it looked like corrupt vm_area_struct
> and hence corrupt policy.
>
> Here's an obvious patch to try, entirely untested - thanks for the
> reproducer, but I'd rather leave the testing to you. Sounds like
> you have a useful fuzzer there: good catch.
>
>
> [PATCH] mm: fix crashes from mbind() merging vmas
>
> v2.6.34's 9d8cebd4bcd7 ("mm: fix mbind vma merge problem") introduced
> vma merging to mbind(), but it should have also changed the convention
> of passing start vma from queue_pages_range() (formerly check_range())
> to new_vma_page(): vma merging may have already freed that structure,
> resulting in BUG at mm/mempolicy.c:1738 and probably worse crashes.
>
> Fixes: 9d8cebd4bcd7 ("mm: fix mbind vma merge problem")
> Reported-by: Naoya Horiguchi <[email protected]>
> Signed-off-by: Hugh Dickins <[email protected]>
> Cc: [email protected] # 2.6.34+

With your patch, the bug doesn't reproduce in one hour testing. I think
it's long enough because it took only a few minutes until the reproducer
triggered the bug without your patch.
So I think the problem was gone, thank you very much!

Tested-by: Naoya Horiguchi <[email protected]>

> ---
>
> mm/mempolicy.c | 46 ++++++++++++++++++++--------------------------
> 1 file changed, 20 insertions(+), 26 deletions(-)
>
> --- 3.16-rc1/mm/mempolicy.c 2014-06-16 00:28:55.116076530 -0700
> +++ linux/mm/mempolicy.c 2014-06-20 12:40:00.000204558 -0700
> @@ -656,19 +656,18 @@ static unsigned long change_prot_numa(st
> * @nodes and @flags,) it's isolated and queued to the pagelist which is
> * passed via @private.)
> */
> -static struct vm_area_struct *
> +static int
> queue_pages_range(struct mm_struct *mm, unsigned long start, unsigned long end,
> const nodemask_t *nodes, unsigned long flags, void *private)
> {
> - int err;
> - struct vm_area_struct *first, *vma, *prev;
> -
> + int err = 0;
> + struct vm_area_struct *vma, *prev;
>
> - first = find_vma(mm, start);
> - if (!first)
> - return ERR_PTR(-EFAULT);
> + vma = find_vma(mm, start);
> + if (!vma)
> + return -EFAULT;
> prev = NULL;
> - for (vma = first; vma && vma->vm_start < end; vma = vma->vm_next) {
> + for (; vma && vma->vm_start < end; vma = vma->vm_next) {
> unsigned long endvma = vma->vm_end;
>
> if (endvma > end)
> @@ -678,9 +677,9 @@ queue_pages_range(struct mm_struct *mm,
>
> if (!(flags & MPOL_MF_DISCONTIG_OK)) {
> if (!vma->vm_next && vma->vm_end < end)
> - return ERR_PTR(-EFAULT);
> + return -EFAULT;
> if (prev && prev->vm_end < vma->vm_start)
> - return ERR_PTR(-EFAULT);
> + return -EFAULT;
> }
>
> if (flags & MPOL_MF_LAZY) {
> @@ -694,15 +693,13 @@ queue_pages_range(struct mm_struct *mm,
>
> err = queue_pages_pgd_range(vma, start, endvma, nodes,
> flags, private);
> - if (err) {
> - first = ERR_PTR(err);
> + if (err)
> break;
> - }
> }
> next:
> prev = vma;
> }
> - return first;
> + return err;
> }
>
> /*
> @@ -1156,16 +1153,17 @@ out:
>
> /*
> * Allocate a new page for page migration based on vma policy.
> - * Start assuming that page is mapped by vma pointed to by @private.
> + * Start by assuming the page is mapped by the same vma as contains @start.
> * Search forward from there, if not. N.B., this assumes that the
> * list of pages handed to migrate_pages()--which is how we get here--
> * is in virtual address order.
> */
> -static struct page *new_vma_page(struct page *page, unsigned long private, int **x)
> +static struct page *new_page(struct page *page, unsigned long start, int **x)
> {
> - struct vm_area_struct *vma = (struct vm_area_struct *)private;
> + struct vm_area_struct *vma;
> unsigned long uninitialized_var(address);
>
> + vma = find_vma(current->mm, start);
> while (vma) {
> address = page_address_in_vma(page, vma);
> if (address != -EFAULT)
> @@ -1195,7 +1193,7 @@ int do_migrate_pages(struct mm_struct *m
> return -ENOSYS;
> }
>
> -static struct page *new_vma_page(struct page *page, unsigned long private, int **x)
> +static struct page *new_page(struct page *page, unsigned long start, int **x)
> {
> return NULL;
> }
> @@ -1205,7 +1203,6 @@ static long do_mbind(unsigned long start
> unsigned short mode, unsigned short mode_flags,
> nodemask_t *nmask, unsigned long flags)
> {
> - struct vm_area_struct *vma;
> struct mm_struct *mm = current->mm;
> struct mempolicy *new;
> unsigned long end;
> @@ -1271,11 +1268,9 @@ static long do_mbind(unsigned long start
> if (err)
> goto mpol_out;
>
> - vma = queue_pages_range(mm, start, end, nmask,
> + err = queue_pages_range(mm, start, end, nmask,
> flags | MPOL_MF_INVERT, &pagelist);
> -
> - err = PTR_ERR(vma); /* maybe ... */
> - if (!IS_ERR(vma))
> + if (!err)
> err = mbind_range(mm, start, end, new);
>
> if (!err) {
> @@ -1283,9 +1278,8 @@ static long do_mbind(unsigned long start
>
> if (!list_empty(&pagelist)) {
> WARN_ON_ONCE(flags & MPOL_MF_LAZY);
> - nr_failed = migrate_pages(&pagelist, new_vma_page,
> - NULL, (unsigned long)vma,
> - MIGRATE_SYNC, MR_MEMPOLICY_MBIND);
> + nr_failed = migrate_pages(&pagelist, new_page, NULL,
> + start, MIGRATE_SYNC, MR_MEMPOLICY_MBIND);
> if (nr_failed)
> putback_movable_pages(&pagelist);
> }
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>

2014-06-20 21:43:50

by Hugh Dickins

[permalink] [raw]
Subject: Re: kernel BUG at /src/linux-dev/mm/mempolicy.c:1738! on v3.16-rc1

On Fri, 20 Jun 2014, Naoya Horiguchi wrote:
> On Fri, Jun 20, 2014 at 01:03:58PM -0700, Hugh Dickins wrote:
> > On Fri, 20 Jun 2014, Naoya Horiguchi wrote:
> > > On Fri, Jun 20, 2014 at 09:24:36AM -0500, Christoph Lameter wrote:
> > > > On Thu, 19 Jun 2014, Naoya Horiguchi wrote:
> > > >
> > > > > I'm suspecting that mbind_range() do something wrong around vma handling,
> > > > > but I don't have enough luck yet. Anyone has an idea?
> > > >
> > > > Well memory policy data corrupted. This looks like you were trying to do
> > > > page migration via mbind()?
> > >
> > > Right.
> > >
> > > > Could we get some more details as to what is
> > > > going on here? Specifically the parameters passed to mbind would be
> > > > interesting.
> > >
> > > My view about the kernel behavior was in another email a few hours ago.
> > > And as for what userspace did, I attach the reproducer below. It's simply
> > > doing mbind(mode=MPOL_BIND, flags=MPOL_MF_MOVE_ALL) on random address/length/node.
> >
> > Thanks for the additional information earlier. ext4, so no shmem
> > shared mempolicy involved: that cuts down the bugspace considerably.
> >
> > I agree from what you said that it looked like corrupt vm_area_struct
> > and hence corrupt policy.
> >
> > Here's an obvious patch to try, entirely untested - thanks for the
> > reproducer, but I'd rather leave the testing to you. Sounds like
> > you have a useful fuzzer there: good catch.
> >
> >
> > [PATCH] mm: fix crashes from mbind() merging vmas
> >
> > v2.6.34's 9d8cebd4bcd7 ("mm: fix mbind vma merge problem") introduced
> > vma merging to mbind(), but it should have also changed the convention
> > of passing start vma from queue_pages_range() (formerly check_range())
> > to new_vma_page(): vma merging may have already freed that structure,
> > resulting in BUG at mm/mempolicy.c:1738 and probably worse crashes.
> >
> > Fixes: 9d8cebd4bcd7 ("mm: fix mbind vma merge problem")
> > Reported-by: Naoya Horiguchi <[email protected]>
> > Signed-off-by: Hugh Dickins <[email protected]>
> > Cc: [email protected] # 2.6.34+
>
> With your patch, the bug doesn't reproduce in one hour testing. I think
> it's long enough because it took only a few minutes until the reproducer
> triggered the bug without your patch.
> So I think the problem was gone, thank you very much!
>
> Tested-by: Naoya Horiguchi <[email protected]>
& Acked-by: Christoph Lameter <[email protected]>

Great, thank you both. I was afraid you might hit something else
immediately. I'll send the patch to Andrew later - unless it
magically appears in his tree before.

Hugh