2014-04-08 14:37:27

by Sasha Levin

[permalink] [raw]
Subject: mm: kernel BUG at mm/huge_memory.c:1829!

Hi all,

While fuzzing with trinity inside a KVM tools guest running the latest -next
kernel, I've stumbled on the following:

[ 1275.253114] kernel BUG at mm/huge_memory.c:1829!
[ 1275.253642] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[ 1275.254775] Dumping ftrace buffer:
[ 1275.255631] (ftrace buffer empty)
[ 1275.256440] Modules linked in:
[ 1275.257347] CPU: 20 PID: 22807 Comm: trinity-c299 Not tainted 3.14.0-next-20140407-sasha-00023-gd35b0d6 #382
[ 1275.258686] task: ffff8803e7873000 ti: ffff8803e7896000 task.ti: ffff8803e7896000
[ 1275.259416] RIP: __split_huge_page (mm/huge_memory.c:1829 (discriminator 1))
[ 1275.260527] RSP: 0018:ffff8803e7897bb8 EFLAGS: 00010297
[ 1275.261323] RAX: 000000000000012c RBX: ffff8803e789d600 RCX: 0000000000000006
[ 1275.261323] RDX: 0000000000005b80 RSI: ffff8803e7873d00 RDI: 0000000000000282
[ 1275.261323] RBP: ffff8803e7897c68 R08: 0000000000000000 R09: 0000000000000000
[ 1275.261323] R10: 0000000000000001 R11: 30303320746e756f R12: 0000000000000000
[ 1275.261323] R13: 0000000000a00000 R14: ffff8803ede73000 R15: ffffea0010030000
[ 1275.261323] FS: 00007f899d23f700(0000) GS:ffff880437000000(0000) knlGS:0000000000000000
[ 1275.261323] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 1275.261323] CR2: 00000000024cf048 CR3: 00000003e787f000 CR4: 00000000000006a0
[ 1275.261323] DR0: 0000000000696000 DR1: 0000000000696000 DR2: 0000000000000000
[ 1275.261323] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
[ 1275.261323] Stack:
[ 1275.261323] ffff8803e7897bd8 ffff880024dab898 ffff8803e7897bd8 ffffffffac1bea0e
[ 1275.261323] ffff8803e7897c28 0000000000000282 00000014b06cc072 0000000000000000
[ 1275.261323] 0000012be7897c28 0000000000000a00 ffff880024dab8d0 ffff880024dab898
[ 1275.261323] Call Trace:
[ 1275.261323] ? put_lock_stats.isra.12 (arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
[ 1275.261323] ? down_write (kernel/locking/rwsem.c:51 (discriminator 2))
[ 1275.261323] ? split_huge_page_to_list (mm/huge_memory.c:1874)
[ 1275.261323] split_huge_page_to_list (include/linux/vmstat.h:37 mm/huge_memory.c:1879)
[ 1275.261323] __split_huge_page_pmd (mm/huge_memory.c:2811)
[ 1275.261323] ? mutex_unlock (kernel/locking/mutex.c:220)
[ 1275.261323] ? __mutex_unlock_slowpath (arch/x86/include/asm/paravirt.h:809 kernel/locking/mutex.c:713 kernel/locking/mutex.c:722)
[ 1275.261323] ? get_parent_ip (kernel/sched/core.c:2471)
[ 1275.261323] ? preempt_count_sub (kernel/sched/core.c:2526)
[ 1275.261323] follow_page_mask (mm/memory.c:1518 (discriminator 1))
[ 1275.261323] SYSC_move_pages (mm/migrate.c:1227 mm/migrate.c:1353 mm/migrate.c:1508)
[ 1275.261323] ? SYSC_move_pages (include/linux/rcupdate.h:800 mm/migrate.c:1472)
[ 1275.261323] ? sched_clock_local (kernel/sched/clock.c:213)
[ 1275.261323] SyS_move_pages (mm/migrate.c:1456)
[ 1275.261323] tracesys (arch/x86/kernel/entry_64.S:749)
[ 1275.261323] Code: c0 01 39 45 94 74 18 41 8b 57 18 48 c7 c7 90 5e 6d b0 31 c0 8b 75 94 83 c2 01 e8 3d 6a 23 03 41 8b 47 18 83 c0 01 39 45 94 74 02 <0f> 0b 49 8b 07 48 89 c2 48 c1 e8 34 83 e0 03 48 c1 ea 36 4c 8d
[ 1275.261323] RIP __split_huge_page (mm/huge_memory.c:1829 (discriminator 1))
[ 1275.261323] RSP <ffff8803e7897bb8>

Looking at the code, there was supposed to be a printk printing both
mapcounts if they're different. However, there was no matching entry
in the log for that.


Thanks,
Sasha


2014-04-10 08:46:01

by Bob Liu

[permalink] [raw]
Subject: Re: mm: kernel BUG at mm/huge_memory.c:1829!

On Tue, Apr 8, 2014 at 10:37 PM, Sasha Levin <[email protected]> wrote:
> Hi all,
>
> While fuzzing with trinity inside a KVM tools guest running the latest -next
> kernel, I've stumbled on the following:
>

Wow! There are so many huge memory related bugs recently.
AFAIR, there were still several without fix. I wanna is there any
place can track those bugs instead of lost in maillist?
It seems this link is out of date
http://codemonkey.org.uk/projects/trinity/bugs-unfixed.php

Thanks,
-Bob

> [ 1275.253114] kernel BUG at mm/huge_memory.c:1829!
> [ 1275.253642] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> [ 1275.254775] Dumping ftrace buffer:
> [ 1275.255631] (ftrace buffer empty)
> [ 1275.256440] Modules linked in:
> [ 1275.257347] CPU: 20 PID: 22807 Comm: trinity-c299 Not tainted 3.14.0-next-20140407-sasha-00023-gd35b0d6 #382
> [ 1275.258686] task: ffff8803e7873000 ti: ffff8803e7896000 task.ti: ffff8803e7896000
> [ 1275.259416] RIP: __split_huge_page (mm/huge_memory.c:1829 (discriminator 1))
> [ 1275.260527] RSP: 0018:ffff8803e7897bb8 EFLAGS: 00010297
> [ 1275.261323] RAX: 000000000000012c RBX: ffff8803e789d600 RCX: 0000000000000006
> [ 1275.261323] RDX: 0000000000005b80 RSI: ffff8803e7873d00 RDI: 0000000000000282
> [ 1275.261323] RBP: ffff8803e7897c68 R08: 0000000000000000 R09: 0000000000000000
> [ 1275.261323] R10: 0000000000000001 R11: 30303320746e756f R12: 0000000000000000
> [ 1275.261323] R13: 0000000000a00000 R14: ffff8803ede73000 R15: ffffea0010030000
> [ 1275.261323] FS: 00007f899d23f700(0000) GS:ffff880437000000(0000) knlGS:0000000000000000
> [ 1275.261323] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [ 1275.261323] CR2: 00000000024cf048 CR3: 00000003e787f000 CR4: 00000000000006a0
> [ 1275.261323] DR0: 0000000000696000 DR1: 0000000000696000 DR2: 0000000000000000
> [ 1275.261323] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
> [ 1275.261323] Stack:
> [ 1275.261323] ffff8803e7897bd8 ffff880024dab898 ffff8803e7897bd8 ffffffffac1bea0e
> [ 1275.261323] ffff8803e7897c28 0000000000000282 00000014b06cc072 0000000000000000
> [ 1275.261323] 0000012be7897c28 0000000000000a00 ffff880024dab8d0 ffff880024dab898
> [ 1275.261323] Call Trace:
> [ 1275.261323] ? put_lock_stats.isra.12 (arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
> [ 1275.261323] ? down_write (kernel/locking/rwsem.c:51 (discriminator 2))
> [ 1275.261323] ? split_huge_page_to_list (mm/huge_memory.c:1874)
> [ 1275.261323] split_huge_page_to_list (include/linux/vmstat.h:37 mm/huge_memory.c:1879)
> [ 1275.261323] __split_huge_page_pmd (mm/huge_memory.c:2811)
> [ 1275.261323] ? mutex_unlock (kernel/locking/mutex.c:220)
> [ 1275.261323] ? __mutex_unlock_slowpath (arch/x86/include/asm/paravirt.h:809 kernel/locking/mutex.c:713 kernel/locking/mutex.c:722)
> [ 1275.261323] ? get_parent_ip (kernel/sched/core.c:2471)
> [ 1275.261323] ? preempt_count_sub (kernel/sched/core.c:2526)
> [ 1275.261323] follow_page_mask (mm/memory.c:1518 (discriminator 1))
> [ 1275.261323] SYSC_move_pages (mm/migrate.c:1227 mm/migrate.c:1353 mm/migrate.c:1508)
> [ 1275.261323] ? SYSC_move_pages (include/linux/rcupdate.h:800 mm/migrate.c:1472)
> [ 1275.261323] ? sched_clock_local (kernel/sched/clock.c:213)
> [ 1275.261323] SyS_move_pages (mm/migrate.c:1456)
> [ 1275.261323] tracesys (arch/x86/kernel/entry_64.S:749)
> [ 1275.261323] Code: c0 01 39 45 94 74 18 41 8b 57 18 48 c7 c7 90 5e 6d b0 31 c0 8b 75 94 83 c2 01 e8 3d 6a 23 03 41 8b 47 18 83 c0 01 39 45 94 74 02 <0f> 0b 49 8b 07 48 89 c2 48 c1 e8 34 83 e0 03 48 c1 ea 36 4c 8d
> [ 1275.261323] RIP __split_huge_page (mm/huge_memory.c:1829 (discriminator 1))
> [ 1275.261323] RSP <ffff8803e7897bb8>
>
> Looking at the code, there was supposed to be a printk printing both
> mapcounts if they're different. However, there was no matching entry
> in the log for that.
>
>
> Thanks,
> Sasha

2014-04-10 10:25:45

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: mm: kernel BUG at mm/huge_memory.c:1829!

On Tue, Apr 08, 2014 at 10:37:05AM -0400, Sasha Levin wrote:
> Hi all,
>
> While fuzzing with trinity inside a KVM tools guest running the latest -next
> kernel, I've stumbled on the following:
>
> [ 1275.253114] kernel BUG at mm/huge_memory.c:1829!
> [ 1275.253642] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> [ 1275.254775] Dumping ftrace buffer:
> [ 1275.255631] (ftrace buffer empty)
> [ 1275.256440] Modules linked in:
> [ 1275.257347] CPU: 20 PID: 22807 Comm: trinity-c299 Not tainted 3.14.0-next-20140407-sasha-00023-gd35b0d6 #382
> [ 1275.258686] task: ffff8803e7873000 ti: ffff8803e7896000 task.ti: ffff8803e7896000
> [ 1275.259416] RIP: __split_huge_page (mm/huge_memory.c:1829 (discriminator 1))
> [ 1275.260527] RSP: 0018:ffff8803e7897bb8 EFLAGS: 00010297
> [ 1275.261323] RAX: 000000000000012c RBX: ffff8803e789d600 RCX: 0000000000000006
> [ 1275.261323] RDX: 0000000000005b80 RSI: ffff8803e7873d00 RDI: 0000000000000282
> [ 1275.261323] RBP: ffff8803e7897c68 R08: 0000000000000000 R09: 0000000000000000
> [ 1275.261323] R10: 0000000000000001 R11: 30303320746e756f R12: 0000000000000000
> [ 1275.261323] R13: 0000000000a00000 R14: ffff8803ede73000 R15: ffffea0010030000
> [ 1275.261323] FS: 00007f899d23f700(0000) GS:ffff880437000000(0000) knlGS:0000000000000000
> [ 1275.261323] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [ 1275.261323] CR2: 00000000024cf048 CR3: 00000003e787f000 CR4: 00000000000006a0
> [ 1275.261323] DR0: 0000000000696000 DR1: 0000000000696000 DR2: 0000000000000000
> [ 1275.261323] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
> [ 1275.261323] Stack:
> [ 1275.261323] ffff8803e7897bd8 ffff880024dab898 ffff8803e7897bd8 ffffffffac1bea0e
> [ 1275.261323] ffff8803e7897c28 0000000000000282 00000014b06cc072 0000000000000000
> [ 1275.261323] 0000012be7897c28 0000000000000a00 ffff880024dab8d0 ffff880024dab898
> [ 1275.261323] Call Trace:
> [ 1275.261323] ? put_lock_stats.isra.12 (arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
> [ 1275.261323] ? down_write (kernel/locking/rwsem.c:51 (discriminator 2))
> [ 1275.261323] ? split_huge_page_to_list (mm/huge_memory.c:1874)
> [ 1275.261323] split_huge_page_to_list (include/linux/vmstat.h:37 mm/huge_memory.c:1879)
> [ 1275.261323] __split_huge_page_pmd (mm/huge_memory.c:2811)
> [ 1275.261323] ? mutex_unlock (kernel/locking/mutex.c:220)
> [ 1275.261323] ? __mutex_unlock_slowpath (arch/x86/include/asm/paravirt.h:809 kernel/locking/mutex.c:713 kernel/locking/mutex.c:722)
> [ 1275.261323] ? get_parent_ip (kernel/sched/core.c:2471)
> [ 1275.261323] ? preempt_count_sub (kernel/sched/core.c:2526)
> [ 1275.261323] follow_page_mask (mm/memory.c:1518 (discriminator 1))
> [ 1275.261323] SYSC_move_pages (mm/migrate.c:1227 mm/migrate.c:1353 mm/migrate.c:1508)
> [ 1275.261323] ? SYSC_move_pages (include/linux/rcupdate.h:800 mm/migrate.c:1472)
> [ 1275.261323] ? sched_clock_local (kernel/sched/clock.c:213)
> [ 1275.261323] SyS_move_pages (mm/migrate.c:1456)
> [ 1275.261323] tracesys (arch/x86/kernel/entry_64.S:749)
> [ 1275.261323] Code: c0 01 39 45 94 74 18 41 8b 57 18 48 c7 c7 90 5e 6d b0 31 c0 8b 75 94 83 c2 01 e8 3d 6a 23 03 41 8b 47 18 83 c0 01 39 45 94 74 02 <0f> 0b 49 8b 07 48 89 c2 48 c1 e8 34 83 e0 03 48 c1 ea 36 4c 8d
> [ 1275.261323] RIP __split_huge_page (mm/huge_memory.c:1829 (discriminator 1))
> [ 1275.261323] RSP <ffff8803e7897bb8>
>
> Looking at the code, there was supposed to be a printk printing both
> mapcounts if they're different. However, there was no matching entry
> in the log for that.

We had the bug triggered a year ago[1] and I tried to ask whether it can
caused by chaining the same_anon_vma list with interval tree[2].

It's not obvious for me how new implementation of anon rmap with interval
tree guarantee behaviour of anon_vma_interval_tree_foreach() vs.
anon_vma_interval_tree_insert() which __split_huge_page() expects.

Michel, could you clarify that?

[1] https://bugzilla.redhat.com/show_bug.cgi?id=923817
[2] http://article.gmane.org/gmane.linux.kernel.mm/96518

--
Kirill A. Shutemov

2014-04-10 11:26:03

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: mm: kernel BUG at mm/huge_memory.c:1829!

On Tue, Apr 08, 2014 at 10:37:05AM -0400, Sasha Levin wrote:
> Hi all,
>
> While fuzzing with trinity inside a KVM tools guest running the latest -next
> kernel, I've stumbled on the following:
>
> [ 1275.253114] kernel BUG at mm/huge_memory.c:1829!
> [ 1275.253642] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> [ 1275.254775] Dumping ftrace buffer:
> [ 1275.255631] (ftrace buffer empty)
> [ 1275.256440] Modules linked in:
> [ 1275.257347] CPU: 20 PID: 22807 Comm: trinity-c299 Not tainted 3.14.0-next-20140407-sasha-00023-gd35b0d6 #382
> [ 1275.258686] task: ffff8803e7873000 ti: ffff8803e7896000 task.ti: ffff8803e7896000
> [ 1275.259416] RIP: __split_huge_page (mm/huge_memory.c:1829 (discriminator 1))
> [ 1275.260527] RSP: 0018:ffff8803e7897bb8 EFLAGS: 00010297
> [ 1275.261323] RAX: 000000000000012c RBX: ffff8803e789d600 RCX: 0000000000000006
> [ 1275.261323] RDX: 0000000000005b80 RSI: ffff8803e7873d00 RDI: 0000000000000282
> [ 1275.261323] RBP: ffff8803e7897c68 R08: 0000000000000000 R09: 0000000000000000
> [ 1275.261323] R10: 0000000000000001 R11: 30303320746e756f R12: 0000000000000000
> [ 1275.261323] R13: 0000000000a00000 R14: ffff8803ede73000 R15: ffffea0010030000
> [ 1275.261323] FS: 00007f899d23f700(0000) GS:ffff880437000000(0000) knlGS:0000000000000000
> [ 1275.261323] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [ 1275.261323] CR2: 00000000024cf048 CR3: 00000003e787f000 CR4: 00000000000006a0
> [ 1275.261323] DR0: 0000000000696000 DR1: 0000000000696000 DR2: 0000000000000000
> [ 1275.261323] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
> [ 1275.261323] Stack:
> [ 1275.261323] ffff8803e7897bd8 ffff880024dab898 ffff8803e7897bd8 ffffffffac1bea0e
> [ 1275.261323] ffff8803e7897c28 0000000000000282 00000014b06cc072 0000000000000000
> [ 1275.261323] 0000012be7897c28 0000000000000a00 ffff880024dab8d0 ffff880024dab898
> [ 1275.261323] Call Trace:
> [ 1275.261323] ? put_lock_stats.isra.12 (arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
> [ 1275.261323] ? down_write (kernel/locking/rwsem.c:51 (discriminator 2))
> [ 1275.261323] ? split_huge_page_to_list (mm/huge_memory.c:1874)
> [ 1275.261323] split_huge_page_to_list (include/linux/vmstat.h:37 mm/huge_memory.c:1879)
> [ 1275.261323] __split_huge_page_pmd (mm/huge_memory.c:2811)
> [ 1275.261323] ? mutex_unlock (kernel/locking/mutex.c:220)
> [ 1275.261323] ? __mutex_unlock_slowpath (arch/x86/include/asm/paravirt.h:809 kernel/locking/mutex.c:713 kernel/locking/mutex.c:722)
> [ 1275.261323] ? get_parent_ip (kernel/sched/core.c:2471)
> [ 1275.261323] ? preempt_count_sub (kernel/sched/core.c:2526)
> [ 1275.261323] follow_page_mask (mm/memory.c:1518 (discriminator 1))
> [ 1275.261323] SYSC_move_pages (mm/migrate.c:1227 mm/migrate.c:1353 mm/migrate.c:1508)
> [ 1275.261323] ? SYSC_move_pages (include/linux/rcupdate.h:800 mm/migrate.c:1472)
> [ 1275.261323] ? sched_clock_local (kernel/sched/clock.c:213)
> [ 1275.261323] SyS_move_pages (mm/migrate.c:1456)
> [ 1275.261323] tracesys (arch/x86/kernel/entry_64.S:749)
> [ 1275.261323] Code: c0 01 39 45 94 74 18 41 8b 57 18 48 c7 c7 90 5e 6d b0 31 c0 8b 75 94 83 c2 01 e8 3d 6a 23 03 41 8b 47 18 83 c0 01 39 45 94 74 02 <0f> 0b 49 8b 07 48 89 c2 48 c1 e8 34 83 e0 03 48 c1 ea 36 4c 8d
> [ 1275.261323] RIP __split_huge_page (mm/huge_memory.c:1829 (discriminator 1))
> [ 1275.261323] RSP <ffff8803e7897bb8>
>
> Looking at the code, there was supposed to be a printk printing both
> mapcounts if they're different. However, there was no matching entry
> in the log for that.

And I was able to trigger it other way around:

[ 156.918770] mapcount 2 page_mapcount 3

But no BUG_ON().

--
Kirill A. Shutemov

2014-04-10 13:45:04

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: mm: kernel BUG at mm/huge_memory.c:1829!

On Thu, Apr 10, 2014 at 01:25:27PM +0300, Kirill A. Shutemov wrote:
> On Tue, Apr 08, 2014 at 10:37:05AM -0400, Sasha Levin wrote:
> > Hi all,
> >
> > While fuzzing with trinity inside a KVM tools guest running the latest -next
> > kernel, I've stumbled on the following:
> >
> > [ 1275.253114] kernel BUG at mm/huge_memory.c:1829!
> > [ 1275.253642] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> > [ 1275.254775] Dumping ftrace buffer:
> > [ 1275.255631] (ftrace buffer empty)
> > [ 1275.256440] Modules linked in:
> > [ 1275.257347] CPU: 20 PID: 22807 Comm: trinity-c299 Not tainted 3.14.0-next-20140407-sasha-00023-gd35b0d6 #382
> > [ 1275.258686] task: ffff8803e7873000 ti: ffff8803e7896000 task.ti: ffff8803e7896000
> > [ 1275.259416] RIP: __split_huge_page (mm/huge_memory.c:1829 (discriminator 1))
> > [ 1275.260527] RSP: 0018:ffff8803e7897bb8 EFLAGS: 00010297
> > [ 1275.261323] RAX: 000000000000012c RBX: ffff8803e789d600 RCX: 0000000000000006
> > [ 1275.261323] RDX: 0000000000005b80 RSI: ffff8803e7873d00 RDI: 0000000000000282
> > [ 1275.261323] RBP: ffff8803e7897c68 R08: 0000000000000000 R09: 0000000000000000
> > [ 1275.261323] R10: 0000000000000001 R11: 30303320746e756f R12: 0000000000000000
> > [ 1275.261323] R13: 0000000000a00000 R14: ffff8803ede73000 R15: ffffea0010030000
> > [ 1275.261323] FS: 00007f899d23f700(0000) GS:ffff880437000000(0000) knlGS:0000000000000000
> > [ 1275.261323] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > [ 1275.261323] CR2: 00000000024cf048 CR3: 00000003e787f000 CR4: 00000000000006a0
> > [ 1275.261323] DR0: 0000000000696000 DR1: 0000000000696000 DR2: 0000000000000000
> > [ 1275.261323] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
> > [ 1275.261323] Stack:
> > [ 1275.261323] ffff8803e7897bd8 ffff880024dab898 ffff8803e7897bd8 ffffffffac1bea0e
> > [ 1275.261323] ffff8803e7897c28 0000000000000282 00000014b06cc072 0000000000000000
> > [ 1275.261323] 0000012be7897c28 0000000000000a00 ffff880024dab8d0 ffff880024dab898
> > [ 1275.261323] Call Trace:
> > [ 1275.261323] ? put_lock_stats.isra.12 (arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
> > [ 1275.261323] ? down_write (kernel/locking/rwsem.c:51 (discriminator 2))
> > [ 1275.261323] ? split_huge_page_to_list (mm/huge_memory.c:1874)
> > [ 1275.261323] split_huge_page_to_list (include/linux/vmstat.h:37 mm/huge_memory.c:1879)
> > [ 1275.261323] __split_huge_page_pmd (mm/huge_memory.c:2811)
> > [ 1275.261323] ? mutex_unlock (kernel/locking/mutex.c:220)
> > [ 1275.261323] ? __mutex_unlock_slowpath (arch/x86/include/asm/paravirt.h:809 kernel/locking/mutex.c:713 kernel/locking/mutex.c:722)
> > [ 1275.261323] ? get_parent_ip (kernel/sched/core.c:2471)
> > [ 1275.261323] ? preempt_count_sub (kernel/sched/core.c:2526)
> > [ 1275.261323] follow_page_mask (mm/memory.c:1518 (discriminator 1))
> > [ 1275.261323] SYSC_move_pages (mm/migrate.c:1227 mm/migrate.c:1353 mm/migrate.c:1508)
> > [ 1275.261323] ? SYSC_move_pages (include/linux/rcupdate.h:800 mm/migrate.c:1472)
> > [ 1275.261323] ? sched_clock_local (kernel/sched/clock.c:213)
> > [ 1275.261323] SyS_move_pages (mm/migrate.c:1456)
> > [ 1275.261323] tracesys (arch/x86/kernel/entry_64.S:749)
> > [ 1275.261323] Code: c0 01 39 45 94 74 18 41 8b 57 18 48 c7 c7 90 5e 6d b0 31 c0 8b 75 94 83 c2 01 e8 3d 6a 23 03 41 8b 47 18 83 c0 01 39 45 94 74 02 <0f> 0b 49 8b 07 48 89 c2 48 c1 e8 34 83 e0 03 48 c1 ea 36 4c 8d
> > [ 1275.261323] RIP __split_huge_page (mm/huge_memory.c:1829 (discriminator 1))
> > [ 1275.261323] RSP <ffff8803e7897bb8>
> >
> > Looking at the code, there was supposed to be a printk printing both
> > mapcounts if they're different. However, there was no matching entry
> > in the log for that.
>
> We had the bug triggered a year ago[1] and I tried to ask whether it can
> caused by chaining the same_anon_vma list with interval tree[2].
>
> It's not obvious for me how new implementation of anon rmap with interval
> tree guarantee behaviour of anon_vma_interval_tree_foreach() vs.
> anon_vma_interval_tree_insert() which __split_huge_page() expects.
>
> Michel, could you clarify that?
>
> [1] https://bugzilla.redhat.com/show_bug.cgi?id=923817
> [2] http://article.gmane.org/gmane.linux.kernel.mm/96518

Okay, below is my attempt to fix the bug. I'm not entirely sure it's
correct. Andrea, could you take a look?

I used this program to reproduce the issue:

#include <sys/mman.h>
#include <unistd.h>
#include <stdlib.h>
#include <sys/wait.h>
#include <stdio.h>

#define SIZE (2*1024*1024)

int main()
{
char * x;

for (;;) {
x = mmap(NULL, SIZE+SIZE-4096, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
if (x == MAP_FAILED)
perror("error"), exit(1);
x[SIZE] = 0;
if (!fork()) {
_exit(0);
}
if (!fork()) {
if (!fork()) {
_exit(0);
}
mprotect(x + SIZE, 4096, PROT_NONE);
_exit(0);
}
mprotect(x + SIZE, 4096, PROT_NONE);
munmap(x, SIZE+SIZE-4096);
while (waitpid(-1, NULL, WNOHANG) > 0);
}
return 0;
}

>From 1b3051b8613de3e55f1062ec0b8914d838e7c266 Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" <[email protected]>
Date: Thu, 10 Apr 2014 15:41:04 +0300
Subject: [PATCH] mm, thp: fix race between split huge page and insert into
anon_vma tree

__split_huge_page() has assumption that iteration over anon_vma will
catch all VMAs the page belongs to. The assumption relies on new VMA to
be added to the tail of VMA list, so list_for_each_entry() can catch
them.

Commit bf181b9f9d8d has replaced same_anon_vma linked list with an
interval tree and, I believe, it breaks the assumption.

Let's retry walk over huge anon VMA tree if number of VMA we found
doesn't match with page_mapcount().

Signed-off-by: Kirill A. Shutemov <[email protected]>
Cc: Michel Lespinasse <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
---
mm/huge_memory.c | 28 +++++++++++++++-------------
1 file changed, 15 insertions(+), 13 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 64635f5278ff..6d868a13ca3c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1807,6 +1807,7 @@ static void __split_huge_page(struct page *page,
BUG_ON(PageTail(page));

mapcount = 0;
+retry:
anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff, pgoff) {
struct vm_area_struct *vma = avc->vma;
unsigned long addr = vma_address(page, vma);
@@ -1814,19 +1815,14 @@ static void __split_huge_page(struct page *page,
mapcount += __split_huge_page_splitting(page, vma, addr);
}
/*
- * It is critical that new vmas are added to the tail of the
- * anon_vma list. This guarantes that if copy_huge_pmd() runs
- * and establishes a child pmd before
- * __split_huge_page_splitting() freezes the parent pmd (so if
- * we fail to prevent copy_huge_pmd() from running until the
- * whole __split_huge_page() is complete), we will still see
- * the newly established pmd of the child later during the
- * walk, to be able to set it as pmd_trans_splitting too.
+ * There's chance that iteration over interval tree will race with
+ * insert to it. Let's try catch new entries by retrying.
*/
- if (mapcount != page_mapcount(page))
- printk(KERN_ERR "mapcount %d page_mapcount %d\n",
+ if (mapcount != page_mapcount(page)) {
+ printk(KERN_DEBUG "mapcount %d page_mapcount %d\n",
mapcount, page_mapcount(page));
- BUG_ON(mapcount != page_mapcount(page));
+ goto retry;
+ }

__split_huge_page_refcount(page, list);

@@ -1837,10 +1833,16 @@ static void __split_huge_page(struct page *page,
BUG_ON(is_vma_temporary_stack(vma));
mapcount2 += __split_huge_page_map(page, vma, addr);
}
- if (mapcount != mapcount2)
+ /*
+ * By the time __split_huge_page_refcount() called all PMDs should be
+ * marked pmd_trans_splitting() and new mappings of the page shouldn't
+ * be created or removed. If number of mappings is changed it's a BUG().
+ */
+ if (mapcount != mapcount2) {
printk(KERN_ERR "mapcount %d mapcount2 %d page_mapcount %d\n",
mapcount, mapcount2, page_mapcount(page));
- BUG_ON(mapcount != mapcount2);
+ BUG();
+ }
}

/*
--
Kirill A. Shutemov

2014-04-10 14:38:42

by Dave Jones

[permalink] [raw]
Subject: Re: mm: kernel BUG at mm/huge_memory.c:1829!

On Thu, Apr 10, 2014 at 04:45:58PM +0800, Bob Liu wrote:
> On Tue, Apr 8, 2014 at 10:37 PM, Sasha Levin <[email protected]> wrote:
> > Hi all,
> >
> > While fuzzing with trinity inside a KVM tools guest running the latest -next
> > kernel, I've stumbled on the following:
> >
>
> Wow! There are so many huge memory related bugs recently.
> AFAIR, there were still several without fix. I wanna is there any
> place can track those bugs instead of lost in maillist?
> It seems this link is out of date
> http://codemonkey.org.uk/projects/trinity/bugs-unfixed.php

It got to be too much for me to track tbh.
Perhaps this is one of the cases where using bugzilla.kernel.org might
be a useful thing ?

Dave

2014-04-10 14:41:40

by Sasha Levin

[permalink] [raw]
Subject: Re: mm: kernel BUG at mm/huge_memory.c:1829!

On 04/10/2014 10:37 AM, Dave Jones wrote:
> On Thu, Apr 10, 2014 at 04:45:58PM +0800, Bob Liu wrote:
> > On Tue, Apr 8, 2014 at 10:37 PM, Sasha Levin <[email protected]> wrote:
> > > Hi all,
> > >
> > > While fuzzing with trinity inside a KVM tools guest running the latest -next
> > > kernel, I've stumbled on the following:
> > >
> >
> > Wow! There are so many huge memory related bugs recently.
> > AFAIR, there were still several without fix. I wanna is there any
> > place can track those bugs instead of lost in maillist?
> > It seems this link is out of date
> > http://codemonkey.org.uk/projects/trinity/bugs-unfixed.php
>
> It got to be too much for me to track tbh.
> Perhaps this is one of the cases where using bugzilla.kernel.org might
> be a useful thing ?

FWIW, I'd be happy to use something else besides having to track mail
threads and keep a list of "to-do open issues".

If folks want to start using bugzilla.kernel.org for mm/ issues I'll go
ahead and enter all the open issues in there.


Thanks,
Sasha

2014-04-10 16:28:09

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: mm: kernel BUG at mm/huge_memory.c:1829!

Hi,

On Thu, Apr 10, 2014 at 04:44:36PM +0300, Kirill A. Shutemov wrote:
> Okay, below is my attempt to fix the bug. I'm not entirely sure it's
> correct. Andrea, could you take a look?

The possibility the interval tree implicitly broke the walk order of
the anon_vma list didn't cross my mind, that's very good catch!
Breakage of the rmap walk order definitely can explain that BUG_ON in
split_huge_page that signals a pte was missed by the rmap walk.

Because this bug only fired on split_huge_page I guess you assumed I
introduced this dependency on order with THP. But it's not actually
the case, there was a place in the VM that already depended on perfect
rmap walk. This is all about providing a perfect rmap walk. A perfect
rmap walk is one where missing a pte is fatal. That other place that
already existed before THP is migrate.c.

The other thing that could break the perfect rmap_walk in additon to a
wrong rmap_walk order, is the exec path where we do an mremap without
a vma covering the destination range (no vma means, no possible
perfect rmap_walk as we need the vma to reach the pte) but that was
handled by other means (see the invalid_migration_vma in migrate.c and
the other checks for is_vma_temporary_stack in huge_memory.c, THP
didn't need to handle it but migrate.c had to). Places using
is_vma_temporary_stack can tell you the cases where a perfect rmap
walk is required and I'm not aware of other locations other than these
two.

split_huge_page might be more pedantic in making sure a pte wasn't
missed (I haven't checked in detail to tell how migrate.c would behave
in such case, split_huge_page just BUG_ON).

So I doubt making a local fix to huge_memory.c is enough, at least
migrate.c (i.e. rmap_walk_anon) should be handled too somehow.

While I'm positive the breakge of rmap_walk order explains the BUG_ON
with trinity (as your forking testcase also shows), I'm quite
uncomfortable to depend on comparison on atomic mapcount that is an
atomic for a reason, to know if the order went wrong and we shall
repeat the loop because fork created a pte we missed.

The commit message doesn't explain how do you guarantee mapcount
cannot change from under us in a racey way. If we go with this fix,
I'd suggest to explain that crucial point about the safety of the
page->mapcount comparison in that code path, in the commit message. It
may be safe by other means! I'm not saying it's definitely not safe,
but it's at least not obviously safe as it looks like the atomic
mapcount could change while it is being read, and there was no obvious
explaination of how it is safe despite it is not stable.

The other downside is that an infinite loop in kernel mode with no
debug message printed, would make this less debuggable too if a real
functional bug hits (not as result of race because of rmap walk
ordering breakage).

I assume the interval tree ordering cannot be fixed, but I'd recommend
to look closer into that possibility too before ruling it out.

Thanks!
Andrea

2014-04-14 14:42:38

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: mm: kernel BUG at mm/huge_memory.c:1829!

On Thu, Apr 10, 2014 at 06:27:50PM +0200, Andrea Arcangeli wrote:
> Hi,
>
> On Thu, Apr 10, 2014 at 04:44:36PM +0300, Kirill A. Shutemov wrote:
> > Okay, below is my attempt to fix the bug. I'm not entirely sure it's
> > correct. Andrea, could you take a look?
>
> The possibility the interval tree implicitly broke the walk order of
> the anon_vma list didn't cross my mind, that's very good catch!
> Breakage of the rmap walk order definitely can explain that BUG_ON in
> split_huge_page that signals a pte was missed by the rmap walk.

I've spent few day trying to understand rmap code. And now I think my
patch is wrong.

I actually don't see where walk order requirement comes from. It seems all
operations (insert, remove, foreach) on anon_vma is serialized with
anon_vma->root->rwsem. Andrea, could you explain this for me?

The actual bug was introduced by me with split PMD page table lock
patchset. The patch below should fix it. Please review.

It also means it can't be the root cause of other bug report[1] since
split PMD lock was introduced in v3.13 and bug report is about v3.8.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=923817

>From 8e0095daf2f0fac113c2e98369c7fc9ff4cde484 Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" <[email protected]>
Date: Mon, 14 Apr 2014 16:05:33 +0300
Subject: [PATCH] thp: close race between split and zap huge pages

Sasha Levin has reported two THP BUGs[1][2]. I believe both of them have
the same root cause. Let's look to them one by one.

The first bug[1] is "kernel BUG at mm/huge_memory.c:1829!".
It's BUG_ON(mapcount != page_mapcount(page)) in __split_huge_page().
>From my testing I see that page_mapcount() is higher than mapcount here.

I think it happens due to race between zap_huge_pmd() and
page_check_address_pmd(). page_check_address_pmd() misses PMD
which is under zap:

CPU0 CPU1
zap_huge_pmd()
pmdp_get_and_clear()
__split_huge_page()
anon_vma_interval_tree_foreach()
__split_huge_page_splitting()
page_check_address_pmd()
mm_find_pmd()
/*
* We check if PMD present without taking ptl: no
* serialization against zap_huge_pmd(). We miss this PMD,
* it's not accounted to 'mapcount' in __split_huge_page().
*/
pmd_present(pmd) == 0

BUG_ON(mapcount != page_mapcount(page)) // CRASH!!!

page_remove_rmap(page)
atomic_add_negative(-1, &page->_mapcount)

The second bug[2] is "kernel BUG at mm/huge_memory.c:1371!".
It's VM_BUG_ON_PAGE(!PageHead(page), page) in zap_huge_pmd().

This happens in similar way:

CPU0 CPU1
zap_huge_pmd()
pmdp_get_and_clear()
page_remove_rmap(page)
atomic_add_negative(-1, &page->_mapcount)
__split_huge_page()
anon_vma_interval_tree_foreach()
__split_huge_page_splitting()
page_check_address_pmd()
mm_find_pmd()
pmd_present(pmd) == 0 /* The same comment as above */
/*
* No crash this time since we already decremented page->_mapcount in
* zap_huge_pmd().
*/
BUG_ON(mapcount != page_mapcount(page))

/*
* We split the compound page here into small pages without
* serialization against zap_huge_pmd()
*/
__split_huge_page_refcount()
VM_BUG_ON_PAGE(!PageHead(page), page); // CRASH!!!

So my understanding the problem is pmd_present() check in mm_find_pmd()
without taking page table lock.

The bug was introduced by me commit with commit 117b0791ac42. Sorry for
that. :(

Let's open code mm_find_pmd() in page_check_address_pmd() and do the
check under page table lock.

Note that __page_check_address() does the same for PTE entires
if sync != 0.

[1] https://lkml.kernel.org/g/<[email protected]>
[2] https://lkml.kernel.org/g/<[email protected]>

Signed-off-by: Kirill A. Shutemov <[email protected]>
Reported-by: Sasha Levin <[email protected]>
Cc: <[email protected]> #3.13+
---
mm/huge_memory.c | 13 ++++++++++---
1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5025709bb3b5..d02a83852ee9 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1536,16 +1536,23 @@ pmd_t *page_check_address_pmd(struct page *page,
enum page_check_address_pmd_flag flag,
spinlock_t **ptl)
{
+ pgd_t *pgd;
+ pud_t *pud;
pmd_t *pmd;

if (address & ~HPAGE_PMD_MASK)
return NULL;

- pmd = mm_find_pmd(mm, address);
- if (!pmd)
+ pgd = pgd_offset(mm, address);
+ if (!pgd_present(*pgd))
return NULL;
+ pud = pud_offset(pgd, address);
+ if (!pud_present(*pud))
+ return NULL;
+ pmd = pmd_offset(pud, address);
+
*ptl = pmd_lock(mm, pmd);
- if (pmd_none(*pmd))
+ if (!pmd_present(*pmd))
goto unlock;
if (pmd_page(*pmd) != page)
goto unlock;
--
Kirill A. Shutemov

2014-04-17 00:31:42

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: mm: kernel BUG at mm/huge_memory.c:1829!

Hi Kirill,

On Mon, Apr 14, 2014 at 05:42:18PM +0300, Kirill A. Shutemov wrote:
> I've spent few day trying to understand rmap code. And now I think my
> patch is wrong.
>
> I actually don't see where walk order requirement comes from. It seems all
> operations (insert, remove, foreach) on anon_vma is serialized with
> anon_vma->root->rwsem. Andrea, could you explain this for me?

It's true the locking protects and freezes the view of all anon_vma
structures associated with the page, but that only guarantees you not
to miss the vma. Not missing the vma is not enough. You can still miss
a pte during the rmap_walk if the order is wrong, because the pte/pmds
are still moving freely under the vmas (absent of the PT lock and the
mmap_sem).

The problem are all MM operations that copies or move a page mapping
from a source to destination vma (fork and mremap). They take the
anon_vma lock, insert the destination vma with the proper anon_vma
chains and then they _drop_ the anon vma lock, and only later they
start moving ptes and pmds around (by taking the proper PT locks).

anon_vma -> src_vma -> dst_vma

If the order is like above (guaranteed before the interval tree was
introduced), if the rmap_walk of split_huge_page and migrate
encounters the source pte/pmd and split/unmap it before it gets
copied, then the copy or move will retain the processed state (regular
pte instead of trans_huge_pmd for split_huge_page or migration pte for
migrate). If instead the pmd/pte was already copied by the time the
src_vma is scanned, then it will encounter the copy to process in the
dst_vma too. The rmap_walk can't miss a pte/pmd if the anon_vma chain
is walked in insertion order (i.e. older vma first).

anon_vma -> dst_vma -> src_vma

If the anon_vma walk order is reversed vs the insertion order, things
falls apart because you will scan dst_vma in split_huge_page while it
still empty, find nothing, then the trans_huge_pmd is moved or copied
from src_vma to dst_vma by the MM code only holding PT lock and
mmap_sem for writing (we cannot hold those across the whole duration
of split_huge_page and migrate). So if the rmap_walk order is not
right, the rmap_walk can miss the contents of the dst_vma that was
still empty at the time it was processed.

If the interval tree walk order cannot be fixed without screwing with
the computation complexity of the structure, a more black and white
fix could be to add a anon_vma templist to scan in O(N) after the
interval tree has been scanned, where you add newly inserted vmas.
The templist shall then be flushed back to the interval tree only
after the pte/pmd mangling of the MM operation is completed. That
requires identifying the closure of the critical section for those
problematic MM operations. The main drawback is actually having to
take the anon_vma lock twice, the second time for the flush to the
interval tree.

Looping like in your previous patch would be much simpler if it could
be made reliable, but it looked like it wouldn't close the bug
entirely because any concurrent unmap operation could lead to false
negative hiding the pmd/pte walk miss (by decreasing page->mapcount
under us).

Comments?

Thanks,
Andrea