2013-03-29 13:05:18

by Sasha Levin

[permalink] [raw]
Subject: mm: BUG in do_huge_pmd_wp_page

Hi all,

While fuzzing with trinity inside a KVM tools guest running latest -next kernel,
I've stumbled on the following.

It seems that the code in do_huge_pmd_wp_page() was recently modified in
"thp: do_huge_pmd_wp_page(): handle huge zero page".

Here's the trace:

[ 246.244708] BUG: unable to handle kernel paging request at ffff88009c422000
[ 246.245743] IP: [<ffffffff81a0a795>] copy_page_rep+0x5/0x10
[ 246.250569] PGD 7232067 PUD 7235067 PMD bfefe067 PTE 800000009c422060
[ 246.251529] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[ 246.252325] Dumping ftrace buffer:
[ 246.252791] (ftrace buffer empty)
[ 246.252869] Modules linked in:
[ 246.252869] CPU 3
[ 246.252869] Pid: 11985, comm: trinity-child12 Tainted: G W 3.9.0-rc4-next-20130328-sasha-00014-g91a3267 #319
[ 246.252869] RIP: 0010:[<ffffffff81a0a795>] [<ffffffff81a0a795>] copy_page_rep+0x5/0x10
[ 246.252869] RSP: 0018:ffff88000015bc40 EFLAGS: 00010286
[ 246.252869] RAX: ffff88000015bfd8 RBX: 0000000002710880 RCX: 0000000000000200
[ 246.252869] RDX: 0000000000000000 RSI: ffff88009c422000 RDI: ffff88009a422000
[ 246.252869] RBP: ffff88000015bc98 R08: 0000000002718000 R09: 0000000000000001
[ 246.252869] R10: 0000000000000001 R11: 0000000000000000 R12: ffff880000000000
[ 246.252869] R13: ffff88000015bfd8 R14: ffff88000015bfd8 R15: fffffffffff80000
[ 246.252869] FS: 00007f53db93f700(0000) GS:ffff8800bba00000(0000) knlGS:0000000000000000
[ 246.252869] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 246.252869] CR2: ffff88009c422000 CR3: 0000000000159000 CR4: 00000000000406e0
[ 246.252869] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 246.252869] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 246.252869] Process trinity-child12 (pid: 11985, threadinfo ffff88000015a000, task ffff88009c60b000)
[ 246.252869] Stack:
[ 246.252869] ffffffff81234aae ffff88000015bc88 ffffffff81273639 0000000000a00000
[ 246.252869] 0000000002718000 ffff8800ab36d050 ffff880000153800 ffffea0002690000
[ 246.252869] 0000000000a00000 ffff8800ab36d000 ffffea0002710000 ffff88000015bd48
[ 246.252869] Call Trace:
[ 246.252869] [<ffffffff81234aae>] ? copy_user_huge_page+0x1de/0x240
[ 246.252869] [<ffffffff81273639>] ? mem_cgroup_charge_common+0xa9/0xc0
[ 246.252869] [<ffffffff8126b4d7>] do_huge_pmd_wp_page+0x9f7/0xc60
[ 246.252869] [<ffffffff81a0acd9>] ? __const_udelay+0x29/0x30
[ 246.252869] [<ffffffff8123364e>] handle_mm_fault+0x26e/0x650
[ 246.252869] [<ffffffff8117dc1a>] ? __lock_is_held+0x5a/0x80
[ 246.252869] [<ffffffff83db3814>] ? __do_page_fault+0x514/0x5e0
[ 246.252869] [<ffffffff83db3870>] __do_page_fault+0x570/0x5e0
[ 246.252869] [<ffffffff811c6500>] ? rcu_eqs_exit_common+0x60/0x260
[ 246.252869] [<ffffffff811c740e>] ? rcu_eqs_enter_common+0x33e/0x3b0
[ 246.252869] [<ffffffff811c679c>] ? rcu_eqs_exit+0x9c/0xb0
[ 246.252869] [<ffffffff83db3912>] do_page_fault+0x32/0x50
[ 246.252869] [<ffffffff83db2ef0>] do_async_page_fault+0x30/0xc0
[ 246.252869] [<ffffffff83db01e8>] async_page_fault+0x28/0x30
[ 246.252869] Code: 90 90 90 90 90 90 9c fa 65 48 3b 06 75 14 65 48 3b 56 08 75 0d 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 9d 30
c0 c3 b9 00 02 00 00 <f3> 48 a5 c3 0f 1f 80 00 00 00 00 eb ee 66 66 66 90 66 66 66 90
[ 246.252869] RIP [<ffffffff81a0a795>] copy_page_rep+0x5/0x10
[ 246.252869] RSP <ffff88000015bc40>
[ 246.252869] CR2: ffff88009c422000
[ 246.252869] ---[ end trace 09fbe37b108d5766 ]---

And this is the code:

if (is_huge_zero_pmd(orig_pmd))
clear_huge_page(new_page, haddr, HPAGE_PMD_NR);
else
copy_user_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR); <--- this


Thanks,
Sasha


2013-04-04 14:04:34

by Sasha Levin

[permalink] [raw]
Subject: Re: mm: BUG in do_huge_pmd_wp_page

Ping? I'm seeing a whole bunch of these with current -next.


Thanks,
Sasha

On 03/29/2013 09:04 AM, Sasha Levin wrote:
> Hi all,
>
> While fuzzing with trinity inside a KVM tools guest running latest -next kernel,
> I've stumbled on the following.
>
> It seems that the code in do_huge_pmd_wp_page() was recently modified in
> "thp: do_huge_pmd_wp_page(): handle huge zero page".
>
> Here's the trace:
>
> [ 246.244708] BUG: unable to handle kernel paging request at ffff88009c422000
> [ 246.245743] IP: [<ffffffff81a0a795>] copy_page_rep+0x5/0x10
> [ 246.250569] PGD 7232067 PUD 7235067 PMD bfefe067 PTE 800000009c422060
> [ 246.251529] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> [ 246.252325] Dumping ftrace buffer:
> [ 246.252791] (ftrace buffer empty)
> [ 246.252869] Modules linked in:
> [ 246.252869] CPU 3
> [ 246.252869] Pid: 11985, comm: trinity-child12 Tainted: G W 3.9.0-rc4-next-20130328-sasha-00014-g91a3267 #319
> [ 246.252869] RIP: 0010:[<ffffffff81a0a795>] [<ffffffff81a0a795>] copy_page_rep+0x5/0x10
> [ 246.252869] RSP: 0018:ffff88000015bc40 EFLAGS: 00010286
> [ 246.252869] RAX: ffff88000015bfd8 RBX: 0000000002710880 RCX: 0000000000000200
> [ 246.252869] RDX: 0000000000000000 RSI: ffff88009c422000 RDI: ffff88009a422000
> [ 246.252869] RBP: ffff88000015bc98 R08: 0000000002718000 R09: 0000000000000001
> [ 246.252869] R10: 0000000000000001 R11: 0000000000000000 R12: ffff880000000000
> [ 246.252869] R13: ffff88000015bfd8 R14: ffff88000015bfd8 R15: fffffffffff80000
> [ 246.252869] FS: 00007f53db93f700(0000) GS:ffff8800bba00000(0000) knlGS:0000000000000000
> [ 246.252869] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 246.252869] CR2: ffff88009c422000 CR3: 0000000000159000 CR4: 00000000000406e0
> [ 246.252869] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 246.252869] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [ 246.252869] Process trinity-child12 (pid: 11985, threadinfo ffff88000015a000, task ffff88009c60b000)
> [ 246.252869] Stack:
> [ 246.252869] ffffffff81234aae ffff88000015bc88 ffffffff81273639 0000000000a00000
> [ 246.252869] 0000000002718000 ffff8800ab36d050 ffff880000153800 ffffea0002690000
> [ 246.252869] 0000000000a00000 ffff8800ab36d000 ffffea0002710000 ffff88000015bd48
> [ 246.252869] Call Trace:
> [ 246.252869] [<ffffffff81234aae>] ? copy_user_huge_page+0x1de/0x240
> [ 246.252869] [<ffffffff81273639>] ? mem_cgroup_charge_common+0xa9/0xc0
> [ 246.252869] [<ffffffff8126b4d7>] do_huge_pmd_wp_page+0x9f7/0xc60
> [ 246.252869] [<ffffffff81a0acd9>] ? __const_udelay+0x29/0x30
> [ 246.252869] [<ffffffff8123364e>] handle_mm_fault+0x26e/0x650
> [ 246.252869] [<ffffffff8117dc1a>] ? __lock_is_held+0x5a/0x80
> [ 246.252869] [<ffffffff83db3814>] ? __do_page_fault+0x514/0x5e0
> [ 246.252869] [<ffffffff83db3870>] __do_page_fault+0x570/0x5e0
> [ 246.252869] [<ffffffff811c6500>] ? rcu_eqs_exit_common+0x60/0x260
> [ 246.252869] [<ffffffff811c740e>] ? rcu_eqs_enter_common+0x33e/0x3b0
> [ 246.252869] [<ffffffff811c679c>] ? rcu_eqs_exit+0x9c/0xb0
> [ 246.252869] [<ffffffff83db3912>] do_page_fault+0x32/0x50
> [ 246.252869] [<ffffffff83db2ef0>] do_async_page_fault+0x30/0xc0
> [ 246.252869] [<ffffffff83db01e8>] async_page_fault+0x28/0x30
> [ 246.252869] Code: 90 90 90 90 90 90 9c fa 65 48 3b 06 75 14 65 48 3b 56 08 75 0d 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 9d 30
> c0 c3 b9 00 02 00 00 <f3> 48 a5 c3 0f 1f 80 00 00 00 00 eb ee 66 66 66 90 66 66 66 90
> [ 246.252869] RIP [<ffffffff81a0a795>] copy_page_rep+0x5/0x10
> [ 246.252869] RSP <ffff88000015bc40>
> [ 246.252869] CR2: ffff88009c422000
> [ 246.252869] ---[ end trace 09fbe37b108d5766 ]---
>
> And this is the code:
>
> if (is_huge_zero_pmd(orig_pmd))
> clear_huge_page(new_page, haddr, HPAGE_PMD_NR);
> else
> copy_user_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR); <--- this
>
>
> Thanks,
> Sasha
>

2013-04-04 14:28:55

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: mm: BUG in do_huge_pmd_wp_page

Sasha Levin wrote:
> Ping? I'm seeing a whole bunch of these with current -next.

Do you have a way to reproduce?

--
Kirill A. Shutemov

2013-04-04 14:38:58

by Sasha Levin

[permalink] [raw]
Subject: Re: mm: BUG in do_huge_pmd_wp_page

On 04/04/2013 10:30 AM, Kirill A. Shutemov wrote:
> Sasha Levin wrote:
>> Ping? I'm seeing a whole bunch of these with current -next.
>
> Do you have a way to reproduce?

Not really, trinity just manages to make it happen quite often.

I can add something in the code to spew more debug info when it
happens though, I just couldn't come up with anything that might
be useful.


Thanks,
Sasha

2013-04-04 16:27:12

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: mm: BUG in do_huge_pmd_wp_page

Sasha Levin wrote:
> On 04/04/2013 10:30 AM, Kirill A. Shutemov wrote:
> > Sasha Levin wrote:
> >> Ping? I'm seeing a whole bunch of these with current -next.
> >
> > Do you have a way to reproduce?
>
> Not really, trinity just manages to make it happen quite often.
>
> I can add something in the code to spew more debug info when it
> happens though, I just couldn't come up with anything that might
> be useful.

I will try to reproduce on my own.

I'm new with trinity. Any hint on run parameters?
Is 'trinity -g vm -C 8' in lkvm good enough?

--
Kirill A. Shutemov

2013-04-04 21:55:10

by Sasha Levin

[permalink] [raw]
Subject: Re: mm: BUG in do_huge_pmd_wp_page

On 04/04/2013 12:28 PM, Kirill A. Shutemov wrote:
> Sasha Levin wrote:
>> On 04/04/2013 10:30 AM, Kirill A. Shutemov wrote:
>>> Sasha Levin wrote:
>>>> Ping? I'm seeing a whole bunch of these with current -next.
>>>
>>> Do you have a way to reproduce?
>>
>> Not really, trinity just manages to make it happen quite often.
>>
>> I can add something in the code to spew more debug info when it
>> happens though, I just couldn't come up with anything that might
>> be useful.
>
> I will try to reproduce on my own.
>
> I'm new with trinity. Any hint on run parameters?
> Is 'trinity -g vm -C 8' in lkvm good enough?

I've actually been unable to reproduce it with just '-g vm'.

My command line for trinity is:

trinity -m --quiet --dangerous -l off -C 60 2> /dev/null

Note that I run it as root and use a disposable image.


Thanks,
Sasha

2013-04-10 08:02:10

by Minchan Kim

[permalink] [raw]
Subject: Re: mm: BUG in do_huge_pmd_wp_page

On Fri, Mar 29, 2013 at 09:04:16AM -0400, Sasha Levin wrote:
> Hi all,
>
> While fuzzing with trinity inside a KVM tools guest running latest -next kernel,
> I've stumbled on the following.
>
> It seems that the code in do_huge_pmd_wp_page() was recently modified in
> "thp: do_huge_pmd_wp_page(): handle huge zero page".
>
> Here's the trace:
>
> [ 246.244708] BUG: unable to handle kernel paging request at ffff88009c422000
> [ 246.245743] IP: [<ffffffff81a0a795>] copy_page_rep+0x5/0x10
> [ 246.250569] PGD 7232067 PUD 7235067 PMD bfefe067 PTE 800000009c422060
> [ 246.251529] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> [ 246.252325] Dumping ftrace buffer:
> [ 246.252791] (ftrace buffer empty)
> [ 246.252869] Modules linked in:
> [ 246.252869] CPU 3
> [ 246.252869] Pid: 11985, comm: trinity-child12 Tainted: G W 3.9.0-rc4-next-20130328-sasha-00014-g91a3267 #319
> [ 246.252869] RIP: 0010:[<ffffffff81a0a795>] [<ffffffff81a0a795>] copy_page_rep+0x5/0x10
> [ 246.252869] RSP: 0018:ffff88000015bc40 EFLAGS: 00010286
> [ 246.252869] RAX: ffff88000015bfd8 RBX: 0000000002710880 RCX: 0000000000000200
> [ 246.252869] RDX: 0000000000000000 RSI: ffff88009c422000 RDI: ffff88009a422000
> [ 246.252869] RBP: ffff88000015bc98 R08: 0000000002718000 R09: 0000000000000001
> [ 246.252869] R10: 0000000000000001 R11: 0000000000000000 R12: ffff880000000000
> [ 246.252869] R13: ffff88000015bfd8 R14: ffff88000015bfd8 R15: fffffffffff80000
> [ 246.252869] FS: 00007f53db93f700(0000) GS:ffff8800bba00000(0000) knlGS:0000000000000000
> [ 246.252869] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 246.252869] CR2: ffff88009c422000 CR3: 0000000000159000 CR4: 00000000000406e0
> [ 246.252869] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 246.252869] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [ 246.252869] Process trinity-child12 (pid: 11985, threadinfo ffff88000015a000, task ffff88009c60b000)
> [ 246.252869] Stack:
> [ 246.252869] ffffffff81234aae ffff88000015bc88 ffffffff81273639 0000000000a00000
> [ 246.252869] 0000000002718000 ffff8800ab36d050 ffff880000153800 ffffea0002690000
> [ 246.252869] 0000000000a00000 ffff8800ab36d000 ffffea0002710000 ffff88000015bd48
> [ 246.252869] Call Trace:
> [ 246.252869] [<ffffffff81234aae>] ? copy_user_huge_page+0x1de/0x240
> [ 246.252869] [<ffffffff81273639>] ? mem_cgroup_charge_common+0xa9/0xc0
> [ 246.252869] [<ffffffff8126b4d7>] do_huge_pmd_wp_page+0x9f7/0xc60
> [ 246.252869] [<ffffffff81a0acd9>] ? __const_udelay+0x29/0x30
> [ 246.252869] [<ffffffff8123364e>] handle_mm_fault+0x26e/0x650
> [ 246.252869] [<ffffffff8117dc1a>] ? __lock_is_held+0x5a/0x80
> [ 246.252869] [<ffffffff83db3814>] ? __do_page_fault+0x514/0x5e0
> [ 246.252869] [<ffffffff83db3870>] __do_page_fault+0x570/0x5e0
> [ 246.252869] [<ffffffff811c6500>] ? rcu_eqs_exit_common+0x60/0x260
> [ 246.252869] [<ffffffff811c740e>] ? rcu_eqs_enter_common+0x33e/0x3b0
> [ 246.252869] [<ffffffff811c679c>] ? rcu_eqs_exit+0x9c/0xb0
> [ 246.252869] [<ffffffff83db3912>] do_page_fault+0x32/0x50
> [ 246.252869] [<ffffffff83db2ef0>] do_async_page_fault+0x30/0xc0
> [ 246.252869] [<ffffffff83db01e8>] async_page_fault+0x28/0x30
> [ 246.252869] Code: 90 90 90 90 90 90 9c fa 65 48 3b 06 75 14 65 48 3b 56 08 75 0d 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 9d 30
> c0 c3 b9 00 02 00 00 <f3> 48 a5 c3 0f 1f 80 00 00 00 00 eb ee 66 66 66 90 66 66 66 90
> [ 246.252869] RIP [<ffffffff81a0a795>] copy_page_rep+0x5/0x10
> [ 246.252869] RSP <ffff88000015bc40>
> [ 246.252869] CR2: ffff88009c422000
> [ 246.252869] ---[ end trace 09fbe37b108d5766 ]---
>
> And this is the code:
>
> if (is_huge_zero_pmd(orig_pmd))
> clear_huge_page(new_page, haddr, HPAGE_PMD_NR);
> else
> copy_user_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR); <--- this
>
>
> Thanks,
> Sasha

I don't know this issue was already resolved. If so, my reply become a just
question to Kirill regardless of this BUG.

When I am looking at the code, I was wonder about the logic of GHZP(aka,
get_huge_zero_page) reference handling. The logic depends on that page
allocator never alocate PFN 0.

Who makes sure it? What happens if allocator allocates PFN 0?
I don't know all of architecture makes sure it.
You investigated it for all arches?

If not,
CPU 1 CPU 2 CPU 3

shrink_huge_zero_page
huge_zero_refcount = 0;
GHZP
pfn_0_zero_page = alloc_pages
GHZP
pfn_some_zero_page = alloc_page
huge_zero_pfn = 0
huge_zero_pfn = pfn_0
huge_zero_refcount = 2
huge_zero_pfn = pfn_some
huge_zero_refcount = 2

So, if you want to stick this logic, at least, don't we need BUG_ON to check
pfn 0 allocation in get_huge_zero_page?

--
Kind regards,
Minchan Kim

2013-04-11 13:16:19

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: mm: BUG in do_huge_pmd_wp_page

Minchan Kim wrote:
> On Fri, Mar 29, 2013 at 09:04:16AM -0400, Sasha Levin wrote:
> > Hi all,
> >
> > While fuzzing with trinity inside a KVM tools guest running latest -next kernel,
> > I've stumbled on the following.
> >
> > It seems that the code in do_huge_pmd_wp_page() was recently modified in
> > "thp: do_huge_pmd_wp_page(): handle huge zero page".
> >
> > Here's the trace:
> >
> > [ 246.244708] BUG: unable to handle kernel paging request at ffff88009c422000
> > [ 246.245743] IP: [<ffffffff81a0a795>] copy_page_rep+0x5/0x10
> > [ 246.250569] PGD 7232067 PUD 7235067 PMD bfefe067 PTE 800000009c422060
> > [ 246.251529] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> > [ 246.252325] Dumping ftrace buffer:
> > [ 246.252791] (ftrace buffer empty)
> > [ 246.252869] Modules linked in:
> > [ 246.252869] CPU 3
> > [ 246.252869] Pid: 11985, comm: trinity-child12 Tainted: G W 3.9.0-rc4-next-20130328-sasha-00014-g91a3267 #319
> > [ 246.252869] RIP: 0010:[<ffffffff81a0a795>] [<ffffffff81a0a795>] copy_page_rep+0x5/0x10
> > [ 246.252869] RSP: 0018:ffff88000015bc40 EFLAGS: 00010286
> > [ 246.252869] RAX: ffff88000015bfd8 RBX: 0000000002710880 RCX: 0000000000000200
> > [ 246.252869] RDX: 0000000000000000 RSI: ffff88009c422000 RDI: ffff88009a422000
> > [ 246.252869] RBP: ffff88000015bc98 R08: 0000000002718000 R09: 0000000000000001
> > [ 246.252869] R10: 0000000000000001 R11: 0000000000000000 R12: ffff880000000000
> > [ 246.252869] R13: ffff88000015bfd8 R14: ffff88000015bfd8 R15: fffffffffff80000
> > [ 246.252869] FS: 00007f53db93f700(0000) GS:ffff8800bba00000(0000) knlGS:0000000000000000
> > [ 246.252869] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [ 246.252869] CR2: ffff88009c422000 CR3: 0000000000159000 CR4: 00000000000406e0
> > [ 246.252869] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [ 246.252869] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > [ 246.252869] Process trinity-child12 (pid: 11985, threadinfo ffff88000015a000, task ffff88009c60b000)
> > [ 246.252869] Stack:
> > [ 246.252869] ffffffff81234aae ffff88000015bc88 ffffffff81273639 0000000000a00000
> > [ 246.252869] 0000000002718000 ffff8800ab36d050 ffff880000153800 ffffea0002690000
> > [ 246.252869] 0000000000a00000 ffff8800ab36d000 ffffea0002710000 ffff88000015bd48
> > [ 246.252869] Call Trace:
> > [ 246.252869] [<ffffffff81234aae>] ? copy_user_huge_page+0x1de/0x240
> > [ 246.252869] [<ffffffff81273639>] ? mem_cgroup_charge_common+0xa9/0xc0
> > [ 246.252869] [<ffffffff8126b4d7>] do_huge_pmd_wp_page+0x9f7/0xc60
> > [ 246.252869] [<ffffffff81a0acd9>] ? __const_udelay+0x29/0x30
> > [ 246.252869] [<ffffffff8123364e>] handle_mm_fault+0x26e/0x650
> > [ 246.252869] [<ffffffff8117dc1a>] ? __lock_is_held+0x5a/0x80
> > [ 246.252869] [<ffffffff83db3814>] ? __do_page_fault+0x514/0x5e0
> > [ 246.252869] [<ffffffff83db3870>] __do_page_fault+0x570/0x5e0
> > [ 246.252869] [<ffffffff811c6500>] ? rcu_eqs_exit_common+0x60/0x260
> > [ 246.252869] [<ffffffff811c740e>] ? rcu_eqs_enter_common+0x33e/0x3b0
> > [ 246.252869] [<ffffffff811c679c>] ? rcu_eqs_exit+0x9c/0xb0
> > [ 246.252869] [<ffffffff83db3912>] do_page_fault+0x32/0x50
> > [ 246.252869] [<ffffffff83db2ef0>] do_async_page_fault+0x30/0xc0
> > [ 246.252869] [<ffffffff83db01e8>] async_page_fault+0x28/0x30
> > [ 246.252869] Code: 90 90 90 90 90 90 9c fa 65 48 3b 06 75 14 65 48 3b 56 08 75 0d 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 9d 30
> > c0 c3 b9 00 02 00 00 <f3> 48 a5 c3 0f 1f 80 00 00 00 00 eb ee 66 66 66 90 66 66 66 90
> > [ 246.252869] RIP [<ffffffff81a0a795>] copy_page_rep+0x5/0x10
> > [ 246.252869] RSP <ffff88000015bc40>
> > [ 246.252869] CR2: ffff88009c422000
> > [ 246.252869] ---[ end trace 09fbe37b108d5766 ]---
> >
> > And this is the code:
> >
> > if (is_huge_zero_pmd(orig_pmd))
> > clear_huge_page(new_page, haddr, HPAGE_PMD_NR);
> > else
> > copy_user_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR); <--- this
> >
> >
> > Thanks,
> > Sasha
>
> I don't know this issue was already resolved. If so, my reply become a just
> question to Kirill regardless of this BUG.
>
> When I am looking at the code, I was wonder about the logic of GHZP(aka,
> get_huge_zero_page) reference handling. The logic depends on that page
> allocator never alocate PFN 0.
>
> Who makes sure it? What happens if allocator allocates PFN 0?
> I don't know all of architecture makes sure it.
> You investigated it for all arches?
>
> If not,
> CPU 1 CPU 2 CPU 3
>
> shrink_huge_zero_page
> huge_zero_refcount = 0;
> GHZP
> pfn_0_zero_page = alloc_pages
> GHZP
> pfn_some_zero_page = alloc_page
> huge_zero_pfn = 0
> huge_zero_pfn = pfn_0
> huge_zero_refcount = 2
> huge_zero_pfn = pfn_some
> huge_zero_refcount = 2
>
> So, if you want to stick this logic, at least, don't we need BUG_ON to check
> pfn 0 allocation in get_huge_zero_page?

I don't think it's related to oops in the thread (I was not able to
reproduce it), but nice catch anyway.

What about the patch below?

=====

>From 4579aefd606b2dd82797af163ce6d08912894b3a Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" <[email protected]>
Date: Thu, 11 Apr 2013 15:47:50 +0300
Subject: [PATCH] thp: fix huge zero page logic for page with pfn == 0

Current implementation of huge zero page uses pfn value 0 to indicate
that the page hasn't allocated yet. It assumes that buddy page allocator
can't return page with pfn == 0.

Let's rework the code to store 'struct page *' of huge zero page, not
its pfn. This way we can avoid the weak assumption.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Reported-by: Minchan Kim <[email protected]>
---
mm/huge_memory.c | 43 +++++++++++++++++++++----------------------
1 file changed, 21 insertions(+), 22 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 45eaae0..bc2a548 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -163,25 +163,24 @@ static int start_khugepaged(void)
}

static atomic_t huge_zero_refcount;
-static unsigned long huge_zero_pfn __read_mostly;
+static struct page *huge_zero_page __read_mostly;

-static inline bool is_huge_zero_pfn(unsigned long pfn)
+static inline bool is_huge_zero_page(struct page *page)
{
- unsigned long zero_pfn = ACCESS_ONCE(huge_zero_pfn);
- return zero_pfn && pfn == zero_pfn;
+ return ACCESS_ONCE(huge_zero_page) == page;
}

static inline bool is_huge_zero_pmd(pmd_t pmd)
{
- return is_huge_zero_pfn(pmd_pfn(pmd));
+ return is_huge_zero_page(pmd_page(pmd));
}

-static unsigned long get_huge_zero_page(void)
+static struct page *get_huge_zero_page(void)
{
struct page *zero_page;
retry:
if (likely(atomic_inc_not_zero(&huge_zero_refcount)))
- return ACCESS_ONCE(huge_zero_pfn);
+ return ACCESS_ONCE(huge_zero_page);

zero_page = alloc_pages((GFP_TRANSHUGE | __GFP_ZERO) & ~__GFP_MOVABLE,
HPAGE_PMD_ORDER);
@@ -191,7 +190,7 @@ retry:
}
count_vm_event(THP_ZERO_PAGE_ALLOC);
preempt_disable();
- if (cmpxchg(&huge_zero_pfn, 0, page_to_pfn(zero_page))) {
+ if (cmpxchg(&huge_zero_page, NULL, zero_page)) {
preempt_enable();
__free_page(zero_page);
goto retry;
@@ -200,7 +199,7 @@ retry:
/* We take additional reference here. It will be put back by shrinker */
atomic_set(&huge_zero_refcount, 2);
preempt_enable();
- return ACCESS_ONCE(huge_zero_pfn);
+ return ACCESS_ONCE(huge_zero_page);
}

static void put_huge_zero_page(void)
@@ -220,9 +219,9 @@ static int shrink_huge_zero_page(struct shrinker *shrink,
return atomic_read(&huge_zero_refcount) == 1 ? HPAGE_PMD_NR : 0;

if (atomic_cmpxchg(&huge_zero_refcount, 1, 0) == 1) {
- unsigned long zero_pfn = xchg(&huge_zero_pfn, 0);
- BUG_ON(zero_pfn == 0);
- __free_page(__pfn_to_page(zero_pfn));
+ struct page *zero_page = xchg(&huge_zero_page, NULL);
+ BUG_ON(zero_page == NULL);
+ __free_page(zero_page);
}

return 0;
@@ -764,12 +763,12 @@ static inline struct page *alloc_hugepage(int defrag)

static bool set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
struct vm_area_struct *vma, unsigned long haddr, pmd_t *pmd,
- unsigned long zero_pfn)
+ struct page *zero_page)
{
pmd_t entry;
if (!pmd_none(*pmd))
return false;
- entry = pfn_pmd(zero_pfn, vma->vm_page_prot);
+ entry = mk_pmd(zero_page, vma->vm_page_prot);
entry = pmd_wrprotect(entry);
entry = pmd_mkhuge(entry);
set_pmd_at(mm, haddr, pmd, entry);
@@ -794,20 +793,20 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
if (!(flags & FAULT_FLAG_WRITE) &&
transparent_hugepage_use_zero_page()) {
pgtable_t pgtable;
- unsigned long zero_pfn;
+ struct page *zero_page;
bool set;
pgtable = pte_alloc_one(mm, haddr);
if (unlikely(!pgtable))
return VM_FAULT_OOM;
- zero_pfn = get_huge_zero_page();
- if (unlikely(!zero_pfn)) {
+ zero_page = get_huge_zero_page();
+ if (unlikely(!zero_page)) {
pte_free(mm, pgtable);
count_vm_event(THP_FAULT_FALLBACK);
goto out;
}
spin_lock(&mm->page_table_lock);
set = set_huge_zero_page(pgtable, mm, vma, haddr, pmd,
- zero_pfn);
+ zero_page);
spin_unlock(&mm->page_table_lock);
if (!set) {
pte_free(mm, pgtable);
@@ -886,16 +885,16 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
* a page table.
*/
if (is_huge_zero_pmd(pmd)) {
- unsigned long zero_pfn;
+ struct page *zero_page;
bool set;
/*
* get_huge_zero_page() will never allocate a new page here,
* since we already have a zero page to copy. It just takes a
* reference.
*/
- zero_pfn = get_huge_zero_page();
+ zero_page = get_huge_zero_page();
set = set_huge_zero_page(pgtable, dst_mm, vma, addr, dst_pmd,
- zero_pfn);
+ zero_page);
BUG_ON(!set); /* unexpected !pmd_none(dst_pmd) */
ret = 0;
goto out_unlock;
@@ -1803,7 +1802,7 @@ int split_huge_page(struct page *page)
struct anon_vma *anon_vma;
int ret = 1;

- BUG_ON(is_huge_zero_pfn(page_to_pfn(page)));
+ BUG_ON(is_huge_zero_page(page));
BUG_ON(!PageAnon(page));

/*
--
Kirill A. Shutemov

2013-04-11 14:56:41

by Sasha Levin

[permalink] [raw]
Subject: Re: mm: BUG in do_huge_pmd_wp_page

On 04/10/2013 04:02 AM, Minchan Kim wrote:
> I don't know this issue was already resolved. If so, my reply become a just
> question to Kirill regardless of this BUG.

The issue is still reproducible with today's -next.


Thanks,
Sasha

2013-04-11 15:11:43

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: mm: BUG in do_huge_pmd_wp_page

Sasha Levin wrote:
> On 04/10/2013 04:02 AM, Minchan Kim wrote:
> > I don't know this issue was already resolved. If so, my reply become a just
> > question to Kirill regardless of this BUG.
>
> The issue is still reproducible with today's -next.

Could you share your kernel config and configuration of your virtual machine?

--
Kirill A. Shutemov

2013-04-11 15:16:02

by Sasha Levin

[permalink] [raw]
Subject: Re: mm: BUG in do_huge_pmd_wp_page

On 04/11/2013 11:13 AM, Kirill A. Shutemov wrote:
> Sasha Levin wrote:
>> On 04/10/2013 04:02 AM, Minchan Kim wrote:
>>> I don't know this issue was already resolved. If so, my reply become a just
>>> question to Kirill regardless of this BUG.
>>
>> The issue is still reproducible with today's -next.
>
> Could you share your kernel config and configuration of your virtual machine?

I've attached my .config.

I start the vm using:

./vm sandbox --rng --balloon -k /usr/src/linux/arch/x86/boot/bzImage -d run -d /dev/shm/swap --no-dhcp -m 3072 -c 6 -p
"init=/virt/init zcache ftrace_dump_on_oops debugpat kvm.mmu_audit=1 memblock=debug slub_debug=FZPU" -- /runtrin.sh

Where /runtrin.sh inside the vm simply mounts some stuff like sysfs and proc,
creates the swap space and runs trinity.


Thanks,
Sasha


Attachments:
config-sasha (141.08 kB)

2013-04-14 07:13:17

by Minchan Kim

[permalink] [raw]
Subject: Re: mm: BUG in do_huge_pmd_wp_page

On Thu, Apr 11, 2013 at 04:18:13PM +0300, Kirill A. Shutemov wrote:
> Minchan Kim wrote:
> > On Fri, Mar 29, 2013 at 09:04:16AM -0400, Sasha Levin wrote:
> > > Hi all,
> > >
> > > While fuzzing with trinity inside a KVM tools guest running latest -next kernel,
> > > I've stumbled on the following.
> > >
> > > It seems that the code in do_huge_pmd_wp_page() was recently modified in
> > > "thp: do_huge_pmd_wp_page(): handle huge zero page".
> > >
> > > Here's the trace:
> > >
> > > [ 246.244708] BUG: unable to handle kernel paging request at ffff88009c422000
> > > [ 246.245743] IP: [<ffffffff81a0a795>] copy_page_rep+0x5/0x10
> > > [ 246.250569] PGD 7232067 PUD 7235067 PMD bfefe067 PTE 800000009c422060
> > > [ 246.251529] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> > > [ 246.252325] Dumping ftrace buffer:
> > > [ 246.252791] (ftrace buffer empty)
> > > [ 246.252869] Modules linked in:
> > > [ 246.252869] CPU 3
> > > [ 246.252869] Pid: 11985, comm: trinity-child12 Tainted: G W 3.9.0-rc4-next-20130328-sasha-00014-g91a3267 #319
> > > [ 246.252869] RIP: 0010:[<ffffffff81a0a795>] [<ffffffff81a0a795>] copy_page_rep+0x5/0x10
> > > [ 246.252869] RSP: 0018:ffff88000015bc40 EFLAGS: 00010286
> > > [ 246.252869] RAX: ffff88000015bfd8 RBX: 0000000002710880 RCX: 0000000000000200
> > > [ 246.252869] RDX: 0000000000000000 RSI: ffff88009c422000 RDI: ffff88009a422000
> > > [ 246.252869] RBP: ffff88000015bc98 R08: 0000000002718000 R09: 0000000000000001
> > > [ 246.252869] R10: 0000000000000001 R11: 0000000000000000 R12: ffff880000000000
> > > [ 246.252869] R13: ffff88000015bfd8 R14: ffff88000015bfd8 R15: fffffffffff80000
> > > [ 246.252869] FS: 00007f53db93f700(0000) GS:ffff8800bba00000(0000) knlGS:0000000000000000
> > > [ 246.252869] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [ 246.252869] CR2: ffff88009c422000 CR3: 0000000000159000 CR4: 00000000000406e0
> > > [ 246.252869] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > [ 246.252869] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > > [ 246.252869] Process trinity-child12 (pid: 11985, threadinfo ffff88000015a000, task ffff88009c60b000)
> > > [ 246.252869] Stack:
> > > [ 246.252869] ffffffff81234aae ffff88000015bc88 ffffffff81273639 0000000000a00000
> > > [ 246.252869] 0000000002718000 ffff8800ab36d050 ffff880000153800 ffffea0002690000
> > > [ 246.252869] 0000000000a00000 ffff8800ab36d000 ffffea0002710000 ffff88000015bd48
> > > [ 246.252869] Call Trace:
> > > [ 246.252869] [<ffffffff81234aae>] ? copy_user_huge_page+0x1de/0x240
> > > [ 246.252869] [<ffffffff81273639>] ? mem_cgroup_charge_common+0xa9/0xc0
> > > [ 246.252869] [<ffffffff8126b4d7>] do_huge_pmd_wp_page+0x9f7/0xc60
> > > [ 246.252869] [<ffffffff81a0acd9>] ? __const_udelay+0x29/0x30
> > > [ 246.252869] [<ffffffff8123364e>] handle_mm_fault+0x26e/0x650
> > > [ 246.252869] [<ffffffff8117dc1a>] ? __lock_is_held+0x5a/0x80
> > > [ 246.252869] [<ffffffff83db3814>] ? __do_page_fault+0x514/0x5e0
> > > [ 246.252869] [<ffffffff83db3870>] __do_page_fault+0x570/0x5e0
> > > [ 246.252869] [<ffffffff811c6500>] ? rcu_eqs_exit_common+0x60/0x260
> > > [ 246.252869] [<ffffffff811c740e>] ? rcu_eqs_enter_common+0x33e/0x3b0
> > > [ 246.252869] [<ffffffff811c679c>] ? rcu_eqs_exit+0x9c/0xb0
> > > [ 246.252869] [<ffffffff83db3912>] do_page_fault+0x32/0x50
> > > [ 246.252869] [<ffffffff83db2ef0>] do_async_page_fault+0x30/0xc0
> > > [ 246.252869] [<ffffffff83db01e8>] async_page_fault+0x28/0x30
> > > [ 246.252869] Code: 90 90 90 90 90 90 9c fa 65 48 3b 06 75 14 65 48 3b 56 08 75 0d 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 9d 30
> > > c0 c3 b9 00 02 00 00 <f3> 48 a5 c3 0f 1f 80 00 00 00 00 eb ee 66 66 66 90 66 66 66 90
> > > [ 246.252869] RIP [<ffffffff81a0a795>] copy_page_rep+0x5/0x10
> > > [ 246.252869] RSP <ffff88000015bc40>
> > > [ 246.252869] CR2: ffff88009c422000
> > > [ 246.252869] ---[ end trace 09fbe37b108d5766 ]---
> > >
> > > And this is the code:
> > >
> > > if (is_huge_zero_pmd(orig_pmd))
> > > clear_huge_page(new_page, haddr, HPAGE_PMD_NR);
> > > else
> > > copy_user_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR); <--- this
> > >
> > >
> > > Thanks,
> > > Sasha
> >
> > I don't know this issue was already resolved. If so, my reply become a just
> > question to Kirill regardless of this BUG.
> >
> > When I am looking at the code, I was wonder about the logic of GHZP(aka,
> > get_huge_zero_page) reference handling. The logic depends on that page
> > allocator never alocate PFN 0.
> >
> > Who makes sure it? What happens if allocator allocates PFN 0?
> > I don't know all of architecture makes sure it.
> > You investigated it for all arches?
> >
> > If not,
> > CPU 1 CPU 2 CPU 3
> >
> > shrink_huge_zero_page
> > huge_zero_refcount = 0;
> > GHZP
> > pfn_0_zero_page = alloc_pages
> > GHZP
> > pfn_some_zero_page = alloc_page
> > huge_zero_pfn = 0
> > huge_zero_pfn = pfn_0
> > huge_zero_refcount = 2
> > huge_zero_pfn = pfn_some
> > huge_zero_refcount = 2
> >
> > So, if you want to stick this logic, at least, don't we need BUG_ON to check
> > pfn 0 allocation in get_huge_zero_page?
>
> I don't think it's related to oops in the thread (I was not able to
> reproduce it), but nice catch anyway.
>
> What about the patch below?
>
> =====
>
> >From 4579aefd606b2dd82797af163ce6d08912894b3a Mon Sep 17 00:00:00 2001
> From: "Kirill A. Shutemov" <[email protected]>
> Date: Thu, 11 Apr 2013 15:47:50 +0300
> Subject: [PATCH] thp: fix huge zero page logic for page with pfn == 0
>
> Current implementation of huge zero page uses pfn value 0 to indicate
> that the page hasn't allocated yet. It assumes that buddy page allocator
> can't return page with pfn == 0.
>
> Let's rework the code to store 'struct page *' of huge zero page, not
> its pfn. This way we can avoid the weak assumption.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Reported-by: Minchan Kim <[email protected]>

Looks nice to me.

Acked-by: Minchan Kim <[email protected]>

--
Kind regards,
Minchan Kim

2013-04-24 22:46:12

by Andrew Morton

[permalink] [raw]
Subject: Re: mm: BUG in do_huge_pmd_wp_page

On Thu, 11 Apr 2013 11:14:29 -0400 Sasha Levin <[email protected]> wrote:

> On 04/11/2013 11:13 AM, Kirill A. Shutemov wrote:
> > Sasha Levin wrote:
> >> On 04/10/2013 04:02 AM, Minchan Kim wrote:
> >>> I don't know this issue was already resolved. If so, my reply become a just
> >>> question to Kirill regardless of this BUG.
> >>
> >> The issue is still reproducible with today's -next.
> >
> > Could you share your kernel config and configuration of your virtual machine?
>
> I've attached my .config.
>
> I start the vm using:
>
> ./vm sandbox --rng --balloon -k /usr/src/linux/arch/x86/boot/bzImage -d run -d /dev/shm/swap --no-dhcp -m 3072 -c 6 -p
> "init=/virt/init zcache ftrace_dump_on_oops debugpat kvm.mmu_audit=1 memblock=debug slub_debug=FZPU" -- /runtrin.sh
>
> Where /runtrin.sh inside the vm simply mounts some stuff like sysfs and proc,
> creates the swap space and runs trinity.

Guys, did this get fixed?

2013-04-24 22:51:53

by H. Peter Anvin

[permalink] [raw]
Subject: Re: mm: BUG in do_huge_pmd_wp_page

On 04/10/2013 01:02 AM, Minchan Kim wrote:
>
> When I am looking at the code, I was wonder about the logic of GHZP(aka,
> get_huge_zero_page) reference handling. The logic depends on that page
> allocator never alocate PFN 0.
>
> Who makes sure it? What happens if allocator allocates PFN 0?
> I don't know all of architecture makes sure it.
> You investigated it for all arches?
>

This isn't manifest, right? At least on x86 we should never, ever
allocate PFN 0.

-hpa

2013-04-26 00:52:39

by Sasha Levin

[permalink] [raw]
Subject: Re: mm: BUG in do_huge_pmd_wp_page

On 04/24/2013 06:46 PM, Andrew Morton wrote:
> On Thu, 11 Apr 2013 11:14:29 -0400 Sasha Levin <[email protected]> wrote:
>
>> On 04/11/2013 11:13 AM, Kirill A. Shutemov wrote:
>>> Sasha Levin wrote:
>>>> On 04/10/2013 04:02 AM, Minchan Kim wrote:
>>>>> I don't know this issue was already resolved. If so, my reply become a just
>>>>> question to Kirill regardless of this BUG.
>>>>
>>>> The issue is still reproducible with today's -next.
>>>
>>> Could you share your kernel config and configuration of your virtual machine?
>>
>> I've attached my .config.
>>
>> I start the vm using:
>>
>> ./vm sandbox --rng --balloon -k /usr/src/linux/arch/x86/boot/bzImage -d run -d /dev/shm/swap --no-dhcp -m 3072 -c 6 -p
>> "init=/virt/init zcache ftrace_dump_on_oops debugpat kvm.mmu_audit=1 memblock=debug slub_debug=FZPU" -- /runtrin.sh
>>
>> Where /runtrin.sh inside the vm simply mounts some stuff like sysfs and proc,
>> creates the swap space and runs trinity.
>
> Guys, did this get fixed?

I've stopped seeing that during fuzzing, so I guess that it got fixed somehow...


Thanks,
Sasha

2013-04-26 01:30:33

by Minchan Kim

[permalink] [raw]
Subject: Re: mm: BUG in do_huge_pmd_wp_page

Hello hpa,

On Wed, Apr 24, 2013 at 03:51:12PM -0700, H. Peter Anvin wrote:
> On 04/10/2013 01:02 AM, Minchan Kim wrote:
> >
> > When I am looking at the code, I was wonder about the logic of GHZP(aka,
> > get_huge_zero_page) reference handling. The logic depends on that page
> > allocator never alocate PFN 0.
> >
> > Who makes sure it? What happens if allocator allocates PFN 0?
> > I don't know all of architecture makes sure it.
> > You investigated it for all arches?
> >
>
> This isn't manifest, right? At least on x86 we should never, ever
> allocate PFN 0.

Thanks for the confirm.

--
Kind regards,
Minchan Kim

2013-04-26 02:01:37

by Dave Jones

[permalink] [raw]
Subject: Re: mm: BUG in do_huge_pmd_wp_page

On Thu, Apr 25, 2013 at 08:51:27PM -0400, Sasha Levin wrote:
> On 04/24/2013 06:46 PM, Andrew Morton wrote:
> > Guys, did this get fixed?
>
> I've stopped seeing that during fuzzing, so I guess that it got fixed somehow...

We've had reports of users hitting this in 3.8

eg:
https://bugzilla.redhat.com/show_bug.cgi?id=947985
https://bugzilla.redhat.com/show_bug.cgi?id=956730

I'm sure there are other reports of it too.

Would be good if we can figure out what fixed it (if it is actually fixed)
for backporting to stable

Dave

2013-04-26 02:29:15

by H. Peter Anvin

[permalink] [raw]
Subject: Re: mm: BUG in do_huge_pmd_wp_page

On 04/24/2013 04:40 PM, Simon Jeons wrote:
>
> I see in memblock_trim_memory(): start = round_up(orig_start, align);
> here align is PAGE_SIZE, so the dump of zone ranges in my machine is [
> 0.000000] DMA [mem 0x00001000-0x00ffffff]. Why PFN 0 is not
> used? just for align?
>

PFN 0 contains the real-mode interrupt vector table and BIOS data area,
so we just reserve it. Avoids issues with zero being special, too.

-hpa

2013-04-26 03:13:56

by Sasha Levin

[permalink] [raw]
Subject: Re: mm: BUG in do_huge_pmd_wp_page

On 04/25/2013 10:01 PM, Dave Jones wrote:
> On Thu, Apr 25, 2013 at 08:51:27PM -0400, Sasha Levin wrote:
> > On 04/24/2013 06:46 PM, Andrew Morton wrote:
> > > Guys, did this get fixed?
> >
> > I've stopped seeing that during fuzzing, so I guess that it got fixed somehow...
>
> We've had reports of users hitting this in 3.8
>
> eg:
> https://bugzilla.redhat.com/show_bug.cgi?id=947985
> https://bugzilla.redhat.com/show_bug.cgi?id=956730
>
> I'm sure there are other reports of it too.
>
> Would be good if we can figure out what fixed it (if it is actually fixed)
> for backporting to stable

If it's interesting to know I'll bisect it over the weekend...

Think it's enough to look at mm/ commits?


Thanks,
Sasha

2014-02-04 03:02:21

by Sasha Levin

[permalink] [raw]
Subject: Re: mm: BUG in do_huge_pmd_wp_page

On 04/25/2013 10:01 PM, Dave Jones wrote:
> On Thu, Apr 25, 2013 at 08:51:27PM -0400, Sasha Levin wrote:
> > On 04/24/2013 06:46 PM, Andrew Morton wrote:
> > > Guys, did this get fixed?
> >
> > I've stopped seeing that during fuzzing, so I guess that it got fixed somehow...
>
> We've had reports of users hitting this in 3.8
>
> eg:
> https://bugzilla.redhat.com/show_bug.cgi?id=947985
> https://bugzilla.redhat.com/show_bug.cgi?id=956730
>
> I'm sure there are other reports of it too.
>
> Would be good if we can figure out what fixed it (if it is actually fixed)
> for backporting to stable

It's been a while (7 months?), but this one is back...

Just hit it again with today's -next:

[ 762.701278] BUG: unable to handle kernel paging request at ffff88009eae6000
[ 762.702462] IP: [<ffffffff81ae8455>] copy_page_rep+0x5/0x10
[ 762.703369] PGD 84bb067 PUD 22fa81067 PMD 22f98b067 PTE 800000009eae6060
[ 762.704411] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[ 762.705873] Dumping ftrace buffer:
[ 762.707606] (ftrace buffer empty)
[ 762.708311] Modules linked in:
[ 762.708762] CPU: 16 PID: 17920 Comm: trinity-c16 Tainted: G W 3.13.0-next-2
0140203-sasha-00007-gf4985e2 #23
[ 762.710135] task: ffff8801ac358000 ti: ffff880199234000 task.ti: ffff880199234000
[ 762.710135] RIP: 0010:[<ffffffff81ae8455>] [<ffffffff81ae8455>] copy_page_rep+0x5/0x
10
[ 762.710135] RSP: 0018:ffff880199235c90 EFLAGS: 00010286
[ 762.710135] RAX: 0000000080000002 RBX: 00000000056db980 RCX: 0000000000000200
[ 762.710135] RDX: ffff8801ac358000 RSI: ffff88009eae6000 RDI: ffff88015b6e6000
[ 762.710135] RBP: ffff880199235cd8 R08: 0000000000000000 R09: 0000000000000000
[ 762.710135] R10: 0000000000000001 R11: 0000000000000000 R12: 00000000027ab980
[ 762.710135] R13: 0000000000000200 R14: 00000000000000e6 R15: ffff880000000000
[ 762.710135] FS: 00007fb0804e1700(0000) GS:ffff88003da00000(0000) knlGS:0000000000000
000
[ 762.710135] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 762.710135] CR2: ffff88009eae6000 CR3: 0000000199225000 CR4: 00000000000006e0
[ 762.710135] Stack:
[ 762.710135] ffffffff81298995 ffff8801a841ae00 ffff88003d084520 ffff880199227090
[ 762.710135] 800000009ea008e5 ffff8801a841ae00 ffffea00027a8000 ffff880199227090
[ 762.710135] ffffea00056d8000 ffff880199235d58 ffffffff812d7260 ffff880199235cf8
[ 762.710135] Call Trace:
[ 762.710135] [<ffffffff81298995>] ? copy_user_huge_page+0x1a5/0x210
[ 762.710135] [<ffffffff812d7260>] do_huge_pmd_wp_page+0x3d0/0x650
[ 762.710135] [<ffffffff811a308e>] ? put_lock_stats+0xe/0x30
[ 762.710135] [<ffffffff8129b511>] __handle_mm_fault+0x2b1/0x3d0
[ 762.710135] [<ffffffff8129b763>] handle_mm_fault+0x133/0x1c0
[ 762.710135] [<ffffffff8129bcf8>] __get_user_pages+0x438/0x630
[ 762.710135] [<ffffffff811a308e>] ? put_lock_stats+0xe/0x30
[ 762.710135] [<ffffffff8129cfc4>] __mlock_vma_pages_range+0xd4/0xe0
[ 762.710135] [<ffffffff8129d0e0>] __mm_populate+0x110/0x190
[ 762.710135] [<ffffffff8129dcd0>] SyS_mlockall+0x160/0x1b0
[ 762.710135] [<ffffffff84450650>] tracesys+0xdd/0xe2
[ 762.710135] Code: 90 90 90 90 90 90 9c fa 65 48 3b 06 75 14 65 48 3b 56 08 75 0d 65 48 89 1e 65
48 89 4e 08 9d b0 01 c3 9d 30 c0 c3 b9 00 02 00 00 <f3> 48 a5 c3 0f 1f 80 00
00 00 00 eb ee 66 66 66 90 66 66 66 90
[ 762.710135] RIP [<ffffffff81ae8455>] copy_page_rep+0x5/0x10
[ 762.710135] RSP <ffff880199235c90>
[ 762.710135] CR2: ffff88009eae6000


Thanks,
Sasha

2014-02-04 04:00:15

by Hugh Dickins

[permalink] [raw]
Subject: Re: mm: BUG in do_huge_pmd_wp_page

On Mon, 3 Feb 2014, Sasha Levin wrote:
> On 04/25/2013 10:01 PM, Dave Jones wrote:
> > On Thu, Apr 25, 2013 at 08:51:27PM -0400, Sasha Levin wrote:
> > > On 04/24/2013 06:46 PM, Andrew Morton wrote:
> > > > Guys, did this get fixed?
> > >
> > > I've stopped seeing that during fuzzing, so I guess that it got fixed
> > somehow...
> >
> > We've had reports of users hitting this in 3.8
> >
> > eg:
> > https://bugzilla.redhat.com/show_bug.cgi?id=947985
> > https://bugzilla.redhat.com/show_bug.cgi?id=956730
> >
> > I'm sure there are other reports of it too.
> >
> > Would be good if we can figure out what fixed it (if it is actually fixed)
> > for backporting to stable
>
> It's been a while (7 months?), but this one is back...
>
> Just hit it again with today's -next:
>
> [ 762.701278] BUG: unable to handle kernel paging request at
> ffff88009eae6000
> [ 762.702462] IP: [<ffffffff81ae8455>] copy_page_rep+0x5/0x10
> [ 762.703369] PGD 84bb067 PUD 22fa81067 PMD 22f98b067 PTE 800000009eae6060
> [ 762.704411] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> [ 762.705873] Dumping ftrace buffer:
> [ 762.707606] (ftrace buffer empty)
> [ 762.708311] Modules linked in:
> [ 762.708762] CPU: 16 PID: 17920 Comm: trinity-c16 Tainted: G W
> 3.13.0-next-2
> 0140203-sasha-00007-gf4985e2 #23
> [ 762.710135] task: ffff8801ac358000 ti: ffff880199234000 task.ti:
> ffff880199234000
> [ 762.710135] RIP: 0010:[<ffffffff81ae8455>] [<ffffffff81ae8455>]
> copy_page_rep+0x5/0x
> 10
> [ 762.710135] RSP: 0018:ffff880199235c90 EFLAGS: 00010286
> [ 762.710135] RAX: 0000000080000002 RBX: 00000000056db980 RCX:
> 0000000000000200
> [ 762.710135] RDX: ffff8801ac358000 RSI: ffff88009eae6000 RDI:
> ffff88015b6e6000
> [ 762.710135] RBP: ffff880199235cd8 R08: 0000000000000000 R09:
> 0000000000000000
> [ 762.710135] R10: 0000000000000001 R11: 0000000000000000 R12:
> 00000000027ab980
> [ 762.710135] R13: 0000000000000200 R14: 00000000000000e6 R15:
> ffff880000000000
> [ 762.710135] FS: 00007fb0804e1700(0000) GS:ffff88003da00000(0000)
> knlGS:0000000000000
> 000
> [ 762.710135] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [ 762.710135] CR2: ffff88009eae6000 CR3: 0000000199225000 CR4:
> 00000000000006e0
> [ 762.710135] Stack:
> [ 762.710135] ffffffff81298995 ffff8801a841ae00 ffff88003d084520
> ffff880199227090
> [ 762.710135] 800000009ea008e5 ffff8801a841ae00 ffffea00027a8000
> ffff880199227090
> [ 762.710135] ffffea00056d8000 ffff880199235d58 ffffffff812d7260
> ffff880199235cf8
> [ 762.710135] Call Trace:
> [ 762.710135] [<ffffffff81298995>] ? copy_user_huge_page+0x1a5/0x210
> [ 762.710135] [<ffffffff812d7260>] do_huge_pmd_wp_page+0x3d0/0x650
> [ 762.710135] [<ffffffff811a308e>] ? put_lock_stats+0xe/0x30
> [ 762.710135] [<ffffffff8129b511>] __handle_mm_fault+0x2b1/0x3d0
> [ 762.710135] [<ffffffff8129b763>] handle_mm_fault+0x133/0x1c0
> [ 762.710135] [<ffffffff8129bcf8>] __get_user_pages+0x438/0x630
> [ 762.710135] [<ffffffff811a308e>] ? put_lock_stats+0xe/0x30
> [ 762.710135] [<ffffffff8129cfc4>] __mlock_vma_pages_range+0xd4/0xe0
> [ 762.710135] [<ffffffff8129d0e0>] __mm_populate+0x110/0x190
> [ 762.710135] [<ffffffff8129dcd0>] SyS_mlockall+0x160/0x1b0
> [ 762.710135] [<ffffffff84450650>] tracesys+0xdd/0xe2
> [ 762.710135] Code: 90 90 90 90 90 90 9c fa 65 48 3b 06 75 14 65 48 3b 56 08
> 75 0d 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 9d 30 c0 c3 b9 00 02 00 00 <f3>
> 48 a5 c3 0f 1f 80 00
> 00 00 00 eb ee 66 66 66 90 66 66 66 90
> [ 762.710135] RIP [<ffffffff81ae8455>] copy_page_rep+0x5/0x10
> [ 762.710135] RSP <ffff880199235c90>
> [ 762.710135] CR2: ffff88009eae6000

Here's what I suggested about that one in eecc1e426d68
"thp: fix copy_page_rep GPF by testing is_huge_zero_pmd once only":
Note: this is not the same issue as trinity's DEBUG_PAGEALLOC BUG
in copy_page_rep with RSI: ffff88009c422000, reported by Sasha Levin
in https://lkml.org/lkml/2013/3/29/103. I believe that one is due
to the source page being split, and a tail page freed, while copy
is in progress; and not a problem without DEBUG_PAGEALLOC, since
the pmd_same check will prevent a miscopy from being made visible.

It could be fixed by additional locking, or by taking an additional
reference on every tail page, in the DEBUG_PAGEALLOC case (we wouldn't
want to add to the overhead in the normal case). I didn't feel very
motivated to uglify the code in that way just for DEBUG_PAGEALLOC and
trinity: if it only comes up once in seven months, I'm inclined to
live with it myself, but you may have a different perspective.

Hugh

2014-02-04 16:59:00

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: mm: BUG in do_huge_pmd_wp_page

Hugh Dickins wrote:
> On Mon, 3 Feb 2014, Sasha Levin wrote:
> > On 04/25/2013 10:01 PM, Dave Jones wrote:
> > > On Thu, Apr 25, 2013 at 08:51:27PM -0400, Sasha Levin wrote:
> > > > On 04/24/2013 06:46 PM, Andrew Morton wrote:
> > > > > Guys, did this get fixed?
> > > >
> > > > I've stopped seeing that during fuzzing, so I guess that it got fixed
> > > somehow...
> > >
> > > We've had reports of users hitting this in 3.8
> > >
> > > eg:
> > > https://bugzilla.redhat.com/show_bug.cgi?id=947985
> > > https://bugzilla.redhat.com/show_bug.cgi?id=956730
> > >
> > > I'm sure there are other reports of it too.
> > >
> > > Would be good if we can figure out what fixed it (if it is actually fixed)
> > > for backporting to stable
> >
> > It's been a while (7 months?), but this one is back...
> >
> > Just hit it again with today's -next:
> >
> > [ 762.701278] BUG: unable to handle kernel paging request at
> > ffff88009eae6000
> > [ 762.702462] IP: [<ffffffff81ae8455>] copy_page_rep+0x5/0x10
> > [ 762.703369] PGD 84bb067 PUD 22fa81067 PMD 22f98b067 PTE 800000009eae6060
> > [ 762.704411] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> > [ 762.705873] Dumping ftrace buffer:
> > [ 762.707606] (ftrace buffer empty)
> > [ 762.708311] Modules linked in:
> > [ 762.708762] CPU: 16 PID: 17920 Comm: trinity-c16 Tainted: G W
> > 3.13.0-next-2
> > 0140203-sasha-00007-gf4985e2 #23
> > [ 762.710135] task: ffff8801ac358000 ti: ffff880199234000 task.ti:
> > ffff880199234000
> > [ 762.710135] RIP: 0010:[<ffffffff81ae8455>] [<ffffffff81ae8455>]
> > copy_page_rep+0x5/0x
> > 10
> > [ 762.710135] RSP: 0018:ffff880199235c90 EFLAGS: 00010286
> > [ 762.710135] RAX: 0000000080000002 RBX: 00000000056db980 RCX:
> > 0000000000000200
> > [ 762.710135] RDX: ffff8801ac358000 RSI: ffff88009eae6000 RDI:
> > ffff88015b6e6000
> > [ 762.710135] RBP: ffff880199235cd8 R08: 0000000000000000 R09:
> > 0000000000000000
> > [ 762.710135] R10: 0000000000000001 R11: 0000000000000000 R12:
> > 00000000027ab980
> > [ 762.710135] R13: 0000000000000200 R14: 00000000000000e6 R15:
> > ffff880000000000
> > [ 762.710135] FS: 00007fb0804e1700(0000) GS:ffff88003da00000(0000)
> > knlGS:0000000000000
> > 000
> > [ 762.710135] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > [ 762.710135] CR2: ffff88009eae6000 CR3: 0000000199225000 CR4:
> > 00000000000006e0
> > [ 762.710135] Stack:
> > [ 762.710135] ffffffff81298995 ffff8801a841ae00 ffff88003d084520
> > ffff880199227090
> > [ 762.710135] 800000009ea008e5 ffff8801a841ae00 ffffea00027a8000
> > ffff880199227090
> > [ 762.710135] ffffea00056d8000 ffff880199235d58 ffffffff812d7260
> > ffff880199235cf8
> > [ 762.710135] Call Trace:
> > [ 762.710135] [<ffffffff81298995>] ? copy_user_huge_page+0x1a5/0x210
> > [ 762.710135] [<ffffffff812d7260>] do_huge_pmd_wp_page+0x3d0/0x650
> > [ 762.710135] [<ffffffff811a308e>] ? put_lock_stats+0xe/0x30
> > [ 762.710135] [<ffffffff8129b511>] __handle_mm_fault+0x2b1/0x3d0
> > [ 762.710135] [<ffffffff8129b763>] handle_mm_fault+0x133/0x1c0
> > [ 762.710135] [<ffffffff8129bcf8>] __get_user_pages+0x438/0x630
> > [ 762.710135] [<ffffffff811a308e>] ? put_lock_stats+0xe/0x30
> > [ 762.710135] [<ffffffff8129cfc4>] __mlock_vma_pages_range+0xd4/0xe0
> > [ 762.710135] [<ffffffff8129d0e0>] __mm_populate+0x110/0x190
> > [ 762.710135] [<ffffffff8129dcd0>] SyS_mlockall+0x160/0x1b0
> > [ 762.710135] [<ffffffff84450650>] tracesys+0xdd/0xe2
> > [ 762.710135] Code: 90 90 90 90 90 90 9c fa 65 48 3b 06 75 14 65 48 3b 56 08
> > 75 0d 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 9d 30 c0 c3 b9 00 02 00 00 <f3>
> > 48 a5 c3 0f 1f 80 00
> > 00 00 00 eb ee 66 66 66 90 66 66 66 90
> > [ 762.710135] RIP [<ffffffff81ae8455>] copy_page_rep+0x5/0x10
> > [ 762.710135] RSP <ffff880199235c90>
> > [ 762.710135] CR2: ffff88009eae6000
>
> Here's what I suggested about that one in eecc1e426d68
> "thp: fix copy_page_rep GPF by testing is_huge_zero_pmd once only":
> Note: this is not the same issue as trinity's DEBUG_PAGEALLOC BUG
> in copy_page_rep with RSI: ffff88009c422000, reported by Sasha Levin
> in https://lkml.org/lkml/2013/3/29/103. I believe that one is due
> to the source page being split, and a tail page freed, while copy
> is in progress; and not a problem without DEBUG_PAGEALLOC, since
> the pmd_same check will prevent a miscopy from being made visible.
>
> It could be fixed by additional locking, or by taking an additional
> reference on every tail page, in the DEBUG_PAGEALLOC case (we wouldn't
> want to add to the overhead in the normal case).

One more nasty idea: invent "safe" variant of copy_user_huge_page(), like
safe_copy_page() in kernel/power/snapshot.c. Although, I'm not sure if
safe_copy_page() itself safe for races..

--
Kirill A. Shutemov

2014-02-05 18:13:30

by Sasha Levin

[permalink] [raw]
Subject: Re: mm: BUG in do_huge_pmd_wp_page

On 02/03/2014 10:59 PM, Hugh Dickins wrote:
> On Mon, 3 Feb 2014, Sasha Levin wrote:
>> On 04/25/2013 10:01 PM, Dave Jones wrote:
>>> On Thu, Apr 25, 2013 at 08:51:27PM -0400, Sasha Levin wrote:
>>> > On 04/24/2013 06:46 PM, Andrew Morton wrote:
>>> > > Guys, did this get fixed?
>>> >
>>> > I've stopped seeing that during fuzzing, so I guess that it got fixed
>>> somehow...
>>>
>>> We've had reports of users hitting this in 3.8
>>>
>>> eg:
>>> https://bugzilla.redhat.com/show_bug.cgi?id=947985
>>> https://bugzilla.redhat.com/show_bug.cgi?id=956730
>>>
>>> I'm sure there are other reports of it too.
>>>
>>> Would be good if we can figure out what fixed it (if it is actually fixed)
>>> for backporting to stable
>>
>> It's been a while (7 months?), but this one is back...
>>
>> Just hit it again with today's -next:
>>
>> [ 762.701278] BUG: unable to handle kernel paging request at
>> ffff88009eae6000
>> [ 762.702462] IP: [<ffffffff81ae8455>] copy_page_rep+0x5/0x10
>> [ 762.703369] PGD 84bb067 PUD 22fa81067 PMD 22f98b067 PTE 800000009eae6060
>> [ 762.704411] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
>> [ 762.705873] Dumping ftrace buffer:
>> [ 762.707606] (ftrace buffer empty)
>> [ 762.708311] Modules linked in:
>> [ 762.708762] CPU: 16 PID: 17920 Comm: trinity-c16 Tainted: G W
>> 3.13.0-next-2
>> 0140203-sasha-00007-gf4985e2 #23
>> [ 762.710135] task: ffff8801ac358000 ti: ffff880199234000 task.ti:
>> ffff880199234000
>> [ 762.710135] RIP: 0010:[<ffffffff81ae8455>] [<ffffffff81ae8455>]
>> copy_page_rep+0x5/0x
>> 10
>> [ 762.710135] RSP: 0018:ffff880199235c90 EFLAGS: 00010286
>> [ 762.710135] RAX: 0000000080000002 RBX: 00000000056db980 RCX:
>> 0000000000000200
>> [ 762.710135] RDX: ffff8801ac358000 RSI: ffff88009eae6000 RDI:
>> ffff88015b6e6000
>> [ 762.710135] RBP: ffff880199235cd8 R08: 0000000000000000 R09:
>> 0000000000000000
>> [ 762.710135] R10: 0000000000000001 R11: 0000000000000000 R12:
>> 00000000027ab980
>> [ 762.710135] R13: 0000000000000200 R14: 00000000000000e6 R15:
>> ffff880000000000
>> [ 762.710135] FS: 00007fb0804e1700(0000) GS:ffff88003da00000(0000)
>> knlGS:0000000000000
>> 000
>> [ 762.710135] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>> [ 762.710135] CR2: ffff88009eae6000 CR3: 0000000199225000 CR4:
>> 00000000000006e0
>> [ 762.710135] Stack:
>> [ 762.710135] ffffffff81298995 ffff8801a841ae00 ffff88003d084520
>> ffff880199227090
>> [ 762.710135] 800000009ea008e5 ffff8801a841ae00 ffffea00027a8000
>> ffff880199227090
>> [ 762.710135] ffffea00056d8000 ffff880199235d58 ffffffff812d7260
>> ffff880199235cf8
>> [ 762.710135] Call Trace:
>> [ 762.710135] [<ffffffff81298995>] ? copy_user_huge_page+0x1a5/0x210
>> [ 762.710135] [<ffffffff812d7260>] do_huge_pmd_wp_page+0x3d0/0x650
>> [ 762.710135] [<ffffffff811a308e>] ? put_lock_stats+0xe/0x30
>> [ 762.710135] [<ffffffff8129b511>] __handle_mm_fault+0x2b1/0x3d0
>> [ 762.710135] [<ffffffff8129b763>] handle_mm_fault+0x133/0x1c0
>> [ 762.710135] [<ffffffff8129bcf8>] __get_user_pages+0x438/0x630
>> [ 762.710135] [<ffffffff811a308e>] ? put_lock_stats+0xe/0x30
>> [ 762.710135] [<ffffffff8129cfc4>] __mlock_vma_pages_range+0xd4/0xe0
>> [ 762.710135] [<ffffffff8129d0e0>] __mm_populate+0x110/0x190
>> [ 762.710135] [<ffffffff8129dcd0>] SyS_mlockall+0x160/0x1b0
>> [ 762.710135] [<ffffffff84450650>] tracesys+0xdd/0xe2
>> [ 762.710135] Code: 90 90 90 90 90 90 9c fa 65 48 3b 06 75 14 65 48 3b 56 08
>> 75 0d 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 9d 30 c0 c3 b9 00 02 00 00 <f3>
>> 48 a5 c3 0f 1f 80 00
>> 00 00 00 eb ee 66 66 66 90 66 66 66 90
>> [ 762.710135] RIP [<ffffffff81ae8455>] copy_page_rep+0x5/0x10
>> [ 762.710135] RSP <ffff880199235c90>
>> [ 762.710135] CR2: ffff88009eae6000
>
> Here's what I suggested about that one in eecc1e426d68
> "thp: fix copy_page_rep GPF by testing is_huge_zero_pmd once only":
> Note: this is not the same issue as trinity's DEBUG_PAGEALLOC BUG
> in copy_page_rep with RSI: ffff88009c422000, reported by Sasha Levin
> in https://lkml.org/lkml/2013/3/29/103. I believe that one is due
> to the source page being split, and a tail page freed, while copy
> is in progress; and not a problem without DEBUG_PAGEALLOC, since
> the pmd_same check will prevent a miscopy from being made visible.
>
> It could be fixed by additional locking, or by taking an additional
> reference on every tail page, in the DEBUG_PAGEALLOC case (we wouldn't
> want to add to the overhead in the normal case). I didn't feel very
> motivated to uglify the code in that way just for DEBUG_PAGEALLOC and
> trinity: if it only comes up once in seven months, I'm inclined to
> live with it myself, but you may have a different perspective.

Either something changed in the kernel or in trinity, but I'm now hitting it 3-4 times a day.

I've been trying to look at the code to understand the issue you've described, but I can't pinpoint
the exact location where that happen.

Could you please point me to the relevant code sections?


Thanks,
Sasha

2014-02-05 22:50:58

by Hugh Dickins

[permalink] [raw]
Subject: Re: mm: BUG in do_huge_pmd_wp_page

On Wed, 5 Feb 2014, Sasha Levin wrote:
> On 02/03/2014 10:59 PM, Hugh Dickins wrote:
> > On Mon, 3 Feb 2014, Sasha Levin wrote:
> > >
> > > [ 762.701278] BUG: unable to handle kernel paging request at
> > > ffff88009eae6000
> > > [ 762.702462] IP: [<ffffffff81ae8455>] copy_page_rep+0x5/0x10
> > > [ 762.710135] Call Trace:
> > > [ 762.710135] [<ffffffff81298995>] ? copy_user_huge_page+0x1a5/0x210
> > > [ 762.710135] [<ffffffff812d7260>] do_huge_pmd_wp_page+0x3d0/0x650
> > > [ 762.710135] [<ffffffff811a308e>] ? put_lock_stats+0xe/0x30
> > > [ 762.710135] [<ffffffff8129b511>] __handle_mm_fault+0x2b1/0x3d0
> > > [ 762.710135] [<ffffffff8129b763>] handle_mm_fault+0x133/0x1c0
> > > [ 762.710135] [<ffffffff8129bcf8>] __get_user_pages+0x438/0x630
> > > [ 762.710135] [<ffffffff811a308e>] ? put_lock_stats+0xe/0x30
> > > [ 762.710135] [<ffffffff8129cfc4>] __mlock_vma_pages_range+0xd4/0xe0
> > > [ 762.710135] [<ffffffff8129d0e0>] __mm_populate+0x110/0x190
> > > [ 762.710135] [<ffffffff8129dcd0>] SyS_mlockall+0x160/0x1b0
> > > [ 762.710135] [<ffffffff84450650>] tracesys+0xdd/0xe2
> >
> > Here's what I suggested about that one in eecc1e426d68
> > "thp: fix copy_page_rep GPF by testing is_huge_zero_pmd once only":
> > Note: this is not the same issue as trinity's DEBUG_PAGEALLOC BUG
> > in copy_page_rep with RSI: ffff88009c422000, reported by Sasha Levin
> > in https://lkml.org/lkml/2013/3/29/103. I believe that one is due
> > to the source page being split, and a tail page freed, while copy
> > is in progress; and not a problem without DEBUG_PAGEALLOC, since
> > the pmd_same check will prevent a miscopy from being made visible.
> >
> > It could be fixed by additional locking, or by taking an additional
> > reference on every tail page, in the DEBUG_PAGEALLOC case (we wouldn't
> > want to add to the overhead in the normal case). I didn't feel very
> > motivated to uglify the code in that way just for DEBUG_PAGEALLOC and
> > trinity: if it only comes up once in seven months, I'm inclined to
> > live with it myself, but you may have a different perspective.
>
> Either something changed in the kernel or in trinity, but I'm now hitting it
> 3-4 times a day.
>
> I've been trying to look at the code to understand the issue you've
> described, but I can't pinpoint the exact location where that happen.
>
> Could you please point me to the relevant code sections?

I'm not sure which part of it is unclear.

copy_page_rep (arch/x86/lib/copy_page_64.S) is what copy_user_huge_page
(mm/memory.c) ends up calling, when it's invoked from do_huge_pmd_wp_page
(mm/huge_memory.c). At this point we hold down_read of this mm's mmap_sem,
and a get_page on the head of the THP; but we don't have down_write or
page_table_lock or compound_lock or anon_vma lock, some of which might
prevent concurrent THP splitting (I say "some" and "might" because I've
not gone back to check precisely which are actually relevant here: THP
locking rules are not the simplest...).

Do you accept that the THP might be split while we're copying? And if
that happens, then, for example, there might be a WP fault from another
thread to one of the 4k pages it gets split into, which results in that
particular 4k page being freed after it's been copied (I'm thinking its
refcount demands that it be copied at the time of the fault, but then
the additional ref gets freed - a fork proceeds to exec and frees it,
for example).

When the page is freed, free_pages_prepare (mm/page_alloc.c) calls
kernel_map_pages (arch/x86/mm/pageattr.c if CONFIG_DEBUG_PAGEALLOC) to
unmap the freed page from kernel virtual address space (__set_pages_np).
Hence "unable to handle kernel paging request" when copy_page_rep
reaches that part of what used to be the THP.

Hugh

2014-04-04 19:37:34

by Sasha Levin

[permalink] [raw]
Subject: Re: mm: BUG in do_huge_pmd_wp_page

And another ping exactly a year later :)

Yes, this is still happening in -next:

[ 370.615914] BUG: unable to handle kernel paging request at ffff880768c72000
[ 370.620246] IP: copy_page (arch/x86/lib/copy_page_64.S:34)
[ 370.620246] PGD 1091c067 PUD 102c5e6067 PMD 102c49f067 PTE 8000000768c72060
[ 370.620246] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[ 370.620246] Dumping ftrace buffer:
[ 370.620246] (ftrace buffer empty)
[ 370.620246] Modules linked in:
[ 370.620246] CPU: 18 PID: 9480 Comm: trinity-c149 Not tainted 3.14.0-next-20140403-sasha-00019-g7474aa9 #375
[ 370.620246] task: ffff8805e92c0000 ti: ffff8805e92c8000 task.ti: ffff8805e92c8000
[ 370.620246] RIP: copy_page (arch/x86/lib/copy_page_64.S:34)
[ 370.620246] RSP: 0000:ffff8805e92c9c20 EFLAGS: 00010206
[ 370.620246] RAX: 0000000000000002 RBX: fffffffff6040000 RCX: 000000000000003a
[ 370.620246] RDX: ffff8805e92c0000 RSI: ffff880768c72000 RDI: ffff8804e9c72000
[ 370.620246] RBP: ffff8805e92c9c78 R08: 000000000000004a R09: 0a0074656e0a0018
[ 370.620246] R10: 0000000000000001 R11: 0000000000000000 R12: 000000001da31c80
[ 370.620246] R13: ffff880000000000 R14: 000000001da38000 R15: ffffea001da30000
[ 370.620246] FS: 00007f65d8d9e700(0000) GS:ffff8804ecc00000(0000) knlGS:0000000000000000
[ 370.620246] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 370.620246] CR2: ffff880768c72000 CR3: 00000005e92a1000 CR4: 00000000000006a0
[ 370.620246] Stack:
[ 370.620246] fffffffff6040000 000000001da31c80 ffffffff892ae6de 0000000000000200
[ 370.620246] ffff8805e92a3028 0000000000a00000 ffffea001da30000 ffffea0013a70000
[ 370.620246] ffff8805e92a3028 0000000000a00000 ffff8805e929a000 ffff8805e92c9d28
[ 370.620246] Call Trace:
[ 370.620246] ? copy_user_huge_page (include/linux/uaccess.h:36 (discriminator 2) include/linux/highmem.h:75 (discriminator 2) include/linux/highmem.h:232 (discriminator 2) mm/memory.c:4398 (discriminator 2))
[ 370.620246] do_huge_pmd_wp_page (arch/x86/include/asm/bitops.h:95 include/linux/page-flags.h:301 mm/huge_memory.c:1122)
[ 370.620246] ? kvm_clock_read (arch/x86/include/asm/preempt.h:90 arch/x86/kernel/kvmclock.c:86)
[ 370.620246] ? sched_clock_local (kernel/sched/clock.c:214)
[ 370.620246] ? get_parent_ip (kernel/sched/core.c:2472)
[ 370.620246] __handle_mm_fault (mm/memory.c:3877)
[ 370.620246] ? __const_udelay (arch/x86/lib/delay.c:126)
[ 370.620246] handle_mm_fault (include/linux/memcontrol.h:148 mm/memory.c:3935)
[ 370.620246] __do_page_fault (arch/x86/mm/fault.c:1220)
[ 370.620246] ? vtime_account_user (kernel/sched/cputime.c:687)
[ 370.620246] ? get_parent_ip (kernel/sched/core.c:2472)
[ 370.620246] ? context_tracking_user_exit (include/linux/vtime.h:89 include/linux/jump_label.h:105 include/trace/events/context_tracking.h:47 kernel/context_tracking.c:178)
[ 370.620246] ? preempt_count_sub (kernel/sched/core.c:2527)
[ 370.620246] ? context_tracking_user_exit (kernel/context_tracking.c:182)
[ 370.620246] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
[ 370.620246] ? trace_hardirqs_off_caller (kernel/locking/lockdep.c:2638 (discriminator 2))
[ 370.620246] do_page_fault (arch/x86/mm/fault.c:1272 include/linux/jump_label.h:105 include/linux/context_tracking_state.h:27 include/linux/context_tracking.h:45 arch/x86/mm/fault.c:1273)
[ 370.620246] do_async_page_fault (arch/x86/kernel/kvm.c:263)
[ 370.620246] async_page_fault (arch/x86/kernel/entry_64.S:1496)
[ 370.620246] Code: c3 0f 1f 80 00 00 00 00 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 b9 3b 00 00 00 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 48 ff c9 <48> 8b 06 48 8b 5e 08 48 8b 56 10 4c 8b 46 18 4c 8b 4e 20 4c 8b
[ 370.620246] RIP copy_page (arch/x86/lib/copy_page_64.S:34)
[ 370.620246] RSP <ffff8805e92c9c20>
[ 370.620246] CR2: ffff880768c72000


Thanks,
Sasha

On 04/04/2013 10:03 AM, Sasha Levin wrote:
> Ping? I'm seeing a whole bunch of these with current -next.
>
>
> Thanks,
> Sasha
>
> On 03/29/2013 09:04 AM, Sasha Levin wrote:
>> Hi all,
>>
>> While fuzzing with trinity inside a KVM tools guest running latest -next kernel,
>> I've stumbled on the following.
>>
>> It seems that the code in do_huge_pmd_wp_page() was recently modified in
>> "thp: do_huge_pmd_wp_page(): handle huge zero page".
>>
>> Here's the trace:
>>
>> [ 246.244708] BUG: unable to handle kernel paging request at ffff88009c422000
>> [ 246.245743] IP: [<ffffffff81a0a795>] copy_page_rep+0x5/0x10
>> [ 246.250569] PGD 7232067 PUD 7235067 PMD bfefe067 PTE 800000009c422060
>> [ 246.251529] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
>> [ 246.252325] Dumping ftrace buffer:
>> [ 246.252791] (ftrace buffer empty)
>> [ 246.252869] Modules linked in:
>> [ 246.252869] CPU 3
>> [ 246.252869] Pid: 11985, comm: trinity-child12 Tainted: G W 3.9.0-rc4-next-20130328-sasha-00014-g91a3267 #319
>> [ 246.252869] RIP: 0010:[<ffffffff81a0a795>] [<ffffffff81a0a795>] copy_page_rep+0x5/0x10
>> [ 246.252869] RSP: 0018:ffff88000015bc40 EFLAGS: 00010286
>> [ 246.252869] RAX: ffff88000015bfd8 RBX: 0000000002710880 RCX: 0000000000000200
>> [ 246.252869] RDX: 0000000000000000 RSI: ffff88009c422000 RDI: ffff88009a422000
>> [ 246.252869] RBP: ffff88000015bc98 R08: 0000000002718000 R09: 0000000000000001
>> [ 246.252869] R10: 0000000000000001 R11: 0000000000000000 R12: ffff880000000000
>> [ 246.252869] R13: ffff88000015bfd8 R14: ffff88000015bfd8 R15: fffffffffff80000
>> [ 246.252869] FS: 00007f53db93f700(0000) GS:ffff8800bba00000(0000) knlGS:0000000000000000
>> [ 246.252869] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 246.252869] CR2: ffff88009c422000 CR3: 0000000000159000 CR4: 00000000000406e0
>> [ 246.252869] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> [ 246.252869] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>> [ 246.252869] Process trinity-child12 (pid: 11985, threadinfo ffff88000015a000, task ffff88009c60b000)
>> [ 246.252869] Stack:
>> [ 246.252869] ffffffff81234aae ffff88000015bc88 ffffffff81273639 0000000000a00000
>> [ 246.252869] 0000000002718000 ffff8800ab36d050 ffff880000153800 ffffea0002690000
>> [ 246.252869] 0000000000a00000 ffff8800ab36d000 ffffea0002710000 ffff88000015bd48
>> [ 246.252869] Call Trace:
>> [ 246.252869] [<ffffffff81234aae>] ? copy_user_huge_page+0x1de/0x240
>> [ 246.252869] [<ffffffff81273639>] ? mem_cgroup_charge_common+0xa9/0xc0
>> [ 246.252869] [<ffffffff8126b4d7>] do_huge_pmd_wp_page+0x9f7/0xc60
>> [ 246.252869] [<ffffffff81a0acd9>] ? __const_udelay+0x29/0x30
>> [ 246.252869] [<ffffffff8123364e>] handle_mm_fault+0x26e/0x650
>> [ 246.252869] [<ffffffff8117dc1a>] ? __lock_is_held+0x5a/0x80
>> [ 246.252869] [<ffffffff83db3814>] ? __do_page_fault+0x514/0x5e0
>> [ 246.252869] [<ffffffff83db3870>] __do_page_fault+0x570/0x5e0
>> [ 246.252869] [<ffffffff811c6500>] ? rcu_eqs_exit_common+0x60/0x260
>> [ 246.252869] [<ffffffff811c740e>] ? rcu_eqs_enter_common+0x33e/0x3b0
>> [ 246.252869] [<ffffffff811c679c>] ? rcu_eqs_exit+0x9c/0xb0
>> [ 246.252869] [<ffffffff83db3912>] do_page_fault+0x32/0x50
>> [ 246.252869] [<ffffffff83db2ef0>] do_async_page_fault+0x30/0xc0
>> [ 246.252869] [<ffffffff83db01e8>] async_page_fault+0x28/0x30
>> [ 246.252869] Code: 90 90 90 90 90 90 9c fa 65 48 3b 06 75 14 65 48 3b 56 08 75 0d 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 9d 30
>> c0 c3 b9 00 02 00 00 <f3> 48 a5 c3 0f 1f 80 00 00 00 00 eb ee 66 66 66 90 66 66 66 90
>> [ 246.252869] RIP [<ffffffff81a0a795>] copy_page_rep+0x5/0x10
>> [ 246.252869] RSP <ffff88000015bc40>
>> [ 246.252869] CR2: ffff88009c422000
>> [ 246.252869] ---[ end trace 09fbe37b108d5766 ]---
>>
>> And this is the code:
>>
>> if (is_huge_zero_pmd(orig_pmd))
>> clear_huge_page(new_page, haddr, HPAGE_PMD_NR);
>> else
>> copy_user_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR); <--- this
>>
>>
>> Thanks,
>> Sasha
>>
>

2014-04-07 14:49:07

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: mm: BUG in do_huge_pmd_wp_page

On Fri, Apr 04, 2014 at 03:37:20PM -0400, Sasha Levin wrote:
> And another ping exactly a year later :)

I think we could "fix" this false positive with the patch below
(untested), but it's ugly and doesn't add much value.

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 6ac89e9f82ef..65ac113037e4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1053,6 +1053,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long haddr;
unsigned long mmun_start; /* For mmu_notifiers */
unsigned long mmun_end; /* For mmu_notifiers */
+ pmd_t entry;

ptl = pmd_lockptr(mm, pmd);
VM_BUG_ON(!vma->anon_vma);
@@ -1115,42 +1116,45 @@ alloc:

count_vm_event(THP_FAULT_ALLOC);

+ mmun_start = haddr;
+ mmun_end = haddr + HPAGE_PMD_SIZE;
+ mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+
+ if (IS_ENABLED(CONFIG_DEBUG_PAGEALLOC)) {
+ spin_lock(ptl);
+ if (unlikely(!pmd_same(*pmd, orig_pmd)))
+ goto out_race;
+ }
+
if (!page)
clear_huge_page(new_page, haddr, HPAGE_PMD_NR);
else
copy_user_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR);
__SetPageUptodate(new_page);

- mmun_start = haddr;
- mmun_end = haddr + HPAGE_PMD_SIZE;
- mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
-
- spin_lock(ptl);
+ if (!IS_ENABLED(CONFIG_DEBUG_PAGEALLOC)) {
+ spin_lock(ptl);
+ if (unlikely(!pmd_same(*pmd, orig_pmd)))
+ goto out_race;
+ }
if (page)
put_page(page);
- if (unlikely(!pmd_same(*pmd, orig_pmd))) {
- spin_unlock(ptl);
- mem_cgroup_uncharge_page(new_page);
- put_page(new_page);
- goto out_mn;
+
+ entry = mk_huge_pmd(new_page, vma->vm_page_prot);
+ entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+ pmdp_clear_flush(vma, haddr, pmd);
+ page_add_new_anon_rmap(new_page, vma, haddr);
+ set_pmd_at(mm, haddr, pmd, entry);
+ update_mmu_cache_pmd(vma, address, pmd);
+ if (!page) {
+ add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
+ put_huge_zero_page();
} else {
- pmd_t entry;
- entry = mk_huge_pmd(new_page, vma->vm_page_prot);
- entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
- pmdp_clear_flush(vma, haddr, pmd);
- page_add_new_anon_rmap(new_page, vma, haddr);
- set_pmd_at(mm, haddr, pmd, entry);
- update_mmu_cache_pmd(vma, address, pmd);
- if (!page) {
- add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
- put_huge_zero_page();
- } else {
- VM_BUG_ON_PAGE(!PageHead(page), page);
- page_remove_rmap(page);
- put_page(page);
- }
- ret |= VM_FAULT_WRITE;
+ VM_BUG_ON_PAGE(!PageHead(page), page);
+ page_remove_rmap(page);
+ put_page(page);
}
+ ret |= VM_FAULT_WRITE;
spin_unlock(ptl);
out_mn:
mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
@@ -1159,6 +1163,13 @@ out:
out_unlock:
spin_unlock(ptl);
return ret;
+out_race:
+ spin_unlock(ptl);
+ if (page)
+ put_page(page);
+ mem_cgroup_uncharge_page(new_page);
+ put_page(new_page);
+ goto out_mn;
}

struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
--
Kirill A. Shutemov

2014-04-07 14:57:09

by Sasha Levin

[permalink] [raw]
Subject: Re: mm: BUG in do_huge_pmd_wp_page

On 04/07/2014 10:48 AM, Kirill A. Shutemov wrote:
> On Fri, Apr 04, 2014 at 03:37:20PM -0400, Sasha Levin wrote:
>> > And another ping exactly a year later :)
> I think we could "fix" this false positive with the patch below
> (untested), but it's ugly and doesn't add much value.

I could carry that patch myself and not complain about it
any more if there's no intent to produce a "real" fix, but
I doubt that that's really the path we want to take in the
long run.

We'll end up with a bunch of broken paths when enabling
debug, making any sort of debugging slow and useless, which
isn't a desirable result to say the least.


Thanks,
Sasha

2014-04-07 19:41:07

by Sasha Levin

[permalink] [raw]
Subject: Re: mm: BUG in do_huge_pmd_wp_page

It also breaks fairly quickly under testing because:

On 04/07/2014 10:48 AM, Kirill A. Shutemov wrote:
> + if (IS_ENABLED(CONFIG_DEBUG_PAGEALLOC)) {
> + spin_lock(ptl);

^ We go into atomic

> + if (unlikely(!pmd_same(*pmd, orig_pmd)))
> + goto out_race;
> + }
> +
> if (!page)
> clear_huge_page(new_page, haddr, HPAGE_PMD_NR);
> else
> copy_user_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR);

copy_user_huge_page() doesn't like running in atomic state,
and asserts might_sleep().


Thanks,
Sasha

2014-04-07 20:11:30

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: mm: BUG in do_huge_pmd_wp_page

On Mon, Apr 07, 2014 at 03:40:46PM -0400, Sasha Levin wrote:
> It also breaks fairly quickly under testing because:
>
> On 04/07/2014 10:48 AM, Kirill A. Shutemov wrote:
> > + if (IS_ENABLED(CONFIG_DEBUG_PAGEALLOC)) {
> > + spin_lock(ptl);
>
> ^ We go into atomic
>
> > + if (unlikely(!pmd_same(*pmd, orig_pmd)))
> > + goto out_race;
> > + }
> > +
> > if (!page)
> > clear_huge_page(new_page, haddr, HPAGE_PMD_NR);
> > else
> > copy_user_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR);
>
> copy_user_huge_page() doesn't like running in atomic state,
> and asserts might_sleep().

Okay, I'll try something else.

--
Kirill A. Shutemov