2017-06-08 13:48:36

by Xishi Qiu

[permalink] [raw]
Subject: Re: mm, something wring in page_lock_anon_vma_read()?

On 2017/5/23 17:33, Vlastimil Babka wrote:

> On 05/23/2017 11:21 AM, zhong jiang wrote:
>> On 2017/5/23 0:51, Vlastimil Babka wrote:
>>> On 05/20/2017 05:01 AM, zhong jiang wrote:
>>>> On 2017/5/20 10:40, Hugh Dickins wrote:
>>>>> On Sat, 20 May 2017, Xishi Qiu wrote:
>>>>>> Here is a bug report form redhat: https://bugzilla.redhat.com/show_bug.cgi?id=1305620
>>>>>> And I meet the bug too. However it is hard to reproduce, and
>>>>>> 624483f3ea82598("mm: rmap: fix use-after-free in __put_anon_vma") is not help.
>>>>>>
>>>>>> From the vmcore, it seems that the page is still mapped(_mapcount=0 and _count=2),
>>>>>> and the value of mapping is a valid address(mapping = 0xffff8801b3e2a101),
>>>>>> but anon_vma has been corrupted.
>>>>>>
>>>>>> Any ideas?
>>>>> Sorry, no. I assume that _mapcount has been misaccounted, for example
>>>>> a pte mapped in on top of another pte; but cannot begin tell you where
>>>>> in Red Hat's kernel-3.10.0-229.4.2.el7 that might happen.
>>>>>
>>>>> Hugh
>>>>>
>>>>> .
>>>>>
>>>> Hi, Hugh
>>>>
>>>> I find the following message from the dmesg.
>>>>
>>>> [26068.316592] BUG: Bad rss-counter state mm:ffff8800a7de2d80 idx:1 val:1
>>>>
>>>> I can prove that the __mapcount is misaccount. when task is exited. the rmap
>>>> still exist.
>>> Check if the kernel in question contains this commit: ad33bb04b2a6 ("mm:
>>> thp: fix SMP race condition between THP page fault and MADV_DONTNEED")
>> HI, Vlastimil
>>
>> I miss the patch.
>
> Try applying it then, there's good chance the error and crash will go
> away. Even if your workload doesn't actually run any madvise(MADV_DONTNEED).
>

Hi Vlastimil,

I find this error was reported by Kirill as following, right?
https://patchwork.kernel.org/patch/7550401/

The call trace is quite like the same as ours.

Thanks,
Xishi Qiu

>> when I read the patch. I find the following issue. but I am sure it is right.
>>
>> if (unlikely(pmd_trans_unstable(pmd)))
>> return 0;
>> /*
>> * A regular pmd is established and it can't morph into a huge pmd
>> * from under us anymore at this point because we hold the mmap_sem
>> * read mode and khugepaged takes it in write mode. So now it's
>> * safe to run pte_offset_map().
>> */
>> pte = pte_offset_map(pmd, address);
>>
>> after pmd_trans_unstable call, without any protect method. by the comments,
>> it think the pte_offset_map is safe. before pte_offset_map call, it still may be
>> unstable. it is possible?
>
> IIRC it's "unstable" wrt possible none->huge->none transition. But once
> we've seen it's a regular pmd via pmd_trans_unstable(), we're safe as a
> transition from regular pmd can't happen.
>
>> Thanks
>> zhongjiang
>>>> Thanks
>>>> zhongjiang
>>>>
>>>> --
>>>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>>> the body to [email protected]. For more info on Linux MM,
>>>> see: http://www.linux-mm.org/ .
>>>> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>>>>
>>>
>>> .
>>>
>>
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to [email protected]. For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>>
>
>
> .
>




2017-06-08 13:59:26

by Vlastimil Babka

[permalink] [raw]
Subject: Re: mm, something wring in page_lock_anon_vma_read()?

On 06/08/2017 03:44 PM, Xishi Qiu wrote:
> On 2017/5/23 17:33, Vlastimil Babka wrote:
>
>> On 05/23/2017 11:21 AM, zhong jiang wrote:
>>> On 2017/5/23 0:51, Vlastimil Babka wrote:
>>>> On 05/20/2017 05:01 AM, zhong jiang wrote:
>>>>> On 2017/5/20 10:40, Hugh Dickins wrote:
>>>>>> On Sat, 20 May 2017, Xishi Qiu wrote:
>>>>>>> Here is a bug report form redhat: https://bugzilla.redhat.com/show_bug.cgi?id=1305620
>>>>>>> And I meet the bug too. However it is hard to reproduce, and
>>>>>>> 624483f3ea82598("mm: rmap: fix use-after-free in __put_anon_vma") is not help.
>>>>>>>
>>>>>>> From the vmcore, it seems that the page is still mapped(_mapcount=0 and _count=2),
>>>>>>> and the value of mapping is a valid address(mapping = 0xffff8801b3e2a101),
>>>>>>> but anon_vma has been corrupted.
>>>>>>>
>>>>>>> Any ideas?
>>>>>> Sorry, no. I assume that _mapcount has been misaccounted, for example
>>>>>> a pte mapped in on top of another pte; but cannot begin tell you where
>>>>>> in Red Hat's kernel-3.10.0-229.4.2.el7 that might happen.
>>>>>>
>>>>>> Hugh
>>>>>>
>>>>>> .
>>>>>>
>>>>> Hi, Hugh
>>>>>
>>>>> I find the following message from the dmesg.
>>>>>
>>>>> [26068.316592] BUG: Bad rss-counter state mm:ffff8800a7de2d80 idx:1 val:1
>>>>>
>>>>> I can prove that the __mapcount is misaccount. when task is exited. the rmap
>>>>> still exist.
>>>> Check if the kernel in question contains this commit: ad33bb04b2a6 ("mm:
>>>> thp: fix SMP race condition between THP page fault and MADV_DONTNEED")
>>> HI, Vlastimil
>>>
>>> I miss the patch.
>>
>> Try applying it then, there's good chance the error and crash will go
>> away. Even if your workload doesn't actually run any madvise(MADV_DONTNEED).
>>
>
> Hi Vlastimil,
>
> I find this error was reported by Kirill as following, right?
> https://patchwork.kernel.org/patch/7550401/

That was reported by Minchan.

> The call trace is quite like the same as ours.

In that thread, the error seems just disappeared in the end.

So, did you apply the patch I suggested? Did it help?

> Thanks,
> Xishi Qiu
>
>>> when I read the patch. I find the following issue. but I am sure it is right.
>>>
>>> if (unlikely(pmd_trans_unstable(pmd)))
>>> return 0;
>>> /*
>>> * A regular pmd is established and it can't morph into a huge pmd
>>> * from under us anymore at this point because we hold the mmap_sem
>>> * read mode and khugepaged takes it in write mode. So now it's
>>> * safe to run pte_offset_map().
>>> */
>>> pte = pte_offset_map(pmd, address);
>>>
>>> after pmd_trans_unstable call, without any protect method. by the comments,
>>> it think the pte_offset_map is safe. before pte_offset_map call, it still may be
>>> unstable. it is possible?
>>
>> IIRC it's "unstable" wrt possible none->huge->none transition. But once
>> we've seen it's a regular pmd via pmd_trans_unstable(), we're safe as a
>> transition from regular pmd can't happen.
>>
>>> Thanks
>>> zhongjiang
>>>>> Thanks
>>>>> zhongjiang
>>>>>
>>>>> --
>>>>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>>>> the body to [email protected]. For more info on Linux MM,
>>>>> see: http://www.linux-mm.org/ .
>>>>> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>>>>>
>>>>
>>>> .
>>>>
>>>
>>>
>>> --
>>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>> the body to [email protected]. For more info on Linux MM,
>>> see: http://www.linux-mm.org/ .
>>> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>>>
>>
>>
>> .
>>
>
>
>

2017-06-08 14:12:58

by zhong jiang

[permalink] [raw]
Subject: Re: mm, something wring in page_lock_anon_vma_read()?

On 2017/6/8 21:59, Vlastimil Babka wrote:
> On 06/08/2017 03:44 PM, Xishi Qiu wrote:
>> On 2017/5/23 17:33, Vlastimil Babka wrote:
>>
>>> On 05/23/2017 11:21 AM, zhong jiang wrote:
>>>> On 2017/5/23 0:51, Vlastimil Babka wrote:
>>>>> On 05/20/2017 05:01 AM, zhong jiang wrote:
>>>>>> On 2017/5/20 10:40, Hugh Dickins wrote:
>>>>>>> On Sat, 20 May 2017, Xishi Qiu wrote:
>>>>>>>> Here is a bug report form redhat: https://bugzilla.redhat.com/show_bug.cgi?id=1305620
>>>>>>>> And I meet the bug too. However it is hard to reproduce, and
>>>>>>>> 624483f3ea82598("mm: rmap: fix use-after-free in __put_anon_vma") is not help.
>>>>>>>>
>>>>>>>> From the vmcore, it seems that the page is still mapped(_mapcount=0 and _count=2),
>>>>>>>> and the value of mapping is a valid address(mapping = 0xffff8801b3e2a101),
>>>>>>>> but anon_vma has been corrupted.
>>>>>>>>
>>>>>>>> Any ideas?
>>>>>>> Sorry, no. I assume that _mapcount has been misaccounted, for example
>>>>>>> a pte mapped in on top of another pte; but cannot begin tell you where
>>>>>>> in Red Hat's kernel-3.10.0-229.4.2.el7 that might happen.
>>>>>>>
>>>>>>> Hugh
>>>>>>>
>>>>>>> .
>>>>>>>
>>>>>> Hi, Hugh
>>>>>>
>>>>>> I find the following message from the dmesg.
>>>>>>
>>>>>> [26068.316592] BUG: Bad rss-counter state mm:ffff8800a7de2d80 idx:1 val:1
>>>>>>
>>>>>> I can prove that the __mapcount is misaccount. when task is exited. the rmap
>>>>>> still exist.
>>>>> Check if the kernel in question contains this commit: ad33bb04b2a6 ("mm:
>>>>> thp: fix SMP race condition between THP page fault and MADV_DONTNEED")
>>>> HI, Vlastimil
>>>>
>>>> I miss the patch.
>>> Try applying it then, there's good chance the error and crash will go
>>> away. Even if your workload doesn't actually run any madvise(MADV_DONTNEED).
>>>
>> Hi Vlastimil,
>>
>> I find this error was reported by Kirill as following, right?
>> https://patchwork.kernel.org/patch/7550401/
> That was reported by Minchan.
>
>> The call trace is quite like the same as ours.
> In that thread, the error seems just disappeared in the end.
without any patch, I wonder that how to disappear.
> So, did you apply the patch I suggested? Did it help?
yes, I apply the patch, test two weeks, no panic occur.
but last panic just occur after one month. so we still not sure that
it is really resolved the issue.

Thanks
zhongjiang
>> Thanks,
>> Xishi Qiu
>>
>>>> when I read the patch. I find the following issue. but I am sure it is right.
>>>>
>>>> if (unlikely(pmd_trans_unstable(pmd)))
>>>> return 0;
>>>> /*
>>>> * A regular pmd is established and it can't morph into a huge pmd
>>>> * from under us anymore at this point because we hold the mmap_sem
>>>> * read mode and khugepaged takes it in write mode. So now it's
>>>> * safe to run pte_offset_map().
>>>> */
>>>> pte = pte_offset_map(pmd, address);
>>>>
>>>> after pmd_trans_unstable call, without any protect method. by the comments,
>>>> it think the pte_offset_map is safe. before pte_offset_map call, it still may be
>>>> unstable. it is possible?
>>> IIRC it's "unstable" wrt possible none->huge->none transition. But once
>>> we've seen it's a regular pmd via pmd_trans_unstable(), we're safe as a
>>> transition from regular pmd can't happen.
>>>
>>>> Thanks
>>>> zhongjiang
>>>>>> Thanks
>>>>>> zhongjiang
>>>>>>
>>>>>> --
>>>>>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>>>>> the body to [email protected]. For more info on Linux MM,
>>>>>> see: http://www.linux-mm.org/ .
>>>>>> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>>>>>>
>>>>> .
>>>>>
>>>>
>>>> --
>>>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>>> the body to [email protected]. For more info on Linux MM,
>>>> see: http://www.linux-mm.org/ .
>>>> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>>>>
>>>
>>> .
>>>
>>
>>
>
> .
>


2017-07-18 11:00:14

by Xishi Qiu

[permalink] [raw]
Subject: Re: mm, something wrong in page_lock_anon_vma_read()?

On 2017/6/8 21:59, Vlastimil Babka wrote:

> On 06/08/2017 03:44 PM, Xishi Qiu wrote:
>> On 2017/5/23 17:33, Vlastimil Babka wrote:
>>
>>> On 05/23/2017 11:21 AM, zhong jiang wrote:
>>>> On 2017/5/23 0:51, Vlastimil Babka wrote:
>>>>> On 05/20/2017 05:01 AM, zhong jiang wrote:
>>>>>> On 2017/5/20 10:40, Hugh Dickins wrote:
>>>>>>> On Sat, 20 May 2017, Xishi Qiu wrote:
>>>>>>>> Here is a bug report form redhat: https://bugzilla.redhat.com/show_bug.cgi?id=1305620
>>>>>>>> And I meet the bug too. However it is hard to reproduce, and
>>>>>>>> 624483f3ea82598("mm: rmap: fix use-after-free in __put_anon_vma") is not help.
>>>>>>>>
>>>>>>>> From the vmcore, it seems that the page is still mapped(_mapcount=0 and _count=2),
>>>>>>>> and the value of mapping is a valid address(mapping = 0xffff8801b3e2a101),
>>>>>>>> but anon_vma has been corrupted.
>>>>>>>>
>>>>>>>> Any ideas?
>>>>>>> Sorry, no. I assume that _mapcount has been misaccounted, for example
>>>>>>> a pte mapped in on top of another pte; but cannot begin tell you where
>>>>>>> in Red Hat's kernel-3.10.0-229.4.2.el7 that might happen.
>>>>>>>
>>>>>>> Hugh
>>>>>>>
>>>>>>> .
>>>>>>>
>>>>>> Hi, Hugh
>>>>>>
>>>>>> I find the following message from the dmesg.
>>>>>>
>>>>>> [26068.316592] BUG: Bad rss-counter state mm:ffff8800a7de2d80 idx:1 val:1
>>>>>>
>>>>>> I can prove that the __mapcount is misaccount. when task is exited. the rmap
>>>>>> still exist.
>>>>> Check if the kernel in question contains this commit: ad33bb04b2a6 ("mm:
>>>>> thp: fix SMP race condition between THP page fault and MADV_DONTNEED")
>>>> HI, Vlastimil
>>>>
>>>> I miss the patch.
>>>
>>> Try applying it then, there's good chance the error and crash will go
>>> away. Even if your workload doesn't actually run any madvise(MADV_DONTNEED).
>>>
>>
>> Hi Vlastimil,
>>
>> I find this error was reported by Kirill as following, right?
>> https://patchwork.kernel.org/patch/7550401/
>
> That was reported by Minchan.
>
>> The call trace is quite like the same as ours.
>
> In that thread, the error seems just disappeared in the end.
>
> So, did you apply the patch I suggested? Did it help?
>

Hi,

Unfortunately, this patch(mm: thp: fix SMP race condition between
THP page fault and MADV_DONTNEED) didn't help, I got the panic again.

And I find this error before panic, "[468229.996610] BUG: Bad rss-counter state mm:ffff8806aebc2580 idx:1 val:1"

[468451.702807] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
[468451.702861] IP: [<ffffffff810ac089>] down_read_trylock+0x9/0x30
[468451.702900] PGD 12445e067 PUD 11acaa067 PMD 0
[468451.702931] Oops: 0000 [#1] SMP
[468451.702953] kbox catch die event.
[468451.703003] collected_len = 1047419, LOG_BUF_LEN_LOCAL = 1048576
[468451.703003] kbox: notify die begin
[468451.703003] kbox: no notify die func register. no need to notify
[468451.703003] do nothing after die!
[468451.703003] Modules linked in: ipt_REJECT macvlan ip_set_hash_ipport vport_vxlan(OVE) xt_statistic xt_physdev xt_nat xt_recent xt_mark xt_comment veth ct_limit(OVE) bum_extract(OVE) policy(OVE) bum(OVE) ip_set nfnetlink openvswitch(OVE) nf_defrag_ipv6 gre ext3 jbd ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack bridge stp llc kboxdriver(O) kbox(O) dm_thin_pool dm_persistent_data crc32_pclmul dm_bio_prison dm_bufio ghash_clmulni_intel libcrc32c aesni_intel lrw gf128mul glue_helper ablk_helper cryptd ppdev sg parport_pc cirrus virtio_console parport syscopyarea sysfillrect sysimgblt ttm drm_kms_helper drm i2c_piix4 i2c_core pcspkr ip_tables ext4 jbd2 mbcache sr_mod cdrom ata_generic pata_acpi
[468451.703003] virtio_net virtio_blk crct10dif_pclmul crct10dif_common ata_piix virtio_pci libata serio_raw virtio_ring crc32c_intel virtio dm_mirror dm_region_hash dm_log dm_mod
[468451.703003] CPU: 6 PID: 21965 Comm: docker-containe Tainted: G OE ----V------- 3.10.0-327.53.58.73.x86_64 #1
[468451.703003] Hardware name: OpenStack Foundation OpenStack Nova, BIOS rel-1.8.1-0-g4adadbd-20170107_142945-9_64_246_229 04/01/2014
[468451.703003] task: ffff880692402e00 ti: ffff88018209c000 task.ti: ffff88018209c000
[468451.703003] RIP: 0010:[<ffffffff810ac089>] [<ffffffff810ac089>] down_read_trylock+0x9/0x30
[468451.703003] RSP: 0018:ffff88018209f8f8 EFLAGS: 00010202
[468451.703003] RAX: 0000000000000000 RBX: ffff880720cd7740 RCX: ffff880720cd7740
[468451.703003] RDX: 0000000000000001 RSI: 0000000000000301 RDI: 0000000000000008
[468451.703003] RBP: ffff88018209f8f8 R08: 00000000c0e0f310 R09: ffff880720cd7740
[468451.703003] R10: ffff88083efd8000 R11: 0000000000000000 R12: ffff880720cd7741
[468451.703003] R13: ffffea000824d100 R14: 0000000000000008 R15: 0000000000000000
[468451.703003] FS: 00007fc0e2a85700(0000) GS:ffff88083ed80000(0000) knlGS:0000000000000000
[468451.703003] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[468451.703003] CR2: 0000000000000008 CR3: 0000000661906000 CR4: 00000000001407e0
[468451.703003] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[468451.703003] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[468451.703003] Stack:
[468451.703003] ffff88018209f928 ffffffff811a7eb5 ffffea000824d100 ffff88018209fa90
[468451.703003] ffffea00082f9680 0000000000000301 ffff88018209f978 ffffffff811a82e1
[468451.703003] ffffea000824d100 ffff88018209fa00 0000000000000001 ffffea000824d100
[468451.703003] Call Trace:
[468451.703003] [<ffffffff811a7eb5>] page_lock_anon_vma_read+0x55/0x110
[468451.703003] [<ffffffff811a82e1>] try_to_unmap_anon+0x21/0x120
[468451.703003] [<ffffffff811a842d>] try_to_unmap+0x4d/0x60
[468451.712006] [<ffffffff811cc749>] migrate_pages+0x439/0x790
[468451.712006] [<ffffffff81193280>] ? __reset_isolation_suitable+0xe0/0xe0
[468451.712006] [<ffffffff811941f9>] compact_zone+0x299/0x400
[468451.712006] [<ffffffff81059aff>] ? kvm_clock_get_cycles+0x1f/0x30
[468451.712006] [<ffffffff811943fc>] compact_zone_order+0x9c/0xf0
[468451.712006] [<ffffffff811947b1>] try_to_compact_pages+0x121/0x1a0
[468451.712006] [<ffffffff8163ace6>] __alloc_pages_direct_compact+0xac/0x196
[468451.712006] [<ffffffff811783e2>] __alloc_pages_nodemask+0xbc2/0xca0
[468451.712006] [<ffffffff811bcb7a>] alloc_pages_vma+0x9a/0x150
[468451.712006] [<ffffffff811d1573>] do_huge_pmd_anonymous_page+0x123/0x510
[468451.712006] [<ffffffff8119bc58>] handle_mm_fault+0x1a8/0xf50
[468451.712006] [<ffffffff8164b4d6>] __do_page_fault+0x166/0x470
[468451.712006] [<ffffffff8164b8a3>] trace_do_page_fault+0x43/0x110
[468451.712006] [<ffffffff8164af79>] do_async_page_fault+0x29/0xe0
[468451.712006] [<ffffffff81647a38>] async_page_fault+0x28/0x30
[468451.712006] Code: 00 00 00 ba 01 00 00 00 48 89 de e8 12 fe ff ff eb ce 48 c7 c0 f2 ff ff ff eb c5 e8 42 ff fc ff 66 90 0f 1f 44 00 00 55 48 89 e5 <48> 8b 07 48 89 c2 48 83 c2 01 7e 07 f0 48 0f b1 17 75 f0 48 f7
[468451.712006] RIP [<ffffffff810ac089>] down_read_trylock+0x9/0x30
[468451.738667] RSP <ffff88018209f8f8>
[468451.738667] CR2: 0000000000000008



2017-07-19 08:41:01

by Vlastimil Babka

[permalink] [raw]
Subject: Re: mm, something wrong in page_lock_anon_vma_read()?

On 07/18/2017 12:59 PM, Xishi Qiu wrote:
> Hi,
>
> Unfortunately, this patch(mm: thp: fix SMP race condition between
> THP page fault and MADV_DONTNEED) didn't help, I got the panic again.

Too bad then. I don't know of any other patch from my own experience
being directly related, try to look for similar THP-related race fixes.
Did you already check whether disabling THP (set it to "never" under
/sys/...) prevents the issue? I forgot.

> And I find this error before panic, "[468229.996610] BUG: Bad rss-counter state mm:ffff8806aebc2580 idx:1 val:1"

This likely means that a pte was overwritten to zero, and an anon page
had no other reference than this pte, so it became orphaned. Its
anon_vma object was freed as the process exited, and eventually
overwritten by a new user, so compaction or reclaim looking at it sooner
or later makes a bad memory access.

The pte overwriting may be a result of races with multiple threads
trying to either read or write within the same page, involving THP zero
page. It doesn't have to be MADV_DONTNEED related.

> [468451.702807] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
> [468451.702861] IP: [<ffffffff810ac089>] down_read_trylock+0x9/0x30
> [468451.702900] PGD 12445e067 PUD 11acaa067 PMD 0
> [468451.702931] Oops: 0000 [#1] SMP
> [468451.702953] kbox catch die event.
> [468451.703003] collected_len = 1047419, LOG_BUF_LEN_LOCAL = 1048576
> [468451.703003] kbox: notify die begin
> [468451.703003] kbox: no notify die func register. no need to notify
> [468451.703003] do nothing after die!
> [468451.703003] Modules linked in: ipt_REJECT macvlan ip_set_hash_ipport vport_vxlan(OVE) xt_statistic xt_physdev xt_nat xt_recent xt_mark xt_comment veth ct_limit(OVE) bum_extract(OVE) policy(OVE) bum(OVE) ip_set nfnetlink openvswitch(OVE) nf_defrag_ipv6 gre ext3 jbd ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack bridge stp llc kboxdriver(O) kbox(O) dm_thin_pool dm_persistent_data crc32_pclmul dm_bio_prison dm_bufio ghash_clmulni_intel libcrc32c aesni_intel lrw gf128mul glue_helper ablk_helper cryptd ppdev sg parport_pc cirrus virtio_console parport syscopyarea sysfillrect sysimgblt ttm drm_kms_helper drm i2c_piix4 i2c_core pcspkr ip_tables ext4 jbd2 mbcache sr_mod cdrom ata_generic pata_acpi
> [468451.703003] virtio_net virtio_blk crct10dif_pclmul crct10dif_common ata_piix virtio_pci libata serio_raw virtio_ring crc32c_intel virtio dm_mirror dm_region_hash dm_log dm_mod
> [468451.703003] CPU: 6 PID: 21965 Comm: docker-containe Tainted: G OE ----V------- 3.10.0-327.53.58.73.x86_64 #1
> [468451.703003] Hardware name: OpenStack Foundation OpenStack Nova, BIOS rel-1.8.1-0-g4adadbd-20170107_142945-9_64_246_229 04/01/2014
> [468451.703003] task: ffff880692402e00 ti: ffff88018209c000 task.ti: ffff88018209c000
> [468451.703003] RIP: 0010:[<ffffffff810ac089>] [<ffffffff810ac089>] down_read_trylock+0x9/0x30
> [468451.703003] RSP: 0018:ffff88018209f8f8 EFLAGS: 00010202
> [468451.703003] RAX: 0000000000000000 RBX: ffff880720cd7740 RCX: ffff880720cd7740
> [468451.703003] RDX: 0000000000000001 RSI: 0000000000000301 RDI: 0000000000000008
> [468451.703003] RBP: ffff88018209f8f8 R08: 00000000c0e0f310 R09: ffff880720cd7740
> [468451.703003] R10: ffff88083efd8000 R11: 0000000000000000 R12: ffff880720cd7741
> [468451.703003] R13: ffffea000824d100 R14: 0000000000000008 R15: 0000000000000000
> [468451.703003] FS: 00007fc0e2a85700(0000) GS:ffff88083ed80000(0000) knlGS:0000000000000000
> [468451.703003] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [468451.703003] CR2: 0000000000000008 CR3: 0000000661906000 CR4: 00000000001407e0
> [468451.703003] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [468451.703003] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [468451.703003] Stack:
> [468451.703003] ffff88018209f928 ffffffff811a7eb5 ffffea000824d100 ffff88018209fa90
> [468451.703003] ffffea00082f9680 0000000000000301 ffff88018209f978 ffffffff811a82e1
> [468451.703003] ffffea000824d100 ffff88018209fa00 0000000000000001 ffffea000824d100
> [468451.703003] Call Trace:
> [468451.703003] [<ffffffff811a7eb5>] page_lock_anon_vma_read+0x55/0x110
> [468451.703003] [<ffffffff811a82e1>] try_to_unmap_anon+0x21/0x120
> [468451.703003] [<ffffffff811a842d>] try_to_unmap+0x4d/0x60
> [468451.712006] [<ffffffff811cc749>] migrate_pages+0x439/0x790
> [468451.712006] [<ffffffff81193280>] ? __reset_isolation_suitable+0xe0/0xe0
> [468451.712006] [<ffffffff811941f9>] compact_zone+0x299/0x400
> [468451.712006] [<ffffffff81059aff>] ? kvm_clock_get_cycles+0x1f/0x30
> [468451.712006] [<ffffffff811943fc>] compact_zone_order+0x9c/0xf0
> [468451.712006] [<ffffffff811947b1>] try_to_compact_pages+0x121/0x1a0
> [468451.712006] [<ffffffff8163ace6>] __alloc_pages_direct_compact+0xac/0x196
> [468451.712006] [<ffffffff811783e2>] __alloc_pages_nodemask+0xbc2/0xca0
> [468451.712006] [<ffffffff811bcb7a>] alloc_pages_vma+0x9a/0x150
> [468451.712006] [<ffffffff811d1573>] do_huge_pmd_anonymous_page+0x123/0x510
> [468451.712006] [<ffffffff8119bc58>] handle_mm_fault+0x1a8/0xf50
> [468451.712006] [<ffffffff8164b4d6>] __do_page_fault+0x166/0x470
> [468451.712006] [<ffffffff8164b8a3>] trace_do_page_fault+0x43/0x110
> [468451.712006] [<ffffffff8164af79>] do_async_page_fault+0x29/0xe0
> [468451.712006] [<ffffffff81647a38>] async_page_fault+0x28/0x30
> [468451.712006] Code: 00 00 00 ba 01 00 00 00 48 89 de e8 12 fe ff ff eb ce 48 c7 c0 f2 ff ff ff eb c5 e8 42 ff fc ff 66 90 0f 1f 44 00 00 55 48 89 e5 <48> 8b 07 48 89 c2 48 83 c2 01 7e 07 f0 48 0f b1 17 75 f0 48 f7
> [468451.712006] RIP [<ffffffff810ac089>] down_read_trylock+0x9/0x30
> [468451.738667] RSP <ffff88018209f8f8>
> [468451.738667] CR2: 0000000000000008
>
>
>

2017-07-19 10:00:50

by Xishi Qiu

[permalink] [raw]
Subject: Re: mm, something wrong in page_lock_anon_vma_read()?

On 2017/7/19 16:40, Vlastimil Babka wrote:

> On 07/18/2017 12:59 PM, Xishi Qiu wrote:
>> Hi,
>>
>> Unfortunately, this patch(mm: thp: fix SMP race condition between
>> THP page fault and MADV_DONTNEED) didn't help, I got the panic again.
>
> Too bad then. I don't know of any other patch from my own experience
> being directly related, try to look for similar THP-related race fixes.
> Did you already check whether disabling THP (set it to "never" under
> /sys/...) prevents the issue? I forgot.
>

Hi Vlastimil,

Thanks for your reply.

This bug is hard to reproduce, and my production line don't allowed
disable THP because performance regression. Also I have no condition to
reproduce this bug(I don't have the user apps or stress from production
line).

>> And I find this error before panic, "[468229.996610] BUG: Bad rss-counter state mm:ffff8806aebc2580 idx:1 val:1"
>
> This likely means that a pte was overwritten to zero, and an anon page
> had no other reference than this pte, so it became orphaned. Its
> anon_vma object was freed as the process exited, and eventually
> overwritten by a new user, so compaction or reclaim looking at it sooner
> or later makes a bad memory access.
>
> The pte overwriting may be a result of races with multiple threads
> trying to either read or write within the same page, involving THP zero
> page. It doesn't have to be MADV_DONTNEED related.
>

I find two patches from upstream.
887843961c4b4681ee993c36d4997bf4b4aa8253
a9c8e4beeeb64c22b84c803747487857fe424b68

I can't find any relations to the panic from the first one, and the second
one seems triggered from xen, but we use kvm.

Thanks,
Xishi Qiu

>> [468451.702807] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
>> [468451.702861] IP: [<ffffffff810ac089>] down_read_trylock+0x9/0x30
>> [468451.702900] PGD 12445e067 PUD 11acaa067 PMD 0
>> [468451.702931] Oops: 0000 [#1] SMP
>> [468451.702953] kbox catch die event.
>> [468451.703003] collected_len = 1047419, LOG_BUF_LEN_LOCAL = 1048576
>> [468451.703003] kbox: notify die begin
>> [468451.703003] kbox: no notify die func register. no need to notify
>> [468451.703003] do nothing after die!
>> [468451.703003] Modules linked in: ipt_REJECT macvlan ip_set_hash_ipport vport_vxlan(OVE) xt_statistic xt_physdev xt_nat xt_recent xt_mark xt_comment veth ct_limit(OVE) bum_extract(OVE) policy(OVE) bum(OVE) ip_set nfnetlink openvswitch(OVE) nf_defrag_ipv6 gre ext3 jbd ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack bridge stp llc kboxdriver(O) kbox(O) dm_thin_pool dm_persistent_data crc32_pclmul dm_bio_prison dm_bufio ghash_clmulni_intel libcrc32c aesni_intel lrw gf128mul glue_helper ablk_helper cryptd ppdev sg parport_pc cirrus virtio_console parport syscopyarea sysfillrect sysimgblt ttm drm_kms_helper drm i2c_piix4 i2c_core pcspkr ip_tables ext4 jbd2 mbcache sr_mod cdrom ata_generic pata_acpi
>> [468451.703003] virtio_net virtio_blk crct10dif_pclmul crct10dif_common ata_piix virtio_pci libata serio_raw virtio_ring crc32c_intel virtio dm_mirror dm_region_hash dm_log dm_mod
>> [468451.703003] CPU: 6 PID: 21965 Comm: docker-containe Tainted: G OE ----V------- 3.10.0-327.53.58.73.x86_64 #1
>> [468451.703003] Hardware name: OpenStack Foundation OpenStack Nova, BIOS rel-1.8.1-0-g4adadbd-20170107_142945-9_64_246_229 04/01/2014
>> [468451.703003] task: ffff880692402e00 ti: ffff88018209c000 task.ti: ffff88018209c000
>> [468451.703003] RIP: 0010:[<ffffffff810ac089>] [<ffffffff810ac089>] down_read_trylock+0x9/0x30
>> [468451.703003] RSP: 0018:ffff88018209f8f8 EFLAGS: 00010202
>> [468451.703003] RAX: 0000000000000000 RBX: ffff880720cd7740 RCX: ffff880720cd7740
>> [468451.703003] RDX: 0000000000000001 RSI: 0000000000000301 RDI: 0000000000000008
>> [468451.703003] RBP: ffff88018209f8f8 R08: 00000000c0e0f310 R09: ffff880720cd7740
>> [468451.703003] R10: ffff88083efd8000 R11: 0000000000000000 R12: ffff880720cd7741
>> [468451.703003] R13: ffffea000824d100 R14: 0000000000000008 R15: 0000000000000000
>> [468451.703003] FS: 00007fc0e2a85700(0000) GS:ffff88083ed80000(0000) knlGS:0000000000000000
>> [468451.703003] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [468451.703003] CR2: 0000000000000008 CR3: 0000000661906000 CR4: 00000000001407e0
>> [468451.703003] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> [468451.703003] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>> [468451.703003] Stack:
>> [468451.703003] ffff88018209f928 ffffffff811a7eb5 ffffea000824d100 ffff88018209fa90
>> [468451.703003] ffffea00082f9680 0000000000000301 ffff88018209f978 ffffffff811a82e1
>> [468451.703003] ffffea000824d100 ffff88018209fa00 0000000000000001 ffffea000824d100
>> [468451.703003] Call Trace:
>> [468451.703003] [<ffffffff811a7eb5>] page_lock_anon_vma_read+0x55/0x110
>> [468451.703003] [<ffffffff811a82e1>] try_to_unmap_anon+0x21/0x120
>> [468451.703003] [<ffffffff811a842d>] try_to_unmap+0x4d/0x60
>> [468451.712006] [<ffffffff811cc749>] migrate_pages+0x439/0x790
>> [468451.712006] [<ffffffff81193280>] ? __reset_isolation_suitable+0xe0/0xe0
>> [468451.712006] [<ffffffff811941f9>] compact_zone+0x299/0x400
>> [468451.712006] [<ffffffff81059aff>] ? kvm_clock_get_cycles+0x1f/0x30
>> [468451.712006] [<ffffffff811943fc>] compact_zone_order+0x9c/0xf0
>> [468451.712006] [<ffffffff811947b1>] try_to_compact_pages+0x121/0x1a0
>> [468451.712006] [<ffffffff8163ace6>] __alloc_pages_direct_compact+0xac/0x196
>> [468451.712006] [<ffffffff811783e2>] __alloc_pages_nodemask+0xbc2/0xca0
>> [468451.712006] [<ffffffff811bcb7a>] alloc_pages_vma+0x9a/0x150
>> [468451.712006] [<ffffffff811d1573>] do_huge_pmd_anonymous_page+0x123/0x510
>> [468451.712006] [<ffffffff8119bc58>] handle_mm_fault+0x1a8/0xf50
>> [468451.712006] [<ffffffff8164b4d6>] __do_page_fault+0x166/0x470
>> [468451.712006] [<ffffffff8164b8a3>] trace_do_page_fault+0x43/0x110
>> [468451.712006] [<ffffffff8164af79>] do_async_page_fault+0x29/0xe0
>> [468451.712006] [<ffffffff81647a38>] async_page_fault+0x28/0x30
>> [468451.712006] Code: 00 00 00 ba 01 00 00 00 48 89 de e8 12 fe ff ff eb ce 48 c7 c0 f2 ff ff ff eb c5 e8 42 ff fc ff 66 90 0f 1f 44 00 00 55 48 89 e5 <48> 8b 07 48 89 c2 48 83 c2 01 7e 07 f0 48 0f b1 17 75 f0 48 f7
>> [468451.712006] RIP [<ffffffff810ac089>] down_read_trylock+0x9/0x30
>> [468451.738667] RSP <ffff88018209f8f8>
>> [468451.738667] CR2: 0000000000000008
>>
>>
>>
>
>
> .
>



2017-07-20 12:58:41

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: mm, something wrong in page_lock_anon_vma_read()?

On Wed, Jul 19, 2017 at 05:59:01PM +0800, Xishi Qiu wrote:
> I find two patches from upstream.
> 887843961c4b4681ee993c36d4997bf4b4aa8253

Do you use the remap_file_pages syscall? Such syscall has been dropped
upstream so very few apps should possibly try to use it on 64bit
archs.

It would also require a get_user_pages(write=1, force=1) on a nonlinear
VM_SHARED mapped without PROT_WRITE and such action should happen
before remap_file_pages is called to overwrite the page that got poked
by gdb.

Which sounds an extremely unusual setup for a production
environment. Said that you're clearly running docker containers so who
knows what is running inside them (and the point where you notice the
stale anon-vma and the container that crashes isn't necessarily the
same container that runs the fremap readonly gdb poking workload).

I'll look into integrating the above fix regardless.

I'll also send you privately the fix backported to the specific
enterprise kernel you're using, adding a WARN_ON as well that will
tell us if such a fix ever makes a difference. The alternative is that
you place a perf probe or systemtap hook in remap_file_pages to know
if it ever runs, but the WARN_ON I'll add is even better proof. If you
get the WARN_ON in the logs, we'll be 100% sure thing the patch fixed
your issue and we don't have to keep looking for other issues of the
same kind.

> a9c8e4beeeb64c22b84c803747487857fe424b68
>
> I can't find any relations to the panic from the first one, and the second

Actually I do. Vlastimil theory that a pte got marked none is sound
but if zap_pte in a fremap fails to drop the anon page that was under
memory migration/compaction the exact same thing will happen. Either
ways an anon page isn't freed as it should have been: the vma will be
dropped, the anon-vma too, but the page will be left hanging around as
anonymous in the lrus with page->mapping pointing to a stale anon_vma
and the rss counters will go off by one too.

> one seems triggered from xen, but we use kvm.

Correct, the second one isn't needed with KVM.

Thanks,
Andrea

2017-07-20 16:15:52

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: mm, something wrong in page_lock_anon_vma_read()?

On Thu, Jul 20, 2017 at 02:58:35PM +0200, Andrea Arcangeli wrote:
> but if zap_pte in a fremap fails to drop the anon page that was under
> memory migration/compaction the exact same thing will happen. Either

... except it is ok to clear a migration entry, it will be migration
that will free the new page after migration completes, zap_pte doesn't
need to wait. So this fix is good, but I was too optimistic about its
ability to explain the whole problem. It only can explain Rss cosmetic
errors, not a anon page left hanging around after its anon vma has
been freed.

About the theory this could be THP related, the Rss stats being off by
one as symptom of the bug, don't seem to point in that direction, all
complex THP operations don't mess with the rss or they tend to act in
blocks of 512. Furthermore the BZ already confirmed it can be
reproduced with THP disabled. Said that it also was supposedly already
fixed by the various patches you manually backported to your build.

I believe for fairness (mailing list traffic etc..) it's be preferable
to continue the debugging in the open BZ and not here because you
didn't reproduce it on a upstraem kernel yet so we cannot be 100% sure
if upstream (if only -stable) could reproduce it.

Thanks,
Andrea