Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752694AbbKPBop (ORCPT ); Sun, 15 Nov 2015 20:44:45 -0500 Received: from LGEAMRELO12.lge.com ([156.147.23.52]:60314 "EHLO lgeamrelo12.lge.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752513AbbKPBok (ORCPT ); Sun, 15 Nov 2015 20:44:40 -0500 X-Original-SENDERIP: 156.147.1.151 X-Original-MAILFROM: minchan@kernel.org X-Original-SENDERIP: 165.244.98.76 X-Original-MAILFROM: minchan@kernel.org X-Original-SENDERIP: 10.177.223.161 X-Original-MAILFROM: minchan@kernel.org Date: Mon, 16 Nov 2015 10:45:21 +0900 From: Minchan Kim To: "Kirill A. Shutemov" CC: Hugh Dickins , Sasha Levin , Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Rik van Riel , Mel Gorman , Michal Hocko , Johannes Weiner , Vlastimil Babka Subject: Re: kernel oops on mmotm-2015-10-15-15-20 Message-ID: <20151116014521.GA7973@bbox> References: <20151030070350.GB16099@bbox> <20151102125749.GB7473@node.shutemov.name> <20151103030258.GJ17906@bbox> <20151103071650.GA21553@node.shutemov.name> <20151103073329.GL17906@bbox> <20151103152019.GM17906@bbox> <20151104142135.GA13303@node.shutemov.name> <20151105001922.GD7357@bbox> <20151108225522.GA29600@node.shutemov.name> <20151112003614.GA5235@bbox> MIME-Version: 1.0 In-Reply-To: <20151112003614.GA5235@bbox> User-Agent: Mutt/1.5.21 (2010-09-15) X-MIMETrack: Itemize by SMTP Server on LGEKRMHUB01/LGE/LG Group(Release 8.5.3FP3HF583 | August 9, 2013) at 2015/11/16 10:44:36, Serialize by Router on LGEKRMHUB01/LGE/LG Group(Release 8.5.3FP3HF583 | August 9, 2013) at 2015/11/16 10:44:36, Serialize complete at 2015/11/16 10:44:36 Content-Type: text/plain; charset="us-ascii" Content-Disposition: inline Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 10938 Lines: 221 On Thu, Nov 12, 2015 at 09:36:14AM +0900, Minchan Kim wrote: > > > mmotm-2015-10-15-15-20-no-madvise_free, IOW it means git head for > > > 54bad5da4834 arm64: add pmd_[dirty|mkclean] for THP so there is no > > > MADV_FREE code in there > > > + pte_mkdirty patch > > > + freeze/unfreeze patch > > > + do_page_add_anon_rmap patch > > > + above split_huge_pmd > > > > > > > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS > > > BUG: Bad rss-counter state mm:ffff88007fa3bb80 idx:1 val:512 > > > > With the patch below my test setup run for 2+ days without triggering the > > bug. split_huge_pmd patch should be dropped. > > > > Please test. > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > > index 14cbbad54a3e..7aa0a3fef2aa 100644 > > --- a/mm/huge_memory.c > > +++ b/mm/huge_memory.c > > @@ -2841,9 +2841,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, > > write = pmd_write(*pmd); > > young = pmd_young(*pmd); > > > > - /* leave pmd empty until pte is filled */ > > - pmdp_huge_clear_flush_notify(vma, haddr, pmd); > > - > > pgtable = pgtable_trans_huge_withdraw(mm, pmd); > > pmd_populate(mm, &_pmd, pgtable); > > > > @@ -2893,6 +2890,28 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, > > } > > > > smp_wmb(); /* make pte visible before pmd */ > > + /* > > + * Up to this point the pmd is present and huge and userland has the > > + * whole access to the hugepage during the split (which happens in > > + * place). If we overwrite the pmd with the not-huge version pointing > > + * to the pte here (which of course we could if all CPUs were bug > > + * free), userland could trigger a small page size TLB miss on the > > + * small sized TLB while the hugepage TLB entry is still established in > > + * the huge TLB. Some CPU doesn't like that. > > + * See http://support.amd.com/us/Processor_TechDocs/41322.pdf, Erratum > > + * 383 on page 93. Intel should be safe but is also warns that it's > > + * only safe if the permission and cache attributes of the two entries > > + * loaded in the two TLB is identical (which should be the case here). > > + * But it is generally safer to never allow small and huge TLB entries > > + * for the same virtual address to be loaded simultaneously. So instead > > + * of doing "pmd_populate(); flush_pmd_tlb_range();" we first mark the > > + * current pmd notpresent (atomically because here the pmd_trans_huge > > + * and pmd_trans_splitting must remain set at all times on the pmd > > + * until the split is complete for this pmd), then we flush the SMP TLB > > + * and finally we write the non-huge version of the pmd entry with > > + * pmd_populate. > > + */ > > + pmdp_invalidate(vma, haddr, pmd); > > pmd_populate(mm, pmd, pgtable); > > > > if (freeze) { > > I have been tested this patch with MADV_DONTNEED for a few days and > I couldn't see the problem any more. And I will continue to test it > with MADV_FREE. During the test with MADV_FREE on kernel I applied your patches, I couldn't see any problem. However, in this round, I did another test which is same one I attached but a liitle bit different because it doesn't do (memcg things/kill/swapoff) for testing program long-live test. With that, I encountered this problem. page:ffffea0000f60080 count:1 mapcount:0 mapping:ffff88007f584691 index:0x600002a02 flags: 0x400000000006a028(uptodate|lru|writeback|swapcache|reclaim|swapbacked) page dumped because: VM_BUG_ON_PAGE(!PageLocked(page)) page->mem_cgroup:ffff880077cf0c00 ------------[ cut here ]------------ kernel BUG at mm/huge_memory.c:3340! invalid opcode: 0000 [#1] SMP Dumping ftrace buffer: (ftrace buffer empty) Modules linked in: CPU: 7 PID: 1657 Comm: memhog Not tainted 4.3.0-rc5-mm1-madv-free+ #4 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 task: ffff88006b0f1a40 ti: ffff88004ced4000 task.ti: ffff88004ced4000 RIP: 0010:[] [] split_huge_page_to_list+0x907/0x920 RSP: 0018:ffff88004ced7a38 EFLAGS: 00010296 RAX: 0000000000000021 RBX: ffffea0000f60080 RCX: ffffffff81830db8 RDX: 0000000000000001 RSI: 0000000000000246 RDI: ffffffff821df4d8 RBP: ffff88004ced7ab8 R08: 0000000000000000 R09: ffff8800000bc560 R10: ffffffff8163d880 R11: 0000000000014f25 R12: ffffea0000f60080 R13: ffffea0000f60088 R14: ffffea0000f60080 R15: 0000000000000000 FS: 00007f43d3ced740(0000) GS:ffff8800782e0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ff1f6fcdb98 CR3: 000000004cf56000 CR4: 00000000000006a0 Stack: cccccccccccccccd ffffea0000f60080 ffff88004ced7ad0 ffffea0000f60088 ffff88004ced7ad0 0000000000000000 ffff88004ced7ab8 ffffffff810ef9d0 ffffea0000f60000 0000000000000000 0000000000000000 ffffea0000f60080 Call Trace: [] ? __lock_page+0xa0/0xb0 [] deferred_split_scan+0x11c/0x260 [] ? list_lru_count_one+0x1c/0x30 [] shrink_slab.part.42+0x1e3/0x350 [] shrink_zone+0x26a/0x280 [] do_try_to_free_pages+0x12d/0x3b0 [] try_to_free_pages+0xb4/0x140 [] __alloc_pages_nodemask+0x459/0x920 [] handle_mm_fault+0xc77/0x1000 [] ? retint_kernel+0x10/0x10 [] __do_page_fault+0x189/0x400 [] do_page_fault+0xc/0x10 [] page_fault+0x22/0x30 Code: ff ff 48 c7 c6 f0 b2 77 81 4c 89 f7 e8 13 c3 fc ff 0f 0b 48 83 e8 01 e9 88 f7 ff ff 48 c7 c6 70 a1 77 81 4c 89 f7 e8 f9 c2 fc ff <0f> 0b 48 c7 c6 38 af 77 81 4c 89 e7 e8 e8 c2 fc ff 0f 0b 66 0f RIP [] split_huge_page_to_list+0x907/0x920 RSP ---[ end trace c9a60522e3a296e4 ]--- So, I reverted all MADV_FREE patches and chaged it with MADV_DONTNEED. In this time, I saw below oops in this time. If I miss somethings, please let me know it. ------------[ cut here ]------------ kernel BUG at include/linux/swapops.h:129! invalid opcode: 0000 [#1] SMP Dumping ftrace buffer: (ftrace buffer empty) Modules linked in: CPU: 5 PID: 1563 Comm: madvise_test Not tainted 4.3.0-rc5-mm1-no-madv-free+ #5 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 task: ffff88007e8d3480 ti: ffff88007f524000 task.ti: ffff88007f524000 RIP: 0010:[] [] migration_entry_to_page.part.61+0x4/0x6 RSP: 0018:ffff88007f527cd0 EFLAGS: 00010246 RAX: ffffea0000896b00 RBX: 00006000013ac000 RCX: ffffea0000000000 RDX: 0000000000000000 RSI: ffffea0001f93e80 RDI: 3e000000000225ac RBP: ffff88007f527cd0 R08: 0000000000000101 R09: ffff88007e4fa000 R10: ffffea0001fda740 R11: 0000000000000000 R12: 00000000044b583e R13: 00006000013ad000 R14: ffff88007f527e00 R15: ffff88007e4fad60 FS: 00007fe2f099a740(0000) GS:ffff8800782a0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 000000000166c0d0 CR3: 000000007e57b000 CR4: 00000000000006a0 Stack: ffff88007f527db8 ffffffff81118030 00006000017fffff ffff88007f527e00 00006000017fffff ffff88007ed71000 ffff88007e57b600 0000600001800000 0000600001800000 00006000017fffff 0000600001800000 ffff88007efb6b78 Call Trace: [] unmap_single_vma+0x840/0x880 [] unmap_vmas+0x41/0x60 [] unmap_region+0x9d/0x100 [] do_munmap+0x217/0x380 [] vm_munmap+0x41/0x60 [] SyS_munmap+0x22/0x30 [] entry_SYSCALL_64_fastpath+0x12/0x6a Code: df 48 c1 ff 06 49 01 fc 4c 89 e7 e8 9c ff ff ff 85 c0 74 0c 4c 89 e0 48 c1 e0 06 48 29 d8 eb 02 31 c0 5b 41 5c 5d c3 55 48 89 e5 <0f> 0b 55 48 c7 c6 30 80 77 81 48 89 e5 e8 f0 45 fc ff 0f 0b 55 RIP [] migration_entry_to_page.part.61+0x4/0x6 RSP ---[ end trace 01097fb7f9cf1b6c ]--- Another hit: page:ffffea0000520080 count:2 mapcount:0 mapping:ffff880072b38a51 index:0x600002602 flags: 0x4000000000048028(uptodate|lru|swapcache|swapbacked) page dumped because: VM_BUG_ON_PAGE(!PageLocked(page)) page->mem_cgroup:ffff880077cf0c00 ------------[ cut here ]------------ kernel BUG at mm/huge_memory.c:3306! invalid opcode: 0000 [#1] SMP Dumping ftrace buffer: (ftrace buffer empty) Modules linked in: CPU: 6 PID: 1419 Comm: madvise_test Not tainted 4.3.0-rc5-mm1-no-madv-free+ #5 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 task: ffff88006f108000 ti: ffff88006f054000 task.ti: ffff88006f054000 RIP: 0010:[] [] split_huge_page_to_list+0x81f/0x890 RSP: 0000:ffff88006f057a40 EFLAGS: 00010282 RAX: 0000000000000021 RBX: ffffea0000520080 RCX: 0000000000000000 RDX: 0000000000000001 RSI: 0000000000000246 RDI: ffffffff821dd418 RBP: ffff88006f057ab8 R08: 0000000000000000 R09: ffff8800000bfb20 R10: ffffffff8163d1c0 R11: 0000000000005c5f R12: ffff88006f057ad0 R13: ffffea0000520080 R14: ffffea0000520080 R15: 0000000000000000 FS: 00007f09963a2740(0000) GS:ffff8800782c0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000600003d92000 CR3: 000000007372e000 CR4: 00000000000006a0 Stack: ffffea0000520080 ffff88006f057ad0 ffffea0000520088 ffff88006f057ad0 0000000000000000 ffff88006f057ab8 ffffffff810ec700 ffffea0000520000 0000000000000000 0000000000000000 ffffea0000520080 ffff88006f057ad0 Call Trace: [] ? __lock_page+0xa0/0xb0 [] deferred_split_scan+0x115/0x240 [] ? list_lru_count_one+0x1c/0x30 [] shrink_slab.part.43+0x1e3/0x350 [] shrink_zone+0x238/0x250 [] do_try_to_free_pages+0x12d/0x3b0 [] try_to_free_pages+0xb4/0x140 [] __alloc_pages_nodemask+0x459/0x920 [] handle_mm_fault+0xbca/0xf90 [] ? enqueue_task+0x3c/0x60 [] ? __set_cpus_allowed_ptr+0x9b/0x1a0 [] __do_page_fault+0x189/0x400 [] do_page_fault+0xc/0x10 [] page_fault+0x22/0x30 Code: ff ff 48 c7 c6 d0 91 77 81 4c 89 f7 e8 1b d7 fc ff 0f 0b 48 83 e8 01 e9 70 f8 ff ff 48 c7 c6 50 80 77 81 4c 89 f7 e8 01 d7 fc ff <0f> 0b 48 c7 c6 d8 be 77 81 4c 89 ef e8 f0 d6 fc ff 0f 0b 48 83 RIP [] split_huge_page_to_list+0x81f/0x890 RSP ---[ end trace 0ce8751b8410cd8e ]--- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/