Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754889Ab3GWHY3 (ORCPT ); Tue, 23 Jul 2013 03:24:29 -0400 Received: from mail-oa0-f66.google.com ([209.85.219.66]:55364 "EHLO mail-oa0-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754222Ab3GWHY1 (ORCPT ); Tue, 23 Jul 2013 03:24:27 -0400 Message-ID: <51EE2FA1.9080504@gmail.com> Date: Tue, 23 Jul 2013 01:24:17 -0600 From: Hush Bensen User-Agent: Mozilla/5.0 (X11; Linux i686; rv:17.0) Gecko/20130329 Thunderbird/17.0.5 MIME-Version: 1.0 To: Minchan Kim CC: "Aneesh Kumar K.V" , Dave Jones , Linux Kernel , linux-mm@kvack.org, Rik van Riel , Michal Hocko , KAMEZAWA Hiroyuki , Hillf Danton , Andrew Morton Subject: Re: hugepage related lockdep trace. References: <20130717153223.GD27731@redhat.com> <20130718000901.GA31972@blaptop> <87hafrdatb.fsf@linux.vnet.ibm.com> <20130719001303.GB23354@blaptop> In-Reply-To: <20130719001303.GB23354@blaptop> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 6820 Lines: 142 On 07/18/2013 06:13 PM, Minchan Kim wrote: > On Thu, Jul 18, 2013 at 11:12:24PM +0530, Aneesh Kumar K.V wrote: >> Minchan Kim writes: >> >>> Ccing people get_maintainer says. >>> >>> On Wed, Jul 17, 2013 at 11:32:23AM -0400, Dave Jones wrote: >>>> [128095.470960] ================================= >>>> [128095.471315] [ INFO: inconsistent lock state ] >>>> [128095.471660] 3.11.0-rc1+ #9 Not tainted >>>> [128095.472156] --------------------------------- >>>> [128095.472905] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage. >>>> [128095.473650] kswapd0/49 [HC0[0]:SC0[0]:HE1:SE1] takes: >>>> [128095.474373] (&mapping->i_mmap_mutex){+.+.?.}, at: [] page_referenced+0x87/0x5e3 >>>> [128095.475128] {RECLAIM_FS-ON-W} state was registered at: >>>> [128095.475866] [] mark_held_locks+0x81/0xe7 >>>> [128095.476597] [] lockdep_trace_alloc+0x5e/0xbc >>>> [128095.477322] [] __alloc_pages_nodemask+0x8b/0x9b6 >>>> [128095.478049] [] __get_free_pages+0x20/0x31 >>>> [128095.478769] [] get_zeroed_page+0x12/0x14 >>>> [128095.479477] [] __pmd_alloc+0x1c/0x6b >>>> [128095.480138] [] huge_pmd_share+0x265/0x283 >>>> [128095.480138] [] huge_pte_alloc+0x5d/0x71 >>>> [128095.480138] [] hugetlb_fault+0x7c/0x64a >>>> [128095.480138] [] handle_mm_fault+0x255/0x299 >>>> [128095.480138] [] __do_page_fault+0x142/0x55c >>>> [128095.480138] [] do_page_fault+0xd/0x16 >>>> [128095.480138] [] error_code+0x6c/0x74 >>>> [128095.480138] irq event stamp: 3136917 >>>> [128095.480138] hardirqs last enabled at (3136917): [] _raw_spin_unlock_irq+0x27/0x50 >>>> [128095.480138] hardirqs last disabled at (3136916): [] _raw_spin_lock_irq+0x15/0x78 >>>> [128095.480138] softirqs last enabled at (3136180): [] __do_softirq+0x137/0x30f >>>> [128095.480138] softirqs last disabled at (3136175): [] irq_exit+0xa8/0xaa >>>> [128095.480138] >>>> other info that might help us debug this: >>>> [128095.480138] Possible unsafe locking scenario: >>>> >>>> [128095.480138] CPU0 >>>> [128095.480138] ---- >>>> [128095.480138] lock(&mapping->i_mmap_mutex); >>>> [128095.480138] >>>> [128095.480138] lock(&mapping->i_mmap_mutex); >>>> [128095.480138] >>>> *** DEADLOCK *** >>>> >>>> [128095.480138] no locks held by kswapd0/49. >>>> [128095.480138] >>>> stack backtrace: >>>> [128095.480138] CPU: 1 PID: 49 Comm: kswapd0 Not tainted 3.11.0-rc1+ #9 >>>> [128095.480138] Hardware name: Dell Inc. Precision WorkStation 490 /0DT031, BIOS A08 04/25/2008 >>>> [128095.480138] c1d32630 00000000 ee39fb18 c15b001e ee395780 ee39fb54 c15acdcb c1751845 >>>> [128095.480138] c1751bbf 00000031 00000000 00000000 00000000 00000000 00000001 00000001 >>>> [128095.480138] c1751bbf 00000008 ee395c44 00000100 ee39fb88 c10a6130 00000008 0000d8fb >>>> [128095.480138] Call Trace: >>>> [128095.480138] [] dump_stack+0x4b/0x79 >>>> [128095.480138] [] print_usage_bug+0x1d9/0x1e3 >>>> [128095.480138] [] mark_lock+0x1e0/0x261 >>>> [128095.480138] [] ? check_usage_backwards+0x109/0x109 >>>> [128095.480138] [] __lock_acquire+0x623/0x17f2 >>>> [128095.480138] [] ? sched_clock_cpu+0xcd/0x130 >>>> [128095.480138] [] ? sched_clock_local+0x42/0x12e >>>> [128095.480138] [] lock_acquire+0x7d/0x195 >>>> [128095.480138] [] ? page_referenced+0x87/0x5e3 >>>> [128095.480138] [] mutex_lock_nested+0x6c/0x3a7 >>>> [128095.480138] [] ? page_referenced+0x87/0x5e3 >>>> [128095.480138] [] ? page_referenced+0x87/0x5e3 >>>> [128095.480138] [] ? mem_cgroup_charge_statistics.isra.24+0x61/0x9e >>>> [128095.480138] [] page_referenced+0x87/0x5e3 >>>> [128095.480138] [] ? raid0_congested+0x26/0x8a [raid0] >>>> [128095.480138] [] shrink_page_list+0x3d9/0x947 >>>> [128095.480138] [] ? trace_hardirqs_on+0xb/0xd >>>> [128095.480138] [] shrink_inactive_list+0x155/0x4cb >>>> [128095.480138] [] shrink_lruvec+0x300/0x5ce >>>> [128095.480138] [] shrink_zone+0x53/0x14e >>>> [128095.480138] [] kswapd+0x517/0xa75 >>>> [128095.480138] [] ? mem_cgroup_shrink_node_zone+0x280/0x280 >>>> [128095.480138] [] kthread+0xa8/0xaa >>>> [128095.480138] [] ? trace_hardirqs_on+0xb/0xd >>>> [128095.480138] [] ret_from_kernel_thread+0x1b/0x28 >>>> [128095.480138] [] ? insert_kthread_work+0x63/0x63 >>> IMHO, it's a false positive because i_mmap_mutex was held by kswapd >>> while one in the middle of fault path could be never on kswapd context. >>> >>> It seems lockdep for reclaim-over-fs isn't enough smart to identify >>> between background and direct reclaim. >>> >>> Wait for other's opinion. >> Is that reasoning correct ?. We may not deadlock because hugetlb pages >> cannot be reclaimed. So the fault path in hugetlb won't end up >> reclaiming pages from same inode. But the report is correct right ? >> >> >> Looking at the hugetlb code we have in huge_pmd_share >> >> out: >> pte = (pte_t *)pmd_alloc(mm, pud, addr); >> mutex_unlock(&mapping->i_mmap_mutex); >> return pte; >> >> I guess we should move that pmd_alloc outside i_mmap_mutex. Otherwise >> that pmd_alloc can result in a reclaim which can call shrink_page_list ? > True. Sorry for that I didn't review the code carefully and I was very paranoid > in reclaim-over-fs due to internal works. :( Could you explain more about reclaim-over-fs stuff? > >> Something like ? >> >> diff --git a/mm/hugetlb.c b/mm/hugetlb.c >> index 83aff0a..2cb1be3 100644 >> --- a/mm/hugetlb.c >> +++ b/mm/hugetlb.c >> @@ -3266,8 +3266,8 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud) >> put_page(virt_to_page(spte)); >> spin_unlock(&mm->page_table_lock); >> out: >> - pte = (pte_t *)pmd_alloc(mm, pud, addr); >> mutex_unlock(&mapping->i_mmap_mutex); >> + pte = (pte_t *)pmd_alloc(mm, pud, addr); >> return pte; > I am blind on hugetlb but not sure it doesn't break eb48c071. > Michal? > > >> } >> >> -aneesh >> >> >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/