Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751666Ab2BPJxg (ORCPT ); Thu, 16 Feb 2012 04:53:36 -0500 Received: from mail-pz0-f46.google.com ([209.85.210.46]:42482 "EHLO mail-pz0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750887Ab2BPJxd (ORCPT ); Thu, 16 Feb 2012 04:53:33 -0500 Date: Thu, 16 Feb 2012 01:53:04 -0800 (PST) From: Hugh Dickins X-X-Sender: hugh@eggly.anvils To: Andrea Arcangeli cc: Dave Jones , Andrew Morton , David Rientjes , linux-kernel@vger.kernel.org, linux-mm@kvack.org, kernel-team@fedoraproject.org Subject: Re: exit_mmap() BUG_ON triggering since 3.1 In-Reply-To: <20120216070753.GA23585@redhat.com> Message-ID: References: <20120215183317.GA26977@redhat.com> <20120216070753.GA23585@redhat.com> User-Agent: Alpine 2.00 (LSU 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5609 Lines: 139 On Thu, 16 Feb 2012, Andrea Arcangeli wrote: > On Wed, Feb 15, 2012 at 06:14:12PM -0800, Hugh Dickins wrote: > > Now most of those paths in THP also hold mmap_sem for read or write (with > > appropriate checks on mm_users), but two do not: when split_huge_page() > > is called by hwpoison_user_mappings(), and when called by add_to_swap(). > > So the race is split_huge_page_map() called by add_to_swap() running > concurrently with free_pgtables. Great catch!! > > > Or should we try harder to avoid the extra locking: test mm_users? > > #ifdef on THP? Or consider the accuracy of this count not worth > > extra locking, and just scrap the BUG_ON now? > > It's probably also happening with a large munmap, while add_to_swap > runs on another vma. Process didn't exit yet, but the actual BUG_ON > check runs at exit. So I doubt aborting split_huge_page on zero > mm_users could solve it. Indeed, what I meant was, I was wondering whether to make the spin_lock and spin_unlock in my patch conditional on mm_users, not to make split_huge_page conditional on it. > > Good part is, this being a false positive makes these oopses a > nuisance, so it means they can't corrupt any memory or disk etc... Yes (and I think less troublesome than most BUGs, coming at exit while not holding locks; though we could well make it a WARN_ON, I don't think that existed back in the day). > > The simplest is probably to change nr_ptes to count THPs too. I tried > that and no oopses so far but it's not very well tested yet. Oh, I like that, that's a much nicer fix than mine. If you're happy to change the THP end (which I could hardly blame for getting those odd rules slightly wrong), and it passes your testing, then certainly add my Acked-by: Hugh Dickins In looking into the bug, it had actually bothered me a little that you were setting aside those pages, yet not counting them into nr_ptes; though the only thing that cares is oom_kill.c, and the count of pages in each hugepage can only dwarf the count in nr_ptes (whereas, without hugepages, it's possible to populate very sparsely and nr_ptes become significant). > > ==== > From: Andrea Arcangeli > Subject: [PATCH] mm: thp: fix BUG on mm->nr_ptes > > Quoting Hugh's discovery and explanation of the SMP race condition: > > === > mm->nr_ptes had unusual locking: down_read mmap_sem plus > page_table_lock when incrementing, down_write mmap_sem (or mm_users 0) > when decrementing; whereas THP is careful to increment and decrement > it under page_table_lock. > > Now most of those paths in THP also hold mmap_sem for read or write > (with appropriate checks on mm_users), but two do not: when > split_huge_page() is called by hwpoison_user_mappings(), and when > called by add_to_swap(). > > It's conceivable that the latter case is responsible for the > exit_mmap() BUG_ON mm->nr_ptes that has been reported on Fedora. > === > > The simplest way to fix it without having to alter the locking is to > make split_huge_page() a noop in nr_ptes terms, so by counting the > preallocated pagetables that exists for every mapped hugepage. It was > an arbitrary choice not to count them and either way is not wrong or > right, because they are not used but they're still allocated. > > Reported-by: Dave Jones > Reported-by: Hugh Dickins > Signed-off-by: Andrea Arcangeli > --- > mm/huge_memory.c | 6 +++--- > 1 files changed, 3 insertions(+), 3 deletions(-) > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 91d3efb..8f7fc39 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -671,6 +671,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm, > set_pmd_at(mm, haddr, pmd, entry); > prepare_pmd_huge_pte(pgtable, mm); > add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR); > + mm->nr_ptes++; > spin_unlock(&mm->page_table_lock); > } > > @@ -789,6 +790,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, > pmd = pmd_mkold(pmd_wrprotect(pmd)); > set_pmd_at(dst_mm, addr, dst_pmd, pmd); > prepare_pmd_huge_pte(pgtable, dst_mm); > + dst_mm->nr_ptes++; > > ret = 0; > out_unlock: > @@ -887,7 +889,6 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm, > } > kfree(pages); > > - mm->nr_ptes++; > smp_wmb(); /* make pte visible before pmd */ > pmd_populate(mm, pmd, pgtable); > page_remove_rmap(page); > @@ -1047,6 +1048,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, > VM_BUG_ON(page_mapcount(page) < 0); > add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR); > VM_BUG_ON(!PageHead(page)); > + tlb->mm->nr_ptes--; > spin_unlock(&tlb->mm->page_table_lock); > tlb_remove_page(tlb, page); > pte_free(tlb->mm, pgtable); > @@ -1375,7 +1377,6 @@ static int __split_huge_page_map(struct page *page, > pte_unmap(pte); > } > > - mm->nr_ptes++; > smp_wmb(); /* make pte visible before pmd */ > /* > * Up to this point the pmd is present and huge and > @@ -1988,7 +1989,6 @@ static void collapse_huge_page(struct mm_struct *mm, > set_pmd_at(mm, address, pmd, _pmd); > update_mmu_cache(vma, address, _pmd); > prepare_pmd_huge_pte(pgtable, mm); > - mm->nr_ptes--; > spin_unlock(&mm->page_table_lock); > > #ifndef CONFIG_NUMA -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/