Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932697Ab2FGSAK (ORCPT ); Thu, 7 Jun 2012 14:00:10 -0400 Received: from acsinet15.oracle.com ([141.146.126.227]:22317 "EHLO acsinet15.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753056Ab2FGSAG (ORCPT ); Thu, 7 Jun 2012 14:00:06 -0400 Date: Thu, 7 Jun 2012 13:52:49 -0400 From: Konrad Rzeszutek Wilk To: Greg KH Cc: linux-kernel@vger.kernel.org, stable@vger.kernel.org, torvalds@linux-foundation.org, akpm@linux-foundation.org, alan@lxorguk.ukuu.org.uk, Ulrich Obergfell , Andrea Arcangeli , Mel Gorman , Hugh Dickins , Larry Woodman , Petr Matousek , Rik van Riel Subject: Re: [ 08/82] mm: pmd_read_atomic: fix 32bit PAE pmd walk vs pmd_populate SMP race condition Message-ID: <20120607175249.GV9472@phenom.dumpdata.com> References: <20120607041406.GA13233@kroah.com> <20120607040337.622672845@linuxfoundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20120607040337.622672845@linuxfoundation.org> User-Agent: Mutt/1.5.21 (2010-09-15) X-Source-IP: acsinet22.oracle.com [141.146.126.238] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 9485 Lines: 231 On Thu, Jun 07, 2012 at 01:03:44PM +0900, Greg KH wrote: > 3.4-stable review patch. If anyone has any objections, please let me know. It breaks Linux running under Amazon EC2 under 32-bit. Please don't apply it to any 3.x kernels until we figure out a fix to this. > > ------------------ > > From: Andrea Arcangeli > > commit 26c191788f18129af0eb32a358cdaea0c7479626 upstream. > > When holding the mmap_sem for reading, pmd_offset_map_lock should only > run on a pmd_t that has been read atomically from the pmdp pointer, > otherwise we may read only half of it leading to this crash. > > PID: 11679 TASK: f06e8000 CPU: 3 COMMAND: "do_race_2_panic" > #0 [f06a9dd8] crash_kexec at c049b5ec > #1 [f06a9e2c] oops_end at c083d1c2 > #2 [f06a9e40] no_context at c0433ded > #3 [f06a9e64] bad_area_nosemaphore at c043401a > #4 [f06a9e6c] __do_page_fault at c0434493 > #5 [f06a9eec] do_page_fault at c083eb45 > #6 [f06a9f04] error_code (via page_fault) at c083c5d5 > EAX: 01fb470c EBX: fff35000 ECX: 00000003 EDX: 00000100 EBP: > 00000000 > DS: 007b ESI: 9e201000 ES: 007b EDI: 01fb4700 GS: 00e0 > CS: 0060 EIP: c083bc14 ERR: ffffffff EFLAGS: 00010246 > #7 [f06a9f38] _spin_lock at c083bc14 > #8 [f06a9f44] sys_mincore at c0507b7d > #9 [f06a9fb0] system_call at c083becd > start len > EAX: ffffffda EBX: 9e200000 ECX: 00001000 EDX: 6228537f > DS: 007b ESI: 00000000 ES: 007b EDI: 003d0f00 > SS: 007b ESP: 62285354 EBP: 62285388 GS: 0033 > CS: 0073 EIP: 00291416 ERR: 000000da EFLAGS: 00000286 > > This should be a longstanding bug affecting x86 32bit PAE without THP. > Only archs with 64bit large pmd_t and 32bit unsigned long should be > affected. > > With THP enabled the barrier() in pmd_none_or_trans_huge_or_clear_bad() > would partly hide the bug when the pmd transition from none to stable, > by forcing a re-read of the *pmd in pmd_offset_map_lock, but when THP is > enabled a new set of problem arises by the fact could then transition > freely in any of the none, pmd_trans_huge or pmd_trans_stable states. > So making the barrier in pmd_none_or_trans_huge_or_clear_bad() > unconditional isn't good idea and it would be a flakey solution. > > This should be fully fixed by introducing a pmd_read_atomic that reads > the pmd in order with THP disabled, or by reading the pmd atomically > with cmpxchg8b with THP enabled. > > Luckily this new race condition only triggers in the places that must > already be covered by pmd_none_or_trans_huge_or_clear_bad() so the fix > is localized there but this bug is not related to THP. > > NOTE: this can trigger on x86 32bit systems with PAE enabled with more > than 4G of ram, otherwise the high part of the pmd will never risk to be > truncated because it would be zero at all times, in turn so hiding the > SMP race. > > This bug was discovered and fully debugged by Ulrich, quote: > > ---- > [..] > pmd_none_or_trans_huge_or_clear_bad() loads the content of edx and > eax. > > 496 static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t > *pmd) > 497 { > 498 /* depend on compiler for an atomic pmd read */ > 499 pmd_t pmdval = *pmd; > > // edi = pmd pointer > 0xc0507a74 : mov 0x8(%esp),%edi > ... > // edx = PTE page table high address > 0xc0507a84 : mov 0x4(%edi),%edx > ... > // eax = PTE page table low address > 0xc0507a8e : mov (%edi),%eax > > [..] > > Please note that the PMD is not read atomically. These are two "mov" > instructions where the high order bits of the PMD entry are fetched > first. Hence, the above machine code is prone to the following race. > > - The PMD entry {high|low} is 0x0000000000000000. > The "mov" at 0xc0507a84 loads 0x00000000 into edx. > > - A page fault (on another CPU) sneaks in between the two "mov" > instructions and instantiates the PMD. > > - The PMD entry {high|low} is now 0x00000003fda38067. > The "mov" at 0xc0507a8e loads 0xfda38067 into eax. > ---- > > Reported-by: Ulrich Obergfell > Signed-off-by: Andrea Arcangeli > Cc: Mel Gorman > Cc: Hugh Dickins > Cc: Larry Woodman > Cc: Petr Matousek > Cc: Rik van Riel > Signed-off-by: Andrew Morton > Signed-off-by: Linus Torvalds > Signed-off-by: Greg Kroah-Hartman > > --- > arch/x86/include/asm/pgtable-3level.h | 50 ++++++++++++++++++++++++++++++++++ > include/asm-generic/pgtable.h | 22 +++++++++++++- > 2 files changed, 70 insertions(+), 2 deletions(-) > > --- a/arch/x86/include/asm/pgtable-3level.h > +++ b/arch/x86/include/asm/pgtable-3level.h > @@ -31,6 +31,56 @@ static inline void native_set_pte(pte_t > ptep->pte_low = pte.pte_low; > } > > +#define pmd_read_atomic pmd_read_atomic > +/* > + * pte_offset_map_lock on 32bit PAE kernels was reading the pmd_t with > + * a "*pmdp" dereference done by gcc. Problem is, in certain places > + * where pte_offset_map_lock is called, concurrent page faults are > + * allowed, if the mmap_sem is hold for reading. An example is mincore > + * vs page faults vs MADV_DONTNEED. On the page fault side > + * pmd_populate rightfully does a set_64bit, but if we're reading the > + * pmd_t with a "*pmdp" on the mincore side, a SMP race can happen > + * because gcc will not read the 64bit of the pmd atomically. To fix > + * this all places running pmd_offset_map_lock() while holding the > + * mmap_sem in read mode, shall read the pmdp pointer using this > + * function to know if the pmd is null nor not, and in turn to know if > + * they can run pmd_offset_map_lock or pmd_trans_huge or other pmd > + * operations. > + * > + * Without THP if the mmap_sem is hold for reading, the > + * pmd can only transition from null to not null while pmd_read_atomic runs. > + * So there's no need of literally reading it atomically. > + * > + * With THP if the mmap_sem is hold for reading, the pmd can become > + * THP or null or point to a pte (and in turn become "stable") at any > + * time under pmd_read_atomic, so it's mandatory to read it atomically > + * with cmpxchg8b. > + */ > +#ifndef CONFIG_TRANSPARENT_HUGEPAGE > +static inline pmd_t pmd_read_atomic(pmd_t *pmdp) > +{ > + pmdval_t ret; > + u32 *tmp = (u32 *)pmdp; > + > + ret = (pmdval_t) (*tmp); > + if (ret) { > + /* > + * If the low part is null, we must not read the high part > + * or we can end up with a partial pmd. > + */ > + smp_rmb(); > + ret |= ((pmdval_t)*(tmp + 1)) << 32; > + } > + > + return (pmd_t) { ret }; > +} > +#else /* CONFIG_TRANSPARENT_HUGEPAGE */ > +static inline pmd_t pmd_read_atomic(pmd_t *pmdp) > +{ > + return (pmd_t) { atomic64_read((atomic64_t *)pmdp) }; > +} > +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ > + > static inline void native_set_pte_atomic(pte_t *ptep, pte_t pte) > { > set_64bit((unsigned long long *)(ptep), native_pte_val(pte)); > --- a/include/asm-generic/pgtable.h > +++ b/include/asm-generic/pgtable.h > @@ -446,6 +446,18 @@ static inline int pmd_write(pmd_t pmd) > #endif /* __HAVE_ARCH_PMD_WRITE */ > #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ > > +#ifndef pmd_read_atomic > +static inline pmd_t pmd_read_atomic(pmd_t *pmdp) > +{ > + /* > + * Depend on compiler for an atomic pmd read. NOTE: this is > + * only going to work, if the pmdval_t isn't larger than > + * an unsigned long. > + */ > + return *pmdp; > +} > +#endif > + > /* > * This function is meant to be used by sites walking pagetables with > * the mmap_sem hold in read mode to protect against MADV_DONTNEED and > @@ -459,11 +471,17 @@ static inline int pmd_write(pmd_t pmd) > * undefined so behaving like if the pmd was none is safe (because it > * can return none anyway). The compiler level barrier() is critically > * important to compute the two checks atomically on the same pmdval. > + * > + * For 32bit kernels with a 64bit large pmd_t this automatically takes > + * care of reading the pmd atomically to avoid SMP race conditions > + * against pmd_populate() when the mmap_sem is hold for reading by the > + * caller (a special atomic read not done by "gcc" as in the generic > + * version above, is also needed when THP is disabled because the page > + * fault can populate the pmd from under us). > */ > static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd) > { > - /* depend on compiler for an atomic pmd read */ > - pmd_t pmdval = *pmd; > + pmd_t pmdval = pmd_read_atomic(pmd); > /* > * The barrier will stabilize the pmdval in a register or on > * the stack so that it will stop changing under the code. > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/