Received: by 2002:a25:824b:0:0:0:0:0 with SMTP id d11csp1797252ybn; Thu, 26 Sep 2019 02:19:47 -0700 (PDT) X-Google-Smtp-Source: APXvYqzgPa5G/Ia+tCX/q9Yt0OAYBP1p5aGnz+992jpSdYAhbiif3S8O10/sbbJ7+W7304X9QOki X-Received: by 2002:a50:dac2:: with SMTP id s2mr2457474edj.26.1569489587027; Thu, 26 Sep 2019 02:19:47 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1569489587; cv=none; d=google.com; s=arc-20160816; b=FT0oy8e63rJZWU54gPiX7vLRSwxftJ/u2kOdLjLHcxwIHGN084wHK94FUWEoW8tcGU RS36nyBE+ujoBHbYGldllv1Dm73am5rY7lSluouP3nH9OTrWfRBgDeZjydt9hdCdUhem ALymVmouhXYVtpahFk/YT+7oI3WLBFGvUI2gwqtv9N6UGKmbnsgAyB7VF7WVOjqKqOrw koI+aLw7BXoCRfjxdy6Y2C2TH/J/sR8K1+NWpoxTnZKrWBYGZ8dWKpv1XDMvQwkwooO9 D3h1POt2CLo658D3PXBRQ2lZ/3GVPlUSxoyGb1BSKgXSrf3YwxiDiFlQAGIFVAv+i2Wk Uerg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:reply-to:message-id :subject:cc:to:from:date; bh=5mWFZYaQDgq3HwFxDEwmjqJqC3AsgyNydzE0/FXgs50=; b=toH3MH6910aEEGR7dtF75G62ttz3YOxzJ2PbrmLE/ywtbllfvnilwDBaSruQcAr44g faJWDcxrlmRH66MojT9UIyPyYrawtUn4lL2NegUh8HKjEssblQQ8mOE9Iyt0xING7Je9 7cGgCjiRdtAvXrJj63TL5x2P8fH+gg5Sm5haQ0hv+Ii6L1YcLgtbb+5tMFPsvcNHpebm qp6eqyDI8WbK6fEbd5jsbFLt4R4AMbz5wxk/iG3W1wDmTh45Bd43J0JCYelyB0zhSjPP PnFOXwjU5BC+3WIa45O/9unzNrzCmwPc1lGFJcoQ8udHjF42fyld/4rwsWnaUUoinxov 7Saw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id jo19si728139ejb.271.2019.09.26.02.19.23; Thu, 26 Sep 2019 02:19:47 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2442601AbfIYH3P (ORCPT + 99 others); Wed, 25 Sep 2019 03:29:15 -0400 Received: from mga11.intel.com ([192.55.52.93]:28444 "EHLO mga11.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2442594AbfIYH3P (ORCPT ); Wed, 25 Sep 2019 03:29:15 -0400 X-Amp-Result: UNKNOWN X-Amp-Original-Verdict: FILE UNKNOWN X-Amp-File-Uploaded: False Received: from orsmga003.jf.intel.com ([10.7.209.27]) by fmsmga102.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 25 Sep 2019 00:29:13 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.64,547,1559545200"; d="scan'208";a="191258350" Received: from richard.sh.intel.com (HELO localhost) ([10.239.159.54]) by orsmga003.jf.intel.com with ESMTP; 25 Sep 2019 00:29:11 -0700 Date: Wed, 25 Sep 2019 15:28:51 +0800 From: Wei Yang To: Ingo Molnar Cc: Wei Yang , tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, x86@kernel.org, linux-kernel@vger.kernel.org, Peter Zijlstra , Andy Lutomirski , Linus Torvalds , Rik van Riel , Dave Hansen Subject: Re: [PATCH] x86/mm: Clean up the pmd_read_atomic() comments Message-ID: <20190925072851.GA32558@richard> Reply-To: Wei Yang References: <20190925014453.20236-1-richardw.yang@linux.intel.com> <20190925015956.GA20295@richard> <20190925065514.GA21740@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20190925065514.GA21740@gmail.com> User-Agent: Mutt/1.9.4 (2018-02-28) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Sep 25, 2019 at 08:55:14AM +0200, Ingo Molnar wrote: > >* Wei Yang wrote: > >> To be honest, I have a question on how this works. >> >> As the comment says, we need to call pmd_read_atomic before using >> pte_offset_map_lock to avoid data corruption. > >There's only a risk of data corruption if mmap_sem is held for reading. >If it's held for writing then the pagetable contents should be stable. > >> For example, in function swapin_walk_pmd_entry: >> >> pmd_none_or_trans_huge_or_clear_bad(pmd) >> pmd_read_atomic(pmd) --- 1 >> pte_offset_map_lock(mm, pmd, ...) --- 2 >> >> At point 1, we are assured the content is intact. While in point 2, we would >> read pmd again to calculate the pte address. How we ensure this time the >> content is intact? Because pmd_none_or_trans_huge_or_clear_bad() ensures the >> pte is stable, so that the content won't be changed? > >Indeed pte_offset_map_lock() will take a non-atomic *pmd value in >pte_lockptr() before taking the pte spinlock. > >I believe the rule here is that if pmd_none_or_trans_huge_or_clear_bad() >finds the pmd 'stable' while holding the mmap_sem read-locked, then the >pmd cannot change while we are continuously holding the mmap_sem. > I have the same gut feeling as you. Then I have another question, why in page fault routines, we don't use pmd_read_atomic() to retrieve pmd value? Since we support concurrent page fault, and we just grap mmap_sem read-locked during page fault, the value could be corrupted too. For example in __handle_mm_fault() pmd_t orig_pmd = *vmf.pmd; barrier(); This is why the barrier() is here? >Hence the followup pte_offset_map_lock() and other iterators can look at >the value of the pmd without locking. (Individual pte entries still need >the pte-lock, because they might be faulted-in in parallel.) > >So the pmd use pattern in swapin_walk_pmd_entry() should be safe. > >I'm not 100% sure though - so I've added a few more Cc:s ... > >I've also cleaned up the pmd_read_atomic() some more to make it more >readable - see the patch below. > >Thanks, > > Ingo > >==================> >From: Ingo Molnar >Date: Wed, 25 Sep 2019 08:38:57 +0200 >Subject: [PATCH] x86/mm: Clean up the pmd_read_atomic() comments > >Fix spelling, consistent parenthesis and grammar - and also clarify >the language where needed. > >Cc: Andy Lutomirski >Cc: Borislav Petkov >Cc: Dave Hansen >Cc: H. Peter Anvin >Cc: Linus Torvalds >Cc: Peter Zijlstra >Cc: Rik van Riel >Cc: Thomas Gleixner >Cc: Wei Yang >Link: https://lkml.kernel.org/r/20190925014453.20236-1-richardw.yang@linux.intel.com >Signed-off-by: Ingo Molnar The following change looks good to me. Reviewed-by: Wei Yang >--- > arch/x86/include/asm/pgtable-3level.h | 44 ++++++++++++++++++----------------- > 1 file changed, 23 insertions(+), 21 deletions(-) > >diff --git a/arch/x86/include/asm/pgtable-3level.h b/arch/x86/include/asm/pgtable-3level.h >index 1796462ff143..5afb5e0fe903 100644 >--- a/arch/x86/include/asm/pgtable-3level.h >+++ b/arch/x86/include/asm/pgtable-3level.h >@@ -36,39 +36,41 @@ static inline void native_set_pte(pte_t *ptep, pte_t pte) > > #define pmd_read_atomic pmd_read_atomic > /* >- * pte_offset_map_lock on 32bit PAE kernels was reading the pmd_t with >- * a "*pmdp" dereference done by gcc. Problem is, in certain places >- * where pte_offset_map_lock is called, concurrent page faults are >+ * pte_offset_map_lock() on 32-bit PAE kernels was reading the pmd_t with >+ * a "*pmdp" dereference done by GCC. Problem is, in certain places >+ * where pte_offset_map_lock() is called, concurrent page faults are > * allowed, if the mmap_sem is hold for reading. An example is mincore > * vs page faults vs MADV_DONTNEED. On the page fault side >- * pmd_populate rightfully does a set_64bit, but if we're reading the >+ * pmd_populate() rightfully does a set_64bit(), but if we're reading the > * pmd_t with a "*pmdp" on the mincore side, a SMP race can happen >- * because gcc will not read the 64bit of the pmd atomically. To fix >- * this all places running pte_offset_map_lock() while holding the >+ * because GCC will not read the 64-bit value of the pmd atomically. >+ * >+ * To fix this all places running pte_offset_map_lock() while holding the > * mmap_sem in read mode, shall read the pmdp pointer using this >- * function to know if the pmd is null nor not, and in turn to know if >+ * function to know if the pmd is null or not, and in turn to know if > * they can run pte_offset_map_lock() or pmd_trans_huge() or other pmd > * operations. > * >- * Without THP if the mmap_sem is hold for reading, the pmd can only >- * transition from null to not null while pmd_read_atomic runs. So >+ * Without THP if the mmap_sem is held for reading, the pmd can only >+ * transition from null to not null while pmd_read_atomic() runs. So > * we can always return atomic pmd values with this function. > * >- * With THP if the mmap_sem is hold for reading, the pmd can become >+ * With THP if the mmap_sem is held for reading, the pmd can become > * trans_huge or none or point to a pte (and in turn become "stable") >- * at any time under pmd_read_atomic. We could read it really >- * atomically here with a atomic64_read for the THP enabled case (and >+ * at any time under pmd_read_atomic(). We could read it truly >+ * atomically here with an atomic64_read() for the THP enabled case (and > * it would be a whole lot simpler), but to avoid using cmpxchg8b we > * only return an atomic pmdval if the low part of the pmdval is later >- * found stable (i.e. pointing to a pte). And we're returning a none >- * pmdval if the low part of the pmd is none. In some cases the high >- * and low part of the pmdval returned may not be consistent if THP is >- * enabled (the low part may point to previously mapped hugepage, >- * while the high part may point to a more recently mapped hugepage), >- * but pmd_none_or_trans_huge_or_clear_bad() only needs the low part >- * of the pmd to be read atomically to decide if the pmd is unstable >- * or not, with the only exception of when the low part of the pmd is >- * zero in which case we return a none pmd. >+ * found to be stable (i.e. pointing to a pte). We are also returning a >+ * 'none' (zero) pmdval if the low part of the pmd is zero. >+ * >+ * In some cases the high and low part of the pmdval returned may not be >+ * consistent if THP is enabled (the low part may point to previously >+ * mapped hugepage, while the high part may point to a more recently >+ * mapped hugepage), but pmd_none_or_trans_huge_or_clear_bad() only >+ * needs the low part of the pmd to be read atomically to decide if the >+ * pmd is unstable or not, with the only exception when the low part >+ * of the pmd is zero, in which case we return a 'none' pmd. > */ > static inline pmd_t pmd_read_atomic(pmd_t *pmdp) > { -- Wei Yang Help you, Help me