Received: by 2002:a25:824b:0:0:0:0:0 with SMTP id d11csp1794131ybn; Thu, 26 Sep 2019 02:16:22 -0700 (PDT) X-Google-Smtp-Source: APXvYqyj3FNtdI+xwg9+yau67m+rCkTH9/9QGS4Deru2jwLVj1/vEAy+hyq2kAXxe+C5J7na1Iy+ X-Received: by 2002:a17:906:4541:: with SMTP id s1mr2144626ejq.210.1569489382391; Thu, 26 Sep 2019 02:16:22 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1569489382; cv=none; d=google.com; s=arc-20160816; b=Ifzmv1RtNCigTj9SKhviumRcrFYdiSB9BrR6n928fP25PLfbWl4fGd11rM4XfNqfsj nsQbR61p2xnUBu7ZDP/VbJZm91FoPavWZ5jWWTb0MFKpgRAYBfyUbucxtOgzsIvs4iGR 9vF/kuJb41wUKKBNO4z43Eh+X29UgcJg2t0HTDIqjPaVXb/QyvZOgkcELNvNio5T1O3+ yGd3V8UZCvBFvhZVmfbOwi/QALFvlaMY3AK3u5V2lto0ldUw3Cbv6VOoEtGs9siP+GLx sHgUQZga9XBvM+HwVmemRJplhtKy21sLIL7/nn5GR90hALzpXq7LZU/u/jc3q287JqVJ KEPQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature; bh=Mr0rHIX84j5zIY5Uipi5s0jSvGWjuHax2KlVj0xWmbc=; b=NCign8R+0rU0xAb64Az8EjGBmutjZumivkEJyYX7VIJgWD4R31InlsDpn/dhxBj/uc 8/O5lxG8GBDgny3bhdzlArfpbl+/p1QGIZfXGl42KSbVZngvTfum/ueMQ5xkXDlJxFdW mXflzJYQpaIKDjHqs21n5N/YOEHTRrZMREejGYJ3Dd3faBQyxlXvFIhQKS/uEEuGhi6A jiE+spcI9pUDzmOdOV8NGJg+32vEhw/Nb2XCDUrHqIBheA+tCl2HlYHxVPQgmV5Rz/tP nKLGhxIzB3Ia18qgOtW1rdUrsBLGMq6oRwTPU5Qg821kestASoXnHqdYxPrb4FDYUS5y TSOw== ARC-Authentication-Results: i=1; mx.google.com; dkim=fail header.i=@gmail.com header.s=20161025 header.b=WfwdnhIn; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id gz11si616045ejb.341.2019.09.26.02.15.58; Thu, 26 Sep 2019 02:16:22 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=fail header.i=@gmail.com header.s=20161025 header.b=WfwdnhIn; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2442410AbfIYGzT (ORCPT + 99 others); Wed, 25 Sep 2019 02:55:19 -0400 Received: from mail-wr1-f65.google.com ([209.85.221.65]:36315 "EHLO mail-wr1-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2442402AbfIYGzT (ORCPT ); Wed, 25 Sep 2019 02:55:19 -0400 Received: by mail-wr1-f65.google.com with SMTP id y19so5063548wrd.3 for ; Tue, 24 Sep 2019 23:55:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=sender:date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to:user-agent; bh=Mr0rHIX84j5zIY5Uipi5s0jSvGWjuHax2KlVj0xWmbc=; b=WfwdnhInyKiAxaZKvSPPXcN2YMrLJJ0apXmm+G1gyN34iZUagQ/uxWNF72VXhLgmZe DgtA8idQQlci9AM6tHesrxaWHIVggagRi3rB3HXp50NGY4eniEyOOGzoCJuLrbbkaUsp o71QBlJvSPZ5aYxEIbomp2HjyTU0L25BhVHAvPZX9UvSOiC3wSqe3caQpfkRghXDEQXu K8G0Rc0/rc2AdCPb9+MRi+NvXJb1WxMNjU3qNglSEFLJiDo4I7jpBGl8fu61gX7dmnPo h6uldyBx6akBQ6PaSekC0hs57XPTv9aO+4jXafWOUh2W1IrVqETLJs7/X7gKZ8+0xjds qibg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:date:from:to:cc:subject:message-id :references:mime-version:content-disposition:in-reply-to:user-agent; bh=Mr0rHIX84j5zIY5Uipi5s0jSvGWjuHax2KlVj0xWmbc=; b=PMLrPXdhbo64dDKxMVg6BReBT5Xfag/Pk0RWGE44tKOCCQu89jq1QaBU+JLkOK1kgH F5Qkgs2LnhkUhPNs7VMILwrJURoscwH4oGJ5j+l4pPeTwk5t9fY4m7WoXrzfFATxi3bi 4KsD6b4RCjNI+HeTY/p6XXaAc3ufMzkl7w9EPFv9ZvHYU94Uc2Ma2SB9r42+F5URbXhQ m40uKFyNcX5CM2U2fV2Z90kqWqeaRZqZt6+2zSK1+pIgtvYq3IPfLwm4u0TPeARW5hKY fT3bqAZnOrDzidRr5Rf2szEGqPLwSd8WSHUCULQTu6vOC3zETB4/nEObAVgHVwOxifR8 vDlA== X-Gm-Message-State: APjAAAXi4EpWQzhiIFuEhDj8juEF6iUaRcRUFSp6+GL20B+o+gdugJ00 3wfKskWmt/ttZOJfnPI85YQ= X-Received: by 2002:adf:dc4b:: with SMTP id m11mr7879921wrj.269.1569394517137; Tue, 24 Sep 2019 23:55:17 -0700 (PDT) Received: from gmail.com (2E8B0CD5.catv.pool.telekom.hu. [46.139.12.213]) by smtp.gmail.com with ESMTPSA id l1sm3948898wrb.1.2019.09.24.23.55.16 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 24 Sep 2019 23:55:16 -0700 (PDT) Date: Wed, 25 Sep 2019 08:55:14 +0200 From: Ingo Molnar To: Wei Yang Cc: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, x86@kernel.org, linux-kernel@vger.kernel.org, Peter Zijlstra , Andy Lutomirski , Linus Torvalds , Rik van Riel , Dave Hansen Subject: [PATCH] x86/mm: Clean up the pmd_read_atomic() comments Message-ID: <20190925065514.GA21740@gmail.com> References: <20190925014453.20236-1-richardw.yang@linux.intel.com> <20190925015956.GA20295@richard> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20190925015956.GA20295@richard> User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Wei Yang wrote: > To be honest, I have a question on how this works. > > As the comment says, we need to call pmd_read_atomic before using > pte_offset_map_lock to avoid data corruption. There's only a risk of data corruption if mmap_sem is held for reading. If it's held for writing then the pagetable contents should be stable. > For example, in function swapin_walk_pmd_entry: > > pmd_none_or_trans_huge_or_clear_bad(pmd) > pmd_read_atomic(pmd) --- 1 > pte_offset_map_lock(mm, pmd, ...) --- 2 > > At point 1, we are assured the content is intact. While in point 2, we would > read pmd again to calculate the pte address. How we ensure this time the > content is intact? Because pmd_none_or_trans_huge_or_clear_bad() ensures the > pte is stable, so that the content won't be changed? Indeed pte_offset_map_lock() will take a non-atomic *pmd value in pte_lockptr() before taking the pte spinlock. I believe the rule here is that if pmd_none_or_trans_huge_or_clear_bad() finds the pmd 'stable' while holding the mmap_sem read-locked, then the pmd cannot change while we are continuously holding the mmap_sem. Hence the followup pte_offset_map_lock() and other iterators can look at the value of the pmd without locking. (Individual pte entries still need the pte-lock, because they might be faulted-in in parallel.) So the pmd use pattern in swapin_walk_pmd_entry() should be safe. I'm not 100% sure though - so I've added a few more Cc:s ... I've also cleaned up the pmd_read_atomic() some more to make it more readable - see the patch below. Thanks, Ingo ==================> From: Ingo Molnar Date: Wed, 25 Sep 2019 08:38:57 +0200 Subject: [PATCH] x86/mm: Clean up the pmd_read_atomic() comments Fix spelling, consistent parenthesis and grammar - and also clarify the language where needed. Cc: Andy Lutomirski Cc: Borislav Petkov Cc: Dave Hansen Cc: H. Peter Anvin Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Rik van Riel Cc: Thomas Gleixner Cc: Wei Yang Link: https://lkml.kernel.org/r/20190925014453.20236-1-richardw.yang@linux.intel.com Signed-off-by: Ingo Molnar --- arch/x86/include/asm/pgtable-3level.h | 44 ++++++++++++++++++----------------- 1 file changed, 23 insertions(+), 21 deletions(-) diff --git a/arch/x86/include/asm/pgtable-3level.h b/arch/x86/include/asm/pgtable-3level.h index 1796462ff143..5afb5e0fe903 100644 --- a/arch/x86/include/asm/pgtable-3level.h +++ b/arch/x86/include/asm/pgtable-3level.h @@ -36,39 +36,41 @@ static inline void native_set_pte(pte_t *ptep, pte_t pte) #define pmd_read_atomic pmd_read_atomic /* - * pte_offset_map_lock on 32bit PAE kernels was reading the pmd_t with - * a "*pmdp" dereference done by gcc. Problem is, in certain places - * where pte_offset_map_lock is called, concurrent page faults are + * pte_offset_map_lock() on 32-bit PAE kernels was reading the pmd_t with + * a "*pmdp" dereference done by GCC. Problem is, in certain places + * where pte_offset_map_lock() is called, concurrent page faults are * allowed, if the mmap_sem is hold for reading. An example is mincore * vs page faults vs MADV_DONTNEED. On the page fault side - * pmd_populate rightfully does a set_64bit, but if we're reading the + * pmd_populate() rightfully does a set_64bit(), but if we're reading the * pmd_t with a "*pmdp" on the mincore side, a SMP race can happen - * because gcc will not read the 64bit of the pmd atomically. To fix - * this all places running pte_offset_map_lock() while holding the + * because GCC will not read the 64-bit value of the pmd atomically. + * + * To fix this all places running pte_offset_map_lock() while holding the * mmap_sem in read mode, shall read the pmdp pointer using this - * function to know if the pmd is null nor not, and in turn to know if + * function to know if the pmd is null or not, and in turn to know if * they can run pte_offset_map_lock() or pmd_trans_huge() or other pmd * operations. * - * Without THP if the mmap_sem is hold for reading, the pmd can only - * transition from null to not null while pmd_read_atomic runs. So + * Without THP if the mmap_sem is held for reading, the pmd can only + * transition from null to not null while pmd_read_atomic() runs. So * we can always return atomic pmd values with this function. * - * With THP if the mmap_sem is hold for reading, the pmd can become + * With THP if the mmap_sem is held for reading, the pmd can become * trans_huge or none or point to a pte (and in turn become "stable") - * at any time under pmd_read_atomic. We could read it really - * atomically here with a atomic64_read for the THP enabled case (and + * at any time under pmd_read_atomic(). We could read it truly + * atomically here with an atomic64_read() for the THP enabled case (and * it would be a whole lot simpler), but to avoid using cmpxchg8b we * only return an atomic pmdval if the low part of the pmdval is later - * found stable (i.e. pointing to a pte). And we're returning a none - * pmdval if the low part of the pmd is none. In some cases the high - * and low part of the pmdval returned may not be consistent if THP is - * enabled (the low part may point to previously mapped hugepage, - * while the high part may point to a more recently mapped hugepage), - * but pmd_none_or_trans_huge_or_clear_bad() only needs the low part - * of the pmd to be read atomically to decide if the pmd is unstable - * or not, with the only exception of when the low part of the pmd is - * zero in which case we return a none pmd. + * found to be stable (i.e. pointing to a pte). We are also returning a + * 'none' (zero) pmdval if the low part of the pmd is zero. + * + * In some cases the high and low part of the pmdval returned may not be + * consistent if THP is enabled (the low part may point to previously + * mapped hugepage, while the high part may point to a more recently + * mapped hugepage), but pmd_none_or_trans_huge_or_clear_bad() only + * needs the low part of the pmd to be read atomically to decide if the + * pmd is unstable or not, with the only exception when the low part + * of the pmd is zero, in which case we return a 'none' pmd. */ static inline pmd_t pmd_read_atomic(pmd_t *pmdp) {