Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753527AbdHOHwC (ORCPT ); Tue, 15 Aug 2017 03:52:02 -0400 Received: from mail-bl2nam02on0069.outbound.protection.outlook.com ([104.47.38.69]:22048 "EHLO NAM02-BL2-obe.outbound.protection.outlook.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1753232AbdHOHwB (ORCPT ); Tue, 15 Aug 2017 03:52:01 -0400 From: Nadav Amit To: Peter Zijlstra CC: Minchan Kim , Ingo Molnar , "Stephen Rothwell" , Andrew Morton , Thomas Gleixner , Ingo Molnar , "H. Peter Anvin" , Linux-Next Mailing List , Linux Kernel Mailing List , Linus Subject: Re: linux-next: manual merge of the akpm-current tree with the tip tree Thread-Topic: linux-next: manual merge of the akpm-current tree with the tip tree Thread-Index: AQHTEnbvkRXpJ0prakSqPKBRkcADs6J+5TmAgAAkogCAAALYgIAAI/cAgAKfBACAAHDTgIAA8e6AgAAfCYCAAPNnAIAAzO2A Date: Tue, 15 Aug 2017 07:51:57 +0000 Message-ID: <1138ED5D-AA95-48D0-86D4-75F20DFE0E0B@vmware.com> References: <20170811175326.36d546dc@canb.auug.org.au> <20170811093449.w5wttpulmwfykjzm@hirez.programming.kicks-ass.net> <20170811214556.322b3c4e@canb.auug.org.au> <20170811115607.p2vgqcp7w3wurhvw@gmail.com> <20170811140450.irhxa2bhdpmmhhpv@hirez.programming.kicks-ass.net> <20170813125019.ihqjud37ytgri7bn@hirez.programming.kicks-ass.net> <20170814031613.GD25427@bbox> <0F858068-D41D-46E3-B4A8-8A95B4EDB94F@vmware.com> <20170814193828.GN6524@worktop.programming.kicks-ass.net> In-Reply-To: <20170814193828.GN6524@worktop.programming.kicks-ass.net> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: authentication-results: spf=none (sender IP is ) smtp.mailfrom=namit@vmware.com; x-originating-ip: [2601:647:4580:b719:1dee:f2c0:9348:c7c7] x-ms-publictraffictype: Email x-microsoft-exchange-diagnostics: 1;BY2PR05MB1974;20:MlpY5voNpXmPN4++cm0KyvDV0orQgW8C8B/LnatfVMJgZQBSTO3aF9XzY2N2zvrrQNV8PZZ+it7y0VjvYCr0qrKW/dSK5kgVSlSRFol9YdF4Hj4A8fCjvewCwbb0SQnpOLmIJbrjaSfLM7QydfjAXqhUNnTX7Mb50DrslbA0K8E= x-ms-office365-filtering-correlation-id: b2ca6f48-9e31-431a-9559-08d4e3b2811d x-microsoft-antispam: UriScan:;BCL:0;PCL:0;RULEID:(300000500095)(300135000095)(300000501095)(300135300095)(300000502095)(300135100095)(22001)(2017030254152)(300000503095)(300135400095)(2017052603031)(201703131423075)(201703031133081)(201702281549075)(300000504095)(300135200095)(300000505095)(300135600095)(300000506095)(300135500095);SRVR:BY2PR05MB1974; x-ms-traffictypediagnostic: BY2PR05MB1974: x-exchange-antispam-report-test: UriScan:(190756311086443); x-microsoft-antispam-prvs: x-exchange-antispam-report-cfa-test: BCL:0;PCL:0;RULEID:(100000700101)(100105000095)(100000701101)(100105300095)(100000702101)(100105100095)(6040450)(601004)(2401047)(5005006)(8121501046)(3002001)(10201501046)(100000703101)(100105400095)(93006095)(93001095)(6041248)(201703131423075)(201702281528075)(201703061421075)(201703061406153)(20161123558100)(20161123555025)(20161123560025)(20161123564025)(20161123562025)(6072148)(201708071742011)(100000704101)(100105200095)(100000705101)(100105500095);SRVR:BY2PR05MB1974;BCL:0;PCL:0;RULEID:(100000800101)(100110000095)(100000801101)(100110300095)(100000802101)(100110100095)(100000803101)(100110400095)(100000804101)(100110200095)(100000805101)(100110500095);SRVR:BY2PR05MB1974; x-forefront-prvs: 04004D94E2 x-forefront-antispam-report: SFV:NSPM;SFS:(10009020)(6009001)(51914003)(24454002)(189002)(199003)(14454004)(3660700001)(93886004)(2950100002)(6512007)(6916009)(53936002)(6116002)(102836003)(25786009)(50986999)(2906002)(8676002)(68736007)(2900100001)(7416002)(6486002)(86362001)(575784001)(54356999)(33656002)(76176999)(77096006)(6506006)(36756003)(82746002)(5660300001)(54906002)(478600001)(3280700002)(101416001)(229853002)(7736002)(97736004)(189998001)(8936002)(110136004)(106356001)(105586002)(6436002)(6246003)(4326008)(83716003)(99286003)(305945005)(81166006)(81156014);DIR:OUT;SFP:1101;SCL:1;SRVR:BY2PR05MB1974;H:BY2PR05MB2215.namprd05.prod.outlook.com;FPR:;SPF:None;PTR:InfoNoRecords;MX:1;A:1;LANG:en; spamdiagnosticoutput: 1:99 spamdiagnosticmetadata: NSPM Content-Type: text/plain; charset="utf-8" Content-ID: MIME-Version: 1.0 X-OriginatorOrg: vmware.com X-MS-Exchange-CrossTenant-originalarrivaltime: 15 Aug 2017 07:51:57.3778 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: b39138ca-3cee-4b4a-a4d6-cd83d9dd62f0 X-MS-Exchange-Transport-CrossTenantHeadersStamped: BY2PR05MB1974 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from base64 to 8bit by nfs id v7F7q8mR011939 Content-Length: 8060 Lines: 244 Peter Zijlstra wrote: > On Mon, Aug 14, 2017 at 05:07:19AM +0000, Nadav Amit wrote: >>>> So I'm not entirely clear about this yet. >>>> >>>> How about: >>>> >>>> >>>> CPU0 CPU1 >>>> >>>> tlb_gather_mmu() >>>> >>>> lock PTLn >>>> no mod >>>> unlock PTLn >>>> >>>> tlb_gather_mmu() >>>> >>>> lock PTLm >>>> mod >>>> include in tlb range >>>> unlock PTLm >>>> >>>> lock PTLn >>>> mod >>>> unlock PTLn >>>> >>>> tlb_finish_mmu() >>>> force = mm_tlb_flush_nested(tlb->mm); >>>> arch_tlb_finish_mmu(force); >>>> >>>> >>>> ... more ... >>>> >>>> tlb_finish_mmu() >>>> >>>> >>>> >>>> In this case you also want CPU1's mm_tlb_flush_nested() call to return >>>> true, right? >>> >>> No, because CPU 1 mofified pte and added it into tlb range >>> so regardless of nested, it will flush TLB so there is no stale >>> TLB problem. > >> To clarify: the main problem that these patches address is when the first >> CPU updates the PTE, and second CPU sees the updated value and thinks: “the >> PTE is already what I wanted - no flush is needed”. > > OK, that simplifies things. > >> For some reason (I would assume intentional), all the examples here first >> “do not modify” the PTE, and then modify it - which is not an “interesting” >> case. > > Depends on what you call 'interesting' :-) They are 'interesting' to > make work from a memory ordering POV. And since I didn't get they were > excluded from the set, I worried. > > In fact, if they were to be included, I couldn't make it work at all. So > I'm really glad to hear we can disregard them. > >> However, based on what I understand on the memory barriers, I think >> there is indeed a missing barrier before reading it in >> mm_tlb_flush_nested(). IIUC using smp_mb__after_unlock_lock() in this case, >> before reading, would solve the problem with least impact on systems with >> strong memory ordering. > > No, all is well. If, as you say, we're naturally constrained to the case > where we only care about prior modification we can rely on the RCpc PTL > locks. > > Consider: > > > CPU0 CPU1 > > tlb_gather_mmu() > > tlb_gather_mmu() > inc --------. > | (inc is constrained by RELEASE) > lock PTLn | > mod ^ > unlock PTLn -----------------> lock PTLn > v no mod > | unlock PTLn > | > | lock PTLm > | mod > | include in tlb range > | unlock PTLm > | > (read is constrained | > by ACQUIRE) | > | tlb_finish_mmu() > `---- force = mm_tlb_flush_nested(tlb->mm); > arch_tlb_finish_mmu(force); > > > ... more ... > > tlb_finish_mmu() > > > Then CPU1's acquire of PTLn orders against CPU0's release of that same > PTLn which guarantees we observe both its (prior) modified PTE and the > mm->tlb_flush_pending increment from tlb_gather_mmu(). > > So all we need for mm_tlb_flush_nested() to work is having acquired the > right PTL at least once before calling it. > > At the same time, the decrements need to be after the TLB invalidate is > complete, this ensures that _IF_ we observe the decrement, we must've > also observed the corresponding invalidate. > > Something like the below is then sufficient. > > --- > Subject: mm: Clarify tlb_flush_pending barriers > From: Peter Zijlstra > Date: Fri, 11 Aug 2017 16:04:50 +0200 > > Better document the ordering around tlb_flush_pending. > > Signed-off-by: Peter Zijlstra (Intel) > --- > include/linux/mm_types.h | 78 +++++++++++++++++++++++++++-------------------- > 1 file changed, 45 insertions(+), 33 deletions(-) > > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -526,30 +526,6 @@ extern void tlb_gather_mmu(struct mmu_ga > extern void tlb_finish_mmu(struct mmu_gather *tlb, > unsigned long start, unsigned long end); > > -/* > - * Memory barriers to keep this state in sync are graciously provided by > - * the page table locks, outside of which no page table modifications happen. > - * The barriers are used to ensure the order between tlb_flush_pending updates, > - * which happen while the lock is not taken, and the PTE updates, which happen > - * while the lock is taken, are serialized. > - */ > -static inline bool mm_tlb_flush_pending(struct mm_struct *mm) > -{ > - /* > - * Must be called with PTL held; such that our PTL acquire will have > - * observed the store from set_tlb_flush_pending(). > - */ > - return atomic_read(&mm->tlb_flush_pending) > 0; > -} > - > -/* > - * Returns true if there are two above TLB batching threads in parallel. > - */ > -static inline bool mm_tlb_flush_nested(struct mm_struct *mm) > -{ > - return atomic_read(&mm->tlb_flush_pending) > 1; > -} > - > static inline void init_tlb_flush_pending(struct mm_struct *mm) > { > atomic_set(&mm->tlb_flush_pending, 0); > @@ -558,7 +534,6 @@ static inline void init_tlb_flush_pendin > static inline void inc_tlb_flush_pending(struct mm_struct *mm) > { > atomic_inc(&mm->tlb_flush_pending); > - > /* > * The only time this value is relevant is when there are indeed pages > * to flush. And we'll only flush pages after changing them, which > @@ -580,24 +555,61 @@ static inline void inc_tlb_flush_pending > * flush_tlb_range(); > * atomic_dec(&mm->tlb_flush_pending); > * > - * So the =true store is constrained by the PTL unlock, and the =false > - * store is constrained by the TLB invalidate. > + * Where the increment if constrained by the PTL unlock, it thus > + * ensures that the increment is visible if the PTE modification is > + * visible. After all, if there is no PTE modification, nobody cares > + * about TLB flushes either. > + * > + * This very much relies on users (mm_tlb_flush_pending() and > + * mm_tlb_flush_nested()) only caring about _specific_ PTEs (and > + * therefore specific PTLs), because with SPLIT_PTE_PTLOCKS and RCpc > + * locks (PPC) the unlock of one doesn't order against the lock of > + * another PTL. > + * > + * The decrement is ordered by the flush_tlb_range(), such that > + * mm_tlb_flush_pending() will not return false unless all flushes have > + * completed. > */ > } > > -/* Clearing is done after a TLB flush, which also provides a barrier. */ > static inline void dec_tlb_flush_pending(struct mm_struct *mm) > { > /* > - * Guarantee that the tlb_flush_pending does not not leak into the > - * critical section, since we must order the PTE change and changes to > - * the pending TLB flush indication. We could have relied on TLB flush > - * as a memory barrier, but this behavior is not clearly documented. > + * See inc_tlb_flush_pending(). > + * > + * This cannot be smp_mb__before_atomic() because smp_mb() simply does > + * not order against TLB invalidate completion, which is what we need. > + * > + * Therefore we must rely on tlb_flush_*() to guarantee order. > */ > - smp_mb__before_atomic(); > atomic_dec(&mm->tlb_flush_pending); > } > > +static inline bool mm_tlb_flush_pending(struct mm_struct *mm) > +{ > + /* > + * Must be called after having acquired the PTL; orders against that > + * PTLs release and therefore ensures that if we observe the modified > + * PTE we must also observe the increment from inc_tlb_flush_pending(). > + * > + * That is, it only guarantees to return true if there is a flush > + * pending for _this_ PTL. > + */ > + return atomic_read(&mm->tlb_flush_pending); > +} > + > +static inline bool mm_tlb_flush_nested(struct mm_struct *mm) > +{ > + /* > + * Similar to mm_tlb_flush_pending(), we must have acquired the PTL > + * for which there is a TLB flush pending in order to guarantee > + * we've seen both that PTE modification and the increment. > + * > + * (no requirement on actually still holding the PTL, that is irrelevant) > + */ > + return atomic_read(&mm->tlb_flush_pending) > 1; > +} > + > struct vm_fault; > > struct vm_special_mapping { Thanks for the detailed explanation. I will pay more attention next time.