Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757310AbdIIAA2 (ORCPT ); Fri, 8 Sep 2017 20:00:28 -0400 Received: from mail.kernel.org ([198.145.29.99]:43654 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757173AbdIIAA1 (ORCPT ); Fri, 8 Sep 2017 20:00:27 -0400 DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 8D4A121D29 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=kernel.org Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=luto@kernel.org X-Google-Smtp-Source: AOwi7QA+9Mb9Tqdbews039Ou2EyljnXOG2PmwEEpyMaGuNAUpSjJt5SHFJ4ZT46CaxSsYRMvn51fZVlFIoWIwVVvqRU= MIME-Version: 1.0 In-Reply-To: References: <20170908080536.ninspvplibd37fj2@pd.tnic> <20170908091614.nmdxjnukxowlsjja@pd.tnic> <20170908094815.GA278@x4> <20170908103513.npjmb2kcjt2zljb2@gmail.com> <20170908103906.GB278@x4> <20170908113039.GA285@x4> <20170908171633.GA279@x4> <20170908215656.qw66lgfsfgpoqrdm@pd.tnic> From: Andy Lutomirski Date: Fri, 8 Sep 2017 17:00:05 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: Current mainline git (24e700e291d52bd2) hangs when building e.g. perf To: Linus Torvalds Cc: Andy Lutomirski , Borislav Petkov , Markus Trippelsdorf , Ingo Molnar , Thomas Gleixner , Peter Zijlstra , LKML , Ingo Molnar , Tom Lendacky Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3645 Lines: 81 On Fri, Sep 8, 2017 at 4:23 PM, Linus Torvalds wrote: > On Fri, Sep 8, 2017 at 4:07 PM, Andy Lutomirski wrote: >> >> I *think* this is impossible because CPU A's mm_cpumask manipulations >> are atomic and should therefore force out the streaming write buffers, >> but maybe there's some other scenario where this matters. > > I don't think atomic memops do that. > > They enforce globally visible ordering, but since they happen in the > cache and is not actually visible to outside, that doesn't actually > affect any streaming write buffers. > > Then, if somebody else requests a cacheline that we have exclusive > ownership to, the write buffers just need to flush before we give up > that cacheline. > > So a locked memory op is *not* serializing, it only enforces memory > ordering. Big difference. > > Only fully serializing instructions will serialize with the write > buffers, and they are expensive as hell (partly exactly _due_ to these > kinds of issues). I'm not convinced. The SDM says (Vol 3, 11.3, under WC): If the WC buffer is partially filled, the writes may be delayed until the next occurrence of a serializing event; such as, an SFENCE or MFENCE instruction, CPUID execution, a read or write to uncached memory, an interrupt occurrence, or a LOCK instruction execution. Thanks, Intel, for definiing "serializing event" differently here than anywhere else in the whole manual. Anyhow, I can think of two cases where this is relevant. 1. The kernel wants to reclaim a page of normal memory, so it unmaps it and flushes. Another CPU has an entry for that page in its WC buffer. I don't think we care whether the flush causes the WC write to really hit RAM because it's unobservable -- we just need to make sure it is ordered, as seen by software, before the flush operation completes. From the quote above, I think we're okay here. 2. The kernel is unmapping some IO memory (e.g. a GPU command buffer). It wants a guarantee that, when flush_tlb_mm_range returns, all CPUs are really done writing to it. Here I'm less convinced. The SDM quote certainly suggests to me that we have a promise that the WC write has *started* before flush_tlb_mm_range returns, but I'm not sure I believe that it's guaranteed to have retired. That being said, I'm not sure that this is observable either -- anything the kernel does that depends on the writes being done presumably has to involve further IO to the same device, and I suspect that WC write; lock write to memory; observe that write on another CPU; IO on other CPU really does guarantee that everything hits the bus in order. FWIW, I'm not sure that we ever had a guarantee that IO writes were all fully done before flush_tlb_mm_range would return. Can't they still be hanging out in the PCI bridge or whatever until someone *reads* that device? > > So this change to delay invalidation does sound fairly scary.. With PCID, we're fundamentally delaying invalidation. I think the worry is more that we're not guaranteeing that every CPU that could have accessed the page being flushed has executed a real serializing instruction. If we want to force invalidation/serialization, I can see two reasonable solutions: 1. Revert this behavior change: continue sending IPIs to lazy CPUs. The problem is that this will totally wipe out the performance gain, and that gain seemed to be substantial in some microbenchmarks at least. 2. Get rid of lazy mode. With PCID at least, switching to init_mm isn't so bad. I'd prefer to leave it as is except on the buggy AMD CPUs, though, since the current code is nice and fast.