Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752678AbdIIRt6 (ORCPT ); Sat, 9 Sep 2017 13:49:58 -0400 Received: from mail.kernel.org ([198.145.29.99]:42570 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751384AbdIIRt5 (ORCPT ); Sat, 9 Sep 2017 13:49:57 -0400 DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 6CE4D21EAC Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=kernel.org Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=luto@kernel.org X-Google-Smtp-Source: AOwi7QB9t2tE4u0GlpQv0Ad6NCP3oX4cT/WbVWqloFQhELeUpVOw7t6QGABzzGQu6Xoj6VkJUeXs0NyD/3sh9yGHIKs= MIME-Version: 1.0 In-Reply-To: <8D582966-08B6-46F2-B12A-BC33F7EF0EB6@amacapital.net> References: <20170908080536.ninspvplibd37fj2@pd.tnic> <20170908091614.nmdxjnukxowlsjja@pd.tnic> <20170908094815.GA278@x4> <20170908103513.npjmb2kcjt2zljb2@gmail.com> <20170908103906.GB278@x4> <20170908113039.GA285@x4> <20170908171633.GA279@x4> <20170908215656.qw66lgfsfgpoqrdm@pd.tnic> <8D582966-08B6-46F2-B12A-BC33F7EF0EB6@amacapital.net> From: Andy Lutomirski Date: Sat, 9 Sep 2017 10:49:35 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: Current mainline git (24e700e291d52bd2) hangs when building e.g. perf To: Linus Torvalds Cc: Andy Lutomirski , Borislav Petkov , Markus Trippelsdorf , Ingo Molnar , Thomas Gleixner , Peter Zijlstra , LKML , Ingo Molnar , Tom Lendacky Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3230 Lines: 71 On Fri, Sep 8, 2017 at 6:39 PM, Andy Lutomirski wrote: > > >> On Sep 8, 2017, at 6:05 PM, Linus Torvalds wrote: >> >>> On Fri, Sep 8, 2017 at 5:00 PM, Andy Lutomirski wrote: >>> >>> I'm not convinced. The SDM says (Vol 3, 11.3, under WC): >>> >>> If the WC buffer is partially filled, the writes may be delayed until >>> the next occurrence of a serializing event; such as, an SFENCE or >>> MFENCE instruction, CPUID execution, a read or write to uncached >>> memory, an interrupt occurrence, or a LOCK instruction execution. >>> >>> Thanks, Intel, for definiing "serializing event" differently here than >>> anywhere else in the whole manual. >> >> Yeah, it's really badly defined. Ok, maybe a locked instruction does >> actually wait for it.. It should be invisible to anything, regardless. >> >>> 1. The kernel wants to reclaim a page of normal memory, so it unmaps >>> it and flushes. Another CPU has an entry for that page in its WC >>> buffer. I don't think we care whether the flush causes the WC write >>> to really hit RAM because it's unobservable -- we just need to make >>> sure it is ordered, as seen by software, before the flush operation >>> completes. From the quote above, I think we're okay here. >> >> Agreed. >> >>> 2. The kernel is unmapping some IO memory (e.g. a GPU command buffer). >>> It wants a guarantee that, when flush_tlb_mm_range returns, all CPUs >>> are really done writing to it. Here I'm less convinced. The SDM >>> quote certainly suggests to me that we have a promise that the WC >>> write has *started* before flush_tlb_mm_range returns, but I'm not >>> sure I believe that it's guaranteed to have retired. >> >> If others have writable TLB entries, what keeps them from just >> continuing to write for a long time afterwards? > > Whoever unmaps the resource by kicking out their drm fd? I admit I'm just trying to think of the worst case. > >> >>> I'd prefer to leave it as is except on the buggy AMD CPUs, though, >>> since the current code is nice and fast. >> >> So is there a patch to detect the 383 erratum and serialize for those? >> I may have missed that part. >> > > The patch is in my head. It's imaginarily attached to this email. After contemplating the info from Boris and Markus, I think I need to add a #3 to the list of reasons my patch could be problematic: 3. If a CPU frees a page table (or PUD or PMD or whatever), that CPU will flush before the memory goes back to the system. If that flush is deferred on a different CPU that has the pointer to the freed table cached in its TLB, then that CPU can speculatively load complete garbage into its TLB. I don't think this should be observable, but I can easily imagine it triggering errata or weird ill-advised machine checks. Anyway, if I need change the behavior back, I can do it in one of two ways. I can just switch to init_mm instead of going lazy, which is expensive, but not *that* expensive on CPUs with PCID. Or I can do it the way we used to do it and send the flush IPI to lazy CPUs. The latter will only have a performance impact when a flush happens, but the performance hit is much higher when there's a flush. --Andy