DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 6CE4D21EAC
MIME-Version: 1.0
In-Reply-To: <8D582966-08B6-46F2-B12A-BC33F7EF0EB6@amacapital.net>
References: <alpine.DEB.2.20.1709080826110.1888@nanos> <20170908080536.ninspvplibd37fj2@pd.tnic>
 <20170908091614.nmdxjnukxowlsjja@pd.tnic> <20170908094815.GA278@x4>
 <20170908103513.npjmb2kcjt2zljb2@gmail.com> <20170908103906.GB278@x4>
 <20170908113039.GA285@x4> <CALCETrXkWrQjSqpTQw_vJwN_hRp6eWXEOcibM+57NmC+HWxGNg@mail.gmail.com>
 <20170908171633.GA279@x4> <CALCETrX6mFHWa1oZxS7DQUtnbqyrRRg53bvgGFv19ka8Sy_KcA@mail.gmail.com>
 <20170908215656.qw66lgfsfgpoqrdm@pd.tnic> <CALCETrUYGv_8GmWpRanYOuJfRfBijCMwMfCtUunKcZ3B0=8M8g@mail.gmail.com>
 <CA+55aFx91B5_eBvuWm2=SGmn_NOTRmooR=cKoP6Nj-DLeM2PMA@mail.gmail.com>
 <CALCETrWTh6YccfBgiTNE5C0FWBegH9LCqUTaj88s81LgUn1MsQ@mail.gmail.com>
 <CA+55aFzf1C+WXRtfkdwAbb0kniYJfkfm=rghouM3u8x7-ZJGMg@mail.gmail.com> <8D582966-08B6-46F2-B12A-BC33F7EF0EB6@amacapital.net>
From: Andy Lutomirski <luto@kernel.org>
Date: Sat, 9 Sep 2017 10:49:35 -0700
Message-ID: <CALCETrVDR-KXkSvhEQda=v0UUVXdS1kitV_MkD0mjHaf+qs4Zg@mail.gmail.com>
Subject: Re: Current mainline git (24e700e291d52bd2) hangs when building e.g. perf
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>, Borislav Petkov <bp@alien8.de>,
        Markus Trippelsdorf <markus@trippelsdorf.de>,
        Ingo Molnar <mingo@kernel.org>, Thomas Gleixner <tglx@linutronix.de>,
        Peter Zijlstra <peterz@infradead.org>,
        LKML <linux-kernel@vger.kernel.org>, Ingo Molnar <mingo@redhat.com>,
        Tom Lendacky <thomas.lendacky@amd.com>
Content-Type: text/plain; charset="UTF-8"
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 3230
Lines: 71

On Fri, Sep 8, 2017 at 6:39 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>
>
>> On Sep 8, 2017, at 6:05 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>>
>>> On Fri, Sep 8, 2017 at 5:00 PM, Andy Lutomirski <luto@kernel.org> wrote:
>>>
>>> I'm not convinced.  The SDM says (Vol 3, 11.3, under WC):
>>>
>>> If the WC buffer is partially filled, the writes may be delayed until
>>> the next occurrence of a serializing event; such as, an SFENCE or
>>> MFENCE instruction, CPUID execution, a read or write to uncached
>>> memory, an interrupt occurrence, or a LOCK instruction execution.
>>>
>>> Thanks, Intel, for definiing "serializing event" differently here than
>>> anywhere else in the whole manual.
>>
>> Yeah, it's really badly defined. Ok, maybe a locked instruction does
>> actually wait for it.. It should be invisible to anything, regardless.
>>
>>> 1. The kernel wants to reclaim a page of normal memory, so it unmaps
>>> it and flushes.  Another CPU has an entry for that page in its WC
>>> buffer.  I don't think we care whether the flush causes the WC write
>>> to really hit RAM because it's unobservable -- we just need to make
>>> sure it is ordered, as seen by software, before the flush operation
>>> completes.  From the quote above, I think we're okay here.
>>
>> Agreed.
>>
>>> 2. The kernel is unmapping some IO memory (e.g. a GPU command buffer).
>>> It wants a guarantee that, when flush_tlb_mm_range returns, all CPUs
>>> are really done writing to it.  Here I'm less convinced.  The SDM
>>> quote certainly suggests to me that we have a promise that the WC
>>> write has *started* before flush_tlb_mm_range returns, but I'm not
>>> sure I believe that it's guaranteed to have retired.
>>
>> If others have writable TLB entries, what keeps them from just
>> continuing to write for a long time afterwards?
>
> Whoever unmaps the resource by kicking out their drm fd?  I admit I'm just trying to think of the worst case.
>
>>
>>> I'd prefer to leave it as is except on the buggy AMD CPUs, though,
>>> since the current code is nice and fast.
>>
>> So is there a patch to detect the 383 erratum and serialize for those?
>> I may have missed that part.
>>
>
> The patch is in my head.  It's imaginarily attached to this email.

After contemplating the info from Boris and Markus, I think I need to
add a #3 to the list of reasons my patch could be problematic:

3. If a CPU frees a page table (or PUD or PMD or whatever), that CPU
will flush before the memory goes back to the system.  If that flush
is deferred on a different CPU that has the pointer to the freed table
cached in its TLB, then that CPU can speculatively load complete
garbage into its TLB.

I don't think this should be observable, but I can easily imagine it
triggering errata or weird ill-advised machine checks.

Anyway, if I need change the behavior back, I can do it in one of two
ways.  I can just switch to init_mm instead of going lazy, which is
expensive, but not *that* expensive on CPUs with PCID.  Or I can do it
the way we used to do it and send the flush IPI to lazy CPUs.  The
latter will only have a performance impact when a flush happens, but
the performance hit is much higher when there's a flush.

--Andy