Received: by 2002:a25:8b12:0:0:0:0:0 with SMTP id i18csp147335ybl; Tue, 27 Aug 2019 17:31:21 -0700 (PDT) X-Google-Smtp-Source: APXvYqwHN4rHjVfY5LRwHzCIM5ByekiZKMKwDGoB3IPGtiZ/mn0m/E5JkV8sP5C9oit0NvErsj3W X-Received: by 2002:aa7:8647:: with SMTP id a7mr1399531pfo.119.1566952281773; Tue, 27 Aug 2019 17:31:21 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1566952281; cv=none; d=google.com; s=arc-20160816; b=ca7PsEuVrNWoHBeHEXXkusVcSEVC9vjNYu7sVlkamcnJyeqe1omCr6ppDguIAQlu3e nMr0bWOzi2qJPXyVW20i769ZfEdxkopcEY7pMM0L7SFncfScwrNyjENQNbZTtlan1xU9 Z4x9aWccV+iDhiOFEPGR0F3wBlAFuCRV/v/YUWRBEfRiX4QocR3Ef5BFcbp6/jM9nm52 qOKelurouWo4eme4ci9jJoYZG18PBgFhX2ANkYFkVK1WSEx+PNNCXwa9dsXs51/xN1mk dQRoaHIEq7wsqhrFHPaQ6b0F83GcDRWuAjYBco9I5BIUvK8LueQjssG8gSbs3kdaOB9I WvxQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=4A4GMfM3IFAHDXsdWg+9PBCeNACTzUPVJUdOUxQ1aKA=; b=mWpyO7Q5Hb4aqZlUL5EMvrQ2sIYfeXswutowL7Li1erNsAUCa/Fl7YiCIulc45rixt Vj+HcFXmqs+ldU8HLBw/4/fDeXrVthf/LNP2amSMCAwZdL+rZ0SLuE86hRfjNX/PInBw E6jd9xh4CAVcGT8auQv37QtkxU8JkXbOfHQ+vSwaVKZwbNRfEjpfWUdRCWDHdo2XHNfl hcEtvynvD893Srd2/hJGfcZGuWObWNIkzQPV+RAg2vmEEO36Ga8waqFh8+R4UZxzXs+c 0Qmfm0y8iqq6Q7TfThBVuAgMBh19xmQc5q+F1hWAVCQoSC0wXh4yk5HTbUMufm3EedPw XjBQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=wmIPuVbv; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 2si774280pgj.349.2019.08.27.17.31.05; Tue, 27 Aug 2019 17:31:21 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=wmIPuVbv; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726127AbfH1AaQ (ORCPT + 99 others); Tue, 27 Aug 2019 20:30:16 -0400 Received: from mail.kernel.org ([198.145.29.99]:37744 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725992AbfH1AaP (ORCPT ); Tue, 27 Aug 2019 20:30:15 -0400 Received: from mail-wr1-f47.google.com (mail-wr1-f47.google.com [209.85.221.47]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 50ECC2189D for ; Wed, 28 Aug 2019 00:30:14 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1566952214; bh=PmXIjBGFHhlM9Dy3h3KgWlBbHIQj4EAfHncCYBrpdYg=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=wmIPuVbvmvSNB7h+AFPsVwv2Uj7NUQzkigcOb0ZdUlgNv5zBBniSNJ4eXrLPM6AIH RBodXKzkHR1qGgiDEJKnPFKkBen1Vjcx61cXferwMOpC0GVvzfI5vaP4IbNe5kYLHR ohtIuxqWS5348kvbPtGv5wXOqZ20v73jiaGjYbIE= Received: by mail-wr1-f47.google.com with SMTP id z11so604776wrt.4 for ; Tue, 27 Aug 2019 17:30:14 -0700 (PDT) X-Gm-Message-State: APjAAAWidahK+Gk7ksne1q3WCZnJxPaWHOv2AHmgXuPh9qu2QYOUSy/Y A0yiIqw6C+o3VoyBGw405N81g0lbOBwyQsvikOGz4g== X-Received: by 2002:a05:6000:4f:: with SMTP id k15mr697179wrx.221.1566952212796; Tue, 27 Aug 2019 17:30:12 -0700 (PDT) MIME-Version: 1.0 References: <20190823224635.15387-1-namit@vmware.com> <3989CBFF-F1C1-4F64-B8C4-DBFF80997857@vmware.com> In-Reply-To: <3989CBFF-F1C1-4F64-B8C4-DBFF80997857@vmware.com> From: Andy Lutomirski Date: Tue, 27 Aug 2019 17:30:01 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [RFC PATCH 0/3] x86/mm/tlb: Defer TLB flushes with PTI To: Nadav Amit Cc: Andy Lutomirski , Dave Hansen , X86 ML , LKML , Peter Zijlstra , Thomas Gleixner , Ingo Molnar Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Aug 27, 2019 at 4:52 PM Nadav Amit wrote: > > > On Aug 27, 2019, at 4:18 PM, Andy Lutomirski wrote: > > > > On Fri, Aug 23, 2019 at 11:07 PM Nadav Amit wrote: > >> INVPCID is considerably slower than INVLPG of a single PTE, but it is > >> currently used to flush PTEs in the user page-table when PTI is used. > >> > >> Instead, it is possible to defer TLB flushes until after the user > >> page-tables are loaded. Preventing speculation over the TLB flushes > >> should keep the whole thing safe. In some cases, deferring TLB flushes > >> in such a way can result in more full TLB flushes, but arguably this > >> behavior is oftentimes beneficial. > > > > I have a somewhat horrible suggestion. > > > > Would it make sense to refactor this so that it works for user *and* > > kernel tables? In particular, if we flush a *kernel* mapping (vfree, > > vunmap, set_memory_ro, etc), we shouldn't need to send an IPI to a > > task that is running user code to flush most kernel mappings or even > > to free kernel pagetables. The same trick could be done if we treat > > idle like user mode for this purpose. > > > > In code, this could mostly consist of changing all the "user" data > > structures involved to something like struct deferred_flush_info and > > having one for user and one for kernel. > > > > I think this is horrible because it will enable certain workloads to > > work considerably faster with PTI on than with PTI off, and that would > > be a barely excusable moral failing. :-p > > > > For what it's worth, other than register clobber issues, the whole > > "switch CR3 for PTI" logic ought to be doable in C. I don't know a > > priori whether that would end up being an improvement. > > I implemented (and have not yet sent) another TLB deferring mechanism. It= is > intended for user mappings and not kernel one, but I think your suggestio= n > shares some similar underlying rationale, and therefore challenges and > solutions. Let me rephrase what you say to ensure we are on the same page= . > > The basic idea is context-tracking to check whether each CPU is in kernel= or > user mode. Accordingly, TLB flushes can be deferred, but I don=E2=80=99t = see that > this solution is limited to PTI. There are 2 possible reasons, according = to > my understanding, that you limit the discussion to PTI: > > 1. PTI provides clear boundaries when user and kernel mappings are used. = I > am not sure that privilege-levels (and SMAP) do not do the same. > > 2. CR3 switching already imposes a memory barrier, which eliminates most = of > the cost of implementing such scheme which requires something which is > similar to: > > write new context (kernel/user) > mb(); > if (need_flush) flush; > > I do agree that PTI addresses (2), but there is another problem. A > reasonable implementation would store in a per-cpu state whether each CPU= is > in user/kernel, and the TLB shootdown initiator CPU would check the state= to > decide whether an IPI is needed. This means that pretty much every TLB > shutdown would incur a cache-miss per-target CPU. This might cause > performance regressions, at least in some cases. We already more or less do this: we have mm_cpumask(), which is particularly awful since it writes to a falsely-shared line for each context switch. For what it's worth, in some sense, your patch series is reinventing the tracking that is already in cpu_tlbstate -- when we do a flush on one mm and some cpu is running another mm, we don't do an IPI shootdown -- instead we set flags so that it will be flushed the next time it's used. Maybe we could actually refactor this so we only have one copy of this code that handles all the various deferred flush variants. Perhaps each tracked mm context could have a user tlb_gen_id and a kernel tlb_gen_id. I guess one thing that makes this nasty is that we need to flush the kernel PCID for kernel *and* user invalidations. Sigh. --Andy