Received: by 2002:a25:6193:0:0:0:0:0 with SMTP id v141csp593827ybb; Fri, 20 Mar 2020 04:50:15 -0700 (PDT) X-Google-Smtp-Source: ADFU+vvKB9qI5x4E5nvMnlyS+eRqCF7tpzs+/cTXNg/AzBUTVdRO49sucCnKX7KswU8xdFMOTHkB X-Received: by 2002:a9d:769a:: with SMTP id j26mr5156124otl.143.1584705015159; Fri, 20 Mar 2020 04:50:15 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1584705015; cv=none; d=google.com; s=arc-20160816; b=p0IA8L2ozv/DIwuyPY4jwKscry7yOyzyPwch4rmOkHckk7LRMO66BJNWOvzB37QTGK B36+jNZ2f3l4QykDeZAd3OrveMSvbQOSzu1DsRaM1FPbqIZMQy+qbQ7NPxdLV/7oV+Od p2DYVD0/+m59EQbt5RS12qbFLhLCaGiAbONI/vWuMmNWWGNDpIBBEiUow+iGAUDXGk4o eWloqOd73Ty1/E35FOFq+cUbPnvX+ArB3LP2A0Ul2k+3WLAuewV+QTiMA43E1GF8JEqB TX6REenH7J52NQDqXIRnUEUiT8loKfWfk2NDeDhHyHB1Q/ZToyotW35OzmyADpw/lIaO hghg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:mime-version:message-id:date:references :in-reply-to:subject:cc:to:from; bh=l1vT4GcmExCzVWWJq4y3g1gfeYvyanDaDOgmHL7At8w=; b=goo6KvrpzlRDPqX01Y/9YgpjYpOWp7YDszO9sz7JQEOdQzk3NCtv9GPlYpRECDKRhO Wi5Ye/JzOTTIvoQi/hLgMPdxXX1C42EKuqtTmro9vfhG83Oqz89iCiYMqW7kDTBQwTxb zdCEWifxyxRHrSoJxo9YfuUWT4nr84AtFevnEIZ01tc9BRSPZyksh1Yj593kPO2IujKA buDKaV5EIvuVtvqw2KalUWK/1IvA+/ytlSYI20haKbHd6thM5NNsLTqBZqjydYpMXpl5 xFR7eeMOYv5ZPSCDJ6wsg/ocNjO5kVQ79vdzp00aXy7ujHeJvjm/uoh+eqcb9oGv8qV3 kuuQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id v130si2680635oib.115.2020.03.20.04.50.03; Fri, 20 Mar 2020 04:50:15 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727005AbgCTLtY (ORCPT + 99 others); Fri, 20 Mar 2020 07:49:24 -0400 Received: from Galois.linutronix.de ([193.142.43.55]:35417 "EHLO Galois.linutronix.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726814AbgCTLtX (ORCPT ); Fri, 20 Mar 2020 07:49:23 -0400 Received: from p5de0bf0b.dip0.t-ipconnect.de ([93.224.191.11] helo=nanos.tec.linutronix.de) by Galois.linutronix.de with esmtpsa (TLS1.2:DHE_RSA_AES_256_CBC_SHA256:256) (Exim 4.80) (envelope-from ) id 1jFG9N-0002wo-Oz; Fri, 20 Mar 2020 12:49:18 +0100 Received: by nanos.tec.linutronix.de (Postfix, from userid 1000) id 931B8100375; Fri, 20 Mar 2020 12:49:16 +0100 (CET) From: Thomas Gleixner To: "Singh\, Balbir" , "linux-kernel\@vger.kernel.org" Cc: "keescook\@chromium.org" , "Herrenschmidt\, Benjamin" , "x86\@kernel.org" Subject: Re: [RFC PATCH] arch/x86: Optionally flush L1D on context switch In-Reply-To: <97b2bffc16257e70b8aa98ee86622dc4178154c4.camel@amazon.com> References: <20200313220415.856-1-sblbir@amazon.com> <87imj19o13.fsf@nanos.tec.linutronix.de> <97b2bffc16257e70b8aa98ee86622dc4178154c4.camel@amazon.com> Date: Fri, 20 Mar 2020 12:49:16 +0100 Message-ID: <8736a3456r.fsf@nanos.tec.linutronix.de> MIME-Version: 1.0 Content-Type: text/plain X-Linutronix-Spam-Score: -1.0 X-Linutronix-Spam-Level: - X-Linutronix-Spam-Status: No , -1.0 points, 5.0 required, ALL_TRUSTED=-1,SHORTCIRCUIT=-0.0001 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Balbir, "Singh, Balbir" writes: > On Thu, 2020-03-19 at 01:38 +0100, Thomas Gleixner wrote: >> What's the point? The attack surface is the L1D content of the scheduled >> out task. If the malicious task schedules out, then why would you care? >> >> I might be missing something, but AFAICT this is beyond paranoia. >> > > I think there are two cases > > 1. Task with important data schedules out > 2. Malicious task schedules in > > These patches address 1, but call out case #2 The point is if the victim task schedules out, then there is no reason to flush L1D immediately in context switch. If that just schedules a kernel thread and then goes back to the task, then there is no point unless you do not even trust the kernel thread. >> > 3. There is a fallback software L1D load, similar to what L1TF does, but >> > we don't prefetch the TLB, is that sufficient? >> >> If we go there, then the KVM L1D flush code wants to move into general >> x86 code. > > OK.. we can definitely consider reusing code, but I think the KVM bits require > tlb prefetching, IIUC before cache flush to negate any bad translations > associated with an L1TF fault, but the code/comments are not clear on the need > to do so. I forgot the gory details by now, but having two entry points or a conditional and share the rest (page allocation etc.) is definitely better than two slightly different implementation which basically do the same thing. >> > +void enable_l1d_flush_for_task(struct task_struct *tsk) >> > +{ >> > + struct page *page; >> > + >> > + if (static_cpu_has(X86_FEATURE_FLUSH_L1D)) >> > + goto done; >> > + >> > + mutex_lock(&l1d_flush_mutex); >> > + if (l1d_flush_pages) >> > + goto done; >> > + /* >> > + * These pages are never freed, we use the same >> > + * set of pages across multiple processes/contexts >> > + */ >> > + page = alloc_pages(GFP_KERNEL | __GFP_ZERO, L1D_CACHE_ORDER); >> > + if (!page) >> > + return; >> > + >> > + l1d_flush_pages = page_address(page); >> > + /* I don't think we need to worry about KSM */ >> >> Why not? Even if it wouldn't be necessary why would we care as this is a >> once per boot operation in fully preemptible code. > > Not sure I understand your question, I was stating that even if KSM was > running, it would not impact us (with dedup), as we'd still be writing out 0s > to the cache line in the fallback case. I probably confused myself vs. the comment in the VMX code, but that mentions nested virt. Needs at least some consideration when we reuse that code. >> > void switch_mm(struct mm_struct *prev, struct mm_struct *next, >> > struct task_struct *tsk) >> > { >> > @@ -433,6 +519,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct >> > mm_struct *next, >> > trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, 0); >> > } >> > >> > + l1d_flush(next, tsk); >> >> This is really the wrong place. You want to do that: >> >> 1) Just before return to user space >> 2) When entering a guest >> >> and only when the previously running user space task was the one which >> requested this massive protection. >> > > Cases 1 and 2 are handled via > > 1. SWAPGS fixes/work arounds (unless I misunderstood your suggestion) How so? SWAPGS mitigation does not flush L1D. It merily serializes SWAPGS. > 2. L1TF fault handling > > This mechanism allows for flushing not restricted to 1 or 2, the idea is to > immediately flush L1D for paranoid processes on mm switch. Why? To protect the victim task against the malicious kernel? The L1D content of the victim is endangered in the following case: victim out -> attacker in The attacker can either run in user space or in guest mode. So the flush is only interesting when the attacker actually goes back to user space or reenters the guest. The following is completely uninteresting: victim out -> kernel thread in/out -> victim in Even this is uninteresting: victim in -> attacker in (stays in kernel, e.g. waits for data) -> attacker out -> victim in So the point where you want to flush conditionally is when the attacker leaves kernel space either to user mode or guest mode. So if the victim schedules out it sets a per cpu request to flush L1D on the borders. And then you have on return to user: if (this_cpu_flush_l1d()) flush_l1d() and in kvm: if (this_cpu_flush_l1d() || L1TF_flush_L1D) flush_l1d() The request does: if (!this_cpu_read(l1d_flush_for_task)) this_cpu_write(l1d_flush_for_task, current) The check does: p = this_cpu_read(l1d_flush_for_task); if (p) { this_cpu_write(l1d_flush_for_task, NULL); return p != current; } return false; Hmm? Thanks, tglx