Received: by 2002:a25:e7d8:0:0:0:0:0 with SMTP id e207csp4203708ybh; Tue, 17 Mar 2020 14:20:34 -0700 (PDT) X-Google-Smtp-Source: ADFU+vtZ1DPy/2QqZ5+wYx1NPeLDyTG9drGFHgKWPksuYr7m1Bc9tpvuuP3U5PV7274wBoapXoN1 X-Received: by 2002:a9d:d04:: with SMTP id 4mr1120300oti.101.1584480033990; Tue, 17 Mar 2020 14:20:33 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1584480033; cv=none; d=google.com; s=arc-20160816; b=VbNiCVnZeslxeTj3QNgJ06DRdQS+GOHBIhujsLFgftFv4b+RX6pTYy9m93bvVZFmg/ JvhGwsq6kofzV1vVN/T3XjtcouUtmILVNQBYRkgND0z698KEJZIXM/gGSx8g6Ztk2sI6 8c8gXO5ah03MjosMtmz3ZD1a+lL0MLK5jR5NTFqBY+auRdXe2gHj7x6GJqC+TXsGCy+w WV7/7+ugNhm3skxm/2RxjkSxzWIx6gKhpKMLxj4w9gIIWlzhGrFFy+kdhpp71j3D7hgy XhGwXLuP01KGf/d2JK19iVy3XJu3niGVQWWEZdcA7bBrk4gfh3lEnF621/FXWozYloJY tSVw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:mime-version:message-id:date:references :in-reply-to:subject:cc:to:from; bh=iSU5++WONxYSsj40USk40miMPt5/m+xxY1FDi5fFTAk=; b=Af9YVXQbwbkzt9j0Ur6EJJF2PZv8+v3MFCKq+bNM7QTdGTaZlTnD2MRdX9cC0rbtcp D0/oIoEr+y91HMmjHwlTmBJWpTBUhpx8fXp2coBgd2Y0WufblLMnxnxim8S8RY9HMdmX GG78gBxVv09y21kM4zoDiClAgwAywhuxfkR5EhyI6whkwFap9H57MDFdyo3QCfSjQ/qt 34Y4RbLJ9VZ7uZLqRN6jQgiVYVYvx39ve2W97ATm2GkS38oPa2NXIW6UxWmJLRJFiW7V NivQfLhmjJyg6Tu6u8oEdsJhXJTkMKFmMJve6O+tPmRFE7GQwgvLPwgeduhZ1GTGYehI FLrg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 11si2126168oij.162.2020.03.17.14.20.19; Tue, 17 Mar 2020 14:20:33 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726783AbgCQVSK (ORCPT + 99 others); Tue, 17 Mar 2020 17:18:10 -0400 Received: from Galois.linutronix.de ([193.142.43.55]:55919 "EHLO Galois.linutronix.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726530AbgCQVSJ (ORCPT ); Tue, 17 Mar 2020 17:18:09 -0400 Received: from p5de0bf0b.dip0.t-ipconnect.de ([93.224.191.11] helo=nanos.tec.linutronix.de) by Galois.linutronix.de with esmtpsa (TLS1.2:DHE_RSA_AES_256_CBC_SHA256:256) (Exim 4.80) (envelope-from ) id 1jEJau-0008WU-H2; Tue, 17 Mar 2020 22:17:48 +0100 Received: by nanos.tec.linutronix.de (Postfix, from userid 1000) id 6D8C5101161; Tue, 17 Mar 2020 22:17:47 +0100 (CET) From: Thomas Gleixner To: Tim Chen , Joel Fernandes , Julien Desfossez , Peter Zijlstra Cc: Vineeth Remanan Pillai , Aubrey Li , Nishanth Aravamudan , Ingo Molnar , Paul Turner , Linus Torvalds , Linux List Kernel Mailing , Dario Faggioli , =?utf-8?B?RnLDqWTDqXJpYw==?= Weisbecker , Kees Cook , Greg Kerr , Phil Auld , Aaron Lu , Valentin Schneider , Mel Gorman , Pawan Gupta , Paolo Bonzini , "Luck\, Tony" Subject: Re: [RFC PATCH v4 00/19] Core scheduling v4 In-Reply-To: References: <3c3c56c1-b8dc-652c-535e-74f6dcf45560@linux.intel.com> <20200212230705.GA25315@sinkpad> <29d43466-1e18-6b42-d4d0-20ccde20ff07@linux.intel.com> <20200221232057.GA19671@sinkpad> <20200317005521.GA8244@google.com> Date: Tue, 17 Mar 2020 22:17:47 +0100 Message-ID: <87imj2bs04.fsf@nanos.tec.linutronix.de> MIME-Version: 1.0 Content-Type: text/plain X-Linutronix-Spam-Score: -1.0 X-Linutronix-Spam-Level: - X-Linutronix-Spam-Status: No , -1.0 points, 5.0 required, ALL_TRUSTED=-1,SHORTCIRCUIT=-0.0001 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Tim, Tim Chen writes: >> However, I have the following questions, in particular there are 4 scenarios >> where I feel the current patches do not resolve MDS/L1TF, would you guys >> please share your thoughts? >> >> 1. HT1 is running either hostile guest or host code. >> HT2 is running an interrupt handler (victim). >> >> In this case I see there is a possible MDS issue between HT1 and HT2. > > Core scheduling mitigates the userspace to userspace attacks via MDS between the HT. > It does not prevent the userspace to kernel space attack. That will > have to be mitigated via other means, e.g. redirecting interrupts to a core > that don't run potentially unsafe code. Which is in some cases simply impossible. Think multiqueue devices with managed interrupts. You can't change the affinity of those. Neither can you do that for the per cpu timer interrupt. >> 2. HT1 is executing hostile host code, and gets interrupted by a victim >> interrupt. HT2 is idle. > > Similar to above. No. It's the same HT so not similar at all. >> In this case, I see there is a possible MDS issue between interrupt and >> the host code on the same HT1. > > The cpu buffers are cleared before return to the hostile host code. So > MDS shouldn't be an issue if interrupt handler and hostile code > runs on the same HT thread. OTOH, thats mostly correct. Aside of the shouldn't wording: MDS _is_ no issue in this case when the full mitigation is enabled. Assumed that I have not less information about MDS than you have :) >> 3. HT1 is executing hostile guest code, HT2 is executing a victim interrupt >> handler on the host. >> >> In this case, I see there is a possible L1TF issue between HT1 and HT2. >> This issue does not happen if HT1 is running host code, since the host >> kernel takes care of inverting PTE bits. > > The interrupt handler will be run with PTE inverted. So I don't think > there's a leak via L1TF in this scenario. How so? Host memory is attackable, when one of the sibling SMT threads runs in host OS (hypervisor) context and the other in guest context. HT1 is in guest mode and attacking (has control over PTEs). HT2 is running in host mode and executes an interrupt handler. The host PTE inversion does not matter in this scenario at all. So HT1 can very well see data which is brought into the shared L1 by HT2. The only way to mitigate that aside of disabling HT is disabling EPT. >> 4. HT1 is idle, and HT2 is running a victim process. Now HT1 starts running >> hostile code on guest or host. HT2 is being forced idle. However, there is >> an overlap between HT1 starting to execute hostile code and HT2's victim >> process getting scheduled out. >> Speaking to Vineeth, we discussed an idea to monitor the core_sched_seq >> counter of the sibling being idled to detect that it is now idle. >> However we discussed today that looking at this data, it is not really an >> issue since it is such a small window. If the victim HT is kicked out of execution with an IPI then the overlap depends on the contexts: HT1 (attack) HT2 (victim) A idle -> user space user space -> idle B idle -> user space guest -> idle C idle -> guest user space -> idle D idle -> guest guest -> idle The IPI from HT1 brings HT2 immediately into the kernel when HT2 is in host user mode or brings it immediately into VMEXIT when HT2 is in guest mode. #A On return from handling the IPI HT2 immediately reschedules to idle. To have an overlap the return to user space on HT1 must be faster. #B Coming back from VEMXIT into schedule/idle might take slightly longer than #A. #C Similar to #A, but reentering guest mode in HT1 after sending the IPI will probably take longer. #D Similar to #C if you make the assumption that VMEXIT on HT2 and rescheduling into idle is not significantly slower than reaching VMENTER after sending the IPI. In all cases the data exposed by a potential overlap shouldn't be that interesting (e.g. scheduler state), but that obviously depends on what the attacker is looking for. But all of them are still problematic vs. interrupts / softinterrupts which can happen on HT2 on the way to idle or while idling. i.e. #3 of the original case list. #A and #B are only affected my MDS, #C and #D by both MDS and L1TF (if EPT is in use). >> My concern is now cases 1, 2 to which there does not seem a good solution, >> short of disabling interrupts. For 3, we could still possibly do something on >> the guest side, such as using shadow page tables. Any thoughts on all this? #1 can be partially mitigated by changing interrupt affinities, which is not always possible and in the case of the local timer interrupt completely impossible. It's not only the timer interrupt itself, the timer callbacks which can run in the softirq on return from interrupt might be valuable attack surface depending on the nature of the callbacks, the random entropy timer just being a random example. #2 is a non issue if MDS mitigation is on, i.e. buffers are flushed before returning to user space. It's pretty much a non SMT case, i.e. same CPU user to kernel attack. #3 Can only be fully mitigated by disabling EPT #4 Assumed that my assumptions about transition times are correct, which I think they are, #4 is pretty much redirected to #1 Hope that helps. Thanks, tglx