Received: by 2002:a05:6a10:a0d1:0:0:0:0 with SMTP id j17csp3063116pxa; Tue, 25 Aug 2020 10:22:48 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwYlsfeR2dCvjg+KLxij1IxJyR8D5jWmOJ196UXwv0bHXp/j4RQTFGg6UCC3V5Q3hS7FJI1 X-Received: by 2002:aa7:db10:: with SMTP id t16mr10818899eds.196.1598376167930; Tue, 25 Aug 2020 10:22:47 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1598376167; cv=none; d=google.com; s=arc-20160816; b=i0IeXHMrisBHRthawLRnq4HOXVPM7EIN0557GQXjpXqIO9fQgD/C21Ke7ykOrz5IdP 9U8qAFZ0rp2k0ih7U+H7Yt2o7ndi8YPQytz4UzetSuHmdB9EQCRm3GBiTdVzC0l6iDA3 LJ5NgAWdVMtPxynfFsCe0nDQwyQwpH6aDbpiZdtSGGi/LvA8vqJ4CLBsFqUWuaTWpCJf gdzVnLnjZNhBu3PHspivLrMq9COGqtIo3EbA+XYVZr2mLMWFU8XQE97dz0D+3BPoYZnU 4dzno+BsfS5FtUfRKbDuq30OQO3AJQzj0XGfpYFJXF04EeMCWnJd5EUKIXNrxjwMvmG5 bkQA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:ironport-sdr:ironport-sdr; bh=rofLLLESo+imDqqpzsaI0zRV/wk3trCXiIMAxuz9Qog=; b=mF7Y1KduBfnPjZh3CrwB9lW72vyikyt7JKaKbq5A0ZJrsJpQDGfgNbGrbrRGl660Yd zhzU9VyRTNJgGkF7uz0L7jKGfwkMNGeMtaK/xeXa5iQ+JbVUnAH9sQKeaVE6lcCUyERF IkTot+XUJzVXu3qRexcWPpe0XK8PmcLbSCn8Fku1dSPbJA+rMiR5YGNO4M5jV9S5bjbM 4M3eL8VyI/kmrVqhS5iDUENBya1qtiRA28yO9bLxVCcs9SwYBWvxSIcJNlvUUABESkZl 970XN/TPBBmluFwXccYxQJAgIT/BfWMQOLitcH5w+25tiSVOqY3QResymchaxGzHJr9H YsWw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id l91si5912948ede.242.2020.08.25.10.22.24; Tue, 25 Aug 2020 10:22:47 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726682AbgHYRTb (ORCPT + 99 others); Tue, 25 Aug 2020 13:19:31 -0400 Received: from mga05.intel.com ([192.55.52.43]:25615 "EHLO mga05.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726828AbgHYRTG (ORCPT ); Tue, 25 Aug 2020 13:19:06 -0400 IronPort-SDR: JUcZaUMBFBs3VyyqVsv751mo7zoFYkcNqK8BE9dERjwQms+c0+Kp7gotWiPCRNHSQB4Unfsjm+ 2lLf+Tu3C33w== X-IronPort-AV: E=McAfee;i="6000,8403,9723"; a="240977616" X-IronPort-AV: E=Sophos;i="5.76,353,1592895600"; d="scan'208";a="240977616" X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga004.fm.intel.com ([10.253.24.48]) by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 25 Aug 2020 10:19:05 -0700 IronPort-SDR: krjUEd96SVu4XLcWJAAUpNkzjjOe59NPZ8snzleXVYa7kPRurHNa29n9kc7zUqyEtSdCxb4F2K 64Waj9OxekDA== X-IronPort-AV: E=Sophos;i="5.76,353,1592895600"; d="scan'208";a="322836478" Received: from sjchrist-ice.jf.intel.com (HELO sjchrist-ice) ([10.54.31.34]) by fmsmga004-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 25 Aug 2020 10:19:05 -0700 Date: Tue, 25 Aug 2020 10:19:03 -0700 From: Sean Christopherson To: Andy Lutomirski Cc: Andrew Cooper , Thomas Gleixner , LKML , X86 ML , Linus Torvalds , Tom Lendacky , Pu Wen , Stephen Hemminger , Sasha Levin , Dirk Hohndel , Jan Kiszka , Tony W Wang-oc , "H. Peter Anvin" , Asit Mallick , Gordon Tetlow , David Kaplan , Tony Luck Subject: Re: TDX #VE in SYSCALL gap (was: [RFD] x86: Curing the exception and syscall trainwreck in hardware) Message-ID: <20200825171903.GA20660@sjchrist-ice> References: <875z98jkof.fsf@nanos.tec.linutronix.de> <3babf003-6854-e50a-34ca-c87ce4169c77@citrix.com> <20200825043959.GF15046@sjchrist-ice> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.9.4 (2018-02-28) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Aug 25, 2020 at 09:49:05AM -0700, Andy Lutomirski wrote: > On Mon, Aug 24, 2020 at 9:40 PM Sean Christopherson > wrote: > > > > +Andy > > > > On Mon, Aug 24, 2020 at 02:52:01PM +0100, Andrew Cooper wrote: > > > And to help with coordination, here is something prepared (slightly) > > > earlier. > > > > > > https://docs.google.com/document/d/1hWejnyDkjRRAW-JEsRjA5c9CKLOPc6VKJQsuvODlQEI/edit?usp=sharing > > > > > > This identifies the problems from software's perspective, along with > > > proposing behaviour which ought to resolve the issues. > > > > > > It is still a work-in-progress. The #VE section still needs updating in > > > light of the publication of the recent TDX spec. > > > > For #VE on memory accesses in the SYSCALL gap (or NMI entry), is this > > something we (Linux) as the guest kernel actually want to handle gracefully > > (where gracefully means not panicking)? For TDX, a #VE in the SYSCALL gap > > would require one of two things: > > > > a) The guest kernel to not accept/validate the GPA->HPA mapping for the > > relevant pages, e.g. code or scratch data. > > > > b) The host VMM to remap the GPA (making the GPA->HPA pending again). > > > > (a) is only possible if there's a fatal buggy guest kernel (or perhaps vBIOS). > > (b) requires either a buggy or malicious host VMM. > > > > I ask because, if the answer is "no, panic at will", then we shouldn't need > > to burn an IST for TDX #VE. Exceptions won't morph to #VE and hitting an > > instruction based #VE in the SYSCALL gap would be a CPU bug or a kernel bug. > > Or malicious hypervisor action, and that's a problem. > > Suppose the hypervisor remaps a GPA used in the SYSCALL gap (e.g. the > actual SYSCALL text or the first memory it accesses -- I don't have a > TDX spec so I don't know the details). You can thank our legal department :-) > The user does SYSCALL, the kernel hits the funny GPA, and #VE is delivered. > The microcode wil write the IRET frame, with mostly user-controlled contents, > wherever RSP points, and RSP is also user controlled. Calling this a "panic" > is charitable -- it's really game over against an attacker who is moderately > clever. > > The kernel can't do anything about this because it's game over before > the kernel has had the chance to execute any instructions. Hrm, I was thinking that SMAP=1 would give the necessary protections, but in typing that out I realized userspace can throw in an RSP value that points at kernel memory. Duh. One thought would be to have the TDX module (thing that runs in SEAM and sits between the VMM and the guest) provide a TDCALL (hypercall from guest to TDX module) to the guest that would allow the guest to specify a very limited number of GPAs that must never generate a #VE, e.g. go straight to guest shutdown if a disallowed GPA would go pending. That seems doable from a TDX perspective without incurring noticeable overhead (assuming the list of GPAs is very small) and should be easy to to support in the guest, e.g. make a TDCALL/hypercall or two during boot to protect the SYSCALL page and its scratch data.