Received: by 2002:a05:7412:8d10:b0:f3:1519:9f41 with SMTP id bj16csp1302053rdb; Wed, 6 Dec 2023 14:54:33 -0800 (PST) X-Google-Smtp-Source: AGHT+IEQye0upgXsbFff5ZV4n55WMeMLOMpx+EL3hndhkAXTabiJI+LAq5vvcFi+VjD7AD1+w5dh X-Received: by 2002:a05:6870:ac12:b0:1fb:75c:3fde with SMTP id kw18-20020a056870ac1200b001fb075c3fdemr1964197oab.62.1701903272958; Wed, 06 Dec 2023 14:54:32 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1701903272; cv=none; d=google.com; s=arc-20160816; b=YS0NKkULVh3f126EqKvj8Pi7feiv8aXiXWz23//LkAs0vWyBxPc4/Dbfzipse2HF0k 2074uAo4CAtWXK9Z2O145ei+ZnwOMzZ2Poh4B8bOTzJ6Gp8EXEdhLQrem7UZtagCiess PEgqVnBtt7ZxdXzH34Elh087jMAj0D6AZubDkyuhBpu1SzY2F0DoMn9o4658XKVH7Unb d5cWlPMzK18dOLiiFPzRG4vlmbDvOxOXiJK52AyF0fGy6PqiNYH04sZorC+Jv7zQDZBO R/bKs17MJkMICvCSb/QclqhLlB3z9kGxckw/Lsvv/Ezxvck1CxNvqrcB2L7fc8lDWhiZ 7kwA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-transfer-encoding :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature; bh=k7FvVP7WJhLRFSlzp2PtTpMtoecFELKbIeg4y3kqBT4=; fh=eomwG96sEC/gjc2hxkfcqQp4wTe8ZlxZ4RTCowdzz4I=; b=nbSrG5fLseV15cXZAGRWqD2QpgQTbtcNfG4tLFOvnmMRVObQ3rwQP+XFeX1+N74OBF K7Qbt0yVC8WnUzoyIOWAAYF8DW7vgGFquyreNxrIfgg65ehdwE+b9GiFVV/Fxu/qn21b Zj6N3RJjTzkp2dSAAI6/3Vo4XTXRvi7iUhlGkgjD/u9PZsHqa7jRO1aQefEMRacEnyNw eekmPdLstzmRONe2NxwztLuq6iDAb/84A/mnc3c/r5T0FLv7M4fcstas83sC6dt+khL8 9Dekn5vxniOxlH4k6aoMpDUnUi/STNS0CPYS2xyzcArqvjZE0hEZ/1oi8AjwzDaLa3Xa 9vQA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=XRt3udaP; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from snail.vger.email (snail.vger.email. [23.128.96.37]) by mx.google.com with ESMTPS id y31-20020a634b1f000000b005c65e697bb3si10964pga.518.2023.12.06.14.54.32 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 06 Dec 2023 14:54:32 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) client-ip=23.128.96.37; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=XRt3udaP; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by snail.vger.email (Postfix) with ESMTP id AE0C98025736; Wed, 6 Dec 2023 14:54:30 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at snail.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1377798AbjLFWyU (ORCPT + 99 others); Wed, 6 Dec 2023 17:54:20 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59518 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1377430AbjLFWyT (ORCPT ); Wed, 6 Dec 2023 17:54:19 -0500 Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.31]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 23E2C18D; Wed, 6 Dec 2023 14:54:25 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1701903265; x=1733439265; h=date:from:to:cc:subject:message-id:references: mime-version:content-transfer-encoding:in-reply-to; bh=JX/oMvpTI/2qT6JJCaAT7HVae/ammdoaCW90BVpkcJI=; b=XRt3udaPTt6Ew68tPdULZIpYzcWK6KjhPhg0uY9GTwyb+HiqE1XLKhdZ v1VutHAfD3yHXhE4vxIGvG/fGPT77/+vlXYpIeNJqJ1xGYP694fsEnOg3 QQNKDGi8g+86wA6A7HcDngDW1i1I0Gh/A7lFDLUjkUuDEs7PyQmc3GIZF oGarW7RPt0kigptDlvrwZiuHi/Qr4XebhfRRQznJrso0F5L1fBvhSorW5 5pXjX9QDolKcFmo+gjn7p/IU2chCLNHHYEifiui9CCShRJykiP9c1DYLa +R6xEcMqVEqMtMaMBNLMLcT7TtbV/VYUYl9K8+9YadOtJDuY62SFyDCO4 A==; X-IronPort-AV: E=McAfee;i="6600,9927,10916"; a="458464949" X-IronPort-AV: E=Sophos;i="6.04,256,1695711600"; d="scan'208";a="458464949" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Dec 2023 14:54:24 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10916"; a="862255152" X-IronPort-AV: E=Sophos;i="6.04,256,1695711600"; d="scan'208";a="862255152" Received: from eborisov-mobl2.ger.corp.intel.com (HELO box.shutemov.name) ([10.252.46.36]) by fmsmga003-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Dec 2023 14:54:17 -0800 Received: by box.shutemov.name (Postfix, from userid 1000) id 1455110A3F5; Thu, 7 Dec 2023 01:54:15 +0300 (+03) Date: Thu, 7 Dec 2023 01:54:15 +0300 From: "Kirill A. Shutemov" To: Jeremi Piotrowski Cc: "Reshetova, Elena" , "linux-kernel@vger.kernel.org" , Borislav Petkov , Dave Hansen , "H. Peter Anvin" , Ingo Molnar , Michael Kelley , Nikolay Borisov , Peter Zijlstra , Thomas Gleixner , Tom Lendacky , "x86@kernel.org" , "Cui, Dexuan" , "linux-hyperv@vger.kernel.org" , "stefan.bader@canonical.com" , "tim.gardner@canonical.com" , "roxana.nicolescu@canonical.com" , "cascardo@canonical.com" , "kys@microsoft.com" , "haiyangz@microsoft.com" , "wei.liu@kernel.org" , "sashal@kernel.org" , "stable@vger.kernel.org" Subject: Re: [PATCH v1 1/3] x86/tdx: Check for TDX partitioning during early TDX init Message-ID: <20231206225415.zxfm2ndpwsmthc6e@box.shutemov.name> References: <20231122170106.270266-1-jpiotrowski@linux.microsoft.com> <9ab71fee-be9f-4afc-8098-ad9d6b667d46@linux.microsoft.com> <20231205105407.vp2rejqb5avoj7mx@box.shutemov.name> <0c4e33f0-6207-448d-a692-e81391089bea@linux.microsoft.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <0c4e33f0-6207-448d-a692-e81391089bea@linux.microsoft.com> X-Spam-Status: No, score=-4.3 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_EF,RCVD_IN_DNSWL_MED,SPF_HELO_NONE, SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (snail.vger.email [0.0.0.0]); Wed, 06 Dec 2023 14:54:30 -0800 (PST) On Wed, Dec 06, 2023 at 06:49:11PM +0100, Jeremi Piotrowski wrote: > On 05/12/2023 11:54, Kirill A. Shutemov wrote: > > On Mon, Dec 04, 2023 at 08:07:38PM +0100, Jeremi Piotrowski wrote: > >> On 04/12/2023 10:17, Reshetova, Elena wrote: > >>>> Check for additional CPUID bits to identify TDX guests running with Trust > >>>> Domain (TD) partitioning enabled. TD partitioning is like nested virtualization > >>>> inside the Trust Domain so there is a L1 TD VM(M) and there can be L2 TD VM(s). > >>>> > >>>> In this arrangement we are not guaranteed that the TDX_CPUID_LEAF_ID is > >>>> visible > >>>> to Linux running as an L2 TD VM. This is because a majority of TDX facilities > >>>> are controlled by the L1 VMM and the L2 TDX guest needs to use TD partitioning > >>>> aware mechanisms for what's left. So currently such guests do not have > >>>> X86_FEATURE_TDX_GUEST set. > >>> > >>> Back to this concrete patch. Why cannot L1 VMM emulate the correct value of > >>> the TDX_CPUID_LEAF_ID to L2 VM? It can do this per TDX partitioning arch. > >>> How do you handle this and other CPUID calls call currently in L1? Per spec, > >>> all CPUIDs calls from L2 will cause L2 --> L1 exit, so what do you do in L1? > >> The disclaimer here is that I don't have access to the paravisor (L1) code. But > >> to the best of my knowledge the L1 handles CPUID calls by calling into the TDX > >> module, or synthesizing a response itself. TDX_CPUID_LEAF_ID is not provided to > >> the L2 guest in order to discriminate a guest that is solely responsible for every > >> TDX mechanism (running at L1) from one running at L2 that has to cooperate with L1. > >> More below. > >> > >>> > >>> Given that you do that simple emulation, you already end up with TDX guest > >>> code being activated. Next you can check what features you wont be able to > >>> provide in L1 and create simple emulation calls for the TDG calls that must be > >>> supported and cannot return error. The biggest TDG call (TDVMCALL) is already > >>> direct call into L0 VMM, so this part doesn’t require L1 VMM support. > >> > >> I don't see anything in the TD-partitioning spec that gives the TDX guest a way > >> to detect if it's running at L2 or L1, or check whether TDVMCALLs go to L0/L1. > >> So in any case this requires an extra cpuid call to establish the environment. > >> Given that, exposing TDX_CPUID_LEAF_ID to the guest doesn't help. > >> > >> I'll give some examples of where the idea of emulating a TDX environment > >> without attempting L1-L2 cooperation breaks down. > >> > >> hlt: if the guest issues a hlt TDVMCALL it goes to L0, but if it issues a classic hlt > >> it traps to L1. The hlt should definitely go to L1 so that L1 has a chance to do > >> housekeeping. > > > > Why would L2 issue HLT TDVMCALL? It only happens in response to #VE, but > > if partitioning enabled #VEs are routed to L1 anyway. > > What about tdx_safe_halt? When X86_FEATURE_TDX_GUEST is defined I see > "using TDX aware idle routing" in dmesg. Yeah. I forgot about this one. My bad. :/ I think it makes a case for more fine-grained control on where TDVMCALL routed: to L1 or to L0. I think TDX module can do that. BTW, what kind of housekeeping do you do in L1 for HLT case? > >> map gpa: say the guest uses MAP_GPA TDVMCALL. This goes to L0, not L1 which is the actual > >> entity that needs to have a say in performing the conversion. L1 can't act on the request > >> if L0 would forward it because of the CoCo threat model. So L1 and L2 get out of sync. > >> The only safe approach is for L2 to use a different mechanism to trap to L1 explicitly. > > > > Hm? L1 is always in loop on share<->private conversion. I don't know why > > you need MAP_GPA for that. > > > > You can't rely on MAP_GPA anyway. It is optional (unfortunately). Conversion > > doesn't require MAP_GPA call. > > > > I'm sorry, I don't quite follow. I'm reading tdx_enc_status_changed(): > - TDVMCALL_MAP_GPA is issued for all transitions > - TDX_ACCEPT_PAGE is issued for shared->private transitions I am talking about TDX architecture. It doesn't require MAP_GPA call. Just setting shared bit and touching the page will do the conversion. MAP_GPA is "being nice" on the guest behalf. Linux do MAP_GPA all the time. Or tries to. I had bug where I converted page by mistake this way. It was pain to debug. My point is that if you *must* catch all conversions in L1, MAP_GPA is not reliable way. > This doesn't work in partitioning when TDVMCALLs go to L0: TDVMCALL_MAP_GPA bypasses > L1 and TDX_ACCEPT_PAGE is L1 responsibility. > > If you want to see how this is currently supported take a look at arch/x86/hyperv/ivm.c. > All memory starts as private and there is a hypercall to notify the paravisor for both > TDX (when partitioning) and SNP (when VMPL). This guarantees that all page conversions > go through L1. But L1 guest control anyway during page conversion and it has to manage aliases with TDG.MEM.PAGE.ATTR.RD/WR. Why do you need MAP_GPA for that? > >> Having a paravisor is required to support a TPM and having TDVMCALLs go to L0 is > >> required to make performance viable for real workloads. > >> > >>> > >>> Until we really see what breaks with this approach, I don’t think it is worth to > >>> take in the complexity to support different L1 hypervisors view on partitioning. > >>> > >> > >> I'm not asking to support different L1 hypervisors view on partitioning, I want to > >> clean up the code (by fixing assumptions that no longer hold) for the model that I'm > >> describing that: the kernel already supports, has an implementation that works and > >> has actual users. This is also a model that Intel intentionally created the TD-partitioning > >> spec to support. > >> > >> So lets work together to make X86_FEATURE_TDX_GUEST match reality. > > > > I think the right direction is to make TDX architecture good enough > > without that. If we need more hooks in TDX module that give required > > control to L1, let's do that. (I don't see it so far) > > > > I'm not the right person to propose changes to the TDX module, I barely know anything about > TDX. The team that develops the paravisor collaborates with Intel on it and was also consulted > in TD-partitioning design. One possible change I mentioned above: make TDVMCALL exit to L1 for some TDVMCALL leafs (or something along the line). I would like to keep it transparent for enlightened TDX Linux guest. It should not care if it runs as L1 or as L2 in your environment. > I'm also not sure what kind of changes you envision. Everything is supported by the > kernel already and the paravisor ABI is meant to stay vendor independent. > > What I'm trying to accomplish is better integration with the non-partitioning side of TDX > so that users don't see "Memory Encryption Features active: AMD SEV" when running on Intel > TDX with a paravisor. This part is cosmetics and doesn't make much difference. -- Kiryl Shutsemau / Kirill A. Shutemov