Received: by 2002:a05:6a10:9afc:0:0:0:0 with SMTP id t28csp1751939pxm; Thu, 24 Feb 2022 08:42:51 -0800 (PST) X-Google-Smtp-Source: ABdhPJxjmISEPkhcxawgzbyqmErhPGDE0NGPEdo5ESIDHVnbcsxCq7oFrXNA2GGHEbWVOxEsKGU2 X-Received: by 2002:a63:8748:0:b0:35e:d94:7b79 with SMTP id i69-20020a638748000000b0035e0d947b79mr2915361pge.81.1645720971177; Thu, 24 Feb 2022 08:42:51 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1645720971; cv=none; d=google.com; s=arc-20160816; b=R8pqCgdmmtO1jPkzqmBtjBl7lVghIdli9wN8GLvFwNiVCyz5PVXR9UW72QOAlmgA1C 6vYrsz0DmnLAO/CSK6Yw7oa6+J33JVPSza3JuJkARx+owsgyQiYpdF84m2srmk5lNugc /Ssmu2KCcQxtwbrBJFZsn0qWeQ4QRoLnmiioF5lKZZyF4td17TEoA3uAbwzGuOfzXaml t+ZNZnoBnxfXK3zgMy9UoXrtuDHTVCdvocT5mMH/QcgJWIKJfdZXBIhNWHcYX5g8w79w M1Kc8hsUFG1ZOBxEXrou6ZdCph+xx4HaoNGrfmNQpzXViLnl1wf0/JlpFE5L+E4X85wx QCkg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=RQYL04WvcyVaMPTyjXlWIGo6GKnajB7KFMZPmTQ8Tis=; b=nfMqQFchOkVHqzwvWgprNnm01QIsIEsq511Lgz7mW0o1KJX1vYyYh8O+8yguzrC+DH Q4tjdOp/RDWdiHB4pEJ0vTakp0ZMQN3M6+DpoJK9MW7C4DvP9N2QBufzc7BwJN48ygYF fRTB6EcfdM8dyUTYjc3XPH9FMUntus0fFT1zFshw58xcLGz3Vfc7JQrk2z4c8I30kWFP 1M9O20Ht8hWi2x3S1MW6Gx5MxxrNBvhIPDY+HBN4qMbdPPjXP86XLHX76tgBKw5MXbCh uqVlNWYhpapJ2Gx2AN1D/3pUDx+wqLDJ5j3vnkWzE+Yv8EAozN4br1Z4rgMaFwHrtrEn Dcag== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=EvVM3qTM; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net. [2620:137:e000::1:18]) by mx.google.com with ESMTPS id bm10si3294416pgb.28.2022.02.24.08.42.50 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 24 Feb 2022 08:42:51 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) client-ip=2620:137:e000::1:18; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=EvVM3qTM; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (out1.vger.email [IPv6:2620:137:e000::1:20]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id B07A01AF8F9; Thu, 24 Feb 2022 08:22:31 -0800 (PST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S236748AbiBXP6P (ORCPT + 99 others); Thu, 24 Feb 2022 10:58:15 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42554 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S236718AbiBXP5Q (ORCPT ); Thu, 24 Feb 2022 10:57:16 -0500 Received: from mga11.intel.com (mga11.intel.com [192.55.52.93]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id ED9DD1D32D for ; Thu, 24 Feb 2022 07:56:39 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1645718199; x=1677254199; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=Nx39V8by9O+7Km3Oh4t8SlBikR+pt9SMuFh5fskeUTc=; b=EvVM3qTMkNSn1qbLcf09bt6x7dAUI44PV0zTApsPc0lHacOtFsegeqUG 700gnx/o6/9UVW0m5OEPAjpUjcDN5061nXt72tBgbZ6nFvyL/+nl1wwgS Cixv59Dm8LAKpZuWO12QdhzTcZX8wLZOXcTsfHByNPec2Cgci9n+XB7MZ saarN82xh03E6Stc/dvvveqRAX7P+nd3YW5138TEyEY7T8R5WRug0lZCI Fd7rNo+O8yJRM/OtL49605HcJKT8KYOU2wEGqeepm60nsu0vPBab2q8j4 PI6KiQBaVG7x8YI1Oa8w5sLeIZLGsZ8FewRKVBWlUmSmAqVTSnpT6rNrw Q==; X-IronPort-AV: E=McAfee;i="6200,9189,10268"; a="249849330" X-IronPort-AV: E=Sophos;i="5.90,134,1643702400"; d="scan'208";a="249849330" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by fmsmga102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 24 Feb 2022 07:56:38 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.90,134,1643702400"; d="scan'208";a="628513019" Received: from black.fi.intel.com ([10.237.72.28]) by FMSMGA003.fm.intel.com with ESMTP; 24 Feb 2022 07:56:32 -0800 Received: by black.fi.intel.com (Postfix, from userid 1000) id BCA71C98; Thu, 24 Feb 2022 17:56:35 +0200 (EET) From: "Kirill A. Shutemov" To: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@intel.com, luto@kernel.org, peterz@infradead.org Cc: sathyanarayanan.kuppuswamy@linux.intel.com, aarcange@redhat.com, ak@linux.intel.com, dan.j.williams@intel.com, david@redhat.com, hpa@zytor.com, jgross@suse.com, jmattson@google.com, joro@8bytes.org, jpoimboe@redhat.com, knsathya@kernel.org, pbonzini@redhat.com, sdeep@vmware.com, seanjc@google.com, tony.luck@intel.com, vkuznets@redhat.com, wanpengli@tencent.com, thomas.lendacky@amd.com, brijesh.singh@amd.com, x86@kernel.org, linux-kernel@vger.kernel.org, "Kirill A . Shutemov" Subject: [PATCHv4 30/30] Documentation/x86: Document TDX kernel architecture Date: Thu, 24 Feb 2022 18:56:30 +0300 Message-Id: <20220224155630.52734-31-kirill.shutemov@linux.intel.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20220224155630.52734-1-kirill.shutemov@linux.intel.com> References: <20220224155630.52734-1-kirill.shutemov@linux.intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,RDNS_NONE,SPF_HELO_NONE,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Kuppuswamy Sathyanarayanan Document the TDX guest architecture details like #VE support, shared memory, etc. Signed-off-by: Kuppuswamy Sathyanarayanan Signed-off-by: Kirill A. Shutemov --- Documentation/x86/index.rst | 1 + Documentation/x86/tdx.rst | 196 ++++++++++++++++++++++++++++++++++++ 2 files changed, 197 insertions(+) create mode 100644 Documentation/x86/tdx.rst diff --git a/Documentation/x86/index.rst b/Documentation/x86/index.rst index f498f1d36cd3..382e53ca850a 100644 --- a/Documentation/x86/index.rst +++ b/Documentation/x86/index.rst @@ -24,6 +24,7 @@ x86-specific Documentation intel-iommu intel_txt amd-memory-encryption + tdx pti mds microcode diff --git a/Documentation/x86/tdx.rst b/Documentation/x86/tdx.rst new file mode 100644 index 000000000000..a0b603ac49ca --- /dev/null +++ b/Documentation/x86/tdx.rst @@ -0,0 +1,196 @@ +.. SPDX-License-Identifier: GPL-2.0 + +===================================== +Intel Trust Domain Extensions (TDX) +===================================== + +Intel's Trust Domain Extensions (TDX) protect confidential guest VMs +from the host and physical attacks by isolating the guest register +state and by encrypting the guest memory. In TDX, a special TDX module +sits between the host and the guest, and runs in a special mode and +manages the guest/host separation. + +Since the host cannot directly access guest registers or memory, much +normal functionality of a hypervisor (such as trapping MMIO, some MSRs, +some CPUIDs, and some other instructions) has to be moved into the +guest. This is implemented using a Virtualization Exception (#VE) that +is handled by the guest kernel. Some #VEs are handled inside the guest +kernel, but some require the hypervisor (VMM) to be involved. The TD +hypercall mechanism allows TD guests to call TDX module or hypervisor +function. + +#VE Exceptions: +=============== + +In TDX guests, #VE Exceptions are delivered to TDX guests in following +scenarios: + +* Execution of certain instructions (see list below) +* Certain MSR accesses. +* CPUID usage (only for certain leaves) +* Shared memory access (including MMIO) + +#VE due to instruction execution +--------------------------------- + +Intel TDX dis-allows execution of certain instructions in non-root +mode. Execution of these instructions would lead to #VE or #GP. + +Details are, + +List of instructions that can cause a #VE is, + +* String I/O (INS, OUTS), IN, OUT +* HLT +* MONITOR, MWAIT +* WBINVD, INVD +* VMCALL + +List of instructions that can cause a #GP is, + +* All VMX instructions: INVEPT, INVVPID, VMCLEAR, VMFUNC, VMLAUNCH, + VMPTRLD, VMPTRST, VMREAD, VMRESUME, VMWRITE, VMXOFF, VMXON +* ENCLS, ENCLV +* GETSEC +* RSM +* ENQCMD + +#VE due to MSR access +---------------------- + +In TDX guest, MSR access behavior can be categorized as, + +* Native supported (also called "context switched MSR") + No special handling is required for these MSRs in TDX guests. +* #GP triggered + Dis-allowed MSR read/write would lead to #GP. +* #VE triggered + All MSRs that are not natively supported or dis-allowed + (triggers #GP) will trigger #VE. To support access to + these MSRs, it needs to be emulated using TDCALL. + +Look Intel TDX Module Specification, sec "MSR Virtualization" for the complete +list of MSRs that fall under the categories above. + +#VE due to CPUID instruction +---------------------------- + +In TDX guests, most of CPUID leaf/sub-leaf combinations are virtualized by +the TDX module while some trigger #VE. Whether the leaf/sub-leaf triggers #VE +defined in the TDX spec. + +VMM during the TD initialization time (using TDH.MNG.INIT) configures if +a feature bits in specific leaf-subleaf are exposed to TD guest or not. + +#VE on Memory Accesses +---------------------- + +A TD guest is in control of whether its memory accesses are treated as +private or shared. It selects the behavior with a bit in its page table +entries. + +#VE on Shared Pages +------------------- + +Access to shared mappings can cause a #VE. The hypervisor controls whether +access of shared mapping causes a #VE, so the guest must be careful to only +reference shared pages it can safely handle a #VE, avoid nested #VEs. + +Content of shared mapping is not trusted since shared memory is writable +by the hypervisor. Shared mappings are never used for sensitive memory content +like stacks or kernel text, only for I/O buffers and MMIO regions. The kernel +will not encounter shared mappings in sensitive contexts like syscall entry +or NMIs. + +#VE on Private Pages +-------------------- + +Some accesses to private mappings may cause #VEs. Before a mapping is +accepted (AKA in the SEPT_PENDING state), a reference would cause a #VE. +But, after acceptance, references typically succeed. + +The hypervisor can cause a private page reference to fail if it chooses +to move an accepted page to a "blocked" state. However, if it does +this, page access will not generate a #VE. It will, instead, cause a +"TD Exit" where the hypervisor is required to handle the exception. + +Linux #VE handler +----------------- + +Both user/kernel #VE exceptions are handled by the tdx_handle_virt_exception() +handler. If successfully handled, the instruction pointer is incremented to +complete the handling process. If failed to handle, it is treated as a regular +exception and handled via fixup handlers. + +In TD guests, #VE nesting (a #VE triggered before handling the current one +or AKA syscall gap issue) problem is handled by TDX module ensuring that +interrupts, including NMIs, are blocked. The hardware blocks interrupts +starting with #VE delivery until TDGETVEINFO is called. + +The kernel must avoid triggering #VE in entry paths: do not touch TD-shared +memory, including MMIO regions, and do not use #VE triggering MSRs, +instructions, or CPUID leaves that might generate #VE. + +MMIO handling: +============== + +In non-TDX VMs, MMIO is usually implemented by giving a guest access to a +mapping which will cause a VMEXIT on access, and then the VMM emulates the +access. That's not possible in TDX guests because VMEXIT will expose the +register state to the host. TDX guests don't trust the host and can't have +their state exposed to the host. + +In TDX the MMIO regions are instead configured to trigger a #VE +exception in the guest. The guest #VE handler then emulates the MMIO +instructions inside the guest and converts them into a controlled TDCALL +to the host, rather than completely exposing the state to the host. + +MMIO addresses on x86 are just special physical addresses. They can be +accessed with any instruction that accesses memory. However, the +introduced instruction decoding method is limited. It is only designed +to decode instructions like those generated by io.h macros. + +MMIO access via other means (like structure overlays) may result in +MMIO_DECODE_FAILED and an oops. + +Shared memory: +============== + +Intel TDX doesn't allow the VMM to access guest private memory. Any +memory that is required for communication with VMM must be shared +explicitly by setting the bit in the page table entry. The shared bit +can be enumerated with TDX_GET_INFO. + +After setting the shared bit, the conversion must be completed with +MapGPA hypercall. The call informs the VMM about the conversion between +private/shared mappings. + +set_memory_decrypted() converts a range of pages to shared. +set_memory_encrypted() converts memory back to private. + +Device drivers are the primary user of shared memory, but there's no +need in touching every driver. DMA buffers and ioremap()'ed regions are +converted to shared automatically. + +TDX uses SWIOTLB for most DMA allocations. The SWIOTLB buffer is +converted to shared on boot. + +For coherent DMA allocation, the DMA buffer gets converted on the +allocation. Check force_dma_unencrypted() for details. + +References +========== + +More details about TDX module (and its response for MSR, memory access, +IO, CPUID etc) can be found at, + +https://www.intel.com/content/dam/develop/external/us/en/documents/tdx-module-1.0-public-spec-v0.931.pdf + +More details about TDX hypercall and TDX module call ABI can be found +at, + +https://www.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-guest-hypervisor-communication-interface-1.0-344426-002.pdf + +More details about TDVF requirements can be found at, + +https://www.intel.com/content/dam/develop/external/us/en/documents/tdx-virtual-firmware-design-guide-rev-1.01.pdf -- 2.34.1