Received: by 2002:a05:6a10:7420:0:0:0:0 with SMTP id hk32csp1712125pxb; Fri, 18 Feb 2022 13:48:03 -0800 (PST) X-Google-Smtp-Source: ABdhPJy8Pmq21G2lvIuDCv1c75y7hu5OfkkTFaXvQ12PsuoCsye8IMg2Sv1XxEHC+5MJJF+/HH0q X-Received: by 2002:a17:90b:33c4:b0:1b9:3aa6:e3e0 with SMTP id lk4-20020a17090b33c400b001b93aa6e3e0mr14664726pjb.182.1645220883479; Fri, 18 Feb 2022 13:48:03 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1645220883; cv=none; d=google.com; s=arc-20160816; b=urPoziTl8eX38y7IuG1Q+Q51RAqkuM08kqqpIrdjhvgc2Y0euDZPKfBhXLL5p6AjOy cRywUtswJ32yqsSg13E2T2CXEzUobvgWwYzvF36JZJFucAkxa0mN/KaoRNY2Z2fqFC7m pdfCquAQBkI1VMaNAdXqUP16EPu+SGDtXb+i3i7bLh6DNcfN28UYzKI6KkfHE/oPd4kj 9vJ0M0N+W2lngeWPhZbG6VetHDOp+a5mye/KiVUywFDd37AxNRekDRUhp5twlEOQ/ps/ BcFzx9hPz65ipdCq83fF3zEnksV69hSiAlJEEmpVvKo7kc3dBWJa58QDqPnuz1RpKFny RmqA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=/QU/UeUHlUgmQYzU/YXnPGI/qI1RapJmxIp4LMnzQ10=; b=Fp8HHQfplzf3UFEfR7tC9jelxnz4Jnj98sIEJVCSLl3grd8W+Hskey6EIBd3wjLjVs DgF5ydh6Bqj+xlMtsOW2AKFSuR+GwGeye5Gx/DnP3q7V1Lc6n4PFVqKZ2nV0zZqRRfEq DWhcPVOZ27k8s0U0ScpnLnjsxD/PxWQ0RQYu5K0Jw7KnZO1QonW9hAVDVj6fQVPhHbMy laRy02Y/39us+enEypD+imvWKBMtwH549vuWWKCu2gnX4wvozHb1ko/7s9z9fzNiVRRy 4ayK7xVj7tsUBsYF+n4+z/OBPRC4HOPzQmmDE19TCTeUrjZbur50OiWYLR8zgxr6FL5H m/6Q== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=ngG5FkU7; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id z4si3000718pfw.148.2022.02.18.13.47.45; Fri, 18 Feb 2022 13:48:03 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=ngG5FkU7; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S237686AbiBRQR6 (ORCPT + 99 others); Fri, 18 Feb 2022 11:17:58 -0500 Received: from mxb-00190b01.gslb.pphosted.com ([23.128.96.19]:42904 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S237553AbiBRQRh (ORCPT ); Fri, 18 Feb 2022 11:17:37 -0500 Received: from mga17.intel.com (mga17.intel.com [192.55.52.151]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 03DD222BD1 for ; Fri, 18 Feb 2022 08:17:19 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1645201040; x=1676737040; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=203HpdElKzgfMQNoMmPAUE8EfsUxcoYFrb4vroCwzp4=; b=ngG5FkU7A21sMQyYnBR8HuaLnrF0dja5YJ3xoVcYRP5hpNzezHWwJ5tb dIKdeIfGgB0oPMjAJ2qSl6Htrpo++VdT1Nfnef+IKpWu9uRlow3aTnp+a mt9KAxcbFg8iqvNMdsUlF1hD/XuXtqGyoLwPcJxQkGcTKY0BarGT9L+M0 nLm0Fh4vccAt9zePehA9aQuQ0lBQsMZWCNhJ/1TBDQO9BcHiJBv+KamkD PDAzNZA3U1dAwZP5pGasBWt3n4uVdr9cXAXqU4QesgswVnC6JjAbtrXZP Yu9Oz1KYi8eXkCqEDnBUjLR4JqZWqYKWLTDJTwGEsH/okiQGpZGTzjC07 w==; X-IronPort-AV: E=McAfee;i="6200,9189,10261"; a="231791701" X-IronPort-AV: E=Sophos;i="5.88,379,1635231600"; d="scan'208";a="231791701" Received: from fmsmga006.fm.intel.com ([10.253.24.20]) by fmsmga107.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Feb 2022 08:17:19 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.88,379,1635231600"; d="scan'208";a="775328236" Received: from black.fi.intel.com ([10.237.72.28]) by fmsmga006.fm.intel.com with ESMTP; 18 Feb 2022 08:17:13 -0800 Received: by black.fi.intel.com (Postfix, from userid 1000) id 9B67E9FE; Fri, 18 Feb 2022 18:17:22 +0200 (EET) From: "Kirill A. Shutemov" To: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@intel.com, luto@kernel.org, peterz@infradead.org Cc: sathyanarayanan.kuppuswamy@linux.intel.com, aarcange@redhat.com, ak@linux.intel.com, dan.j.williams@intel.com, david@redhat.com, hpa@zytor.com, jgross@suse.com, jmattson@google.com, joro@8bytes.org, jpoimboe@redhat.com, knsathya@kernel.org, pbonzini@redhat.com, sdeep@vmware.com, seanjc@google.com, tony.luck@intel.com, vkuznets@redhat.com, wanpengli@tencent.com, x86@kernel.org, linux-kernel@vger.kernel.org, "Kirill A. Shutemov" Subject: [PATCHv3 12/32] x86/tdx: Handle in-kernel MMIO Date: Fri, 18 Feb 2022 19:16:58 +0300 Message-Id: <20220218161718.67148-13-kirill.shutemov@linux.intel.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20220218161718.67148-1-kirill.shutemov@linux.intel.com> References: <20220218161718.67148-1-kirill.shutemov@linux.intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_EF,RCVD_IN_DNSWL_MED,SPF_HELO_NONE, SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org In non-TDX VMs, MMIO is implemented by providing the guest a mapping which will cause a VMEXIT on access and then the VMM emulating the instruction that caused the VMEXIT. That's not possible for TDX VM. To emulate an instruction an emulator needs two things: - R/W access to the register file to read/modify instruction arguments and see RIP of the faulted instruction. - Read access to memory where instruction is placed to see what to emulate. In this case it is guest kernel text. Both of them are not available to VMM in TDX environment: - Register file is never exposed to VMM. When a TD exits to the module, it saves registers into the state-save area allocated for that TD. The module then scrubs these registers before returning execution control to the VMM, to help prevent leakage of TD state. - Memory is encrypted a TD-private key. The CPU disallows software other than the TDX module and TDs from making memory accesses using the private key. In TDX the MMIO regions are instead configured to trigger a #VE exception in the guest. The guest #VE handler then emulates the MMIO instruction inside the guest and converts it into a controlled hypercall to the host. MMIO addresses can be used with any CPU instruction that accesses memory. Address only MMIO accesses done via io.h helpers, such as 'readl()' or 'writeq()'. readX()/writeX() helpers limit the range of instructions which can trigger MMIO. It makes MMIO instruction emulation feasible. Raw access to a MMIO region allows the compiler to generate whatever instruction it wants. Supporting all possible instructions is a task of a different scope. MMIO access with anything other than helpers from io.h may result in MMIO_DECODE_FAILED and an oops. AMD SEV has the same limitations to MMIO handling. === Potential alternative approaches === == Paravirtualizing all MMIO == An alternative to letting MMIO induce a #VE exception is to avoid the #VE in the first place. Similar to the port I/O case, it is theoretically possible to paravirtualize MMIO accesses. Like the exception-based approach offered here, a fully paravirtualized approach would be limited to MMIO users that leverage common infrastructure like the io.h macros. However, any paravirtual approach would be patching approximately 120k call sites. With a conservative overhead estimation of 5 bytes per call site (CALL instruction), it leads to bloating code by 600k. Many drivers will never be used in the TDX environment and the bloat cannot be justified. == Patching TDX drivers == Rather than touching the entire kernel, it might also be possible to just go after drivers that use MMIO in TDX guests. Right now, that's limited only to virtio and some x86-specific drivers. All virtio MMIO appears to be done through a single function, which makes virtio eminently easy to patch. This will be implemented in the future, removing the bulk of MMIO #VEs. Co-developed-by: Kuppuswamy Sathyanarayanan Signed-off-by: Kuppuswamy Sathyanarayanan Reviewed-by: Andi Kleen Reviewed-by: Tony Luck Signed-off-by: Kirill A. Shutemov --- arch/x86/coco/tdx.c | 110 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 110 insertions(+) diff --git a/arch/x86/coco/tdx.c b/arch/x86/coco/tdx.c index 83cbc94b30d0..74ab7c5a767d 100644 --- a/arch/x86/coco/tdx.c +++ b/arch/x86/coco/tdx.c @@ -8,11 +8,17 @@ #include #include #include +#include +#include /* TDX module Call Leaf IDs */ #define TDX_GET_INFO 1 #define TDX_GET_VEINFO 3 +/* MMIO direction */ +#define EPT_READ 0 +#define EPT_WRITE 1 + static struct { unsigned int gpa_width; unsigned long attributes; @@ -184,6 +190,108 @@ static bool handle_cpuid(struct pt_regs *regs) return true; } +static bool mmio_read(int size, unsigned long addr, unsigned long *val) +{ + struct tdx_hypercall_args args = { + .r10 = TDX_HYPERCALL_STANDARD, + .r11 = EXIT_REASON_EPT_VIOLATION, + .r12 = size, + .r13 = EPT_READ, + .r14 = addr, + .r15 = *val, + }; + + if (__tdx_hypercall(&args, TDX_HCALL_HAS_OUTPUT)) + return false; + *val = args.r11; + return true; +} + +static bool mmio_write(int size, unsigned long addr, unsigned long val) +{ + return !_tdx_hypercall(EXIT_REASON_EPT_VIOLATION, size, EPT_WRITE, + addr, val); +} + +static bool handle_mmio(struct pt_regs *regs, struct ve_info *ve) +{ + char buffer[MAX_INSN_SIZE]; + unsigned long *reg, val; + struct insn insn = {}; + enum mmio_type mmio; + int size, extend_size; + u8 extend_val = 0; + + if (copy_from_kernel_nofault(buffer, (void *)regs->ip, MAX_INSN_SIZE)) + return false; + + if (insn_decode(&insn, buffer, MAX_INSN_SIZE, INSN_MODE_64)) + return false; + + mmio = insn_decode_mmio(&insn, &size); + if (WARN_ON_ONCE(mmio == MMIO_DECODE_FAILED)) + return false; + + if (mmio != MMIO_WRITE_IMM && mmio != MMIO_MOVS) { + reg = insn_get_modrm_reg_ptr(&insn, regs); + if (!reg) + return false; + } + + ve->instr_len = insn.length; + + switch (mmio) { + case MMIO_WRITE: + memcpy(&val, reg, size); + return mmio_write(size, ve->gpa, val); + case MMIO_WRITE_IMM: + val = insn.immediate.value; + return mmio_write(size, ve->gpa, val); + case MMIO_READ: + case MMIO_READ_ZERO_EXTEND: + case MMIO_READ_SIGN_EXTEND: + break; + case MMIO_MOVS: + case MMIO_DECODE_FAILED: + return false; + default: + BUG(); + } + + /* Handle reads */ + if (!mmio_read(size, ve->gpa, &val)) + return false; + + switch (mmio) { + case MMIO_READ: + /* Zero-extend for 32-bit operation */ + extend_size = size == 4 ? sizeof(*reg) : 0; + break; + case MMIO_READ_ZERO_EXTEND: + /* Zero extend based on operand size */ + extend_size = insn.opnd_bytes; + break; + case MMIO_READ_SIGN_EXTEND: + /* Sign extend based on operand size */ + extend_size = insn.opnd_bytes; + if (size == 1 && val & BIT(7)) + extend_val = 0xFF; + else if (size > 1 && val & BIT(15)) + extend_val = 0xFF; + break; + case MMIO_MOVS: + case MMIO_DECODE_FAILED: + return false; + default: + BUG(); + } + + if (extend_size) + memset(reg, extend_val, extend_size); + memcpy(reg, &val, size); + return true; +} + void tdx_get_ve_info(struct ve_info *ve) { struct tdx_module_output out; @@ -237,6 +345,8 @@ static bool virt_exception_kernel(struct pt_regs *regs, struct ve_info *ve) return write_msr(regs); case EXIT_REASON_CPUID: return handle_cpuid(regs); + case EXIT_REASON_EPT_VIOLATION: + return handle_mmio(regs, ve); default: pr_warn("Unexpected #VE: %lld\n", ve->exit_reason); return false; -- 2.34.1