Received: by 2002:a05:6a10:2726:0:0:0:0 with SMTP id ib38csp984328pxb; Wed, 6 Apr 2022 06:02:57 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyO1CJnUDaT5+TNvEToq67LMyv5x3c1TertXB+UhEc5/AvI+/Lutyn2Qy2M78rkdIw7yVSn X-Received: by 2002:a05:6e02:1c28:b0:2ca:4d3b:a132 with SMTP id m8-20020a056e021c2800b002ca4d3ba132mr3902632ilh.24.1649250176979; Wed, 06 Apr 2022 06:02:56 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1649250176; cv=none; d=google.com; s=arc-20160816; b=h+TOTliqrz07wUcGgZQaC1DP/jJqiIc0UuOuJqNYE9whKVxA8VndtXVMmanqXqQ6Ny XZJQtzXC0fwLTWBED2Qu9WmZw+q3yNBQtirc3zf4qmZqLphBS8rDiBPMNsVTbK8QDm03 XimBIrf+qEHLYl6DJrSEsIdD2iTEqm3VvRfPB7+l6q9YdDOLRrI8AHgzpRmVgJGSZFC8 EuvlIWam42n4jsHzDuTWaZ1HTYF8Ta8zsYs3maMxpf/qBby3gpEm6KMMX9O5qzH+b/xP LN/ZtYAZQiT9J+g9ueI0e+CiJXI5sUU342CM7+iA8fCEDVe1ZsBfgUsaLA3DbKQCkkuL QiPQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=KN5N0YPDKTY//znFuWCzahTrbzQOUZ94m9E497aHfVg=; b=NlqcwGwAgzPqQdxxJuxb9GeF14Xkjx6aWH7t5Kkj84DfAaVTory/UnesNBmGgiV9au G4r7wlUgtZ6lwD0aowQ+hf8zQ+Ds1eYk5+F52SOjQKQTHal37GlX3OSmhK0ki/eH7Cxi OQLegyspQF5QMLuS5YdUBwBgPOSTROfr9m/D15dO/xkquogk38vpr3aN+6VHIBSQsDHh VvOJKyn3pwVEHDtPT/98FT5Go0TIM/uULMcDjJ4b+rEAQ1fZAihoDZ+XmyV9GSppWyt8 VTUSra3lyAQFO37xOlIhLzdZH0YBe8sg37atTdtxSjhxj42NffYOBQcm1PbgYBgaZV3J LFiA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b="WVw/gr+z"; spf=softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net. [23.128.96.19]) by mx.google.com with ESMTPS id a15-20020a92ce4f000000b002ca4c17b8f4si3500669ilr.132.2022.04.06.06.02.56 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 06 Apr 2022 06:02:56 -0700 (PDT) Received-SPF: softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) client-ip=23.128.96.19; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b="WVw/gr+z"; spf=softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (out1.vger.email [IPv6:2620:137:e000::1:20]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 9DC7965A54C; Wed, 6 Apr 2022 03:21:14 -0700 (PDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2359492AbiDFDPN (ORCPT + 99 others); Tue, 5 Apr 2022 23:15:13 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57692 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1847092AbiDFCMv (ORCPT ); Tue, 5 Apr 2022 22:12:51 -0400 Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D02D31B0857 for ; Tue, 5 Apr 2022 16:35:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1649201710; x=1680737710; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=Lq4iALRjqGIRJg1d0xna5Ah9jz2sJI0qm/AgCgVVxmc=; b=WVw/gr+zjrkxvN5gYNXatNmr0Lu3RXYbCvxhlbpkb/8gsMWhZkut1kmf SgvQtWBXP9HtMEoRyTt7ZHKi3SHkIk24cJe5M+9hozTujzcJA2+ay/jJr hKGHqXKsmIQL44SZZVHkCkFds7zDtkDPq5aCWujVH+KK5iq+ZRY42lLjC cWJ1T+t45/QfseVWgXptfwowbfVRUFOuik7Rw8Vhd4hEL8foGKRwTW+lH wvjm9vfkuCzrD2DHXBmvGi05U5KGkESgSLr5H6AzbmNVr2Ag5VmczhWID 9+enqNWGWlv3y0YWXGEmXsURUpG1R5zV7IfrAFpXyWfpYa2CmZRary6Qo A==; X-IronPort-AV: E=McAfee;i="6200,9189,10308"; a="260588108" X-IronPort-AV: E=Sophos;i="5.90,238,1643702400"; d="scan'208";a="260588108" Received: from orsmga007.jf.intel.com ([10.7.209.58]) by orsmga102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 05 Apr 2022 16:34:49 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.90,238,1643702400"; d="scan'208";a="549273496" Received: from black.fi.intel.com ([10.237.72.28]) by orsmga007.jf.intel.com with ESMTP; 05 Apr 2022 16:34:42 -0700 Received: by black.fi.intel.com (Postfix, from userid 1000) id 0ACA932A; Wed, 6 Apr 2022 02:29:46 +0300 (EEST) From: "Kirill A. Shutemov" To: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@intel.com, luto@kernel.org, peterz@infradead.org Cc: sathyanarayanan.kuppuswamy@linux.intel.com, aarcange@redhat.com, ak@linux.intel.com, dan.j.williams@intel.com, david@redhat.com, hpa@zytor.com, jgross@suse.com, jmattson@google.com, joro@8bytes.org, jpoimboe@redhat.com, knsathya@kernel.org, pbonzini@redhat.com, sdeep@vmware.com, seanjc@google.com, tony.luck@intel.com, vkuznets@redhat.com, wanpengli@tencent.com, thomas.lendacky@amd.com, brijesh.singh@amd.com, x86@kernel.org, linux-kernel@vger.kernel.org, "Kirill A. Shutemov" , Dave Hansen Subject: [PATCHv8 11/30] x86/tdx: Handle in-kernel MMIO Date: Wed, 6 Apr 2022 02:29:20 +0300 Message-Id: <20220405232939.73860-12-kirill.shutemov@linux.intel.com> X-Mailer: git-send-email 2.35.1 In-Reply-To: <20220405232939.73860-1-kirill.shutemov@linux.intel.com> References: <20220405232939.73860-1-kirill.shutemov@linux.intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,RDNS_NONE,SPF_HELO_NONE,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org In non-TDX VMs, MMIO is implemented by providing the guest a mapping which will cause a VMEXIT on access and then the VMM emulating the instruction that caused the VMEXIT. That's not possible for TDX VM. To emulate an instruction an emulator needs two things: - R/W access to the register file to read/modify instruction arguments and see RIP of the faulted instruction. - Read access to memory where instruction is placed to see what to emulate. In this case it is guest kernel text. Both of them are not available to VMM in TDX environment: - Register file is never exposed to VMM. When a TD exits to the module, it saves registers into the state-save area allocated for that TD. The module then scrubs these registers before returning execution control to the VMM, to help prevent leakage of TD state. - TDX does not allow guests to execute from shared memory. All executed instructions are in TD-private memory. Being private to the TD, VMMs have no way to access TD-private memory and no way to read the instruction to decode and emulate it. In TDX the MMIO regions are instead configured by VMM to trigger a #VE exception in the guest. Add #VE handling that emulates the MMIO instruction inside the guest and converts it into a controlled hypercall to the host. This approach is bad for performance. But, it has (virtually) no impact on the size of the kernel image and will work for a wide variety of drivers. This allows TDX deployments to use arbitrary devices and device drivers, including virtio. TDX customers have asked for the capability to use random devices in their deployments. In other words, even if all of the work was done to paravirtualize all x86 MMIO users and virtio, this approach would still be needed. There is essentially no way to get rid of this code. This approach is functional for all in-kernel MMIO users current and future and does so with a minimal amount of code and kernel image bloat. MMIO addresses can be used with any CPU instruction that accesses memory. Address only MMIO accesses done via io.h helpers, such as 'readl()' or 'writeq()'. Any CPU instruction that accesses memory can also be used to access MMIO. However, by convention, MMIO access are typically performed via io.h helpers such as 'readl()' or 'writeq()'. The io.h helpers intentionally use a limited set of instructions when accessing MMIO. This known, limited set of instructions makes MMIO instruction decoding and emulation feasible in KVM hosts and SEV guests today. MMIO accesses performed without the io.h helpers are at the mercy of the compiler. Compilers can and will generate a much more broad set of instructions which can not practically be decoded and emulated. TDX guests will oops if they encounter one of these decoding failures. This means that TDX guests *must* use the io.h helpers to access MMIO. This requirement is not new. Both KVM hosts and AMD SEV guests have the same limitations on MMIO access. === Potential alternative approaches === == Paravirtualizing all MMIO == An alternative to letting MMIO induce a #VE exception is to avoid the #VE in the first place. Similar to the port I/O case, it is theoretically possible to paravirtualize MMIO accesses. Like the exception-based approach offered here, a fully paravirtualized approach would be limited to MMIO users that leverage common infrastructure like the io.h macros. However, any paravirtual approach would be patching approximately 120k call sites. Any paravirtual approach would need to replace a bare memory access instruction with (at least) a function call. With a conservative overhead estimation of 5 bytes per call site (CALL instruction), it leads to bloating code by 600k. Many drivers will never be used in the TDX environment and the bloat cannot be justified. == Patching TDX drivers == Rather than touching the entire kernel, it might also be possible to just go after drivers that use MMIO in TDX guests *and* are performance critical to justify the effrort. Right now, that's limited only to virtio. All virtio MMIO appears to be done through a single function, which makes virtio eminently easy to patch. This approach will be adopted in the future, removing the bulk of MMIO #VEs. The #VE-based MMIO will remain serving non-virtio use cases. Co-developed-by: Kuppuswamy Sathyanarayanan Signed-off-by: Kuppuswamy Sathyanarayanan Reviewed-by: Andi Kleen Reviewed-by: Tony Luck Signed-off-by: Kirill A. Shutemov Reviewed-by: Dave Hansen Reviewed-by: Thomas Gleixner --- arch/x86/coco/tdx/tdx.c | 121 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 121 insertions(+) diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c index 50c3b97d6db7..ab10bc73a7c5 100644 --- a/arch/x86/coco/tdx/tdx.c +++ b/arch/x86/coco/tdx/tdx.c @@ -8,11 +8,17 @@ #include #include #include +#include +#include /* TDX module Call Leaf IDs */ #define TDX_GET_INFO 1 #define TDX_GET_VEINFO 3 +/* MMIO direction */ +#define EPT_READ 0 +#define EPT_WRITE 1 + /* * Wrapper for standard use of __tdx_hypercall with no output aside from * return code. @@ -222,6 +228,119 @@ static bool handle_cpuid(struct pt_regs *regs) return true; } +static bool mmio_read(int size, unsigned long addr, unsigned long *val) +{ + struct tdx_hypercall_args args = { + .r10 = TDX_HYPERCALL_STANDARD, + .r11 = hcall_func(EXIT_REASON_EPT_VIOLATION), + .r12 = size, + .r13 = EPT_READ, + .r14 = addr, + .r15 = *val, + }; + + if (__tdx_hypercall(&args, TDX_HCALL_HAS_OUTPUT)) + return false; + *val = args.r11; + return true; +} + +static bool mmio_write(int size, unsigned long addr, unsigned long val) +{ + return !_tdx_hypercall(hcall_func(EXIT_REASON_EPT_VIOLATION), size, + EPT_WRITE, addr, val); +} + +static bool handle_mmio(struct pt_regs *regs, struct ve_info *ve) +{ + char buffer[MAX_INSN_SIZE]; + unsigned long *reg, val; + struct insn insn = {}; + enum mmio_type mmio; + int size, extend_size; + u8 extend_val = 0; + + /* Only in-kernel MMIO is supported */ + if (WARN_ON_ONCE(user_mode(regs))) + return false; + + if (copy_from_kernel_nofault(buffer, (void *)regs->ip, MAX_INSN_SIZE)) + return false; + + if (insn_decode(&insn, buffer, MAX_INSN_SIZE, INSN_MODE_64)) + return false; + + mmio = insn_decode_mmio(&insn, &size); + if (WARN_ON_ONCE(mmio == MMIO_DECODE_FAILED)) + return false; + + if (mmio != MMIO_WRITE_IMM && mmio != MMIO_MOVS) { + reg = insn_get_modrm_reg_ptr(&insn, regs); + if (!reg) + return false; + } + + ve->instr_len = insn.length; + + /* Handle writes first */ + switch (mmio) { + case MMIO_WRITE: + memcpy(&val, reg, size); + return mmio_write(size, ve->gpa, val); + case MMIO_WRITE_IMM: + val = insn.immediate.value; + return mmio_write(size, ve->gpa, val); + case MMIO_READ: + case MMIO_READ_ZERO_EXTEND: + case MMIO_READ_SIGN_EXTEND: + /* Reads are handled below */ + break; + case MMIO_MOVS: + case MMIO_DECODE_FAILED: + /* + * MMIO was accessed with an instruction that could not be + * decoded or handled properly. It was likely not using io.h + * helpers or accessed MMIO accidentally. + */ + return false; + default: + WARN_ONCE(1, "Unknown insn_decode_mmio() decode value?"); + return false; + } + + /* Handle reads */ + if (!mmio_read(size, ve->gpa, &val)) + return false; + + switch (mmio) { + case MMIO_READ: + /* Zero-extend for 32-bit operation */ + extend_size = size == 4 ? sizeof(*reg) : 0; + break; + case MMIO_READ_ZERO_EXTEND: + /* Zero extend based on operand size */ + extend_size = insn.opnd_bytes; + break; + case MMIO_READ_SIGN_EXTEND: + /* Sign extend based on operand size */ + extend_size = insn.opnd_bytes; + if (size == 1 && val & BIT(7)) + extend_val = 0xFF; + else if (size > 1 && val & BIT(15)) + extend_val = 0xFF; + break; + default: + /* All other cases has to be covered with the first switch() */ + WARN_ON_ONCE(1); + return false; + } + + if (extend_size) + memset(reg, extend_val, extend_size); + memcpy(reg, &val, size); + return true; +} + void tdx_get_ve_info(struct ve_info *ve) { struct tdx_module_output out; @@ -276,6 +395,8 @@ static bool virt_exception_kernel(struct pt_regs *regs, struct ve_info *ve) return write_msr(regs); case EXIT_REASON_CPUID: return handle_cpuid(regs); + case EXIT_REASON_EPT_VIOLATION: + return handle_mmio(regs, ve); default: pr_warn("Unexpected #VE: %lld\n", ve->exit_reason); return false; -- 2.35.1