Received: by 2002:a05:6a10:9afc:0:0:0:0 with SMTP id t28csp2187796pxm; Thu, 24 Feb 2022 18:53:06 -0800 (PST) X-Google-Smtp-Source: ABdhPJw8UbqA0cKuTxL1cVWaPyO+MmzqapGNm4xf06hxQLndbgjCWZ1LZepj+JUe05pSFcml4AVa X-Received: by 2002:a05:6a00:16d6:b0:4e0:ed6a:cf82 with SMTP id l22-20020a056a0016d600b004e0ed6acf82mr5433806pfc.9.1645757586073; Thu, 24 Feb 2022 18:53:06 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1645757586; cv=none; d=google.com; s=arc-20160816; b=R5vrSHWbSEVM7mmuWKPt+vSCHrYPSeJvsWKvI/zQliXk2+FFlaxIZFAlpNld8T7lOX 0V75lA1Ab+xVpdIJReHoI0iJNQvjdStI0hpLC6A3WdiMKhG9xw565ckyRAsm4jCB0rTF Vw8W2QgYveo7QnTOJ27nMO/YF7SV/AwyIdeYzvKcLIcO8FrZoF5YhgFXCTqtS2fCEZ82 52Ev9DerKg4nVj1TGOm0pRGfXOs4AxOE6iCW2nBJk+gnz2kQMPDFz13Ko5KUwyn5IcWL ox0HJw3o+oi2juGrbKJb5SRRj0HlPXduK5gLZz3H9p2EUccQGESiuPhj/rndhRusXf6N doTQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:subject :from:references:cc:to:content-language:user-agent:mime-version:date :message-id:dkim-signature; bh=tzqOFm5uO9raTY45lHpo5zx/mQ+SD73D/CydniZXbEQ=; b=Rkfsf+nUgooFxdcFhZXzffMgRpFTZ/8osalNtwfx3XLqCLHZHsIf1IoswFDjqXrKSr SfXAe/7mL/LUAyvNJknGq9tf2q2sd92pZqT3MGW4PY/bSc54+chKbt+dIToOZMWANXbC i/JkyjXDv6AuLaEtjGALX2gRzFDP+82NNHhStCjfQDHM5aQKey9t51LUywLF4iD1Jbx7 Imheh3HODGyFgX/oWE/wSAKjGZep6nikrIfl4Fnd2IXFGwSqvPSiD0kBmmweEc0Qxprh QygwpylUzF//bN3lhWejmsC53TksyaGylixPP+q5XXh2ZQc5gukHNFmkhOtx7ToJJUCD gP9A== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=dLiK96Zu; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id i2-20020a170902cf0200b0014f9dcb1bbbsi756247plg.124.2022.02.24.18.52.47; Thu, 24 Feb 2022 18:53:06 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=dLiK96Zu; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234279AbiBXUMg (ORCPT + 99 others); Thu, 24 Feb 2022 15:12:36 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58076 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234273AbiBXUMb (ORCPT ); Thu, 24 Feb 2022 15:12:31 -0500 Received: from mga01.intel.com (mga01.intel.com [192.55.52.88]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EE2C1710EA for ; Thu, 24 Feb 2022 12:11:59 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1645733519; x=1677269519; h=message-id:date:mime-version:to:cc:references:from: subject:in-reply-to:content-transfer-encoding; bh=tU+PxZlH6C9Eoi9fFviDMwdRZTHXhp2XbDPLc6/PgjE=; b=dLiK96Zuc43DRBqfutIykWbw3LkqKG0AC52dsiiQ215L9llmAEOmaRon 6eTTfF9OA8SW11gT5v5IifQd8BwPsapHRC9znkKADtCcyFByBdghTBTb6 DRqw7aYbfTURzKQefYCNN4k1+ZEIrXow5X2QQXZsVVzzdZEcNEQFfrL6E yRM0vZ/Grmy/VEr8bNb8qBKUXQ/YSUT005UNVT55tKFFjde9YdRJszzr2 bRsRrYNpEp/FCD4DdaTMt5lGoM+bVZzuqvxk0SbzwaS3HN38da6OucBxF +ldAVvFOYOGw2998Q4kfufUoeLdIGVtpyAAIKakfAnuOO67pfc/hiS804 g==; X-IronPort-AV: E=McAfee;i="6200,9189,10268"; a="276969170" X-IronPort-AV: E=Sophos;i="5.90,134,1643702400"; d="scan'208";a="276969170" Received: from orsmga004.jf.intel.com ([10.7.209.38]) by fmsmga101.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 24 Feb 2022 12:11:59 -0800 X-IronPort-AV: E=Sophos;i="5.90,134,1643702400"; d="scan'208";a="639862943" Received: from hthen-mobl2.amr.corp.intel.com (HELO [10.209.48.194]) ([10.209.48.194]) by orsmga004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 24 Feb 2022 12:11:58 -0800 Message-ID: Date: Thu, 24 Feb 2022 12:11:54 -0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.5.0 Content-Language: en-US To: "Kirill A. Shutemov" , tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, luto@kernel.org, peterz@infradead.org Cc: sathyanarayanan.kuppuswamy@linux.intel.com, aarcange@redhat.com, ak@linux.intel.com, dan.j.williams@intel.com, david@redhat.com, hpa@zytor.com, jgross@suse.com, jmattson@google.com, joro@8bytes.org, jpoimboe@redhat.com, knsathya@kernel.org, pbonzini@redhat.com, sdeep@vmware.com, seanjc@google.com, tony.luck@intel.com, vkuznets@redhat.com, wanpengli@tencent.com, thomas.lendacky@amd.com, brijesh.singh@amd.com, x86@kernel.org, linux-kernel@vger.kernel.org References: <20220224155630.52734-1-kirill.shutemov@linux.intel.com> <20220224155630.52734-12-kirill.shutemov@linux.intel.com> From: Dave Hansen Subject: Re: [PATCHv4 11/30] x86/tdx: Handle in-kernel MMIO In-Reply-To: <20220224155630.52734-12-kirill.shutemov@linux.intel.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-7.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,NICE_REPLY_A, RCVD_IN_DNSWL_HI,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE, SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2/24/22 07:56, Kirill A. Shutemov wrote: > In non-TDX VMs, MMIO is implemented by providing the guest a mapping > which will cause a VMEXIT on access and then the VMM emulating the > instruction that caused the VMEXIT. That's not possible for TDX VM. > > To emulate an instruction an emulator needs two things: > > - R/W access to the register file to read/modify instruction arguments > and see RIP of the faulted instruction. > > - Read access to memory where instruction is placed to see what to > emulate. In this case it is guest kernel text. > > Both of them are not available to VMM in TDX environment: > > - Register file is never exposed to VMM. When a TD exits to the module, > it saves registers into the state-save area allocated for that TD. > The module then scrubs these registers before returning execution > control to the VMM, to help prevent leakage of TD state. > > - Memory is encrypted a TD-private key. The CPU disallows software > other than the TDX module and TDs from making memory accesses using > the private key. > > In TDX the MMIO regions are instead configured to trigger a #VE > exception in the guest. The guest #VE handler then emulates the MMIO > instruction inside the guest and converts it into a controlled hypercall > to the host. Nit on the changelog: This never really comes out and explicitly says what *this* patch does. It never transitions into imperative voice. Maybe: In TDX, MMIO regions are configured by ____ to trigger a #VE exception in the guest. Add #VE handling that emulates the MMIO instruction inside the guest and converts it into a controlled hypercall. I found this next transition jarring. Maybe add a section title: === Limitations of this approach === > MMIO addresses can be used with any CPU instruction that accesses > memory. Address only MMIO accesses done via io.h helpers, such as > 'readl()' or 'writeq()'. Any CPU instruction that accesses memory can also be used to access MMIO. However, by convention, MMIO access are typically performed via io.h helpers such as 'readl()' or 'writeq()'. > readX()/writeX() helpers limit the range of instructions which can trigger > MMIO. It makes MMIO instruction emulation feasible. Raw access to a MMIO > region allows the compiler to generate whatever instruction it wants. > Supporting all possible instructions is a task of a different scope. The io.h helpers intentionally use a limited set of instructions when accessing MMIO. This known, limited set of instructions makes MMIO instruction decoding and emulation feasible in KVM hosts and SEV guests today. MMIO accesses are performed without the io.h helpers are at the mercy of the compiler. Compilers can and will generate a much more broad set of instructions which can not practically be decoded and emulated. TDX guests will oops if they encounter one of these decoding failures. This means that TDX guests *must* use the io.h helpers to access MMIO. This requirement is not new. Both KVM hosts and AMD SEV guests have the same limitations on MMIO access. --- I found a few things lacking in that description. How's that for a rewrite? > === Potential alternative approaches === > > == Paravirtualizing all MMIO == > > An alternative to letting MMIO induce a #VE exception is to avoid > the #VE in the first place. Similar to the port I/O case, it is > theoretically possible to paravirtualize MMIO accesses. > > Like the exception-based approach offered here, a fully paravirtualized > approach would be limited to MMIO users that leverage common > infrastructure like the io.h macros. > > However, any paravirtual approach would be patching approximately > 120k call sites. With a conservative overhead estimation of 5 bytes per > call site (CALL instruction), it leads to bloating code by 600k. There's one important detail missing there: Any paravirtual approach would need to replace a bare memory access instruction with (at least) a function call. > Many drivers will never be used in the TDX environment and the bloat > cannot be justified. > > == Patching TDX drivers == > > Rather than touching the entire kernel, it might also be possible to > just go after drivers that use MMIO in TDX guests. Right now, that's > limited only to virtio and some x86-specific drivers. > > All virtio MMIO appears to be done through a single function, which > makes virtio eminently easy to patch. This will be implemented in the > future, removing the bulk of MMIO #VEs. Given what is written here, this sounds like a great solution especially compared to all the instruction decoding nasiness. What's wrong with it? > diff --git a/arch/x86/coco/tdx.c b/arch/x86/coco/tdx.c > index fd78b81a951d..15519e498679 100644 > --- a/arch/x86/coco/tdx.c > +++ b/arch/x86/coco/tdx.c > @@ -8,11 +8,17 @@ > #include > #include > #include > +#include > +#include > > /* TDX module Call Leaf IDs */ > #define TDX_GET_INFO 1 > #define TDX_GET_VEINFO 3 > > +/* MMIO direction */ > +#define EPT_READ 0 > +#define EPT_WRITE 1 > + > static struct { > unsigned int gpa_width; > unsigned long attributes; > @@ -184,6 +190,108 @@ static bool handle_cpuid(struct pt_regs *regs) > return true; > } > > +static bool mmio_read(int size, unsigned long addr, unsigned long *val) > +{ > + struct tdx_hypercall_args args = { > + .r10 = TDX_HYPERCALL_STANDARD, > + .r11 = EXIT_REASON_EPT_VIOLATION, > + .r12 = size, > + .r13 = EPT_READ, > + .r14 = addr, > + .r15 = *val, > + }; > + > + if (__tdx_hypercall(&args, TDX_HCALL_HAS_OUTPUT)) > + return false; > + *val = args.r11; > + return true; > +} > + > +static bool mmio_write(int size, unsigned long addr, unsigned long val) > +{ > + return !_tdx_hypercall(EXIT_REASON_EPT_VIOLATION, size, EPT_WRITE, > + addr, val); > +} > + > +static bool handle_mmio(struct pt_regs *regs, struct ve_info *ve) > +{ > + char buffer[MAX_INSN_SIZE]; > + unsigned long *reg, val; > + struct insn insn = {}; > + enum mmio_type mmio; > + int size, extend_size; > + u8 extend_val = 0; > + > + if (copy_from_kernel_nofault(buffer, (void *)regs->ip, MAX_INSN_SIZE)) > + return false; > + > + if (insn_decode(&insn, buffer, MAX_INSN_SIZE, INSN_MODE_64)) > + return false; > + > + mmio = insn_decode_mmio(&insn, &size); > + if (WARN_ON_ONCE(mmio == MMIO_DECODE_FAILED)) > + return false; > + > + if (mmio != MMIO_WRITE_IMM && mmio != MMIO_MOVS) { > + reg = insn_get_modrm_reg_ptr(&insn, regs); > + if (!reg) > + return false; > + } > + > + ve->instr_len = insn.length; > + > + switch (mmio) { > + case MMIO_WRITE: > + memcpy(&val, reg, size); > + return mmio_write(size, ve->gpa, val); > + case MMIO_WRITE_IMM: > + val = insn.immediate.value; > + return mmio_write(size, ve->gpa, val); > + case MMIO_READ: > + case MMIO_READ_ZERO_EXTEND: > + case MMIO_READ_SIGN_EXTEND: > + break; > + case MMIO_MOVS: > + case MMIO_DECODE_FAILED: > + return false; > + default: > + BUG(); > + } Given the huge description above, it's borderline criminal to not discuss what could led to this BUG(). It could literally be some minor tweak in the compiler that changed a non-io.h-using MMIO access to get converted over to a instruction that can't be decoded. Could we spend a few lines of comments to help out the future poor sod that sees "kernel bug at foo.c:1234"? Maybe: /* * MMIO was accessed with an instruction that could not * be decoded. It was likely not using io.h helpers or * accessed MMIO accidentally. */ > + /* Handle reads */ > + if (!mmio_read(size, ve->gpa, &val)) > + return false; > + > + switch (mmio) { > + case MMIO_READ: > + /* Zero-extend for 32-bit operation */ > + extend_size = size == 4 ? sizeof(*reg) : 0; > + break; > + case MMIO_READ_ZERO_EXTEND: > + /* Zero extend based on operand size */ > + extend_size = insn.opnd_bytes; > + break; > + case MMIO_READ_SIGN_EXTEND: > + /* Sign extend based on operand size */ > + extend_size = insn.opnd_bytes; > + if (size == 1 && val & BIT(7)) > + extend_val = 0xFF; > + else if (size > 1 && val & BIT(15)) > + extend_val = 0xFF; > + break; > + case MMIO_MOVS: > + case MMIO_DECODE_FAILED: > + return false; > + default: > + BUG(); > + } > + > + if (extend_size) > + memset(reg, extend_val, extend_size); > + memcpy(reg, &val, size); > + return true; > +} > + > void tdx_get_ve_info(struct ve_info *ve) > { > struct tdx_module_output out; > @@ -237,6 +345,8 @@ static bool virt_exception_kernel(struct pt_regs *regs, struct ve_info *ve) > return write_msr(regs); > case EXIT_REASON_CPUID: > return handle_cpuid(regs); > + case EXIT_REASON_EPT_VIOLATION: > + return handle_mmio(regs, ve); > default: > pr_warn("Unexpected #VE: %lld\n", ve->exit_reason); > return false;