Received: by 2002:a05:6358:a55:b0:ec:fcf4:3ecf with SMTP id 21csp3428594rwb; Fri, 20 Jan 2023 16:18:25 -0800 (PST) X-Google-Smtp-Source: AMrXdXuPFSFeyPRyAPSWfMXjh6H8AUZqxQc9MBxicxHD0lg9XaODZkwlqtdUwxPPEJh1gkAPpcFh X-Received: by 2002:aa7:dccd:0:b0:49e:1d4f:a9dd with SMTP id w13-20020aa7dccd000000b0049e1d4fa9ddmr17813256edu.39.1674260304821; Fri, 20 Jan 2023 16:18:24 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1674260304; cv=none; d=google.com; s=arc-20160816; b=hUZIqolGdRj/WFw5nVGbHrxYrQDnlgk6EP8YIqeZvlUB2CT20lcWhNvxUe7k/pbm1g OQpo/4s+RDPMI+jjm7s9HAvk0dhGu63ffwpyz0bg8HuHRokQ+zKpz0DoMuYGhA81k9mW 8cF4dnVk5ANvruJBCzmIkDnWnw8FPyIKqNgDkYaC8e7Z2up7LWMAqu/cZn8qLqkRNh8t Q/zbT5whDMN8bnEY7jq6Nw8IFD0LrB5Ikk+r/BFIEofvI7oibusr+iALoAqbY1YS/sPb TGZO7FTiEUb2MGl6fNyplp5seQyWouNLvtsOKU4pk2rjOdMi6tNJ1rBva9HFZuHssLp2 CvEg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=4SK5QkZtNn1114IW/p2QqQjxFu3MpNyGxrITiSXzRQE=; b=UchBcCDtrK6hA40BL9xp8CXJdXQ3mSve3SQiJ/N1qdXNUq61RZYp5ZLA9QV4mwpF6U nJQdMJTCqPzmtTTEsGwILAZ9f/v7IvnIPf6P6T1l+r3GSpgeAbFSPD827doXVep1lEIq TYi5u1wuyEBCKJMuNnGFNbOrZt74BbzEDuVmOutUQIC053tx4pXaOj6cS9n/hW12C1QK QLSudRdu9ykNmaK9doq7DZXSZ2WeHL9gdOGCDvYPzYsaM5MfT275E46mMYLsvoOEl9i/ E3QIsg71AK9jgZcPplNY9B+zWIET8efUbKMB7PK/tP7uEbM4FBqGnIk6bJFypU+pw9BG oFHg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=tg5hMelZ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id t35-20020a056402242300b00499b463ebaesi38815103eda.605.2023.01.20.16.18.12; Fri, 20 Jan 2023 16:18:24 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=tg5hMelZ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229653AbjAUAMU (ORCPT + 50 others); Fri, 20 Jan 2023 19:12:20 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49340 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229608AbjAUAMR (ORCPT ); Fri, 20 Jan 2023 19:12:17 -0500 Received: from mail-pj1-x102d.google.com (mail-pj1-x102d.google.com [IPv6:2607:f8b0:4864:20::102d]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id CB8049370C for ; Fri, 20 Jan 2023 16:12:13 -0800 (PST) Received: by mail-pj1-x102d.google.com with SMTP id m3-20020a17090a414300b00229ef93c5b0so5707029pjg.2 for ; Fri, 20 Jan 2023 16:12:13 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=4SK5QkZtNn1114IW/p2QqQjxFu3MpNyGxrITiSXzRQE=; b=tg5hMelZw1XcCxmeBgZd0wSJ8z98D8IlS1msOxT8Rq6EDxxz4yssidqhASAoJdXRbU r89swOLn+DnyhpkJJY7L/hEDPZ2upsGHOGbMHww8PxY4/DschRirjTz96PkkGnwrbrl/ b8YWtV/trPuLg3QiOu4c+dNChVCYtUxNf2GfGMe0XA3TIhcE3i0on3UWQZSdgw8uF4Mb lsL9kjwvGhDMbcpZg04rPvgXEhIeg1IZ/8yMKhBQOtDopsR1tylXXpngjh4Yn4fz4t22 QpRkcFj5qrPzN470zgvQh+dHzGxwgf1fwTioLKue9uViLKmu61OeG+55oUfx0djOIvBz DX+A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=4SK5QkZtNn1114IW/p2QqQjxFu3MpNyGxrITiSXzRQE=; b=wL7aPYrACr0qKf6K0JuyuKrOp+gYX87edgsrlocCezcknczyQuxwirNg9mrkCW7LW9 aE0RwbxuhcKXoPW7yOKHZ6W1hyM8sWOGMCJHEx80GVHujhbY5huXy8+R6xi4fafCUVkA WIzJY49VJzJFCxB4qEE9CdcI7GqClj5Z8yIVpa3CscAHAOEXeci/Lxa2wbmzx5mPkFIT vz5ljBagXJLJ7yBRWUS1VmZOVaAAf98Tey20n+34cOJUxk4tCpjOkdKlAKGp2fgxPFnE SyqgMpxaxItYaV1z9Yg6syLV/hlSDPOVc6kDL9O6MJpKua0W3FvTWPJXaz9G1OvKF9Yi uvgA== X-Gm-Message-State: AFqh2koabzIt7evrEW/6EX/GPKXh9j3Pilx8462eztyU3hPdzUvTkc0Y OhvTlCRuulB84mxpxtbcaDkKng== X-Received: by 2002:a05:6a20:93a7:b0:b8:ca86:f3ab with SMTP id x39-20020a056a2093a700b000b8ca86f3abmr19101pzh.2.1674259933059; Fri, 20 Jan 2023 16:12:13 -0800 (PST) Received: from google.com (7.104.168.34.bc.googleusercontent.com. [34.168.104.7]) by smtp.gmail.com with ESMTPSA id 63-20020a621942000000b0056d98e359a5sm26512823pfz.165.2023.01.20.16.12.12 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 20 Jan 2023 16:12:12 -0800 (PST) Date: Sat, 21 Jan 2023 00:12:08 +0000 From: Sean Christopherson To: David Matlack Cc: "Huang, Kai" , "sean.j.christopherson@intel.com" , "Shahar, Sagi" , "isaku.yamahata@gmail.com" , "Aktas, Erdem" , "kvm@vger.kernel.org" , "pbonzini@redhat.com" , "zhi.wang.linux@gmail.com" , "linux-kernel@vger.kernel.org" , "Yamahata, Isaku" Subject: Re: [PATCH v11 018/113] KVM: TDX: create/destroy VM structure Message-ID: References: <20230117214414.00003229@gmail.com> <02b0e551647beed9ec3a2fefd3b659eb52c4846c.camel@intel.com> <1c71eda35e03372f29162c6a5286f5b4d1e1d7e1.camel@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Spam-Status: No, score=-17.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jan 20, 2023, David Matlack wrote: > On Thu, Jan 19, 2023 at 4:16 PM Sean Christopherson wrote: > > > > On Thu, Jan 19, 2023, Huang, Kai wrote: > > > On Thu, 2023-01-19 at 21:36 +0000, Sean Christopherson wrote: > > > > The least invasive idea I have is expand the TDP MMU's concept of "frozen" SPTEs > > > > and freeze (a.k.a. lock) the SPTE (KVM's mirror) until the corresponding S-EPT > > > > update completes.a Doh, the proposed patches already do a freeze+unfreeze. Sorry, I obviously didn't read the most recent changelog and haven't looked at the code in a few versions. > > > This will introduce another "having-to-wait while SPTE is frozen" problem I > > > think, which IIUC means (one way is) you have to do some loop and retry, perhaps > > > similar to yield_safe. > > > > Yes, but because the TDP MMU already freezes SPTEs (just for a shorter duration), > > I'm 99% sure all of the affected flows already know how to yield/bail when necessary. > > > > The problem with the zero-step mitigation is that it could (theoretically) cause > > a "busy" error on literally any accesses, which makes it infeasible for KVM to have > > sane behavior. E.g. freezing SPTEs to avoid the ordering issues isn't necessary > > when holding mmu_lock for write, whereas the zero-step madness brings everything > > into play. > > (I'm still ramping up on TDX so apologies in advance if the following > is totally off base.) > > The complexity, and to a lesser extent the memory overhead, of > mirroring Secure EPT tables with the TDP MMU makes me wonder if it is > really worth it. Especially since the initial TDX support has so many > constraints that would seem to allow a simpler implementation: all > private memory is pinned, no live migration support, no test/clear > young notifiers, etc. > > For the initial version of KVM TDX support, what if we implemented the > Secure EPT management entirely off to the side? i.e. Not on top of the > TDP MMU. For example, write TDX-specific routines for: > > - Fully populating the Secure EPT tree some time during VM creation. Fully populating S-EPT is likely not a viable option for performance reasons. The number of SEAMCALLS needed for a large VM would be quite enormous, and to avoid faults entirely the backing memory would also need to be pre-allocated. > - Tearing down the Secure EPT tree during VM destruction. > - Support for unmapping/mapping specific regions of the Secure EPT > tree for private<->shared conversions. > > With that in place, KVM never would need to handle a fault on a Secure > EPT mapping. Any fault (e.g. due to an in-progress private<->shared > conversion) can just return back to the guest to retry the memory > access until the operation is complete. I don't think this will work. Or at least, it's not that simple. TDX and SNP both require the host to support implicit conversions from shared => private, i.e. KVM needs to be able to handle faults on private memory at any time. KVM could rely on the attributes xarray to know if KVM should resume the guest or exit to userspace, but at some point KVM needs to know whether or an entry has been installed. We could work around that by adding a hooking to "set attributes" to populate S-EPT entries, but I'm not sure that would yield a simpler implementation without sacrificing performance. Thinking more about this, I do believe there is a simpler approach than freezing S-EPT entries. If we tweak the TDP MMU rules for TDX to require mmu_lock be held for write when overwiting a present SPTE (with a live VM), then the problematic cases go away. As you point out, all of the restrictions on TDX private memory means we're most of the way there. Explicit unmap flows already take mmu_lock for write, zapping/splitting for dirty logging isn't a thing (yet), and the S-EPT root is fixed, i.e. kvm_tdp_mmu_put_root() should never free S-EPT tables until the VM is dead. In other words, the only problematic flow is kvm_tdp_mmu_map(). Specifically, KVM will overwrite a present SPTE only to demote/promote an existing entry, i.e. to split a hugepage or collapse into a hugepage. And I believe it's not too difficult to ensure KVM never does in-place promotion/demotion: - Private memory doesn't rely on host userspace page tables, i.e. KVM won't try to promote to a hugepage because a hugepage was created in the host. - Private memory doesn't (yet) support page migration, so again KVM won't try to promote because a hugepage is suddenly possible. - Shadow paging isn't compatible, i.e. disallow_lpage won't change dynamically except for when there are mixed attributes, and toggling private/shared in the attributes requires zapping with mmu_lock held for write. - The idiotic non-coherent DMA + guest MTRR interaction can be disallowed (which I assume is already the case for TDX). - Disallow toggling the NX hugepage mitigation when TDX is allowed. Races to install leaf SPTEs will yield a "loser" either on the spurious check in tdp_mmu_map_handle_target_level(): if (new_spte == iter->old_spte) ret = RET_PF_SPURIOUS; or on the CMPXCHG in tdp_mmu_set_spte_atomic(). So I _think_ the race I described can't actually happen in practice, and if it does happen, then we have a bug somewhere else. Though that raises the question of why the proposed patches do the freeze+unfreeze. If it's just the VM destroy case, then that should be easy enough to special case. Intel folks, Do you happen to know exactly what scenario prompted adding the freeze+unfreeze code? Is there something I'm forgetting/missing, or is it possible we can go with a simpler implementation?