Received: by 2002:a05:6a10:9afc:0:0:0:0 with SMTP id t28csp1254347pxm; Thu, 3 Mar 2022 13:32:45 -0800 (PST) X-Google-Smtp-Source: ABdhPJydUlmRoRo/ubbRUUFIxZ/k8wTpE3gAMAA/QwarmqTHHrt2sUXciDmwXJNzdYBYYR+/jpNS X-Received: by 2002:a50:d9c2:0:b0:413:97b5:d9e6 with SMTP id x2-20020a50d9c2000000b0041397b5d9e6mr26582712edj.334.1646343165068; Thu, 03 Mar 2022 13:32:45 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1646343165; cv=none; d=google.com; s=arc-20160816; b=UG/eeqPnusYMt/G10j9RR4wscQvQFQLPQiFvyMdjdCJdrsttYLcwulAVXGd+2gYuf7 hTyfhqRCe1pQkuGEbBXeqL55LHU8B2DpcLFx3rZiO+cZOZSMCdfmd2IJBjTY2gkSuGup WZ9MQylEr3EoRskdodhdqnpb23+muAL2ENQtnDGCV2tU6PVHmesHvTy26/6h4Hi4ktsk 49Msh15FEt7jPV9++E5Q/2EVbLJaOMhOpiRnrXZYfQXL5389Bm3092FB3ZTe90k8qIva Vx5sz+Udwi/MmxvRQr818tfiD0Cv8oZT5cLfBJpw57XXokYF0HjNf73qJ8iGM15fxe8E P1Jg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=7n7LtyQ7H3ss+S0f6pd4UBJVO1zZM22YShmG6/aanXc=; b=Rnhou6KEPphYgAJ5fcpo/uExd7hwRbqRmMcmqvykjtac0gG4vsctbh65TuU7/Aj2ru UOPUMBO7G26Q+9kgMIBhkT/rO+gd1hIrSVmaahLNkmNuscfeaVvTfbvOSqHz4p8H6ivp Gbjic30C10IbcnDfe2ygXxXwbPE9K/sgHAO7jLwRsZ6CrJHImaUMQa0HaxYUPgHMc3UE CEV1ex5qAFiM9Dh6oMl/RkRaNjEg5uHUKfjcKWgDzw6m3CWfcphQDIshZpa50dDZObw0 QyU+Rjje78LCUo/USWZ1zx2eUNzWwsumGZBBiwQQtZCW9bHeGBEU3fjuzdCzqb/oVcaX FPrg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=hEEXGqhw; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id gb5-20020a170907960500b006cf0dd73024si2129834ejc.51.2022.03.03.13.32.22; Thu, 03 Mar 2022 13:32:45 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=hEEXGqhw; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S236470AbiCCVUg (ORCPT + 99 others); Thu, 3 Mar 2022 16:20:36 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56276 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234966AbiCCVUf (ORCPT ); Thu, 3 Mar 2022 16:20:35 -0500 Received: from mail-pl1-x635.google.com (mail-pl1-x635.google.com [IPv6:2607:f8b0:4864:20::635]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id ADE2D1275CD for ; Thu, 3 Mar 2022 13:19:48 -0800 (PST) Received: by mail-pl1-x635.google.com with SMTP id z2so5906125plg.8 for ; Thu, 03 Mar 2022 13:19:48 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=7n7LtyQ7H3ss+S0f6pd4UBJVO1zZM22YShmG6/aanXc=; b=hEEXGqhwD+X9h7IK+ygN6pK7ycJrEH5R2Be0db5wlXY7fZhhCLZM2hkfxCGwHcktkS LXIOteK0PTJv30sLXOaDWNxMRgXNc2u42NedG1FkN9mN+glRd+ftDcGRRmfIGdrqKfYm /IkFw0eymDM1N9zVNF9W15z98gq71ZRqdn26bN8BQ2Hw3pGOu/py6u5lztQvpMOI0Qlm BpgAkJCwJiF4IaR2jRXc9iws6Tda+clIJu8+OBl5tKFUIWeGpcVLXdijS9ZIJrzOVdcZ Jp0gwtwyaK/PDOlXcHnzCALJk8a750z/JkbqOjkNfrWwxqnO7yCjcQsayNBOTObwnjq5 gxVQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=7n7LtyQ7H3ss+S0f6pd4UBJVO1zZM22YShmG6/aanXc=; b=OImFT/gnCtcGXLmI4mTvXr6Uxpsb29FG/4/Ge59PMv4r774TWTKraNM3lUa6YDc4bM P85d5g6D5eKnVIhWq+83GB5+VpwHKfwipUkAtOPoU2/x/7Hrkj7NG40zhpMEkqu2S828 DWTGXqujG6HKLf7zbNBAceLdycwG4tJUzO5vQkPYSWdPGMC0ek0r2yP7ujRm4LuhLPUy 3PyOOkDPemq+VYLFsKFG5/Ef/9EBycabrKcdxQerVOqGxgYE2hpo9e1kqXbrXCijNfCA n1TWg+Wo6PgYQ/ZqrFMbdGwZ3Bd9hu7S69ikIBnus0FcACGZgyX556P6rPES4kSf2Jq+ KZvw== X-Gm-Message-State: AOAM533qAY/nnBrGfCxNQwdovnPZ+DO9XmTkPd3D+jTcPMs5PIG96Tu+ zQg8nilYpicyUK786ziKKBZKjw== X-Received: by 2002:a17:902:6ac4:b0:149:907d:afdc with SMTP id i4-20020a1709026ac400b00149907dafdcmr38204290plt.59.1646342387836; Thu, 03 Mar 2022 13:19:47 -0800 (PST) Received: from google.com (226.75.127.34.bc.googleusercontent.com. [34.127.75.226]) by smtp.gmail.com with ESMTPSA id z12-20020aa7888c000000b004f3fc6d95casm3643866pfe.20.2022.03.03.13.19.47 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 03 Mar 2022 13:19:47 -0800 (PST) Date: Thu, 3 Mar 2022 21:19:43 +0000 From: Mingwei Zhang To: Sean Christopherson Cc: Paolo Bonzini , Christian Borntraeger , Janosch Frank , Claudio Imbrenda , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , David Hildenbrand , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, David Matlack , Ben Gardon Subject: Re: [PATCH v3 15/28] KVM: x86/mmu: Add dedicated helper to zap TDP MMU root shadow page Message-ID: References: <20220226001546.360188-1-seanjc@google.com> <20220226001546.360188-16-seanjc@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20220226001546.360188-16-seanjc@google.com> X-Spam-Status: No, score=-18.1 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE,USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sat, Feb 26, 2022, Sean Christopherson wrote: > Add a dedicated helper for zapping a TDP MMU root, and use it in the three > flows that do "zap_all" and intentionally do not do a TLB flush if SPTEs > are zapped (zapping an entire root is safe if and only if it cannot be in > use by any vCPU). Because a TLB flush is never required, unconditionally > pass "false" to tdp_mmu_iter_cond_resched() when potentially yielding. > > Opportunistically document why KVM must not yield when zapping roots that > are being zapped by kvm_tdp_mmu_put_root(), i.e. roots whose refcount has > reached zero, and further harden the flow to detect improper KVM behavior > with respect to roots that are supposed to be unreachable. > > In addition to hardening zapping of roots, isolating zapping of roots > will allow future simplification of zap_gfn_range() by having it zap only > leaf SPTEs, and by removing its tricky "zap all" heuristic. By having > all paths that truly need to free _all_ SPs flow through the dedicated > root zapper, the generic zapper can be freed of those concerns. > > Signed-off-by: Sean Christopherson > --- > arch/x86/kvm/mmu/tdp_mmu.c | 98 +++++++++++++++++++++++++++++++------- > 1 file changed, 82 insertions(+), 16 deletions(-) > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c > index 87706e9cc6f3..c5df9a552470 100644 > --- a/arch/x86/kvm/mmu/tdp_mmu.c > +++ b/arch/x86/kvm/mmu/tdp_mmu.c > @@ -56,10 +56,6 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm) > rcu_barrier(); > } > > -static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root, > - gfn_t start, gfn_t end, bool can_yield, bool flush, > - bool shared); > - > static void tdp_mmu_free_sp(struct kvm_mmu_page *sp) > { > free_page((unsigned long)sp->spt); > @@ -82,6 +78,9 @@ static void tdp_mmu_free_sp_rcu_callback(struct rcu_head *head) > tdp_mmu_free_sp(sp); > } > > +static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root, > + bool shared); > + > void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root, > bool shared) > { > @@ -104,7 +103,7 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root, > * intermediate paging structures, that may be zapped, as such entries > * are associated with the ASID on both VMX and SVM. > */ > - (void)zap_gfn_range(kvm, root, 0, -1ull, false, false, shared); > + tdp_mmu_zap_root(kvm, root, shared); > > call_rcu(&root->rcu_head, tdp_mmu_free_sp_rcu_callback); > } > @@ -751,6 +750,76 @@ static inline bool __must_check tdp_mmu_iter_cond_resched(struct kvm *kvm, > return iter->yielded; > } > > +static inline gfn_t tdp_mmu_max_gfn_host(void) > +{ > + /* > + * Bound TDP MMU walks at host.MAXPHYADDR, guest accesses beyond that > + * will hit a #PF(RSVD) and never hit an EPT Violation/Misconfig / #NPF, > + * and so KVM will never install a SPTE for such addresses. > + */ > + return 1ULL << (shadow_phys_bits - PAGE_SHIFT); > +} > + > +static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root, > + bool shared) > +{ > + bool root_is_unreachable = !refcount_read(&root->tdp_mmu_root_count); > + struct tdp_iter iter; > + > + gfn_t end = tdp_mmu_max_gfn_host(); > + gfn_t start = 0; > + > + kvm_lockdep_assert_mmu_lock_held(kvm, shared); > + > + rcu_read_lock(); > + > + /* > + * No need to try to step down in the iterator when zapping an entire > + * root, zapping an upper-level SPTE will recurse on its children. > + */ > + for_each_tdp_pte_min_level(iter, root, root->role.level, start, end) { > +retry: > + /* > + * Yielding isn't allowed when zapping an unreachable root as > + * the root won't be processed by mmu_notifier callbacks. When > + * handling an unmap/release mmu_notifier command, KVM must > + * drop all references to relevant pages prior to completing > + * the callback. Dropping mmu_lock can result in zapping SPTEs > + * for an unreachable root after a relevant callback completes, > + * which leads to use-after-free as zapping a SPTE triggers > + * "writeback" of dirty/accessed bits to the SPTE's associated > + * struct page. > + */ I have a quick question here: when the roots are unreachable, we can't yield, understand that after reading the comments. However, what if there are too many SPTEs that need to be zapped that requires yielding. In this case, I guess we will have a RCU warning, which is unavoidable, right? > + if (!root_is_unreachable && > + tdp_mmu_iter_cond_resched(kvm, &iter, false, shared)) > + continue; > + > + if (!is_shadow_present_pte(iter.old_spte)) > + continue; > + > + if (!shared) { > + tdp_mmu_set_spte(kvm, &iter, 0); > + } else if (tdp_mmu_set_spte_atomic(kvm, &iter, 0)) { > + /* > + * cmpxchg() shouldn't fail if the root is unreachable. > + * Retry so as not to leak the page and its children. > + */ > + WARN_ONCE(root_is_unreachable, > + "Contended TDP MMU SPTE in unreachable root."); > + goto retry; > + } > + > + /* > + * WARN if the root is invalid and is unreachable, all SPTEs > + * should've been zapped by kvm_tdp_mmu_zap_invalidated_roots(), > + * and inserting new SPTEs under an invalid root is a KVM bug. > + */ > + WARN_ON_ONCE(root_is_unreachable && root->role.invalid); > + } > + > + rcu_read_unlock(); > +} > + > bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp) > { > u64 old_spte; > @@ -799,8 +868,7 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root, > gfn_t start, gfn_t end, bool can_yield, bool flush, > bool shared) > { > - gfn_t max_gfn_host = 1ULL << (shadow_phys_bits - PAGE_SHIFT); > - bool zap_all = (start == 0 && end >= max_gfn_host); > + bool zap_all = (start == 0 && end >= tdp_mmu_max_gfn_host()); > struct tdp_iter iter; > > /* > @@ -809,12 +877,7 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root, > */ > int min_level = zap_all ? root->role.level : PG_LEVEL_4K; > > - /* > - * Bound the walk at host.MAXPHYADDR, guest accesses beyond that will > - * hit a #PF(RSVD) and never get to an EPT Violation/Misconfig / #NPF, > - * and so KVM will never install a SPTE for such addresses. > - */ > - end = min(end, max_gfn_host); > + end = min(end, tdp_mmu_max_gfn_host()); > > kvm_lockdep_assert_mmu_lock_held(kvm, shared); > > @@ -874,6 +937,7 @@ bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id, gfn_t start, > > void kvm_tdp_mmu_zap_all(struct kvm *kvm) > { > + struct kvm_mmu_page *root; > int i; > > /* > @@ -881,8 +945,10 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm) > * is being destroyed or the userspace VMM has exited. In both cases, > * KVM_RUN is unreachable, i.e. no vCPUs will ever service the request. > */ > - for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) > - (void)kvm_tdp_mmu_zap_gfn_range(kvm, i, 0, -1ull, false); > + for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) { > + for_each_tdp_mmu_root_yield_safe(kvm, root, i, false) > + tdp_mmu_zap_root(kvm, root, false); > + } > } > > /* > @@ -908,7 +974,7 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm) > * will still flush on yield, but that's a minor performance > * blip and not a functional issue. > */ > - (void)zap_gfn_range(kvm, root, 0, -1ull, true, false, true); > + tdp_mmu_zap_root(kvm, root, true); > > /* > * Put the reference acquired in kvm_tdp_mmu_invalidate_roots(). > -- > 2.35.1.574.g5d30c73bfb-goog >