Received: by 2002:a05:6a10:2726:0:0:0:0 with SMTP id ib38csp5639452pxb; Mon, 28 Mar 2022 15:20:54 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzOchKiFokObKU804KhhvuNkxFM78WwvO0RSI6D0yrSZr6wjuOJUH0m4UcwZkyDuThz4iZ3 X-Received: by 2002:a05:6122:18a2:b0:33f:c030:cc9e with SMTP id bi34-20020a05612218a200b0033fc030cc9emr13175263vkb.16.1648506054623; Mon, 28 Mar 2022 15:20:54 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1648506054; cv=none; d=google.com; s=arc-20160816; b=t9BgKdwi2mpWBuNirWwb5fte3oE6nbFiQUiO8UDcygMy5GUvqoNbTzBqP9wNFIltUy WqlZYaK8hVGtBLEdvHedq1HVfSdmQtMNZLo5S1X6wHv7rcN8600Non4Uizo14ZkCaplF WbGHDTm02hN8VU6j/lpdLCWMUt4Il1rZI36GKTF/R3qV2o9UMqdlYBBDbCZS6cwnMjLx fxpAnnTlhB2QL3P8baGiI3HirbBalNiUnS7yRf43mYe5pmTDKyl7NYpvAMgyNLxfA625 Y5sCVQ2iqDe9E6ewvk/zSWQZa9+5SMow8lvR8ujdzpjj0LW8k+YO46X+tsuiC/VeQHf2 9HPg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=xqEEYU+NLqFNxKimuVXTLE3KxvDawYGyrEy7eiTScjA=; b=JYIZ78vfxjYgMVmFmMS+KKO6ifMRPhIDkp7SjlSH7BmSy2d3xoQSPv0T44IHLSFPjj hBPMuaFKu83gRuhhO0/O2W6boijxiSoUyFWdoRZC4GmPtE/gFV/eVhhjIY+IRN3qs166 OYUbBCKk1YsDjgBBUbZytu8i1sXdBVl+oHBNTDnRmjfjRBcVjk6KSvqMRhoFed9Ai3P9 0MagzuZR+THX3G1F6+Zux1EUko2xgS7lYzYS1zuYsoKE1pbhMyfj/WIL2zAQezBZvu4r 1252oBwurpALUb7sjvTQ7Gps1KIREQG1UH2/kSwH+vuLIi2OpR3SPWpS1UoxiG+gPFAx +1Xw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=GVaqpK4S; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net. [2620:137:e000::1:18]) by mx.google.com with ESMTPS id a8-20020a056102014800b00324e8e8607csi2978050vsr.524.2022.03.28.15.20.54 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 28 Mar 2022 15:20:54 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) client-ip=2620:137:e000::1:18; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=GVaqpK4S; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 7B44D197ACA; Mon, 28 Mar 2022 14:37:58 -0700 (PDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S245014AbiC1SJf (ORCPT + 99 others); Mon, 28 Mar 2022 14:09:35 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54396 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S244898AbiC1SJd (ORCPT ); Mon, 28 Mar 2022 14:09:33 -0400 Received: from mail-yw1-x1135.google.com (mail-yw1-x1135.google.com [IPv6:2607:f8b0:4864:20::1135]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 63E2D4CD6F for ; Mon, 28 Mar 2022 11:07:52 -0700 (PDT) Received: by mail-yw1-x1135.google.com with SMTP id 00721157ae682-2e5827a76f4so158068067b3.6 for ; Mon, 28 Mar 2022 11:07:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=xqEEYU+NLqFNxKimuVXTLE3KxvDawYGyrEy7eiTScjA=; b=GVaqpK4Ssj9998AfJa+CpInccGgJL7vEcfAbsBngm1KCw4+9l6XkgiTPlEOFhsVLyC fsEhzEG0XayeUGip009EAR1gqR2BL71b2WgZd5Z0MuNoewBMqbbFFXifRpVPZ4iZrokE kZbE8f/j/Vh977Ad3M3HNxwRwuFAIp8ikE1mdyWUz0Q4IoZRXD/Yc5zIX11Ctjm6vTXg 6lGabFQjHkekXT9JZY4Jfrdlapmxe2YCgymRJqDHsdqWwaALuoFzooUhvIadhgwHh6H7 /PUibGcEtOi9NprMj5ngNphjeWIXSrL7yPKrT/LwvSUAmdVnSR9bMls4jwiQeT1xFRYk c01A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=xqEEYU+NLqFNxKimuVXTLE3KxvDawYGyrEy7eiTScjA=; b=XWm9MoTjXHqIX3gDop+BzLNJBupWaeGIQ1ayTBpHrnsj4t46J5yBSR9MCIVGVUxKzq mP/tbt6G9guodxQALCDfRf4VIJ0b6n89Z5xd9UZDZ0GJxhHEGEzu5ZK3dxu2KWSuPyjq zW4EM/kEeEYJgZ97z2Pg4S9s1QDtIo/DBhvc6dyir3dkmwUKsqEeCnzsDnHgJlN8TR+K Bt58a9IeSSVrSi9L+CkhSDASWHm6fBpZkgVqLUvZ4LIGH1kmsmFxhVQ5NrJ5zyb7uJfw +QplTDIUwkKTCwgK3vgtPrhzbw+MEYQfTZWJxKIdVYzZCugDEzi5VJzkQG+N3ZOPCHr2 JE7w== X-Gm-Message-State: AOAM531Uzfu1SwY+Re+uR36VU01PrWvLGW1YUx+NIKJRIw/u0M463vmT Q+TmkwZV5M8md4ihvQbQ5wL8LjvEfNOxOxRvqneGrg== X-Received: by 2002:a81:15ce:0:b0:2e5:e189:7366 with SMTP id 197-20020a8115ce000000b002e5e1897366mr27257565ywv.188.1648490871291; Mon, 28 Mar 2022 11:07:51 -0700 (PDT) MIME-Version: 1.0 References: <20220321224358.1305530-1-bgardon@google.com> <20220321224358.1305530-10-bgardon@google.com> In-Reply-To: From: Ben Gardon Date: Mon, 28 Mar 2022 11:07:40 -0700 Message-ID: Subject: Re: [PATCH v2 9/9] KVM: x86/mmu: Promote pages in-place when disabling dirty logging To: David Matlack Cc: LKML , kvm , Paolo Bonzini , Peter Xu , Sean Christopherson , Jim Mattson , David Dunn , Jing Zhang , Junaid Shahid Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-9.5 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,RDNS_NONE,SPF_HELO_NONE,T_SCC_BODY_TEXT_LINE, USER_IN_DEF_DKIM_WL autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Mar 28, 2022 at 10:45 AM David Matlack wrote: > > On Mon, Mar 21, 2022 at 03:43:58PM -0700, Ben Gardon wrote: > > When disabling dirty logging, the TDP MMU currently zaps each leaf entry > > mapping memory in the relevant memslot. This is very slow. Doing the zaps > > under the mmu read lock requires a TLB flush for every zap and the > > zapping causes a storm of ETP/NPT violations. > > > > Instead of zapping, replace the split large pages with large page > > mappings directly. While this sort of operation has historically only > > been done in the vCPU page fault handler context, refactorings earlier > > in this series and the relative simplicity of the TDP MMU make it > > possible here as well. > > > > Running the dirty_log_perf_test on an Intel Skylake with 96 vCPUs and 1G > > of memory per vCPU, this reduces the time required to disable dirty > > logging from over 45 seconds to just over 1 second. It also avoids > > provoking page faults, improving vCPU performance while disabling > > dirty logging. > > > > Signed-off-by: Ben Gardon > > --- > > arch/x86/kvm/mmu/mmu.c | 4 +- > > arch/x86/kvm/mmu/mmu_internal.h | 6 +++ > > arch/x86/kvm/mmu/tdp_mmu.c | 73 ++++++++++++++++++++++++++++++++- > > 3 files changed, 79 insertions(+), 4 deletions(-) > > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > > index 6f98111f8f8b..a99c23ef90b6 100644 > > --- a/arch/x86/kvm/mmu/mmu.c > > +++ b/arch/x86/kvm/mmu/mmu.c > > @@ -100,7 +100,7 @@ module_param_named(flush_on_reuse, force_flush_and_sync_on_reuse, bool, 0644); > > */ > > bool tdp_enabled = false; > > > > -static int max_huge_page_level __read_mostly; > > +int max_huge_page_level; > > static int tdp_root_level __read_mostly; > > static int max_tdp_level __read_mostly; > > > > @@ -4486,7 +4486,7 @@ static inline bool boot_cpu_is_amd(void) > > * the direct page table on host, use as much mmu features as > > * possible, however, kvm currently does not do execution-protection. > > */ > > -static void > > +void > > build_tdp_shadow_zero_bits_mask(struct rsvd_bits_validate *shadow_zero_check, > > int shadow_root_level) > > { > > diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h > > index 1bff453f7cbe..6c08a5731fcb 100644 > > --- a/arch/x86/kvm/mmu/mmu_internal.h > > +++ b/arch/x86/kvm/mmu/mmu_internal.h > > @@ -171,4 +171,10 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc); > > void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp); > > void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp); > > > > +void > > +build_tdp_shadow_zero_bits_mask(struct rsvd_bits_validate *shadow_zero_check, > > + int shadow_root_level); > > + > > +extern int max_huge_page_level __read_mostly; > > + > > #endif /* __KVM_X86_MMU_INTERNAL_H */ > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c > > index af60922906ef..eb8929e394ec 100644 > > --- a/arch/x86/kvm/mmu/tdp_mmu.c > > +++ b/arch/x86/kvm/mmu/tdp_mmu.c > > @@ -1709,6 +1709,66 @@ void kvm_tdp_mmu_clear_dirty_pt_masked(struct kvm *kvm, > > clear_dirty_pt_masked(kvm, root, gfn, mask, wrprot); > > } > > > > +static bool try_promote_lpage(struct kvm *kvm, > > + const struct kvm_memory_slot *slot, > > + struct tdp_iter *iter) > > Use "huge_page" instead of "lpage" to be consistent with eager page > splitting and the rest of the Linux kernel. Some of the old KVM methods > still use "lpage" and "large page", but we're slowly moving away from > that. Ah good catch. Paolo, if you want me to send a v2 to address all these comments, I can. Otherwise I'll just reply to the questions below. > > > +{ > > + struct kvm_mmu_page *sp = sptep_to_sp(iter->sptep); > > + struct rsvd_bits_validate shadow_zero_check; > > + bool map_writable; > > + kvm_pfn_t pfn; > > + u64 new_spte; > > + u64 mt_mask; > > + > > + /* > > + * If addresses are being invalidated, don't do in-place promotion to > > + * avoid accidentally mapping an invalidated address. > > + */ > > + if (unlikely(kvm->mmu_notifier_count)) > > + return false; > > Why is this necessary? Seeing this makes me wonder if we need a similar > check for eager page splitting. This is needed here, but not in the page splitting case, because we are potentially mapping new memory. If a page is split for dirt logging, but then the backing transparent huge page is split for some reason, we could race with the THP split. Since we're mapping the entire huge page, this could wind up mapping more memory than it should. Checking the MMU notifier count prevents that. It's not needed in the splitting case because the memory in question is already mapped. We're essentially trying to do what the page fault handler does, since we know that's safe and it's what replaces the zapped page with a huge page. The page fault handler checks for MMU notifiers, so we need to as well. > > > + > > + if (iter->level > max_huge_page_level || iter->gfn < slot->base_gfn || > > + iter->gfn >= slot->base_gfn + slot->npages) > > + return false; > > + > > + pfn = __gfn_to_pfn_memslot(slot, iter->gfn, true, NULL, true, > > + &map_writable, NULL); > > + if (is_error_noslot_pfn(pfn)) > > + return false; > > + > > + /* > > + * Can't reconstitute an lpage if the consituent pages can't be > > + * mapped higher. > > + */ > > + if (iter->level > kvm_mmu_max_mapping_level(kvm, slot, iter->gfn, > > + pfn, PG_LEVEL_NUM)) > > + return false; > > + > > + build_tdp_shadow_zero_bits_mask(&shadow_zero_check, iter->root_level); > > + > > + /* > > + * In some cases, a vCPU pointer is required to get the MT mask, > > + * however in most cases it can be generated without one. If a > > + * vCPU pointer is needed kvm_x86_try_get_mt_mask will fail. > > + * In that case, bail on in-place promotion. > > + */ > > + if (unlikely(!static_call(kvm_x86_try_get_mt_mask)(kvm, iter->gfn, > > + kvm_is_mmio_pfn(pfn), > > + &mt_mask))) > > + return false; > > + > > + __make_spte(kvm, sp, slot, ACC_ALL, iter->gfn, pfn, 0, false, true, > > + map_writable, mt_mask, &shadow_zero_check, &new_spte); > > + > > + if (tdp_mmu_set_spte_atomic(kvm, iter, new_spte)) > > + return true; > > + > > + /* Re-read the SPTE as it must have been changed by another thread. */ > > + iter->old_spte = READ_ONCE(*rcu_dereference(iter->sptep)); > > Huge page promotion could be retried in this case. That's true, but retries always get complicated since we need to guarantee forward progress and then you get into counting retries and it adds complexity. Given how rare this race should be, I'm inclined to just let it fall back to zapping the spte. > > > + > > + return false; > > +} > > + > > /* > > * Clear leaf entries which could be replaced by large mappings, for > > * GFNs within the slot. > > @@ -1729,8 +1789,17 @@ static void zap_collapsible_spte_range(struct kvm *kvm, > > if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true)) > > continue; > > > > - if (!is_shadow_present_pte(iter.old_spte) || > > - !is_last_spte(iter.old_spte, iter.level)) > > + if (iter.level > max_huge_page_level || > > + iter.gfn < slot->base_gfn || > > + iter.gfn >= slot->base_gfn + slot->npages) > > I feel like I've been seeing this "does slot contain gfn" calculation a > lot in recent commits. It's probably time to create a helper function. > No need to do this clean up as part of your series though, unless you > want to :). > > > + continue; > > + > > + if (!is_shadow_present_pte(iter.old_spte)) > > + continue; > > + > > + /* Try to promote the constitutent pages to an lpage. */ > > + if (!is_last_spte(iter.old_spte, iter.level) && > > + try_promote_lpage(kvm, slot, &iter)) > > continue; > > If iter.old_spte is not a leaf, the only loop would always continue to > the next SPTE. Now we try to promote it and if that fails we run through > the rest of the loop. This seems broken. For example, in the next line > we end up grabbing the pfn of the non-leaf SPTE (which would be the PFN > of the TDP MMU page table?) and treat that as the PFN backing this GFN, > which is wrong. > > In the worst case we end up zapping an SPTE that we didn't need to, but > we should still fix up this code. > > > > > pfn = spte_to_pfn(iter.old_spte); > > -- > > 2.35.1.894.gb6a874cedc-goog > >