Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 68E2CC433F5 for ; Thu, 25 Nov 2021 04:21:01 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S237524AbhKYEYJ (ORCPT ); Wed, 24 Nov 2021 23:24:09 -0500 Received: from us-smtp-delivery-124.mimecast.com ([170.10.129.124]:24733 "EHLO us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238200AbhKYEWI (ORCPT ); Wed, 24 Nov 2021 23:22:08 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1637813937; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=1fqU39dED2qlRuJQ5y9Cv0wjsRIVNN7ti/VU1j4R6ig=; b=bcl/piL9rAXh7iifkrKJ9EFaFqFj9DBDyj0P7jXT4VaVS4XPiisfglu2MsNh3STzz0XIKM 6OJWcBUoRpH+baI5jiiPYOQhMJvy2kUhluEoHmg3OSmo2wUsSafdYuIebHxSB+4LRMso1L Go10Aq/6+kp30orhMLa1VCG5NZw6Ah0= Received: from mail-pf1-f198.google.com (mail-pf1-f198.google.com [209.85.210.198]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-125-btJZ2PupMuGEi-L3tdHx7g-1; Wed, 24 Nov 2021 23:18:56 -0500 X-MC-Unique: btJZ2PupMuGEi-L3tdHx7g-1 Received: by mail-pf1-f198.google.com with SMTP id q82-20020a627555000000b004a4f8cadb6fso2776435pfc.20 for ; Wed, 24 Nov 2021 20:18:56 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=1fqU39dED2qlRuJQ5y9Cv0wjsRIVNN7ti/VU1j4R6ig=; b=YTmwHD0nO40OaXRVZJQoxWWOg7MbaYkzmnCMAEo+8xRkYKvs+X2jJNDWtc53TtPUws R5x1V78IDoRz7qesTTyUzdhgjLB9UBvj1wdxcpGUlDtZPebW4/BpfyciH8cXAqB4JvQH Q4DTjIAXJxsmyoQVQIhA4U0y4+zX2+GXWcXmPDL5/7LiJ5fLQF/LxvPLXdRrme6Qoqiy gyzyG4hBGsPLl7TlEzSyz59WRvrMsl127O1s7r3SiN/Xh4xD+003dVP2TtRepyE5NZ73 s1j2fZoB7bqaj2SFTTX7LEaT0vNLXMbTyUojY8PxcXRmspUVoljHW4Fzm6HHX289Vlty eyWg== X-Gm-Message-State: AOAM530pDnsawfqIAfd3i2uOME5eeBkUwHPFDxpw5S0KE+HuspcnVVxp 4bUxSPfaNQG1FMIRFYfYiHRVY+sYRx8Xyp4pTDeYkQnT3O7+RUz2kO8k0S7Elwrb3I7RyedReH+ MWCih4+33ncqTlpJOyQ25pkMv X-Received: by 2002:a63:7c2:: with SMTP id 185mr13916242pgh.406.1637813934781; Wed, 24 Nov 2021 20:18:54 -0800 (PST) X-Google-Smtp-Source: ABdhPJwpSDP+7QIYtdhR5abUmEScVhI5NGYueeIt2u1f+jPPHQ9xVyw9uKsmDeflcrB8465BVYWbQQ== X-Received: by 2002:a63:7c2:: with SMTP id 185mr13916219pgh.406.1637813934436; Wed, 24 Nov 2021 20:18:54 -0800 (PST) Received: from xz-m1.local ([94.177.118.150]) by smtp.gmail.com with ESMTPSA id t4sm1316242pfq.163.2021.11.24.20.18.48 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 24 Nov 2021 20:18:53 -0800 (PST) Date: Thu, 25 Nov 2021 12:18:45 +0800 From: Peter Xu To: Ben Gardon Cc: linux-kernel@vger.kernel.org, kvm@vger.kernel.org, Paolo Bonzini , Sean Christopherson , Peter Shier , David Matlack , Mingwei Zhang , Yulei Zhang , Wanpeng Li , Xiao Guangrong , Kai Huang , Keqian Zhu , David Hildenbrand Subject: Re: [PATCH 15/15] KVM: x86/mmu: Promote pages in-place when disabling dirty logging Message-ID: References: <20211115234603.2908381-1-bgardon@google.com> <20211115234603.2908381-16-bgardon@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20211115234603.2908381-16-bgardon@google.com> Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi, Ben, On Mon, Nov 15, 2021 at 03:46:03PM -0800, Ben Gardon wrote: > When disabling dirty logging, the TDP MMU currently zaps each leaf entry > mapping memory in the relevant memslot. This is very slow. Doing the zaps > under the mmu read lock requires a TLB flush for every zap and the > zapping causes a storm of ETP/NPT violations. > > Instead of zapping, replace the split large pages with large page > mappings directly. While this sort of operation has historically only > been done in the vCPU page fault handler context, refactorings earlier > in this series and the relative simplicity of the TDP MMU make it > possible here as well. Thanks for this patch, it looks very useful. I've got a few comments below, but before that I've also got one off-topic question too; it'll be great if you can help answer. When I was looking into how the old code recovers the huge pages I found that we'll leave the full-zero pgtable page there until the next page fault, then I _think_ it'll be released only until the __handle_changed_spte() when we're dropping the old spte (handle_removed_tdp_mmu_page). As comment above handle_removed_tdp_mmu_page() showed, at this point IIUC current thread should have exclusive ownership of this orphaned and abandoned pgtable page, then why in handle_removed_tdp_mmu_page() we still need all the atomic operations and REMOVED_SPTE tricks to protect from concurrent access? Since that's cmpxchg-ed out of the old pgtable, what can be accessing it besides the current thread? > > Running the dirty_log_perf_test on an Intel Skylake with 96 vCPUs and 1G > of memory per vCPU, this reduces the time required to disable dirty > logging from over 45 seconds to just over 1 second. It also avoids > provoking page faults, improving vCPU performance while disabling > dirty logging. > > > Signed-off-by: Ben Gardon > --- > arch/x86/kvm/mmu/mmu.c | 2 +- > arch/x86/kvm/mmu/mmu_internal.h | 4 ++ > arch/x86/kvm/mmu/tdp_mmu.c | 69 ++++++++++++++++++++++++++++++++- > 3 files changed, 72 insertions(+), 3 deletions(-) > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index ef7a84422463..add724aa9e8c 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -4449,7 +4449,7 @@ static inline bool boot_cpu_is_amd(void) > * the direct page table on host, use as much mmu features as > * possible, however, kvm currently does not do execution-protection. > */ > -static void > +void > build_tdp_shadow_zero_bits_mask(struct rsvd_bits_validate *shadow_zero_check, > int shadow_root_level) > { > diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h > index 6563cce9c438..84d439432acf 100644 > --- a/arch/x86/kvm/mmu/mmu_internal.h > +++ b/arch/x86/kvm/mmu/mmu_internal.h > @@ -161,4 +161,8 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc); > void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp); > void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp); > > +void > +build_tdp_shadow_zero_bits_mask(struct rsvd_bits_validate *shadow_zero_check, > + int shadow_root_level); > + > #endif /* __KVM_X86_MMU_INTERNAL_H */ > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c > index 43c7834b4f0a..b15c8cd11cf9 100644 > --- a/arch/x86/kvm/mmu/tdp_mmu.c > +++ b/arch/x86/kvm/mmu/tdp_mmu.c > @@ -1361,6 +1361,66 @@ void kvm_tdp_mmu_clear_dirty_pt_masked(struct kvm *kvm, > clear_dirty_pt_masked(kvm, root, gfn, mask, wrprot); > } > > +static void try_promote_lpage(struct kvm *kvm, > + const struct kvm_memory_slot *slot, > + struct tdp_iter *iter) > +{ > + struct kvm_mmu_page *sp = sptep_to_sp(iter->sptep); > + struct rsvd_bits_validate shadow_zero_check; > + /* > + * Since the TDP MMU doesn't manage nested PTs, there's no need to > + * write protect for a nested VM when PML is in use. > + */ > + bool ad_need_write_protect = false; Shall we just pass in "false" in make_spte() and just move the comment there? > + bool map_writable; > + kvm_pfn_t pfn; > + u64 new_spte; > + u64 mt_mask; > + > + /* > + * If addresses are being invalidated, don't do in-place promotion to > + * avoid accidentally mapping an invalidated address. > + */ > + if (unlikely(kvm->mmu_notifier_count)) > + return; > + > + pfn = __gfn_to_pfn_memslot(slot, iter->gfn, true, NULL, true, > + &map_writable, NULL); Should we better check pfn validity and bail out otherwise? E.g. for atomic I think we can also get KVM_PFN_ERR_FAULT when fast-gup failed somehow. > + > + /* > + * Can't reconstitute an lpage if the consituent pages can't be > + * mapped higher. > + */ > + if (iter->level > kvm_mmu_max_mapping_level(kvm, slot, iter->gfn, > + pfn, PG_LEVEL_NUM)) > + return; > + > + build_tdp_shadow_zero_bits_mask(&shadow_zero_check, iter->root_level); > + > + /* > + * In some cases, a vCPU pointer is required to get the MT mask, > + * however in most cases it can be generated without one. If a > + * vCPU pointer is needed kvm_x86_try_get_mt_mask will fail. > + * In that case, bail on in-place promotion. > + */ > + if (unlikely(!static_call(kvm_x86_try_get_mt_mask)(kvm, iter->gfn, > + kvm_is_mmio_pfn(pfn), > + &mt_mask))) > + return; > + > + make_spte(kvm, sp, slot, ACC_ALL, iter->gfn, pfn, 0, false, true, > + map_writable, ad_need_write_protect, mt_mask, > + &shadow_zero_check, &new_spte); > + > + tdp_mmu_set_spte_atomic(kvm, iter, new_spte); > + > + /* > + * Re-read the SPTE to avoid recursing into one of the removed child > + * page tables. > + */ > + iter->old_spte = READ_ONCE(*rcu_dereference(iter->sptep)); Is this redundant since it seems the iterator logic handles this already, I'm reading try_step_down() here: /* * Reread the SPTE before stepping down to avoid traversing into page * tables that are no longer linked from this entry. */ iter->old_spte = READ_ONCE(*rcu_dereference(iter->sptep)); The rest looks good to me, thanks. > +} > + > /* > * Clear leaf entries which could be replaced by large mappings, for > * GFNs within the slot. > @@ -1381,9 +1441,14 @@ static void zap_collapsible_spte_range(struct kvm *kvm, > if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true)) > continue; > > - if (!is_shadow_present_pte(iter.old_spte) || > - !is_last_spte(iter.old_spte, iter.level)) > + if (!is_shadow_present_pte(iter.old_spte)) > + continue; > + > + /* Try to promote the constitutent pages to an lpage. */ > + if (!is_last_spte(iter.old_spte, iter.level)) { > + try_promote_lpage(kvm, slot, &iter); > continue; > + } > > pfn = spte_to_pfn(iter.old_spte); > if (kvm_is_reserved_pfn(pfn) || > -- > 2.34.0.rc1.387.gb447b232ab-goog > -- Peter Xu