Received: by 2002:a05:6a10:2726:0:0:0:0 with SMTP id ib38csp5611698pxb; Mon, 28 Mar 2022 14:58:04 -0700 (PDT) X-Google-Smtp-Source: ABdhPJw1bSTmLqDoHFERaPMluKJQJajXRMC8u+fAGFsFPVPdJdbVp2KMg4Io1zyxIRz35ANPMeRB X-Received: by 2002:a17:903:244b:b0:154:2cb2:86d with SMTP id l11-20020a170903244b00b001542cb2086dmr27145243pls.123.1648504683809; Mon, 28 Mar 2022 14:58:03 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1648504683; cv=none; d=google.com; s=arc-20160816; b=fGoBxgrdvWqTe5wwGPG88HH0sqbay7HlS5H11PJnLHZxHVf9EXn+Si7MeFkNaHlgQW YhUqxa1WEhGU5BHwSemTRJ5IIMeyIPA7e66X0OkN42vH7jbyN54IlKT16AqQFr0XSsy5 VYiB1DMeapu2ZT8sSB2+tpuFmV/B5gjHlDV5mQ5dHZ4J+o/6LW2E4zyDeMCuNDYL86Xd 7KK96yPIvlAZBSkbfjI9uKgpsit76Fg670cxSjoo95BjgYl2vSNpTPnHRfuIYLC9gbnU A2pRBybCELueRCxDwRv0Q/H6kHlvD3yEO5EFuiLvQPfQFVtucEOpHEd2Le47q65zX2ln GRng== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=1vLAv1KOlCRxpDZItVyZbMpKfcQXoowxC7+HEkoBdy4=; b=DLSMrm9PTtXbPy+v+hMW0jZZenLGFJYFC9kMvFU6VcN26waoei+23hWDvtu8uzRv9Z fOEGCwRUTWa9gtd2bb0KEGKec4t7YGGrlYpy2zlI0quSxRukB1d6VEQmw/okn982occQ 4OBM9KH5SknC3jZFs0ltQU8+4WVJX3/nHea3R+J4NpHP6cb0oi8hQZ8SnYIvEp0gQnCf 2maSnWOIDXlOG9FCAdU7x99z/ekJGDsGxAABixEKqiENtCIBoM8VLOWh53MFzVSruVzb q2VSSuWi/lqq32dWF/DxyWno3vziOj2eoQEFCSRB4H4+V8MU5JnqRuzxvuKpGhFK22eX Y+lA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=o30vEfNB; spf=softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net. [23.128.96.19]) by mx.google.com with ESMTPS id s17-20020a63a311000000b0038228eaa3fdsi14705247pge.648.2022.03.28.14.58.03 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 28 Mar 2022 14:58:03 -0700 (PDT) Received-SPF: softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) client-ip=23.128.96.19; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=o30vEfNB; spf=softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id EFD3228E38; Mon, 28 Mar 2022 14:24:11 -0700 (PDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S245166AbiC1SWn (ORCPT + 99 others); Mon, 28 Mar 2022 14:22:43 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34702 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S244743AbiC1SWm (ORCPT ); Mon, 28 Mar 2022 14:22:42 -0400 Received: from mail-lf1-x12c.google.com (mail-lf1-x12c.google.com [IPv6:2a00:1450:4864:20::12c]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0A39328E08 for ; Mon, 28 Mar 2022 11:21:00 -0700 (PDT) Received: by mail-lf1-x12c.google.com with SMTP id z12so12438857lfu.10 for ; Mon, 28 Mar 2022 11:20:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=1vLAv1KOlCRxpDZItVyZbMpKfcQXoowxC7+HEkoBdy4=; b=o30vEfNBFnnK30LBoZ01uwe+RsC8PrLrIPZyVcX3SB0+mOjLfVPG5xbF3NovYExpPI EEg5269QO2Y16wVVrAPmiihiVOftHZvuYQ8m6P7rJ3uLLZRtzJ4VnZ4PDBFedIRGFMzE JYR+5M/tKvQReQ5kuzCbYE6w7QMRxTiaAl1woRfEMmTSdr4gmrZ/tCzVVFVzOfRzubTO AFCyc+PK4iIIFptdO6tJfYBTLctLWuFXh39583b+1IqywifxH8O6t7eE4oqaoQpfptCt WygHvoIgohWd2+CEbmoJu+zg584vgqan6f+4GDVe5lYSPmVyNWyToXgUmwS/5Ve9t621 fAsw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=1vLAv1KOlCRxpDZItVyZbMpKfcQXoowxC7+HEkoBdy4=; b=Fb+Z3AOj8fGZK+H+cq32Fb2CXty6ocEpsh2HOFT+dJoZ13x4mfbIXhEzxrhJEHPRUv n+NL+3pr/pqj6T6AqxbpBQMBPzNbId8Gq1eng4nNEgqsx8JdRsb3ps30CHotu4A35Zhu FVKRtBoVi/zdUuq1tzykyshF0jb1265cIINzeeoT/j8IfYkTehHroi5mRwKg9+0If9ht 4IJkrmTsnlWQMU9EOKNjDD1RV+RljUymHCgzwWfsGTRUtWX3ffmwQEcN9KqgohsFR1cm 0zuImjbnw1ABHWL1CAezd0I3/qKZJNzHIBp0tdDnWtyLwz4pYjCF1nAG4jwDs2S/Go3A KqpA== X-Gm-Message-State: AOAM532sUjM2qV025OPR6FxZPZy9MdTaBF3T/QorjYOoCK74K40y7N9n iXM2SyIL8d4mQ64UhI5raMP/j4fqGdPE2nKd7uAehw== X-Received: by 2002:a05:6512:3341:b0:433:b033:bd22 with SMTP id y1-20020a056512334100b00433b033bd22mr21026170lfd.190.1648491656532; Mon, 28 Mar 2022 11:20:56 -0700 (PDT) MIME-Version: 1.0 References: <20220321224358.1305530-1-bgardon@google.com> <20220321224358.1305530-10-bgardon@google.com> In-Reply-To: From: David Matlack Date: Mon, 28 Mar 2022 11:20:30 -0700 Message-ID: Subject: Re: [PATCH v2 9/9] KVM: x86/mmu: Promote pages in-place when disabling dirty logging To: Ben Gardon Cc: LKML , kvm , Paolo Bonzini , Peter Xu , Sean Christopherson , Jim Mattson , David Dunn , Jing Zhang , Junaid Shahid Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-9.5 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,RDNS_NONE,SPF_HELO_NONE,T_SCC_BODY_TEXT_LINE, USER_IN_DEF_DKIM_WL autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Mar 28, 2022 at 11:07 AM Ben Gardon wrote: > > On Mon, Mar 28, 2022 at 10:45 AM David Matlack wrote: > > > > On Mon, Mar 21, 2022 at 03:43:58PM -0700, Ben Gardon wrote: > > > When disabling dirty logging, the TDP MMU currently zaps each leaf entry > > > mapping memory in the relevant memslot. This is very slow. Doing the zaps > > > under the mmu read lock requires a TLB flush for every zap and the > > > zapping causes a storm of ETP/NPT violations. > > > > > > Instead of zapping, replace the split large pages with large page > > > mappings directly. While this sort of operation has historically only > > > been done in the vCPU page fault handler context, refactorings earlier > > > in this series and the relative simplicity of the TDP MMU make it > > > possible here as well. > > > > > > Running the dirty_log_perf_test on an Intel Skylake with 96 vCPUs and 1G > > > of memory per vCPU, this reduces the time required to disable dirty > > > logging from over 45 seconds to just over 1 second. It also avoids > > > provoking page faults, improving vCPU performance while disabling > > > dirty logging. > > > > > > Signed-off-by: Ben Gardon > > > --- > > > arch/x86/kvm/mmu/mmu.c | 4 +- > > > arch/x86/kvm/mmu/mmu_internal.h | 6 +++ > > > arch/x86/kvm/mmu/tdp_mmu.c | 73 ++++++++++++++++++++++++++++++++- > > > 3 files changed, 79 insertions(+), 4 deletions(-) > > > > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > > > index 6f98111f8f8b..a99c23ef90b6 100644 > > > --- a/arch/x86/kvm/mmu/mmu.c > > > +++ b/arch/x86/kvm/mmu/mmu.c > > > @@ -100,7 +100,7 @@ module_param_named(flush_on_reuse, force_flush_and_sync_on_reuse, bool, 0644); > > > */ > > > bool tdp_enabled = false; > > > > > > -static int max_huge_page_level __read_mostly; > > > +int max_huge_page_level; > > > static int tdp_root_level __read_mostly; > > > static int max_tdp_level __read_mostly; > > > > > > @@ -4486,7 +4486,7 @@ static inline bool boot_cpu_is_amd(void) > > > * the direct page table on host, use as much mmu features as > > > * possible, however, kvm currently does not do execution-protection. > > > */ > > > -static void > > > +void > > > build_tdp_shadow_zero_bits_mask(struct rsvd_bits_validate *shadow_zero_check, > > > int shadow_root_level) > > > { > > > diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h > > > index 1bff453f7cbe..6c08a5731fcb 100644 > > > --- a/arch/x86/kvm/mmu/mmu_internal.h > > > +++ b/arch/x86/kvm/mmu/mmu_internal.h > > > @@ -171,4 +171,10 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc); > > > void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp); > > > void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp); > > > > > > +void > > > +build_tdp_shadow_zero_bits_mask(struct rsvd_bits_validate *shadow_zero_check, > > > + int shadow_root_level); > > > + > > > +extern int max_huge_page_level __read_mostly; > > > + > > > #endif /* __KVM_X86_MMU_INTERNAL_H */ > > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c > > > index af60922906ef..eb8929e394ec 100644 > > > --- a/arch/x86/kvm/mmu/tdp_mmu.c > > > +++ b/arch/x86/kvm/mmu/tdp_mmu.c > > > @@ -1709,6 +1709,66 @@ void kvm_tdp_mmu_clear_dirty_pt_masked(struct kvm *kvm, > > > clear_dirty_pt_masked(kvm, root, gfn, mask, wrprot); > > > } > > > > > > +static bool try_promote_lpage(struct kvm *kvm, > > > + const struct kvm_memory_slot *slot, > > > + struct tdp_iter *iter) > > > > Use "huge_page" instead of "lpage" to be consistent with eager page > > splitting and the rest of the Linux kernel. Some of the old KVM methods > > still use "lpage" and "large page", but we're slowly moving away from > > that. > > Ah good catch. Paolo, if you want me to send a v2 to address all these > comments, I can. Otherwise I'll just reply to the questions below. > > > > > > +{ > > > + struct kvm_mmu_page *sp = sptep_to_sp(iter->sptep); > > > + struct rsvd_bits_validate shadow_zero_check; > > > + bool map_writable; > > > + kvm_pfn_t pfn; > > > + u64 new_spte; > > > + u64 mt_mask; > > > + > > > + /* > > > + * If addresses are being invalidated, don't do in-place promotion to > > > + * avoid accidentally mapping an invalidated address. > > > + */ > > > + if (unlikely(kvm->mmu_notifier_count)) > > > + return false; > > > > Why is this necessary? Seeing this makes me wonder if we need a similar > > check for eager page splitting. > > This is needed here, but not in the page splitting case, because we > are potentially mapping new memory. > If a page is split for dirt logging, but then the backing transparent > huge page is split for some reason, we could race with the THP split. > Since we're mapping the entire huge page, this could wind up mapping > more memory than it should. Checking the MMU notifier count prevents > that. It's not needed in the splitting case because the memory in > question is already mapped. We're essentially trying to do what the > page fault handler does, since we know that's safe and it's what > replaces the zapped page with a huge page. The page fault handler > checks for MMU notifiers, so we need to as well. > > > > > + > > > + if (iter->level > max_huge_page_level || iter->gfn < slot->base_gfn || > > > + iter->gfn >= slot->base_gfn + slot->npages) > > > + return false; > > > + > > > + pfn = __gfn_to_pfn_memslot(slot, iter->gfn, true, NULL, true, > > > + &map_writable, NULL); > > > + if (is_error_noslot_pfn(pfn)) > > > + return false; > > > + > > > + /* > > > + * Can't reconstitute an lpage if the consituent pages can't be > > > + * mapped higher. > > > + */ > > > + if (iter->level > kvm_mmu_max_mapping_level(kvm, slot, iter->gfn, > > > + pfn, PG_LEVEL_NUM)) > > > + return false; > > > + > > > + build_tdp_shadow_zero_bits_mask(&shadow_zero_check, iter->root_level); > > > + > > > + /* > > > + * In some cases, a vCPU pointer is required to get the MT mask, > > > + * however in most cases it can be generated without one. If a > > > + * vCPU pointer is needed kvm_x86_try_get_mt_mask will fail. > > > + * In that case, bail on in-place promotion. > > > + */ > > > + if (unlikely(!static_call(kvm_x86_try_get_mt_mask)(kvm, iter->gfn, > > > + kvm_is_mmio_pfn(pfn), > > > + &mt_mask))) > > > + return false; > > > + > > > + __make_spte(kvm, sp, slot, ACC_ALL, iter->gfn, pfn, 0, false, true, > > > + map_writable, mt_mask, &shadow_zero_check, &new_spte); > > > + > > > + if (tdp_mmu_set_spte_atomic(kvm, iter, new_spte)) > > > + return true; > > > + > > > + /* Re-read the SPTE as it must have been changed by another thread. */ > > > + iter->old_spte = READ_ONCE(*rcu_dereference(iter->sptep)); > > > > Huge page promotion could be retried in this case. > > That's true, but retries always get complicated since we need to > guarantee forward progress and then you get into counting retries and > it adds complexity. There's plenty of unbounding retrying in tdp_mmu.c (search for "goto retry"). I think that' fine though. I can't imagine a scenario where a thread is blocked retrying more than a few times. > Given how rare this race should be, I'm inclined > to just let it fall back to zapping the spte. I think that's fine too. Although it'd be pretty easy to plumb by checking for -EBUSY from tdp_mmu_set_spte_atomic(). Maybe just leave a comment explaining why we don't care about going through the effort of retrying. > > > > > > + > > > + return false; > > > +} > > > + > > > /* > > > * Clear leaf entries which could be replaced by large mappings, for > > > * GFNs within the slot. > > > @@ -1729,8 +1789,17 @@ static void zap_collapsible_spte_range(struct kvm *kvm, > > > if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true)) > > > continue; > > > > > > - if (!is_shadow_present_pte(iter.old_spte) || > > > - !is_last_spte(iter.old_spte, iter.level)) > > > + if (iter.level > max_huge_page_level || > > > + iter.gfn < slot->base_gfn || > > > + iter.gfn >= slot->base_gfn + slot->npages) > > > > I feel like I've been seeing this "does slot contain gfn" calculation a > > lot in recent commits. It's probably time to create a helper function. > > No need to do this clean up as part of your series though, unless you > > want to :). > > > > > + continue; > > > + > > > + if (!is_shadow_present_pte(iter.old_spte)) > > > + continue; > > > + > > > + /* Try to promote the constitutent pages to an lpage. */ > > > + if (!is_last_spte(iter.old_spte, iter.level) && > > > + try_promote_lpage(kvm, slot, &iter)) > > > continue; > > > > If iter.old_spte is not a leaf, the only loop would always continue to > > the next SPTE. Now we try to promote it and if that fails we run through > > the rest of the loop. This seems broken. For example, in the next line > > we end up grabbing the pfn of the non-leaf SPTE (which would be the PFN > > of the TDP MMU page table?) and treat that as the PFN backing this GFN, > > which is wrong. > > > > In the worst case we end up zapping an SPTE that we didn't need to, but > > we should still fix up this code. > > > > > > > > pfn = spte_to_pfn(iter.old_spte); > > > -- > > > 2.35.1.894.gb6a874cedc-goog > > >