Received: by 2002:a05:6a10:8a4d:0:0:0:0 with SMTP id dn13csp118229pxb; Thu, 12 Aug 2021 12:10:31 -0700 (PDT) X-Google-Smtp-Source: ABdhPJw0Y7eLQ1gOHVg/JH5wZ9dL0/RvAcy2KsQs8TPuvkHZ43lo1CegdIElofivkzSPpFAU+Hhp X-Received: by 2002:a17:906:bcdb:: with SMTP id lw27mr5270834ejb.292.1628795431473; Thu, 12 Aug 2021 12:10:31 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1628795431; cv=none; d=google.com; s=arc-20160816; b=QkZohgMElGa0RkkOUdlPsCIpWerlygxZUZjlnMd2nadX1bxC3gGWx8+2qaKxYizwBM CoT1d0PB0RIIABfbUcSZRzN18zuGMOxEMUFFOYyKanq9Zwwo4sEIIFSUfKnFWRf+8lBp 1UDJA9bt6KDrMu/wnq12imrzJboK0sRe55BlDyNQlmQW41bRHE3TxEK618NxLnBP4Mwp rp5Xdy1VQeOPzXiPZMrt60mBt+iVzTH2SnJGOfAsX01BL6Lf9YF2pDKa4mecQLBECha8 sQAKECakqx18fc4Lp8LmXC35zQ+/DSV0p78lYtQt2tDuXblEzmUQoh6IUykUpSheyDvj pXJg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=8p/OCx8wfKLI1Vwr3WoruCyYyLDk0HKlG9q3THas0Ro=; b=u7Fvd6q91kixZxmDLSs2/5FGxftxD33fi4MQpd6YnRWHFWb49CUbnfXdrNctqwh+aE y0dVccIKe8/QQZ3+L6s9RyJKlDMaEn3/uDDsGPKdr9iqnvzbO1/xLXG3tjMUjamzmraK /qlKHksOsact05fADA1ruIQvfLEByKRS8pDVkjy1egAInDxwH2cJ8PI5wbWH2cxm47oR gqW+w6aM04aco55wY/vB57LpEO4se+r7iaiH2YVFVSBUgxLVCRbKoC7TEWUhuRH16Sar QlZw9uBCT9CURfD5LGfCvn/Bn9TSXcVi5Z5qu7swVFgM2ZbOj6Eiwmg42dCXLyL9xjEb ruqA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=pCkiiuBc; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id x2si4121834ejy.385.2021.08.12.12.10.05; Thu, 12 Aug 2021 12:10:31 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=pCkiiuBc; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233471AbhHLSaP (ORCPT + 99 others); Thu, 12 Aug 2021 14:30:15 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48090 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233436AbhHLSaO (ORCPT ); Thu, 12 Aug 2021 14:30:14 -0400 Received: from mail-ot1-x330.google.com (mail-ot1-x330.google.com [IPv6:2607:f8b0:4864:20::330]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9EE4DC0617A8 for ; Thu, 12 Aug 2021 11:29:48 -0700 (PDT) Received: by mail-ot1-x330.google.com with SMTP id d10-20020a9d4f0a0000b02904f51c5004e3so8865541otl.9 for ; Thu, 12 Aug 2021 11:29:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=8p/OCx8wfKLI1Vwr3WoruCyYyLDk0HKlG9q3THas0Ro=; b=pCkiiuBcAuyJQfE4lgHO2ZNaNUHxyIZv+gkKlkHTrJLj5xvMeJrz9bWScyqw/CpS24 YZ+1Q8phMUIM9RivFHJsILLnwPWCWsXFbUTNrUi9gUfYWVUuUI93UEj1Ol6uwBY3alyn P/deOXHX0+ZtYJfTnYvzFAIdZ79Y3niK8GIX0bATy2Pkv1WicFhfc75p31bOoAmQtr2D siqWqJ2SiIxlmKmr+hg5DahX0u3DwOT83FVQtueF/qQKjhk3fEgSsQXcwi4U6OYnJhSQ qxj1aJfTWJ/BGdp5X6hAdhaK1VFAYJDW6YItkYu5eOXSaEX0JWr3HsODbMbraE8ogrPl kBlw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=8p/OCx8wfKLI1Vwr3WoruCyYyLDk0HKlG9q3THas0Ro=; b=jA9AV1xRQCGfA55tYJ8cJWPdqwRBSCHdVQBEr9Q8cZGGPx1tZ5b9K5hvoFXnRsalgd wL80WUBIRwSQ9J2xokPeF4vd1Jo/QgFBiEDqz77QJ0ZIy9IY4leILOyyANLwtxK+Vous BENwoSsU5Vox1JX1Ha3/MshpSuz1hKl/ywNZYOxUYIAd3ZOycOZeSptdls5VA0RCgWzt 9ATOYkUuEZMoJliesTo8v/uznN6x+Jt5a0BVvkSSRWG6kn/r+1qNlJI5CgPiUL3H2DTi p/85oKSE+ZIVopne2myNWcosqy9EW/x+qhFvcnk6x+ewPu46pJ6ZzRf+IszYfwBZRF2U VaUA== X-Gm-Message-State: AOAM533SzXcTEcO6ZcmgeSJLtG1IH/7n/M1Y2zK3vyx2+11WgQi85SDY xxj0g0sAsNHuIIs76N9+I0Nr+IrhEwRHRylBuUxqaQ== X-Received: by 2002:a05:6830:189:: with SMTP id q9mr4548874ota.313.1628792987226; Thu, 12 Aug 2021 11:29:47 -0700 (PDT) MIME-Version: 1.0 References: <20210812181815.3378104-1-seanjc@google.com> In-Reply-To: <20210812181815.3378104-1-seanjc@google.com> From: Ben Gardon Date: Thu, 12 Aug 2021 11:29:35 -0700 Message-ID: Subject: Re: [PATCH v2] KVM: x86/mmu: Protect marking SPs unsync when using TDP MMU with spinlock To: Sean Christopherson Cc: Paolo Bonzini , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , kvm , LKML Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Aug 12, 2021 at 11:18 AM Sean Christopherson wrote: > > Add yet another spinlock for the TDP MMU and take it when marking indirect > shadow pages unsync. When using the TDP MMU and L1 is running L2(s) with > nested TDP, KVM may encounter shadow pages for the TDP entries managed by > L1 (controlling L2) when handling a TDP MMU page fault. The unsync logic > is not thread safe, e.g. the kvm_mmu_page fields are not atomic, and > misbehaves when a shadow page is marked unsync via a TDP MMU page fault, > which runs with mmu_lock held for read, not write. > > Lack of a critical section manifests most visibly as an underflow of > unsync_children in clear_unsync_child_bit() due to unsync_children being > corrupted when multiple CPUs write it without a critical section and > without atomic operations. But underflow is the best case scenario. The > worst case scenario is that unsync_children prematurely hits '0' and > leads to guest memory corruption due to KVM neglecting to properly sync > shadow pages. > > Use an entirely new spinlock even though piggybacking tdp_mmu_pages_lock > would functionally be ok. Usurping the lock could degrade performance when > building upper level page tables on different vCPUs, especially since the > unsync flow could hold the lock for a comparatively long time depending on > the number of indirect shadow pages and the depth of the paging tree. > > For simplicity, take the lock for all MMUs, even though KVM could fairly > easily know that mmu_lock is held for write. If mmu_lock is held for > write, there cannot be contention for the inner spinlock, and marking > shadow pages unsync across multiple vCPUs will be slow enough that > bouncing the kvm_arch cacheline should be in the noise. > > Note, even though L2 could theoretically be given access to its own EPT > entries, a nested MMU must hold mmu_lock for write and thus cannot race > against a TDP MMU page fault. I.e. the additional spinlock only _needs_ to > be taken by the TDP MMU, as opposed to being taken by any MMU for a VM > that is running with the TDP MMU enabled. Holding mmu_lock for read also > prevents the indirect shadow page from being freed. But as above, keep > it simple and always take the lock. > > Alternative #1, the TDP MMU could simply pass "false" for can_unsync and > effectively disable unsync behavior for nested TDP. Write protecting leaf > shadow pages is unlikely to noticeably impact traditional L1 VMMs, as such > VMMs typically don't modify TDP entries, but the same may not hold true for > non-standard use cases and/or VMMs that are migrating physical pages (from > L1's perspective). > > Alternative #2, the unsync logic could be made thread safe. In theory, > simply converting all relevant kvm_mmu_page fields to atomics and using > atomic bitops for the bitmap would suffice. However, (a) an in-depth audit > would be required, (b) the code churn would be substantial, and (c) legacy > shadow paging would incur additional atomic operations in performance > sensitive paths for no benefit (to legacy shadow paging). > > Fixes: a2855afc7ee8 ("KVM: x86/mmu: Allow parallel page faults for the TDP MMU") > Cc: stable@vger.kernel.org > Cc: Ben Gardon > Signed-off-by: Sean Christopherson > --- > Documentation/virt/kvm/locking.rst | 8 ++++---- > arch/x86/include/asm/kvm_host.h | 7 +++++++ > arch/x86/kvm/mmu/mmu.c | 28 ++++++++++++++++++++++++++++ > 3 files changed, 39 insertions(+), 4 deletions(-) > > diff --git a/Documentation/virt/kvm/locking.rst b/Documentation/virt/kvm/locking.rst > index 8138201efb09..5d27da356836 100644 > --- a/Documentation/virt/kvm/locking.rst > +++ b/Documentation/virt/kvm/locking.rst > @@ -31,10 +31,10 @@ On x86: > > - vcpu->mutex is taken outside kvm->arch.hyperv.hv_lock > > -- kvm->arch.mmu_lock is an rwlock. kvm->arch.tdp_mmu_pages_lock is > - taken inside kvm->arch.mmu_lock, and cannot be taken without already > - holding kvm->arch.mmu_lock (typically with ``read_lock``, otherwise > - there's no need to take kvm->arch.tdp_mmu_pages_lock at all). > +- kvm->arch.mmu_lock is an rwlock. kvm->arch.tdp_mmu_pages_lock and > + kvm->arch.mmu_unsync_pages_lock are taken inside kvm->arch.mmu_lock, and > + cannot be taken without already holding kvm->arch.mmu_lock (typically with > + ``read_lock`` for the TDP MMU, thus the need for additional spinlocks). > > Everything else is a leaf: no other lock is taken inside the critical > sections. > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > index 20daaf67a5bf..cf32b87b6bd3 100644 > --- a/arch/x86/include/asm/kvm_host.h > +++ b/arch/x86/include/asm/kvm_host.h > @@ -1036,6 +1036,13 @@ struct kvm_arch { > struct list_head lpage_disallowed_mmu_pages; > struct kvm_page_track_notifier_node mmu_sp_tracker; > struct kvm_page_track_notifier_head track_notifier_head; > + /* > + * Protects marking pages unsync during page faults, as TDP MMU page > + * faults only take mmu_lock for read. For simplicity, the unsync > + * pages lock is always taken when marking pages unsync regardless of > + * whether mmu_lock is held for read or write. > + */ > + spinlock_t mmu_unsync_pages_lock; > > struct list_head assigned_dev_head; > struct iommu_domain *iommu_domain; > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index a272ccbddfa1..cef526dac730 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -2596,6 +2596,7 @@ static void kvm_unsync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp) > int mmu_try_to_unsync_pages(struct kvm_vcpu *vcpu, gfn_t gfn, bool can_unsync) > { > struct kvm_mmu_page *sp; > + bool locked = false; > > /* > * Force write-protection if the page is being tracked. Note, the page It might also be worth adding a note about how it's safe to use for_each_gfn_indirect_valid_sp under the MMU read lock because mmu_page_hash is only modified with the lock held for write. > @@ -2618,9 +2619,34 @@ int mmu_try_to_unsync_pages(struct kvm_vcpu *vcpu, gfn_t gfn, bool can_unsync) > if (sp->unsync) > continue; > > + /* > + * TDP MMU page faults require an additional spinlock as they > + * run with mmu_lock held for read, not write, and the unsync > + * logic is not thread safe. Take the spinklock regardless of > + * the MMU type to avoid extra conditionals/parameters, there's > + * no meaningful penalty if mmu_lock is held for write. > + */ > + if (!locked) { > + locked = true; > + spin_lock(&vcpu->kvm->arch.mmu_unsync_pages_lock); > + > + /* > + * Recheck after taking the spinlock, a different vCPU > + * may have since marked the page unsync. A false > + * positive on the unprotected check above is not > + * possible as clearing sp->unsync _must_ hold mmu_lock > + * for write, i.e. unsync cannot transition from 0->1 > + * while this CPU holds mmu_lock for read (or write). > + */ > + if (READ_ONCE(sp->unsync)) > + continue; > + } > + > WARN_ON(sp->role.level != PG_LEVEL_4K); > kvm_unsync_page(vcpu, sp); > } > + if (locked) > + spin_unlock(&vcpu->kvm->arch.mmu_unsync_pages_lock); > > /* > * We need to ensure that the marking of unsync pages is visible > @@ -5604,6 +5630,8 @@ void kvm_mmu_init_vm(struct kvm *kvm) > { > struct kvm_page_track_notifier_node *node = &kvm->arch.mmu_sp_tracker; > > + spin_lock_init(&kvm->arch.mmu_unsync_pages_lock); > + > if (!kvm_mmu_init_tdp_mmu(kvm)) > /* > * No smp_load/store wrappers needed here as we are in > -- > 2.33.0.rc1.237.g0d66db33f3-goog >