LinuxLists.cc - [PATCH v2] KVM: x86: Preserve TDP MMU roots until they are explicitly invalidated

2023-04-21 21:53:43

Subject: [PATCH v2] KVM: x86: Preserve TDP MMU roots until they are explicitly invalidated

Preserve TDP MMU roots until they are explicitly invalidated by gifting
the TDP MMU itself a reference to a root when it is allocated. Keeping a
reference in the TDP MMU fixes a flaw where the TDP MMU exhibits terrible
performance, and can potentially even soft-hang a vCPU, if a vCPU
frequently unloads its roots, e.g. when KVM is emulating SMI+RSM.

When KVM emulates something that invalidates _all_ TLB entries, e.g. SMI
and RSM, KVM unloads all of the vCPUs roots (KVM keeps a small per-vCPU
cache of previous roots). Unloading roots is a simple way to ensure KVM
flushes and synchronizes all roots for the vCPU, as KVM flushes and syncs
when allocating a "new" root (from the vCPU's perspective).

In the shadow MMU, KVM keeps track of all shadow pages, roots included, in
a per-VM hash table. Unloading a shadow MMU root just wipes it from the
per-vCPU cache; the root is still tracked in the per-VM hash table. When
KVM loads a "new" root for the vCPU, KVM will find the old, unloaded root
in the per-VM hash table.

Unlike the shadow MMU, the TDP MMU doesn't track "inactive" roots in a
per-VM structure, where "active" in this case means a root is either
in-use or cached as a previous root by at least one vCPU. When a TDP MMU
root becomes inactive, i.e. the last vCPU reference to the root is put,
KVM immediately frees the root (asterisk on "immediately" as the actual
freeing may be done by a worker, but for all intents and purposes the root
is gone).

The TDP MMU behavior is especially problematic for 1-vCPU setups, as
unloading all roots effectively frees all roots. The issue is mitigated
to some degree in multi-vCPU setups as a different vCPU usually holds a
reference to an unloaded root and thus keeps the root alive, allowing the
vCPU to reuse its old root after unloading (with a flush+sync).

The TDP MMU flaw has been known for some time, as until very recently,
KVM's handling of CR0.WP also triggered unloading of all roots. The
CR0.WP toggling scenario was eventually addressed by not unloading roots
when _only_ CR0.WP is toggled, but such an approach doesn't Just Work
for emulating SMM as KVM must emulate a full TLB flush on entry and exit
to/from SMM. Given that the shadow MMU plays nice with unloading roots
at will, teaching the TDP MMU to do the same is far less complex than
modifying KVM to track which roots need to be flushed before reuse.

Note, preserving all possible TDP MMU roots is not a concern with respect
to memory consumption. Now that the role for direct MMUs doesn't include
information about the guest, e.g. CR0.PG, CR0.WP, CR4.SMEP, etc., there
are _at most_ six possible roots (where "guest_mode" here means L2):

1. 4-level !SMM !guest_mode
2. 4-level SMM !guest_mode
3. 5-level !SMM !guest_mode
4. 5-level SMM !guest_mode
5. 4-level !SMM guest_mode
6. 5-level !SMM guest_mode

And because each vCPU can track 4 valid roots, a VM can already have all
6 root combinations live at any given time. Not to mention that, in
practice, no sane VMM will advertise different guest.MAXPHYADDR values
across vCPUs, i.e. KVM won't ever use both 4-level and 5-level roots for
a single VM. Furthermore, the vast majority of modern hypervisors will
utilize EPT/NPT when available, thus the guest_mode=%true cases are also
unlikely to be utilized.

Reported-by: Jeremi Piotrowski <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
Link: https://lkml.kernel.org/r/20220209170020.1775368-1-pbonzini%40redhat.com
Link: https://lore.kernel.org/all/[email protected]
Link: https://lore.kernel.org/all/[email protected]
Cc: Ben Gardon <[email protected]>
Cc: David Matlack <[email protected]>
Cc: [email protected]
Tested-by: Jeremi Piotrowski <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---

David, I dropped your review as adding RCU protection is trivial from a code
perspective, but at the same time quite subtle.

I did keep Jeremi's tag, because protecting the walk with RCU during
invalidation isn't really relevant to what Jeremi was testing (performance and
core functionality).

arch/x86/kvm/mmu/tdp_mmu.c | 106 ++++++++++++++-----------------------
1 file changed, 41 insertions(+), 65 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index b2fca11b91ff..9925f8e39ea6 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -40,7 +40,17 @@ static __always_inline bool kvm_lockdep_assert_mmu_lock_held(struct kvm *kvm,

void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
{
- /* Also waits for any queued work items. */
+ /*
+ * Invalidate all roots, which besides the obvious, schedules all roots
+ * for zapping and thus puts the TDP MMU's reference to each root, i.e.
+ * ultimately frees all roots.
+ */
+ kvm_tdp_mmu_invalidate_all_roots(kvm);
+
+ /*
+ * Destroying a workqueue also first flushes the workqueue, i.e. no
+ * need to invoke kvm_tdp_mmu_zap_invalidated_roots().
+ */
destroy_workqueue(kvm->arch.tdp_mmu_zap_wq);

WARN_ON(atomic64_read(&kvm->arch.tdp_mmu_pages));
@@ -116,16 +126,6 @@ static void tdp_mmu_schedule_zap_root(struct kvm *kvm, struct kvm_mmu_page *root
queue_work(kvm->arch.tdp_mmu_zap_wq, &root->tdp_mmu_async_work);
}

-static inline bool kvm_tdp_root_mark_invalid(struct kvm_mmu_page *page)
-{
- union kvm_mmu_page_role role = page->role;
- role.invalid = true;
-
- /* No need to use cmpxchg, only the invalid bit can change. */
- role.word = xchg(&page->role.word, role.word);
- return role.invalid;
-}
-
void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
bool shared)
{
@@ -134,45 +134,12 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
if (!refcount_dec_and_test(&root->tdp_mmu_root_count))
return;

- WARN_ON(!is_tdp_mmu_page(root));
-
/*
- * The root now has refcount=0. It is valid, but readers already
- * cannot acquire a reference to it because kvm_tdp_mmu_get_root()
- * rejects it. This remains true for the rest of the execution
- * of this function, because readers visit valid roots only
- * (except for tdp_mmu_zap_root_work(), which however
- * does not acquire any reference itself).
- *
- * Even though there are flows that need to visit all roots for
- * correctness, they all take mmu_lock for write, so they cannot yet
- * run concurrently. The same is true after kvm_tdp_root_mark_invalid,
- * since the root still has refcount=0.
- *
- * However, tdp_mmu_zap_root can yield, and writers do not expect to
- * see refcount=0 (see for example kvm_tdp_mmu_invalidate_all_roots()).
- * So the root temporarily gets an extra reference, going to refcount=1
- * while staying invalid. Readers still cannot acquire any reference;
- * but writers are now allowed to run if tdp_mmu_zap_root yields and
- * they might take an extra reference if they themselves yield.
- * Therefore, when the reference is given back by the worker,
- * there is no guarantee that the refcount is still 1. If not, whoever
- * puts the last reference will free the page, but they will not have to
- * zap the root because a root cannot go from invalid to valid.
+ * The TDP MMU itself holds a reference to each root until the root is
+ * explicitly invalidated, i.e. the final reference should be never be
+ * put for a valid root.
*/
- if (!kvm_tdp_root_mark_invalid(root)) {
- refcount_set(&root->tdp_mmu_root_count, 1);
-
- /*
- * Zapping the root in a worker is not just "nice to have";
- * it is required because kvm_tdp_mmu_invalidate_all_roots()
- * skips already-invalid roots. If kvm_tdp_mmu_put_root() did
- * not add the root to the workqueue, kvm_tdp_mmu_zap_all_fast()
- * might return with some roots not zapped yet.
- */
- tdp_mmu_schedule_zap_root(kvm, root);
- return;
- }
+ KVM_BUG_ON(!is_tdp_mmu_page(root) || !root->role.invalid, kvm);

spin_lock(&kvm->arch.tdp_mmu_pages_lock);
list_del_rcu(&root->link);
@@ -320,7 +287,14 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
root = tdp_mmu_alloc_sp(vcpu);
tdp_mmu_init_sp(root, NULL, 0, role);

- refcount_set(&root->tdp_mmu_root_count, 1);
+ /*
+ * TDP MMU roots are kept until they are explicitly invalidated, either
+ * by a memslot update or by the destruction of the VM. Initialize the
+ * refcount to two; one reference for the vCPU, and one reference for
+ * the TDP MMU itself, which is held until the root is invalidated and
+ * is ultimately put by tdp_mmu_zap_root_work().
+ */
+ refcount_set(&root->tdp_mmu_root_count, 2);

spin_lock(&kvm->arch.tdp_mmu_pages_lock);
list_add_rcu(&root->link, &kvm->arch.tdp_mmu_roots);
@@ -946,32 +920,34 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
/*
* Mark each TDP MMU root as invalid to prevent vCPUs from reusing a root that
* is about to be zapped, e.g. in response to a memslots update. The actual
- * zapping is performed asynchronously, so a reference is taken on all roots.
- * Using a separate workqueue makes it easy to ensure that the destruction is
- * performed before the "fast zap" completes, without keeping a separate list
- * of invalidated roots; the list is effectively the list of work items in
- * the workqueue.
+ * zapping is performed asynchronously. Using a separate workqueue makes it
+ * easy to ensure that the destruction is performed before the "fast zap"
+ * completes, without keeping a separate list of invalidated roots; the list is
+ * effectively the list of work items in the workqueue.
*
- * Get a reference even if the root is already invalid, the asynchronous worker
- * assumes it was gifted a reference to the root it processes. Because mmu_lock
- * is held for write, it should be impossible to observe a root with zero refcount,
- * i.e. the list of roots cannot be stale.
- *
- * This has essentially the same effect for the TDP MMU
- * as updating mmu_valid_gen does for the shadow MMU.
+ * Note, the asynchronous worker is gifted the TDP MMU's reference.
+ * See kvm_tdp_mmu_get_vcpu_root_hpa().
*/
void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm)
{
struct kvm_mmu_page *root;

- lockdep_assert_held_write(&kvm->mmu_lock);
- list_for_each_entry(root, &kvm->arch.tdp_mmu_roots, link) {
- if (!root->role.invalid &&
- !WARN_ON_ONCE(!kvm_tdp_mmu_get_root(root))) {
+ /*
+ * Note! mmu_lock isn't held when destroying the VM! There can't be
+ * other references to @kvm, i.e. nothing else can invalidate roots,
+ * but walking the list of roots does need to be guarded against roots
+ * being deleted by the asynchronous zap worker.
+ */
+ rcu_read_lock();
+
+ list_for_each_entry_rcu(root, &kvm->arch.tdp_mmu_roots, link) {
+ if (!root->role.invalid) {
root->role.invalid = true;
tdp_mmu_schedule_zap_root(kvm, root);
}
}
+
+ rcu_read_unlock();
}

/*

base-commit: 62cf1e941a1169a5e8016fd8683d4d888ab51e01
--
2.40.0.634.g4ca3ef3211-goog

2023-04-21 22:13:51

by Sean Christopherson

[permalink] [raw]

Subject: Re: [PATCH v2] KVM: x86: Preserve TDP MMU roots until they are explicitly invalidated

On Fri, 21 Apr 2023 14:49:46 -0700, Sean Christopherson wrote:
> Preserve TDP MMU roots until they are explicitly invalidated by gifting
> the TDP MMU itself a reference to a root when it is allocated. Keeping a
> reference in the TDP MMU fixes a flaw where the TDP MMU exhibits terrible
> performance, and can potentially even soft-hang a vCPU, if a vCPU
> frequently unloads its roots, e.g. when KVM is emulating SMI+RSM.
>
> When KVM emulates something that invalidates _all_ TLB entries, e.g. SMI
> and RSM, KVM unloads all of the vCPUs roots (KVM keeps a small per-vCPU
> cache of previous roots). Unloading roots is a simple way to ensure KVM
> flushes and synchronizes all roots for the vCPU, as KVM flushes and syncs
> when allocating a "new" root (from the vCPU's perspective).
>
> [...]

Replaced v1 with this version in kvm-x86 mmu. I'll omit this from the initial
pull request for 6.4 and submit it separately later on, assuming syzbot doesn't
find more holes in my logic.

David (or anyone else), feel free to provide feedback/reviews, I'll squash
any trivial changes (tags, comment tweaks, etc.) as needed. I immediately
pushed this to -next purely to get testing on the (hopefully) fixed version.

[1/1] KVM: x86: Preserve TDP MMU roots until they are explicitly invalidated
https://github.com/kvm-x86/linux/commit/bf4166af027e

--
https://github.com/kvm-x86/linux/tree/next
https://github.com/kvm-x86/linux/tree/fixes

2023-04-21 23:14:41

by David Matlack

[permalink] [raw]

Subject: Re: [PATCH v2] KVM: x86: Preserve TDP MMU roots until they are explicitly invalidated

On Fri, Apr 21, 2023 at 2:49 PM Sean Christopherson <[email protected]> wrote:
>
> void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm)
> {
> struct kvm_mmu_page *root;
>
> - lockdep_assert_held_write(&kvm->mmu_lock);
> - list_for_each_entry(root, &kvm->arch.tdp_mmu_roots, link) {
> - if (!root->role.invalid &&
> - !WARN_ON_ONCE(!kvm_tdp_mmu_get_root(root))) {
> + /*
> + * Note! mmu_lock isn't held when destroying the VM! There can't be
> + * other references to @kvm, i.e. nothing else can invalidate roots,
> + * but walking the list of roots does need to be guarded against roots
> + * being deleted by the asynchronous zap worker.
> + */
> + rcu_read_lock();
> +
> + list_for_each_entry_rcu(root, &kvm->arch.tdp_mmu_roots, link) {

I see that roots are removed from the list with list_del_rcu(), so I
agree this should be safe.

KVM could, alternatively, acquire the mmu_lock in
kvm_mmu_uninit_tdp_mmu(), which would let us keep the lockdep
assertion and drop the rcu_read_lock() + comment. That might be worth
it in case someone accidentally adds a call to
kvm_tdp_mmu_invalidate_all_roots() without mmu_lock outside of VM
teardown. kvm_mmu_uninit_tdp_mmu() is not a particularly performance
sensitive path and adding the mmu_lock wouldn't add much overhead
anyway (it would block for at most a few milliseconds waiting for the
async work to reschedule).

2023-04-22 01:59:26

by Sean Christopherson

[permalink] [raw]

Subject: Re: [PATCH v2] KVM: x86: Preserve TDP MMU roots until they are explicitly invalidated

On Fri, Apr 21, 2023, David Matlack wrote:
> On Fri, Apr 21, 2023 at 2:49 PM Sean Christopherson <[email protected]> wrote:
> >
> > void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm)
> > {
> > struct kvm_mmu_page *root;
> >
> > - lockdep_assert_held_write(&kvm->mmu_lock);
> > - list_for_each_entry(root, &kvm->arch.tdp_mmu_roots, link) {
> > - if (!root->role.invalid &&
> > - !WARN_ON_ONCE(!kvm_tdp_mmu_get_root(root))) {
> > + /*
> > + * Note! mmu_lock isn't held when destroying the VM! There can't be
> > + * other references to @kvm, i.e. nothing else can invalidate roots,
> > + * but walking the list of roots does need to be guarded against roots
> > + * being deleted by the asynchronous zap worker.
> > + */
> > + rcu_read_lock();
> > +
> > + list_for_each_entry_rcu(root, &kvm->arch.tdp_mmu_roots, link) {
>
> I see that roots are removed from the list with list_del_rcu(), so I
> agree this should be safe.
>
> KVM could, alternatively, acquire the mmu_lock in
> kvm_mmu_uninit_tdp_mmu(), which would let us keep the lockdep
> assertion and drop the rcu_read_lock() + comment. That might be worth
> it in case someone accidentally adds a call to
> kvm_tdp_mmu_invalidate_all_roots() without mmu_lock outside of VM
> teardown. kvm_mmu_uninit_tdp_mmu() is not a particularly performance
> sensitive path and adding the mmu_lock wouldn't add much overhead
> anyway (it would block for at most a few milliseconds waiting for the
> async work to reschedule).

Heh, I actually started to ping you off-list to specifically discuss this option,
but then decided that not waiting those few milliseconds might be worthwhile for
some use cases. I also couldn't quite convince myself that it would only be a few
milliseconds, e.g. if the worker is zapping a fully populated 5-level root, there
are no other tasks scheduled on _its_ CPU, and CONFIG_PREEMPTION=n (which neuters
rwlock_needbreak()).

The other reason I opted for not taking mmu_lock is that, with the persistent roots
approach, I don't think it's actually strictly necessary for kvm_mmu_zap_all_fast()
to invaliate roots while holding mmu_lock for write. Holding slots_lock ensures
that only a single task can be doing invalidations, versus the old model where
putting the last reference to a root could happen just about anywhere. And
allocating a new root and zapping from mmu_noitifiers requires holding mmu_lock for
write, so I _think_ we could getaway with holding mmu_lock for read. Maybe.

It's largely a moot point since kvm_mmu_zap_all_fast() needs to hold mmu_lock for
write anyways to play nice with the shadow MMU, i.e. I don't expect us to ever
want to pursue a change in this area. But at the same time I was struggling to
write a comment explaining why the VM destruction path "had" to take mmu_lock.

2023-04-24 23:58:21

by David Matlack

[permalink] [raw]

Subject: Re: [PATCH v2] KVM: x86: Preserve TDP MMU roots until they are explicitly invalidated

On Fri, Apr 21, 2023 at 6:56 PM Sean Christopherson <[email protected]> wrote:
>
> On Fri, Apr 21, 2023, David Matlack wrote:
> > On Fri, Apr 21, 2023 at 2:49 PM Sean Christopherson <[email protected]> wrote:
> > >
> > > void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm)
> > > {
> > > struct kvm_mmu_page *root;
> > >
> > > - lockdep_assert_held_write(&kvm->mmu_lock);
> > > - list_for_each_entry(root, &kvm->arch.tdp_mmu_roots, link) {
> > > - if (!root->role.invalid &&
> > > - !WARN_ON_ONCE(!kvm_tdp_mmu_get_root(root))) {
> > > + /*
> > > + * Note! mmu_lock isn't held when destroying the VM! There can't be
> > > + * other references to @kvm, i.e. nothing else can invalidate roots,
> > > + * but walking the list of roots does need to be guarded against roots
> > > + * being deleted by the asynchronous zap worker.
> > > + */
> > > + rcu_read_lock();
> > > +
> > > + list_for_each_entry_rcu(root, &kvm->arch.tdp_mmu_roots, link) {
> >
> > I see that roots are removed from the list with list_del_rcu(), so I
> > agree this should be safe.
> >
> > KVM could, alternatively, acquire the mmu_lock in
> > kvm_mmu_uninit_tdp_mmu(), which would let us keep the lockdep
> > assertion and drop the rcu_read_lock() + comment. That might be worth
> > it in case someone accidentally adds a call to
> > kvm_tdp_mmu_invalidate_all_roots() without mmu_lock outside of VM
> > teardown. kvm_mmu_uninit_tdp_mmu() is not a particularly performance
> > sensitive path and adding the mmu_lock wouldn't add much overhead
> > anyway (it would block for at most a few milliseconds waiting for the
> > async work to reschedule).
>
> Heh, I actually started to ping you off-list to specifically discuss this option,
> but then decided that not waiting those few milliseconds might be worthwhile for
> some use cases. I also couldn't quite convince myself that it would only be a few
> milliseconds, e.g. if the worker is zapping a fully populated 5-level root, there
> are no other tasks scheduled on _its_ CPU, and CONFIG_PREEMPTION=n (which neuters
> rwlock_needbreak()).

Good point. At some point we're going to have to fix that.

>
> The other reason I opted for not taking mmu_lock is that, with the persistent roots
> approach, I don't think it's actually strictly necessary for kvm_mmu_zap_all_fast()
> to invaliate roots while holding mmu_lock for write. Holding slots_lock ensures
> that only a single task can be doing invalidations, versus the old model where
> putting the last reference to a root could happen just about anywhere. And
> allocating a new root and zapping from mmu_noitifiers requires holding mmu_lock for
> write, so I _think_ we could getaway with holding mmu_lock for read. Maybe.
>
> It's largely a moot point since kvm_mmu_zap_all_fast() needs to hold mmu_lock for
> write anyways to play nice with the shadow MMU, i.e. I don't expect us to ever
> want to pursue a change in this area. But at the same time I was struggling to
> write a comment explaining why the VM destruction path "had" to take mmu_lock.

Yeah, probably because it really isn't necessary :). It'd be nice to keep
around the lockdep assertion though for the other (and future)
callers. The cleanest options I can think of are:

1. Pass in a bool "vm_teardown" kvm_tdp_mmu_invalidate_all_roots() and
use that to gate the lockdep assertion.
2. Take the mmu_lock for read in kvm_mmu_uninit_tdp_mmu() and pass
down bool shared to kvm_tdp_mmu_invalidate_all_roots().

Both would satisfy your concern of not blocking teardown on the async
worker and my concern of keeping the lockdep check. I think I prefer
(1) since, as you point out, taking the mmu_lock at all is
unnecessary.

2023-04-25 00:47:03

by Sean Christopherson

[permalink] [raw]

Subject: Re: [PATCH v2] KVM: x86: Preserve TDP MMU roots until they are explicitly invalidated

On Mon, Apr 24, 2023, David Matlack wrote:
> It'd be nice to keep around the lockdep assertion though for the other (and
> future) callers. The cleanest options I can think of are:
>
> 1. Pass in a bool "vm_teardown" kvm_tdp_mmu_invalidate_all_roots() and
> use that to gate the lockdep assertion.
> 2. Take the mmu_lock for read in kvm_mmu_uninit_tdp_mmu() and pass
> down bool shared to kvm_tdp_mmu_invalidate_all_roots().
>
> Both would satisfy your concern of not blocking teardown on the async
> worker and my concern of keeping the lockdep check. I think I prefer
> (1) since, as you point out, taking the mmu_lock at all is
> unnecessary.

Hmm, another option:

3. Refactor the code so that kvm_arch_init_vm() doesn't call
kvm_tdp_mmu_invalidate_all_roots() when VM creation fails, and then lockdep
can ignore on users_count==0 without hitting the false positive.

I like (2) the least. Not sure I prefer (1) versus (3). I dislike passing bools
just to ignore lockdep, but reworking code for a "never hit in practice" edge case
is arguably worse :-/

2023-04-25 22:16:51

by David Matlack

[permalink] [raw]

Subject: Re: [PATCH v2] KVM: x86: Preserve TDP MMU roots until they are explicitly invalidated

On Mon, Apr 24, 2023 at 05:36:37PM -0700, Sean Christopherson wrote:
> On Mon, Apr 24, 2023, David Matlack wrote:
> > It'd be nice to keep around the lockdep assertion though for the other (and
> > future) callers. The cleanest options I can think of are:
> >
> > 1. Pass in a bool "vm_teardown" kvm_tdp_mmu_invalidate_all_roots() and
> > use that to gate the lockdep assertion.
> > 2. Take the mmu_lock for read in kvm_mmu_uninit_tdp_mmu() and pass
> > down bool shared to kvm_tdp_mmu_invalidate_all_roots().
> >
> > Both would satisfy your concern of not blocking teardown on the async
> > worker and my concern of keeping the lockdep check. I think I prefer
> > (1) since, as you point out, taking the mmu_lock at all is
> > unnecessary.
>
> Hmm, another option:
>
> 3. Refactor the code so that kvm_arch_init_vm() doesn't call
> kvm_tdp_mmu_invalidate_all_roots() when VM creation fails, and then lockdep
> can ignore on users_count==0 without hitting the false positive.
>
> I like (2) the least. Not sure I prefer (1) versus (3). I dislike passing bools
> just to ignore lockdep, but reworking code for a "never hit in practice" edge case
> is arguably worse :-/

Agree (2) is the worst option. (3) seems potentially brittle (likely to
trigger a false-positive lockdep warning if the code ever gets
refactored back).

How about throwing some underscores at the problem?

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 649e1773baf1..3e00afc31c71 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -38,6 +38,8 @@ static __always_inline bool kvm_lockdep_assert_mmu_lock_held(struct kvm *kvm,
return true;
}

+static void __kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm);
+
void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
{
/*
@@ -45,7 +47,7 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
* for zapping and thus puts the TDP MMU's reference to each root, i.e.
* ultimately frees all roots.
*/
- kvm_tdp_mmu_invalidate_all_roots(kvm);
+ __kvm_tdp_mmu_invalidate_all_roots(kvm);

/*
* Destroying a workqueue also first flushes the workqueue, i.e. no
@@ -1004,7 +1006,7 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
* Note, the asynchronous worker is gifted the TDP MMU's reference.
* See kvm_tdp_mmu_get_vcpu_root_hpa().
*/
-void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm)
+static void __kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm)
{
struct kvm_mmu_page *root;

@@ -1026,6 +1028,12 @@ void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm)
rcu_read_unlock();
}

+void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm)
+{
+ lockdep_assert_held_write(&kvm->mmu_lock);
+ __kvm_tdp_mmu_invalidate_all_roots(kvm);
+}
+
/*
* Installs a last-level SPTE to handle a TDP page fault.
* (NPT/EPT violation/misconfiguration)

2023-04-26 02:00:05

by Sean Christopherson

[permalink] [raw]

Subject: Re: [PATCH v2] KVM: x86: Preserve TDP MMU roots until they are explicitly invalidated

On Tue, Apr 25, 2023, David Matlack wrote:
> On Mon, Apr 24, 2023 at 05:36:37PM -0700, Sean Christopherson wrote:
> > On Mon, Apr 24, 2023, David Matlack wrote:
> > > It'd be nice to keep around the lockdep assertion though for the other (and
> > > future) callers. The cleanest options I can think of are:
> > >
> > > 1. Pass in a bool "vm_teardown" kvm_tdp_mmu_invalidate_all_roots() and
> > > use that to gate the lockdep assertion.
> > > 2. Take the mmu_lock for read in kvm_mmu_uninit_tdp_mmu() and pass
> > > down bool shared to kvm_tdp_mmu_invalidate_all_roots().
> > >
> > > Both would satisfy your concern of not blocking teardown on the async
> > > worker and my concern of keeping the lockdep check. I think I prefer
> > > (1) since, as you point out, taking the mmu_lock at all is
> > > unnecessary.
> >
> > Hmm, another option:
> >
> > 3. Refactor the code so that kvm_arch_init_vm() doesn't call
> > kvm_tdp_mmu_invalidate_all_roots() when VM creation fails, and then lockdep
> > can ignore on users_count==0 without hitting the false positive.
> >
> > I like (2) the least. Not sure I prefer (1) versus (3). I dislike passing bools
> > just to ignore lockdep, but reworking code for a "never hit in practice" edge case
> > is arguably worse :-/
>
> Agree (2) is the worst option. (3) seems potentially brittle (likely to
> trigger a false-positive lockdep warning if the code ever gets
> refactored back).
>
> How about throwing some underscores at the problem?

LOL, now we're speaking my language.

I think I have a better option though. The false positives on users_count can be
suppressed by gating the assert on kvm->created_vcpus. If KVM_CREATE_VM fails then
it's impossible for the VM to have created vCPUs. I like this option in particular
because it captures why it's safe for the KVM_CREATE_VM error path to run without
mmu_lock (no vCPUs == no roots).

I'll manually test this against the error path tomorrow:

if (IS_ENABLED(CONFIG_PROVE_LOCKING) &&
refcount_read(&kvm->users_count) && kvm->created_vcpus)
lockdep_assert_held_write(&kvm->mmu_lock);