2021-07-13 14:21:55

by Maxim Levitsky

[permalink] [raw]
Subject: [PATCH v2 0/8] My AVIC patch queue

Hi!



This is a series of bugfixes to the AVIC dynamic inhibition, which was

made while trying to fix bugs as much as possible, in this area and trying

to make the AVIC+SYNIC conditional enablement work.



* Patches 1-4 address an issue of possible

mismatch between the AVIC inhibit state and AVIC enable state on all vCPUs.



Since AVICs state is changed via a request there is a window during which

the states differ which can lead to various warnings and errors.



There was an earlier attempt to fix this by changing the AVIC enable state

on the current vCPU immediately when the AVIC inhibit request is created,

however while this fixes the common case, it actually hides the issue deeper,

because on all other vCPUs but current one, the two states can still

mismatch till the KVM_REQ_APICV_UPDATE is processed on each of them.



My take on this is to fix the places where the mismatch causes the

issues instead and then drop the special case of toggling the AVIC right

away in kvm_request_apicv_update.



V2: I rewrote the commit description for the patch that touches

avic inhibition in nested case.



* Patches 5-6 in this series fix a race condition which can cause

a lost write from a guest to APIC when the APIC write races

the AVIC un-inhibition, and add a warning to catch this problem

if it re-emerges again.



V2: I re-implemented this with a mutex in V2.



* Patch 7 is an fix yet another issue I found in AVIC inhibit code:

Currently avic_vcpu_load/avic_vcpu_put are called on userspace entry/exit

from KVM (aka kvm_vcpu_get/kvm_vcpu_put), and these functions update the

"is running" bit in the AVIC physical ID remap table and update the

target vCPU in iommu code.



However both of these functions don't do anything when AVIC is inhibited

thus the "is running" bit will be kept enabled during exit to userspace.

This shouldn't be a big issue as the caller

doesn't use the AVIC when inhibited but still inconsistent and can trigger

a warning about this in avic_vcpu_load.



To be on the safe side I think it makes sense to call

avic_vcpu_put/avic_vcpu_load when inhibiting/uninhibiting the AVIC.

This will ensure that the work these functions do is matched.



* Patch 8 is the patch from Vitaly about allowing AVIC with SYNC

as long as the guest doesn’t use the AutoEOI feature. I only slightly

changed it to drop the SRCU lock around call to kvm_request_apicv_update

and also expose the AutoEOI cpuid bit regardless of AVIC enablement.



Despite the fact that this is the last patch in this series, this patch

doesn't depend on the other fixes.



Best regards,

Maxim Levitsky



Maxim Levitsky (7):

KVM: SVM: svm_set_vintr don't warn if AVIC is active but is about to

be deactivated

KVM: SVM: tweak warning about enabled AVIC on nested entry

KVM: SVM: use vmcb01 in svm_refresh_apicv_exec_ctrl

KVM: x86: APICv: drop immediate APICv disablement on current vCPU

KVM: x86: APICv: fix race in kvm_request_apicv_update on SVM

KVM: SVM: add warning for mistmatch between AVIC state and AVIC access

page state

KVM: SVM: call avic_vcpu_load/avic_vcpu_put when enabling/disabling

AVIC



Vitaly Kuznetsov (1):

KVM: x86: hyper-v: Deactivate APICv only when AutoEOI feature is in

use



arch/x86/include/asm/kvm_host.h | 3 ++

arch/x86/kvm/hyperv.c | 34 ++++++++++++++++----

arch/x86/kvm/svm/avic.c | 45 ++++++++++++++------------

arch/x86/kvm/svm/nested.c | 2 +-

arch/x86/kvm/svm/svm.c | 18 ++++++++---

arch/x86/kvm/x86.c | 57 ++++++++++++++++++---------------

include/linux/kvm_host.h | 1 +

virt/kvm/kvm_main.c | 1 +

8 files changed, 103 insertions(+), 58 deletions(-)



--

2.26.3





2021-07-13 14:22:02

by Maxim Levitsky

[permalink] [raw]
Subject: [PATCH v2 4/8] KVM: x86: APICv: drop immediate APICv disablement on current vCPU

Special case of disabling the APICv on the current vCPU right away in
kvm_request_apicv_update doesn't bring much benefit vs raising
KVM_REQ_APICV_UPDATE on it instead, since this request will be processed
on the next entry to the guest.
(the comment about having another #VMEXIT is wrong).

It also hides various assumptions that APIVc enable state matches
the APICv inhibit state, as this special case only makes those states
match on the current vCPU.

Previous patches fixed few such assumptions so now it should be safe
to drop this special case.

Signed-off-by: Maxim Levitsky <[email protected]>
---
arch/x86/kvm/x86.c | 12 +-----------
1 file changed, 1 insertion(+), 11 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 76dae88cf524..29b92f6cbad4 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9204,7 +9204,6 @@ EXPORT_SYMBOL_GPL(kvm_vcpu_update_apicv);
*/
void kvm_request_apicv_update(struct kvm *kvm, bool activate, ulong bit)
{
- struct kvm_vcpu *except;
unsigned long old, new, expected;

if (!kvm_x86_ops.check_apicv_inhibit_reasons ||
@@ -9230,16 +9229,7 @@ void kvm_request_apicv_update(struct kvm *kvm, bool activate, ulong bit)
if (kvm_x86_ops.pre_update_apicv_exec_ctrl)
static_call(kvm_x86_pre_update_apicv_exec_ctrl)(kvm, activate);

- /*
- * Sending request to update APICV for all other vcpus,
- * while update the calling vcpu immediately instead of
- * waiting for another #VMEXIT to handle the request.
- */
- except = kvm_get_running_vcpu();
- kvm_make_all_cpus_request_except(kvm, KVM_REQ_APICV_UPDATE,
- except);
- if (except)
- kvm_vcpu_update_apicv(except);
+ kvm_make_all_cpus_request(kvm, KVM_REQ_APICV_UPDATE);
}
EXPORT_SYMBOL_GPL(kvm_request_apicv_update);

--
2.26.3

2021-07-13 14:22:23

by Maxim Levitsky

[permalink] [raw]
Subject: [PATCH v2 5/8] KVM: x86: APICv: fix race in kvm_request_apicv_update on SVM

Currently on SVM, the kvm_request_apicv_update calls the
'pre_update_apicv_exec_ctrl' without doing any synchronization
and that function toggles the APIC_ACCESS_PAGE_PRIVATE_MEMSLOT.

If there is a mismatch between that memslot state and the AVIC state,
on one of vCPUs, an APIC mmio write can be lost:

For example:

VCPU0: enable the APIC_ACCESS_PAGE_PRIVATE_MEMSLOT
VCPU1: write to an APIC mmio register.

Since AVIC is still disabled on VCPU1, the access will not be intercepted
by it, and neither will it cause MMIO fault, but rather it will just update
the dummy page mapped into the APIC_ACCESS_PAGE_PRIVATE_MEMSLOT.

Fix that by adding a lock guarding the AVIC state changes, and carefully
order the operations of kvm_request_apicv_update to avoid this race:

1. Take the lock
2. Send KVM_REQ_APICV_UPDATE
3. Update the apic inhibit reason
4. Release the lock

This ensures that at (2) all vCPUs are kicked out of the guest mode,
but don't yet see the new avic state.
Then only after (4) all other vCPUs can update their AVIC state and resume.

Signed-off-by: Maxim Levitsky <[email protected]>
---
arch/x86/kvm/x86.c | 39 ++++++++++++++++++++++++---------------
include/linux/kvm_host.h | 1 +
virt/kvm/kvm_main.c | 1 +
3 files changed, 26 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 29b92f6cbad4..a91e35b92447 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9180,6 +9180,8 @@ void kvm_vcpu_update_apicv(struct kvm_vcpu *vcpu)
if (!lapic_in_kernel(vcpu))
return;

+ mutex_lock(&vcpu->kvm->apicv_update_lock);
+
vcpu->arch.apicv_active = kvm_apicv_activated(vcpu->kvm);
kvm_apic_update_apicv(vcpu);
static_call(kvm_x86_refresh_apicv_exec_ctrl)(vcpu);
@@ -9192,6 +9194,8 @@ void kvm_vcpu_update_apicv(struct kvm_vcpu *vcpu)
*/
if (!vcpu->arch.apicv_active)
kvm_make_request(KVM_REQ_EVENT, vcpu);
+
+ mutex_unlock(&vcpu->kvm->apicv_update_lock);
}
EXPORT_SYMBOL_GPL(kvm_vcpu_update_apicv);

@@ -9204,32 +9208,34 @@ EXPORT_SYMBOL_GPL(kvm_vcpu_update_apicv);
*/
void kvm_request_apicv_update(struct kvm *kvm, bool activate, ulong bit)
{
- unsigned long old, new, expected;
+ unsigned long old, new;

if (!kvm_x86_ops.check_apicv_inhibit_reasons ||
!static_call(kvm_x86_check_apicv_inhibit_reasons)(bit))
return;

- old = READ_ONCE(kvm->arch.apicv_inhibit_reasons);
- do {
- expected = new = old;
- if (activate)
- __clear_bit(bit, &new);
- else
- __set_bit(bit, &new);
- if (new == old)
- break;
- old = cmpxchg(&kvm->arch.apicv_inhibit_reasons, expected, new);
- } while (old != expected);
+ mutex_lock(&kvm->apicv_update_lock);
+
+ old = new = kvm->arch.apicv_inhibit_reasons;
+ if (activate)
+ __clear_bit(bit, &new);
+ else
+ __set_bit(bit, &new);
+
+ kvm->arch.apicv_inhibit_reasons = new;

if (!!old == !!new)
- return;
+ goto out;

trace_kvm_apicv_update_request(activate, bit);
+
+ kvm_make_all_cpus_request(kvm, KVM_REQ_APICV_UPDATE);
+
if (kvm_x86_ops.pre_update_apicv_exec_ctrl)
static_call(kvm_x86_pre_update_apicv_exec_ctrl)(kvm, activate);

- kvm_make_all_cpus_request(kvm, KVM_REQ_APICV_UPDATE);
+out:
+ mutex_unlock(&kvm->apicv_update_lock);
}
EXPORT_SYMBOL_GPL(kvm_request_apicv_update);

@@ -9436,8 +9442,11 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
*/
if (kvm_check_request(KVM_REQ_HV_STIMER, vcpu))
kvm_hv_process_stimers(vcpu);
- if (kvm_check_request(KVM_REQ_APICV_UPDATE, vcpu))
+ if (kvm_check_request(KVM_REQ_APICV_UPDATE, vcpu)) {
+ srcu_read_unlock(&vcpu->kvm->srcu, vcpu->srcu_idx);
kvm_vcpu_update_apicv(vcpu);
+ vcpu->srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
+ }
if (kvm_check_request(KVM_REQ_APF_READY, vcpu))
kvm_check_async_pf_completion(vcpu);
if (kvm_check_request(KVM_REQ_MSR_FILTER_CHANGED, vcpu))
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 37cbb56ccd09..0364d35d43dc 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -524,6 +524,7 @@ struct kvm {
#endif /* KVM_HAVE_MMU_RWLOCK */

struct mutex slots_lock;
+ struct mutex apicv_update_lock;

/*
* Protects the arch-specific fields of struct kvm_memory_slots in
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ed4d1581d502..ba5d5d9ebc64 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -943,6 +943,7 @@ static struct kvm *kvm_create_vm(unsigned long type)
mutex_init(&kvm->irq_lock);
mutex_init(&kvm->slots_lock);
mutex_init(&kvm->slots_arch_lock);
+ mutex_init(&kvm->apicv_update_lock);
INIT_LIST_HEAD(&kvm->devices);

BUILD_BUG_ON(KVM_MEM_SLOTS_NUM > SHRT_MAX);
--
2.26.3

2021-07-13 14:22:48

by Maxim Levitsky

[permalink] [raw]
Subject: [PATCH v2 7/8] KVM: SVM: call avic_vcpu_load/avic_vcpu_put when enabling/disabling AVIC

Currently it is possible to have the following scenario:

1. AVIC is disabled by svm_refresh_apicv_exec_ctrl
2. svm_vcpu_blocking calls avic_vcpu_put which does nothing
3. svm_vcpu_unblocking enables the AVIC (due to KVM_REQ_APICV_UPDATE)
and then calls avic_vcpu_load
4. warning is triggered in avic_vcpu_load since
AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK was never cleared

While it is possible to just remove the warning, it seems to be more robust
to fully disable/enable AVIC in svm_refresh_apicv_exec_ctrl by calling the
avic_vcpu_load/avic_vcpu_put

Signed-off-by: Maxim Levitsky <[email protected]>
---
arch/x86/kvm/svm/avic.c | 43 ++++++++++++++++++++++-------------------
arch/x86/kvm/svm/svm.c | 8 ++++++--
arch/x86/kvm/x86.c | 10 ++++++++--
3 files changed, 37 insertions(+), 24 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 64d1e1b6a305..0e4370fe0868 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -80,6 +80,28 @@ enum avic_ipi_failure_cause {
AVIC_IPI_FAILURE_INVALID_BACKING_PAGE,
};

+
+static void __avic_set_running(struct kvm_vcpu *vcpu, bool is_run)
+{
+ if (is_run)
+ avic_vcpu_load(vcpu, vcpu->cpu);
+ else
+ avic_vcpu_put(vcpu);
+}
+
+/*
+ * This function is called during VCPU halt/unhalt.
+ */
+static void avic_set_running(struct kvm_vcpu *vcpu, bool is_run)
+{
+ struct vcpu_svm *svm = to_svm(vcpu);
+
+ svm->avic_is_running = is_run;
+
+ if (kvm_vcpu_apicv_active(vcpu))
+ __avic_set_running(vcpu, is_run);
+}
+
/* Note:
* This function is called from IOMMU driver to notify
* SVM to schedule in a particular vCPU of a particular VM.
@@ -667,6 +689,7 @@ void svm_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
}
vmcb_mark_dirty(vmcb, VMCB_AVIC);

+ __avic_set_running(vcpu, activated);
svm_set_pi_irte_mode(vcpu, activated);
}

@@ -960,9 +983,6 @@ void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
int h_physical_id = kvm_cpu_get_apicid(cpu);
struct vcpu_svm *svm = to_svm(vcpu);

- if (!kvm_vcpu_apicv_active(vcpu))
- return;
-
/*
* Since the host physical APIC id is 8 bits,
* we can support host APIC ID upto 255.
@@ -990,9 +1010,6 @@ void avic_vcpu_put(struct kvm_vcpu *vcpu)
u64 entry;
struct vcpu_svm *svm = to_svm(vcpu);

- if (!kvm_vcpu_apicv_active(vcpu))
- return;
-
entry = READ_ONCE(*(svm->avic_physical_id_cache));
if (entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK)
avic_update_iommu_vcpu_affinity(vcpu, -1, 0);
@@ -1001,20 +1018,6 @@ void avic_vcpu_put(struct kvm_vcpu *vcpu)
WRITE_ONCE(*(svm->avic_physical_id_cache), entry);
}

-/*
- * This function is called during VCPU halt/unhalt.
- */
-static void avic_set_running(struct kvm_vcpu *vcpu, bool is_run)
-{
- struct vcpu_svm *svm = to_svm(vcpu);
-
- svm->avic_is_running = is_run;
- if (is_run)
- avic_vcpu_load(vcpu, vcpu->cpu);
- else
- avic_vcpu_put(vcpu);
-}
-
void svm_vcpu_blocking(struct kvm_vcpu *vcpu)
{
avic_set_running(vcpu, false);
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index d14be8aba2d7..2c291dfa0e04 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1505,12 +1505,16 @@ static void svm_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
sd->current_vmcb = svm->vmcb;
indirect_branch_prediction_barrier();
}
- avic_vcpu_load(vcpu, cpu);
+
+ if (kvm_vcpu_apicv_active(vcpu))
+ avic_vcpu_load(vcpu, cpu);
}

static void svm_vcpu_put(struct kvm_vcpu *vcpu)
{
- avic_vcpu_put(vcpu);
+ if (kvm_vcpu_apicv_active(vcpu))
+ avic_vcpu_put(vcpu);
+
svm_prepare_host_switch(vcpu);

++vcpu->stat.host_state_reload;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a91e35b92447..213d332664e2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9177,12 +9177,18 @@ void kvm_make_scan_ioapic_request(struct kvm *kvm)

void kvm_vcpu_update_apicv(struct kvm_vcpu *vcpu)
{
+ bool activate;
+
if (!lapic_in_kernel(vcpu))
return;

mutex_lock(&vcpu->kvm->apicv_update_lock);

- vcpu->arch.apicv_active = kvm_apicv_activated(vcpu->kvm);
+ activate = kvm_apicv_activated(vcpu->kvm);
+ if (vcpu->arch.apicv_active == activate)
+ goto out;
+
+ vcpu->arch.apicv_active = activate;
kvm_apic_update_apicv(vcpu);
static_call(kvm_x86_refresh_apicv_exec_ctrl)(vcpu);

@@ -9194,7 +9200,7 @@ void kvm_vcpu_update_apicv(struct kvm_vcpu *vcpu)
*/
if (!vcpu->arch.apicv_active)
kvm_make_request(KVM_REQ_EVENT, vcpu);
-
+out:
mutex_unlock(&vcpu->kvm->apicv_update_lock);
}
EXPORT_SYMBOL_GPL(kvm_vcpu_update_apicv);
--
2.26.3

2021-07-13 14:22:52

by Maxim Levitsky

[permalink] [raw]
Subject: [PATCH v2 8/8] KVM: x86: hyper-v: Deactivate APICv only when AutoEOI feature is in use

From: Vitaly Kuznetsov <[email protected]>

APICV_INHIBIT_REASON_HYPERV is currently unconditionally forced upon
SynIC activation as SynIC's AutoEOI is incompatible with APICv/AVIC. It is,
however, possible to track whether the feature was actually used by the
guest and only inhibit APICv/AVIC when needed.

TLFS suggests a dedicated 'HV_DEPRECATING_AEOI_RECOMMENDED' flag to let
Windows know that AutoEOI feature should be avoided. While it's up to
KVM userspace to set the flag, KVM can help a bit by exposing global
APICv/AVIC enablement: in case APICv/AVIC usage is impossible, AutoEOI
is still preferred.

Maxim:
- added SRCU lock drop around call to kvm_request_apicv_update
- always set HV_DEPRECATING_AEOI_RECOMMENDED in kvm_get_hv_cpuid,
since this feature can be used regardless of AVIC

Signed-off-by: Vitaly Kuznetsov <[email protected]>
Signed-off-by: Maxim Levitsky <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 3 +++
arch/x86/kvm/hyperv.c | 34 +++++++++++++++++++++++++++------
2 files changed, 31 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index e11d64aa0bcd..f900dca58af8 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -956,6 +956,9 @@ struct kvm_hv {
/* How many vCPUs have VP index != vCPU index */
atomic_t num_mismatched_vp_indexes;

+ /* How many SynICs use 'AutoEOI' feature */
+ atomic_t synic_auto_eoi_used;
+
struct hv_partition_assist_pg *hv_pa_pg;
struct kvm_hv_syndbg hv_syndbg;
};
diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
index b07592ca92f0..6bf47a583d0e 100644
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@ -85,9 +85,22 @@ static bool synic_has_vector_auto_eoi(struct kvm_vcpu_hv_synic *synic,
return false;
}

+
+static void synic_toggle_avic(struct kvm_vcpu *vcpu, bool activate)
+{
+ srcu_read_unlock(&vcpu->kvm->srcu, vcpu->srcu_idx);
+ kvm_request_apicv_update(vcpu->kvm, activate,
+ APICV_INHIBIT_REASON_HYPERV);
+ vcpu->srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
+}
+
static void synic_update_vector(struct kvm_vcpu_hv_synic *synic,
int vector)
{
+ struct kvm_vcpu *vcpu = hv_synic_to_vcpu(synic);
+ struct kvm_hv *hv = to_kvm_hv(vcpu->kvm);
+ int auto_eoi_old, auto_eoi_new;
+
if (vector < HV_SYNIC_FIRST_VALID_VECTOR)
return;

@@ -96,10 +109,23 @@ static void synic_update_vector(struct kvm_vcpu_hv_synic *synic,
else
__clear_bit(vector, synic->vec_bitmap);

+ auto_eoi_old = bitmap_weight(synic->auto_eoi_bitmap, 256);
+
if (synic_has_vector_auto_eoi(synic, vector))
__set_bit(vector, synic->auto_eoi_bitmap);
else
__clear_bit(vector, synic->auto_eoi_bitmap);
+
+ auto_eoi_new = bitmap_weight(synic->auto_eoi_bitmap, 256);
+
+ /* Hyper-V SynIC auto EOI SINTs are not compatible with APICV */
+ if (!auto_eoi_old && auto_eoi_new) {
+ if (atomic_inc_return(&hv->synic_auto_eoi_used) == 1)
+ synic_toggle_avic(vcpu, false);
+ } else if (!auto_eoi_new && auto_eoi_old) {
+ if (atomic_dec_return(&hv->synic_auto_eoi_used) == 0)
+ synic_toggle_avic(vcpu, true);
+ }
}

static int synic_set_sint(struct kvm_vcpu_hv_synic *synic, int sint,
@@ -933,12 +959,6 @@ int kvm_hv_activate_synic(struct kvm_vcpu *vcpu, bool dont_zero_synic_pages)

synic = to_hv_synic(vcpu);

- /*
- * Hyper-V SynIC auto EOI SINT's are
- * not compatible with APICV, so request
- * to deactivate APICV permanently.
- */
- kvm_request_apicv_update(vcpu->kvm, false, APICV_INHIBIT_REASON_HYPERV);
synic->active = true;
synic->dont_zero_synic_pages = dont_zero_synic_pages;
synic->control = HV_SYNIC_CONTROL_ENABLE;
@@ -2466,6 +2486,8 @@ int kvm_get_hv_cpuid(struct kvm_vcpu *vcpu, struct kvm_cpuid2 *cpuid,
ent->eax |= HV_X64_ENLIGHTENED_VMCS_RECOMMENDED;
if (!cpu_smt_possible())
ent->eax |= HV_X64_NO_NONARCH_CORESHARING;
+
+ ent->eax |= HV_DEPRECATING_AEOI_RECOMMENDED;
/*
* Default number of spinlock retry attempts, matches
* HyperV 2016.
--
2.26.3

2021-07-13 14:23:50

by Maxim Levitsky

[permalink] [raw]
Subject: [PATCH v2 6/8] KVM: SVM: add warning for mistmatch between AVIC state and AVIC access page state

It is never a good idea to enter a guest when AVIC state doesn't match
the state of the AVIC MMIO page state.

Signed-off-by: Maxim Levitsky <[email protected]>
---
arch/x86/kvm/svm/svm.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index d4d14e318ab5..d14be8aba2d7 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3783,6 +3783,9 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu)

pre_svm_run(vcpu);

+ WARN_ON_ONCE(vcpu->kvm->arch.apic_access_memslot_enabled !=
+ kvm_vcpu_apicv_active(vcpu));
+
sync_lapic_to_cr8(vcpu);

if (unlikely(svm->asid != svm->vmcb->control.asid)) {
--
2.26.3

2021-07-18 12:15:24

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [PATCH v2 8/8] KVM: x86: hyper-v: Deactivate APICv only when AutoEOI feature is in use

On Tue, 2021-07-13 at 17:20 +0300, Maxim Levitsky wrote:
> From: Vitaly Kuznetsov <[email protected]>
>
> APICV_INHIBIT_REASON_HYPERV is currently unconditionally forced upon
> SynIC activation as SynIC's AutoEOI is incompatible with APICv/AVIC. It is,
> however, possible to track whether the feature was actually used by the
> guest and only inhibit APICv/AVIC when needed.
>
> TLFS suggests a dedicated 'HV_DEPRECATING_AEOI_RECOMMENDED' flag to let
> Windows know that AutoEOI feature should be avoided. While it's up to
> KVM userspace to set the flag, KVM can help a bit by exposing global
> APICv/AVIC enablement: in case APICv/AVIC usage is impossible, AutoEOI
> is still preferred.

>
> Maxim:
> - added SRCU lock drop around call to kvm_request_apicv_update
> - always set HV_DEPRECATING_AEOI_RECOMMENDED in kvm_get_hv_cpuid,
> since this feature can be used regardless of AVIC
>
> Signed-off-by: Vitaly Kuznetsov <[email protected]>
> Signed-off-by: Maxim Levitsky <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 3 +++
> arch/x86/kvm/hyperv.c | 34 +++++++++++++++++++++++++++------
> 2 files changed, 31 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index e11d64aa0bcd..f900dca58af8 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -956,6 +956,9 @@ struct kvm_hv {
> /* How many vCPUs have VP index != vCPU index */
> atomic_t num_mismatched_vp_indexes;
>
> + /* How many SynICs use 'AutoEOI' feature */
> + atomic_t synic_auto_eoi_used;
> +
> struct hv_partition_assist_pg *hv_pa_pg;
> struct kvm_hv_syndbg hv_syndbg;
> };
> diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
> index b07592ca92f0..6bf47a583d0e 100644
> --- a/arch/x86/kvm/hyperv.c
> +++ b/arch/x86/kvm/hyperv.c
> @@ -85,9 +85,22 @@ static bool synic_has_vector_auto_eoi(struct kvm_vcpu_hv_synic *synic,
> return false;
> }
>
> +
> +static void synic_toggle_avic(struct kvm_vcpu *vcpu, bool activate)
> +{
> + srcu_read_unlock(&vcpu->kvm->srcu, vcpu->srcu_idx);
> + kvm_request_apicv_update(vcpu->kvm, activate,
> + APICV_INHIBIT_REASON_HYPERV);
> + vcpu->srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
> +}

Well turns out that this patch still doesn't work (on this
weekend I found out that all my AVIC enabled VMs hang on reboot).

I finally found out what prompted me back then to make srcu lock drop
in synic_update_vector conditional on whether the write was done
by the host.

Turns out that while KVM_SET_MSRS does take the kvm->srcu lock,
it stores the returned srcu index in a local variable and not
in vcpu->srcu_idx, thus the lock drop in synic_toggle_avic
doesn't work.

So it is likely that I have seen it not work, and blamed
KVM_SET_MSRS for not taking the srcu lock which was a wrong assumption.

I am more inclined to fix this by just tracking if we hold the srcu
lock on each VCPU manually, just as we track the srcu index anyway,
and then kvm_request_apicv_update can use this to drop the srcu
lock when needed.


> +
> static void synic_update_vector(struct kvm_vcpu_hv_synic *synic,
> int vector)
> {
> + struct kvm_vcpu *vcpu = hv_synic_to_vcpu(synic);
> + struct kvm_hv *hv = to_kvm_hv(vcpu->kvm);
> + int auto_eoi_old, auto_eoi_new;
> +
> if (vector < HV_SYNIC_FIRST_VALID_VECTOR)
> return;
>
> @@ -96,10 +109,23 @@ static void synic_update_vector(struct kvm_vcpu_hv_synic *synic,
> else
> __clear_bit(vector, synic->vec_bitmap);
>
> + auto_eoi_old = bitmap_weight(synic->auto_eoi_bitmap, 256);
> +
> if (synic_has_vector_auto_eoi(synic, vector))
> __set_bit(vector, synic->auto_eoi_bitmap);
> else
> __clear_bit(vector, synic->auto_eoi_bitmap);
> +
> + auto_eoi_new = bitmap_weight(synic->auto_eoi_bitmap, 256);
> +
> + /* Hyper-V SynIC auto EOI SINTs are not compatible with APICV */
> + if (!auto_eoi_old && auto_eoi_new) {
> + if (atomic_inc_return(&hv->synic_auto_eoi_used) == 1)
> + synic_toggle_avic(vcpu, false);
> + } else if (!auto_eoi_new && auto_eoi_old) {
> + if (atomic_dec_return(&hv->synic_auto_eoi_used) == 0)
> + synic_toggle_avic(vcpu, true);
> + }
> }
>
> static int synic_set_sint(struct kvm_vcpu_hv_synic *synic, int sint,
> @@ -933,12 +959,6 @@ int kvm_hv_activate_synic(struct kvm_vcpu *vcpu, bool dont_zero_synic_pages)
>
> synic = to_hv_synic(vcpu);
>
> - /*
> - * Hyper-V SynIC auto EOI SINT's are
> - * not compatible with APICV, so request
> - * to deactivate APICV permanently.
> - */
> - kvm_request_apicv_update(vcpu->kvm, false, APICV_INHIBIT_REASON_HYPERV);
> synic->active = true;
> synic->dont_zero_synic_pages = dont_zero_synic_pages;
> synic->control = HV_SYNIC_CONTROL_ENABLE;
> @@ -2466,6 +2486,8 @@ int kvm_get_hv_cpuid(struct kvm_vcpu *vcpu, struct kvm_cpuid2 *cpuid,
> ent->eax |= HV_X64_ENLIGHTENED_VMCS_RECOMMENDED;
> if (!cpu_smt_possible())
> ent->eax |= HV_X64_NO_NONARCH_CORESHARING;
> +
> + ent->eax |= HV_DEPRECATING_AEOI_RECOMMENDED;

Vitally, I would like to hear your opinion on this change I also made to your code.
I think that we should always expose HV_DEPRECATING_AEOI_RECOMMENDED as a supported
HV cpuid bit regardless of AVIC, so that qemu could set it regardless of AVIC
in the kernel, even if this is not optimal.


Best regards,
Maxim Levitsky

> /*
> * Default number of spinlock retry attempts, matches
> * HyperV 2016.


2021-07-19 07:49:46

by Vitaly Kuznetsov

[permalink] [raw]
Subject: Re: [PATCH v2 8/8] KVM: x86: hyper-v: Deactivate APICv only when AutoEOI feature is in use

Maxim Levitsky <[email protected]> writes:

> On Tue, 2021-07-13 at 17:20 +0300, Maxim Levitsky wrote:
>> From: Vitaly Kuznetsov <[email protected]>
>>
>> APICV_INHIBIT_REASON_HYPERV is currently unconditionally forced upon
>> SynIC activation as SynIC's AutoEOI is incompatible with APICv/AVIC. It is,
>> however, possible to track whether the feature was actually used by the
>> guest and only inhibit APICv/AVIC when needed.
>>
>> TLFS suggests a dedicated 'HV_DEPRECATING_AEOI_RECOMMENDED' flag to let
>> Windows know that AutoEOI feature should be avoided. While it's up to
>> KVM userspace to set the flag, KVM can help a bit by exposing global
>> APICv/AVIC enablement: in case APICv/AVIC usage is impossible, AutoEOI
>> is still preferred.
>
>>
>> Maxim:
>> - added SRCU lock drop around call to kvm_request_apicv_update
>> - always set HV_DEPRECATING_AEOI_RECOMMENDED in kvm_get_hv_cpuid,
>> since this feature can be used regardless of AVIC
>>
>> Signed-off-by: Vitaly Kuznetsov <[email protected]>
>> Signed-off-by: Maxim Levitsky <[email protected]>
>> ---
>> arch/x86/include/asm/kvm_host.h | 3 +++
>> arch/x86/kvm/hyperv.c | 34 +++++++++++++++++++++++++++------
>> 2 files changed, 31 insertions(+), 6 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>> index e11d64aa0bcd..f900dca58af8 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -956,6 +956,9 @@ struct kvm_hv {
>> /* How many vCPUs have VP index != vCPU index */
>> atomic_t num_mismatched_vp_indexes;
>>
>> + /* How many SynICs use 'AutoEOI' feature */
>> + atomic_t synic_auto_eoi_used;
>> +
>> struct hv_partition_assist_pg *hv_pa_pg;
>> struct kvm_hv_syndbg hv_syndbg;
>> };
>> diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
>> index b07592ca92f0..6bf47a583d0e 100644
>> --- a/arch/x86/kvm/hyperv.c
>> +++ b/arch/x86/kvm/hyperv.c
>> @@ -85,9 +85,22 @@ static bool synic_has_vector_auto_eoi(struct kvm_vcpu_hv_synic *synic,
>> return false;
>> }
>>
>> +
>> +static void synic_toggle_avic(struct kvm_vcpu *vcpu, bool activate)
>> +{
>> + srcu_read_unlock(&vcpu->kvm->srcu, vcpu->srcu_idx);
>> + kvm_request_apicv_update(vcpu->kvm, activate,
>> + APICV_INHIBIT_REASON_HYPERV);
>> + vcpu->srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
>> +}
>
> Well turns out that this patch still doesn't work (on this
> weekend I found out that all my AVIC enabled VMs hang on reboot).
>
> I finally found out what prompted me back then to make srcu lock drop
> in synic_update_vector conditional on whether the write was done
> by the host.
>
> Turns out that while KVM_SET_MSRS does take the kvm->srcu lock,
> it stores the returned srcu index in a local variable and not
> in vcpu->srcu_idx, thus the lock drop in synic_toggle_avic
> doesn't work.
>
> So it is likely that I have seen it not work, and blamed
> KVM_SET_MSRS for not taking the srcu lock which was a wrong assumption.
>
> I am more inclined to fix this by just tracking if we hold the srcu
> lock on each VCPU manually, just as we track the srcu index anyway,
> and then kvm_request_apicv_update can use this to drop the srcu
> lock when needed.
>

Would it be possible to use some magic value in 'vcpu->srcu_idx' and not
introduce a new 'srcu_ls_locked' flag?

>
>> +
>> static void synic_update_vector(struct kvm_vcpu_hv_synic *synic,
>> int vector)
>> {
>> + struct kvm_vcpu *vcpu = hv_synic_to_vcpu(synic);
>> + struct kvm_hv *hv = to_kvm_hv(vcpu->kvm);
>> + int auto_eoi_old, auto_eoi_new;
>> +
>> if (vector < HV_SYNIC_FIRST_VALID_VECTOR)
>> return;
>>
>> @@ -96,10 +109,23 @@ static void synic_update_vector(struct kvm_vcpu_hv_synic *synic,
>> else
>> __clear_bit(vector, synic->vec_bitmap);
>>
>> + auto_eoi_old = bitmap_weight(synic->auto_eoi_bitmap, 256);
>> +
>> if (synic_has_vector_auto_eoi(synic, vector))
>> __set_bit(vector, synic->auto_eoi_bitmap);
>> else
>> __clear_bit(vector, synic->auto_eoi_bitmap);
>> +
>> + auto_eoi_new = bitmap_weight(synic->auto_eoi_bitmap, 256);
>> +
>> + /* Hyper-V SynIC auto EOI SINTs are not compatible with APICV */
>> + if (!auto_eoi_old && auto_eoi_new) {
>> + if (atomic_inc_return(&hv->synic_auto_eoi_used) == 1)
>> + synic_toggle_avic(vcpu, false);
>> + } else if (!auto_eoi_new && auto_eoi_old) {
>> + if (atomic_dec_return(&hv->synic_auto_eoi_used) == 0)
>> + synic_toggle_avic(vcpu, true);
>> + }
>> }
>>
>> static int synic_set_sint(struct kvm_vcpu_hv_synic *synic, int sint,
>> @@ -933,12 +959,6 @@ int kvm_hv_activate_synic(struct kvm_vcpu *vcpu, bool dont_zero_synic_pages)
>>
>> synic = to_hv_synic(vcpu);
>>
>> - /*
>> - * Hyper-V SynIC auto EOI SINT's are
>> - * not compatible with APICV, so request
>> - * to deactivate APICV permanently.
>> - */
>> - kvm_request_apicv_update(vcpu->kvm, false, APICV_INHIBIT_REASON_HYPERV);
>> synic->active = true;
>> synic->dont_zero_synic_pages = dont_zero_synic_pages;
>> synic->control = HV_SYNIC_CONTROL_ENABLE;
>> @@ -2466,6 +2486,8 @@ int kvm_get_hv_cpuid(struct kvm_vcpu *vcpu, struct kvm_cpuid2 *cpuid,
>> ent->eax |= HV_X64_ENLIGHTENED_VMCS_RECOMMENDED;
>> if (!cpu_smt_possible())
>> ent->eax |= HV_X64_NO_NONARCH_CORESHARING;
>> +
>> + ent->eax |= HV_DEPRECATING_AEOI_RECOMMENDED;
>
> Vitally, I would like to hear your opinion on this change I also made to your code.
> I think that we should always expose HV_DEPRECATING_AEOI_RECOMMENDED as a supported
> HV cpuid bit regardless of AVIC, so that qemu could set it regardless of AVIC
> in the kernel, even if this is not optimal.

Generally, I'm OK with the change. The meaning of the bit changes
slightly depending on whether we expose it conditionally or not.

If HV_DEPRECATING_AEOI_RECOMMENDED is always exposed it just means that
the patch in question is in, nothing else. VMM has to figure out if
APICv/AVIC is supported by some other means.

If HV_DEPRECATING_AEOI_RECOMMENDED is exposed conditionally, then VMM
can be stupid and solely rely on this information by just copying the
bit to user visible CPUIDs (e.g. QEMU's 'hv-passthrough' mode).

--
Vitaly

2021-07-19 09:01:28

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [PATCH v2 8/8] KVM: x86: hyper-v: Deactivate APICv only when AutoEOI feature is in use

On Mon, 2021-07-19 at 09:47 +0200, Vitaly Kuznetsov wrote:
> Maxim Levitsky <[email protected]> writes:
>
> > On Tue, 2021-07-13 at 17:20 +0300, Maxim Levitsky wrote:
> > > From: Vitaly Kuznetsov <[email protected]>
> > >
> > > APICV_INHIBIT_REASON_HYPERV is currently unconditionally forced upon
> > > SynIC activation as SynIC's AutoEOI is incompatible with APICv/AVIC. It is,
> > > however, possible to track whether the feature was actually used by the
> > > guest and only inhibit APICv/AVIC when needed.
> > >
> > > TLFS suggests a dedicated 'HV_DEPRECATING_AEOI_RECOMMENDED' flag to let
> > > Windows know that AutoEOI feature should be avoided. While it's up to
> > > KVM userspace to set the flag, KVM can help a bit by exposing global
> > > APICv/AVIC enablement: in case APICv/AVIC usage is impossible, AutoEOI
> > > is still preferred.
> > > Maxim:
> > > - added SRCU lock drop around call to kvm_request_apicv_update
> > > - always set HV_DEPRECATING_AEOI_RECOMMENDED in kvm_get_hv_cpuid,
> > > since this feature can be used regardless of AVIC
> > >
> > > Signed-off-by: Vitaly Kuznetsov <[email protected]>
> > > Signed-off-by: Maxim Levitsky <[email protected]>
> > > ---
> > > arch/x86/include/asm/kvm_host.h | 3 +++
> > > arch/x86/kvm/hyperv.c | 34 +++++++++++++++++++++++++++------
> > > 2 files changed, 31 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > > index e11d64aa0bcd..f900dca58af8 100644
> > > --- a/arch/x86/include/asm/kvm_host.h
> > > +++ b/arch/x86/include/asm/kvm_host.h
> > > @@ -956,6 +956,9 @@ struct kvm_hv {
> > > /* How many vCPUs have VP index != vCPU index */
> > > atomic_t num_mismatched_vp_indexes;
> > >
> > > + /* How many SynICs use 'AutoEOI' feature */
> > > + atomic_t synic_auto_eoi_used;
> > > +
> > > struct hv_partition_assist_pg *hv_pa_pg;
> > > struct kvm_hv_syndbg hv_syndbg;
> > > };
> > > diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
> > > index b07592ca92f0..6bf47a583d0e 100644
> > > --- a/arch/x86/kvm/hyperv.c
> > > +++ b/arch/x86/kvm/hyperv.c
> > > @@ -85,9 +85,22 @@ static bool synic_has_vector_auto_eoi(struct kvm_vcpu_hv_synic *synic,
> > > return false;
> > > }
> > >
> > > +
> > > +static void synic_toggle_avic(struct kvm_vcpu *vcpu, bool activate)
> > > +{
> > > + srcu_read_unlock(&vcpu->kvm->srcu, vcpu->srcu_idx);
> > > + kvm_request_apicv_update(vcpu->kvm, activate,
> > > + APICV_INHIBIT_REASON_HYPERV);
> > > + vcpu->srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
> > > +}
> >
> > Well turns out that this patch still doesn't work (on this
> > weekend I found out that all my AVIC enabled VMs hang on reboot).
> >
> > I finally found out what prompted me back then to make srcu lock drop
> > in synic_update_vector conditional on whether the write was done
> > by the host.
> >
> > Turns out that while KVM_SET_MSRS does take the kvm->srcu lock,
> > it stores the returned srcu index in a local variable and not
> > in vcpu->srcu_idx, thus the lock drop in synic_toggle_avic
> > doesn't work.
> >
> > So it is likely that I have seen it not work, and blamed
> > KVM_SET_MSRS for not taking the srcu lock which was a wrong assumption.
> >
> > I am more inclined to fix this by just tracking if we hold the srcu
> > lock on each VCPU manually, just as we track the srcu index anyway,
> > and then kvm_request_apicv_update can use this to drop the srcu
> > lock when needed.
> >
>
> Would it be possible to use some magic value in 'vcpu->srcu_idx' and not
> introduce a new 'srcu_ls_locked' flag?

Well, currently the returned index value from srcu_read_lock is opaque
(and we have two SRCU implementations and both I think return small positive numbers,
but I haven't studied them in depth).

We can ask the people that maintain SRCU to reserve a number (like -1)
or so.
I probably first add the 'srcu_is_locked' thought and then as a follow up patch
remove it if they agree.

>
> > > +
> > > static void synic_update_vector(struct kvm_vcpu_hv_synic *synic,
> > > int vector)
> > > {
> > > + struct kvm_vcpu *vcpu = hv_synic_to_vcpu(synic);
> > > + struct kvm_hv *hv = to_kvm_hv(vcpu->kvm);
> > > + int auto_eoi_old, auto_eoi_new;
> > > +
> > > if (vector < HV_SYNIC_FIRST_VALID_VECTOR)
> > > return;
> > >
> > > @@ -96,10 +109,23 @@ static void synic_update_vector(struct kvm_vcpu_hv_synic *synic,
> > > else
> > > __clear_bit(vector, synic->vec_bitmap);
> > >
> > > + auto_eoi_old = bitmap_weight(synic->auto_eoi_bitmap, 256);
> > > +
> > > if (synic_has_vector_auto_eoi(synic, vector))
> > > __set_bit(vector, synic->auto_eoi_bitmap);
> > > else
> > > __clear_bit(vector, synic->auto_eoi_bitmap);
> > > +
> > > + auto_eoi_new = bitmap_weight(synic->auto_eoi_bitmap, 256);
> > > +
> > > + /* Hyper-V SynIC auto EOI SINTs are not compatible with APICV */
> > > + if (!auto_eoi_old && auto_eoi_new) {
> > > + if (atomic_inc_return(&hv->synic_auto_eoi_used) == 1)
> > > + synic_toggle_avic(vcpu, false);
> > > + } else if (!auto_eoi_new && auto_eoi_old) {
> > > + if (atomic_dec_return(&hv->synic_auto_eoi_used) == 0)
> > > + synic_toggle_avic(vcpu, true);
> > > + }
> > > }
> > >
> > > static int synic_set_sint(struct kvm_vcpu_hv_synic *synic, int sint,
> > > @@ -933,12 +959,6 @@ int kvm_hv_activate_synic(struct kvm_vcpu *vcpu, bool dont_zero_synic_pages)
> > >
> > > synic = to_hv_synic(vcpu);
> > >
> > > - /*
> > > - * Hyper-V SynIC auto EOI SINT's are
> > > - * not compatible with APICV, so request
> > > - * to deactivate APICV permanently.
> > > - */
> > > - kvm_request_apicv_update(vcpu->kvm, false, APICV_INHIBIT_REASON_HYPERV);
> > > synic->active = true;
> > > synic->dont_zero_synic_pages = dont_zero_synic_pages;
> > > synic->control = HV_SYNIC_CONTROL_ENABLE;
> > > @@ -2466,6 +2486,8 @@ int kvm_get_hv_cpuid(struct kvm_vcpu *vcpu, struct kvm_cpuid2 *cpuid,
> > > ent->eax |= HV_X64_ENLIGHTENED_VMCS_RECOMMENDED;
> > > if (!cpu_smt_possible())
> > > ent->eax |= HV_X64_NO_NONARCH_CORESHARING;
> > > +
> > > + ent->eax |= HV_DEPRECATING_AEOI_RECOMMENDED;
> >
> > Vitally, I would like to hear your opinion on this change I also made to your code.
> > I think that we should always expose HV_DEPRECATING_AEOI_RECOMMENDED as a supported
> > HV cpuid bit regardless of AVIC, so that qemu could set it regardless of AVIC
> > in the kernel, even if this is not optimal.
>
> Generally, I'm OK with the change. The meaning of the bit changes
> slightly depending on whether we expose it conditionally or not.
>
> If HV_DEPRECATING_AEOI_RECOMMENDED is always exposed it just means that
> the patch in question is in, nothing else. VMM has to figure out if
> APICv/AVIC is supported by some other means.
>
> If HV_DEPRECATING_AEOI_RECOMMENDED is exposed conditionally, then VMM
> can be stupid and solely rely on this information by just copying the
> bit to user visible CPUIDs (e.g. QEMU's 'hv-passthrough' mode).

I understand what you mean.
In regard to avic, though, the 'hv-passthrough' doesn't work at all
anyway since it also makes the qemu enable hv-vapic bit which
makes the guest not use APIC....

So it seems that not even qemu but a management layer above it should
figure out that it wants AVIC, and then enable all the bits that 'should'
make it work, but AVIC might still not work for some reason.

Currently those are the flags that are needed to make AVIC work on windows:

-cpu -svm,-x2apic,-hv-vapic,hv-aeoi-deprecation (I added this locally)
-global kvm-pit.lost_tick_policy=discard

Thanks,
Best regards,
Maxim Levitsky


>


2021-07-19 09:25:31

by Vitaly Kuznetsov

[permalink] [raw]
Subject: Re: [PATCH v2 8/8] KVM: x86: hyper-v: Deactivate APICv only when AutoEOI feature is in use

Maxim Levitsky <[email protected]> writes:

> On Mon, 2021-07-19 at 09:47 +0200, Vitaly Kuznetsov wrote:
>> Maxim Levitsky <[email protected]> writes:
>>
>> > On Tue, 2021-07-13 at 17:20 +0300, Maxim Levitsky wrote:
>> > > From: Vitaly Kuznetsov <[email protected]>
>> > >
>> > > APICV_INHIBIT_REASON_HYPERV is currently unconditionally forced upon
>> > > SynIC activation as SynIC's AutoEOI is incompatible with APICv/AVIC. It is,
>> > > however, possible to track whether the feature was actually used by the
>> > > guest and only inhibit APICv/AVIC when needed.
>> > >
>> > > TLFS suggests a dedicated 'HV_DEPRECATING_AEOI_RECOMMENDED' flag to let
>> > > Windows know that AutoEOI feature should be avoided. While it's up to
>> > > KVM userspace to set the flag, KVM can help a bit by exposing global
>> > > APICv/AVIC enablement: in case APICv/AVIC usage is impossible, AutoEOI
>> > > is still preferred.
>> > > Maxim:
>> > > - added SRCU lock drop around call to kvm_request_apicv_update
>> > > - always set HV_DEPRECATING_AEOI_RECOMMENDED in kvm_get_hv_cpuid,
>> > > since this feature can be used regardless of AVIC
>> > >
>> > > Signed-off-by: Vitaly Kuznetsov <[email protected]>
>> > > Signed-off-by: Maxim Levitsky <[email protected]>
>> > > ---
>> > > arch/x86/include/asm/kvm_host.h | 3 +++
>> > > arch/x86/kvm/hyperv.c | 34 +++++++++++++++++++++++++++------
>> > > 2 files changed, 31 insertions(+), 6 deletions(-)
>> > >
>> > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>> > > index e11d64aa0bcd..f900dca58af8 100644
>> > > --- a/arch/x86/include/asm/kvm_host.h
>> > > +++ b/arch/x86/include/asm/kvm_host.h
>> > > @@ -956,6 +956,9 @@ struct kvm_hv {
>> > > /* How many vCPUs have VP index != vCPU index */
>> > > atomic_t num_mismatched_vp_indexes;
>> > >
>> > > + /* How many SynICs use 'AutoEOI' feature */
>> > > + atomic_t synic_auto_eoi_used;
>> > > +
>> > > struct hv_partition_assist_pg *hv_pa_pg;
>> > > struct kvm_hv_syndbg hv_syndbg;
>> > > };
>> > > diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
>> > > index b07592ca92f0..6bf47a583d0e 100644
>> > > --- a/arch/x86/kvm/hyperv.c
>> > > +++ b/arch/x86/kvm/hyperv.c
>> > > @@ -85,9 +85,22 @@ static bool synic_has_vector_auto_eoi(struct kvm_vcpu_hv_synic *synic,
>> > > return false;
>> > > }
>> > >
>> > > +
>> > > +static void synic_toggle_avic(struct kvm_vcpu *vcpu, bool activate)
>> > > +{
>> > > + srcu_read_unlock(&vcpu->kvm->srcu, vcpu->srcu_idx);
>> > > + kvm_request_apicv_update(vcpu->kvm, activate,
>> > > + APICV_INHIBIT_REASON_HYPERV);
>> > > + vcpu->srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
>> > > +}
>> >
>> > Well turns out that this patch still doesn't work (on this
>> > weekend I found out that all my AVIC enabled VMs hang on reboot).
>> >
>> > I finally found out what prompted me back then to make srcu lock drop
>> > in synic_update_vector conditional on whether the write was done
>> > by the host.
>> >
>> > Turns out that while KVM_SET_MSRS does take the kvm->srcu lock,
>> > it stores the returned srcu index in a local variable and not
>> > in vcpu->srcu_idx, thus the lock drop in synic_toggle_avic
>> > doesn't work.
>> >
>> > So it is likely that I have seen it not work, and blamed
>> > KVM_SET_MSRS for not taking the srcu lock which was a wrong assumption.
>> >
>> > I am more inclined to fix this by just tracking if we hold the srcu
>> > lock on each VCPU manually, just as we track the srcu index anyway,
>> > and then kvm_request_apicv_update can use this to drop the srcu
>> > lock when needed.
>> >
>>
>> Would it be possible to use some magic value in 'vcpu->srcu_idx' and not
>> introduce a new 'srcu_ls_locked' flag?
>
> Well, currently the returned index value from srcu_read_lock is opaque
> (and we have two SRCU implementations and both I think return small positive numbers,
> but I haven't studied them in depth).
>
> We can ask the people that maintain SRCU to reserve a number (like -1)
> or so.
> I probably first add the 'srcu_is_locked' thought and then as a follow up patch
> remove it if they agree.
>

Ah, OK. BTW, I've just discovered srcu_read_lock_held() which sounds
like the function we need but unfortunately it is not.

--
Vitaly

2021-07-19 10:02:38

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [PATCH v2 8/8] KVM: x86: hyper-v: Deactivate APICv only when AutoEOI feature is in use

On Mon, 2021-07-19 at 11:23 +0200, Vitaly Kuznetsov wrote:
> Maxim Levitsky <[email protected]> writes:
>
> > On Mon, 2021-07-19 at 09:47 +0200, Vitaly Kuznetsov wrote:
> > > Maxim Levitsky <[email protected]> writes:
> > >
> > > > On Tue, 2021-07-13 at 17:20 +0300, Maxim Levitsky wrote:
> > > > > From: Vitaly Kuznetsov <[email protected]>
> > > > >
> > > > > APICV_INHIBIT_REASON_HYPERV is currently unconditionally forced upon
> > > > > SynIC activation as SynIC's AutoEOI is incompatible with APICv/AVIC. It is,
> > > > > however, possible to track whether the feature was actually used by the
> > > > > guest and only inhibit APICv/AVIC when needed.
> > > > >
> > > > > TLFS suggests a dedicated 'HV_DEPRECATING_AEOI_RECOMMENDED' flag to let
> > > > > Windows know that AutoEOI feature should be avoided. While it's up to
> > > > > KVM userspace to set the flag, KVM can help a bit by exposing global
> > > > > APICv/AVIC enablement: in case APICv/AVIC usage is impossible, AutoEOI
> > > > > is still preferred.
> > > > > Maxim:
> > > > > - added SRCU lock drop around call to kvm_request_apicv_update
> > > > > - always set HV_DEPRECATING_AEOI_RECOMMENDED in kvm_get_hv_cpuid,
> > > > > since this feature can be used regardless of AVIC
> > > > >
> > > > > Signed-off-by: Vitaly Kuznetsov <[email protected]>
> > > > > Signed-off-by: Maxim Levitsky <[email protected]>
> > > > > ---
> > > > > arch/x86/include/asm/kvm_host.h | 3 +++
> > > > > arch/x86/kvm/hyperv.c | 34 +++++++++++++++++++++++++++------
> > > > > 2 files changed, 31 insertions(+), 6 deletions(-)
> > > > >
> > > > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > > > > index e11d64aa0bcd..f900dca58af8 100644
> > > > > --- a/arch/x86/include/asm/kvm_host.h
> > > > > +++ b/arch/x86/include/asm/kvm_host.h
> > > > > @@ -956,6 +956,9 @@ struct kvm_hv {
> > > > > /* How many vCPUs have VP index != vCPU index */
> > > > > atomic_t num_mismatched_vp_indexes;
> > > > >
> > > > > + /* How many SynICs use 'AutoEOI' feature */
> > > > > + atomic_t synic_auto_eoi_used;
> > > > > +
> > > > > struct hv_partition_assist_pg *hv_pa_pg;
> > > > > struct kvm_hv_syndbg hv_syndbg;
> > > > > };
> > > > > diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
> > > > > index b07592ca92f0..6bf47a583d0e 100644
> > > > > --- a/arch/x86/kvm/hyperv.c
> > > > > +++ b/arch/x86/kvm/hyperv.c
> > > > > @@ -85,9 +85,22 @@ static bool synic_has_vector_auto_eoi(struct kvm_vcpu_hv_synic *synic,
> > > > > return false;
> > > > > }
> > > > >
> > > > > +
> > > > > +static void synic_toggle_avic(struct kvm_vcpu *vcpu, bool activate)
> > > > > +{
> > > > > + srcu_read_unlock(&vcpu->kvm->srcu, vcpu->srcu_idx);
> > > > > + kvm_request_apicv_update(vcpu->kvm, activate,
> > > > > + APICV_INHIBIT_REASON_HYPERV);
> > > > > + vcpu->srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
> > > > > +}
> > > >
> > > > Well turns out that this patch still doesn't work (on this
> > > > weekend I found out that all my AVIC enabled VMs hang on reboot).
> > > >
> > > > I finally found out what prompted me back then to make srcu lock drop
> > > > in synic_update_vector conditional on whether the write was done
> > > > by the host.
> > > >
> > > > Turns out that while KVM_SET_MSRS does take the kvm->srcu lock,
> > > > it stores the returned srcu index in a local variable and not
> > > > in vcpu->srcu_idx, thus the lock drop in synic_toggle_avic
> > > > doesn't work.
> > > >
> > > > So it is likely that I have seen it not work, and blamed
> > > > KVM_SET_MSRS for not taking the srcu lock which was a wrong assumption.
> > > >
> > > > I am more inclined to fix this by just tracking if we hold the srcu
> > > > lock on each VCPU manually, just as we track the srcu index anyway,
> > > > and then kvm_request_apicv_update can use this to drop the srcu
> > > > lock when needed.
> > > >
> > >
> > > Would it be possible to use some magic value in 'vcpu->srcu_idx' and not
> > > introduce a new 'srcu_ls_locked' flag?
> >
> > Well, currently the returned index value from srcu_read_lock is opaque
> > (and we have two SRCU implementations and both I think return small positive numbers,
> > but I haven't studied them in depth).
> >
> > We can ask the people that maintain SRCU to reserve a number (like -1)
> > or so.
> > I probably first add the 'srcu_is_locked' thought and then as a follow up patch
> > remove it if they agree.
> >
>
> Ah, OK. BTW, I've just discovered srcu_read_lock_held() which sounds
> like the function we need but unfortunately it is not.

Yea, exactly this. :-(

Best regards,
Maxim Levitsky

>


2021-07-19 20:08:31

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v2 8/8] KVM: x86: hyper-v: Deactivate APICv only when AutoEOI feature is in use

On Sun, Jul 18, 2021, Maxim Levitsky wrote:
> I am more inclined to fix this by just tracking if we hold the srcu
> lock on each VCPU manually, just as we track the srcu index anyway,
> and then kvm_request_apicv_update can use this to drop the srcu
> lock when needed.

The entire approach of dynamically adding/removing the memslot seems doomed to
failure, and is likely responsible for the performance issues with AVIC, e.g. a
single vCPU temporarily inhibiting AVIC will zap all SPTEs _twice_; on disable
and again on re-enable.

Rather than pile on more gunk, what about special casing the APIC access page
memslot in try_async_pf()? E.g. zap the GFN in avic_update_access_page() when
disabling (and bounce through kvm_{inc,dec}_notifier_count()), and have the page
fault path skip directly to MMIO emulation without caching the MMIO info. It'd
also give us a good excuse to rename try_async_pf() :-)

If lack of MMIO caching is a performance problem, an alternative solution would
be to allow caching but add a helper to zap the MMIO SPTE and request all vCPUs to
clear their cache.

It's all a bit gross, especially hijacking the mmu_notifier path, but IMO it'd be
less awful than the current memslot+SRCU mess.

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index f4d35289f59e..ea434d76cfb0 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3767,9 +3767,9 @@ static bool kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
kvm_vcpu_gfn_to_hva(vcpu, gfn), &arch);
}

-static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
- gpa_t cr2_or_gpa, kvm_pfn_t *pfn, hva_t *hva,
- bool write, bool *writable)
+static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
+ gpa_t cr2_or_gpa, kvm_pfn_t *pfn, hva_t *hva,
+ bool write, bool *writable, int *r)
{
struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
bool async;
@@ -3780,13 +3780,26 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
* be zapped before KVM inserts a new MMIO SPTE for the gfn.
*/
if (slot && (slot->flags & KVM_MEMSLOT_INVALID))
- return true;
+ goto out_retry;

- /* Don't expose private memslots to L2. */
- if (is_guest_mode(vcpu) && !kvm_is_visible_memslot(slot)) {
- *pfn = KVM_PFN_NOSLOT;
- *writable = false;
- return false;
+ if (!kvm_is_visible_memslot(slot)) {
+ /* Don't expose private memslots to L2. */
+ if (is_guest_mode(vcpu)) {
+ *pfn = KVM_PFN_NOSLOT;
+ *writable = false;
+ return false;
+ }
+ /*
+ * If the APIC access page exists but is disabled, go directly
+ * to emulation without caching the MMIO access or creating a
+ * MMIO SPTE. That way the cache doesn't need to be purged
+ * when the AVIC is re-enabled.
+ */
+ if (slot->id == APIC_ACCESS_PAGE_PRIVATE_MEMSLOT &&
+ !vcpu->kvm->arch.apic_access_memslot_enabled) {
+ *r = RET_PF_EMULATE;
+ return true;
+ }
}

async = false;
@@ -3800,14 +3813,19 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
if (kvm_find_async_pf_gfn(vcpu, gfn)) {
trace_kvm_async_pf_doublefault(cr2_or_gpa, gfn);
kvm_make_request(KVM_REQ_APF_HALT, vcpu);
- return true;
- } else if (kvm_arch_setup_async_pf(vcpu, cr2_or_gpa, gfn))
- return true;
+ goto out_retry;
+ } else if (kvm_arch_setup_async_pf(vcpu, cr2_or_gpa, gfn)) {
+ goto out_retry;
+ }
}

*pfn = __gfn_to_pfn_memslot(slot, gfn, false, NULL,
write, writable, hva);
return false;
+
+out_retry:
+ *r = RET_PF_RETRY;
+ return true;
}

static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
@@ -3839,9 +3857,9 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
mmu_seq = vcpu->kvm->mmu_notifier_seq;
smp_rmb();

- if (try_async_pf(vcpu, prefault, gfn, gpa, &pfn, &hva,
- write, &map_writable))
- return RET_PF_RETRY;
+ if (kvm_faultin_pfn(vcpu, prefault, gfn, gpa, &pfn, &hva, write,
+ &map_writable, &r))
+ return r;

if (handle_abnormal_pfn(vcpu, is_tdp ? 0 : gpa, gfn, pfn, ACC_ALL, &r))
return r;
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 490a028ddabe..9747124b877d 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -881,9 +881,9 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gpa_t addr, u32 error_code,
mmu_seq = vcpu->kvm->mmu_notifier_seq;
smp_rmb();

- if (try_async_pf(vcpu, prefault, walker.gfn, addr, &pfn, &hva,
- write_fault, &map_writable))
- return RET_PF_RETRY;
+ if (kvm_faultin_pfn(vcpu, prefault, walker.gfn, addr, &pfn, &hva,
+ write_fault, &map_writable, &r))
+ return r;

if (handle_abnormal_pfn(vcpu, addr, walker.gfn, pfn, walker.pte_access, &r))
return r;

2021-07-20 09:42:45

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [PATCH v2 8/8] KVM: x86: hyper-v: Deactivate APICv only when AutoEOI feature is in use

On Mon, 2021-07-19 at 18:49 +0000, Sean Christopherson wrote:
> On Sun, Jul 18, 2021, Maxim Levitsky wrote:
> > I am more inclined to fix this by just tracking if we hold the srcu
> > lock on each VCPU manually, just as we track the srcu index anyway,
> > and then kvm_request_apicv_update can use this to drop the srcu
> > lock when needed.
>
> The entire approach of dynamically adding/removing the memslot seems doomed to
> failure, and is likely responsible for the performance issues with AVIC, e.g. a
> single vCPU temporarily inhibiting AVIC will zap all SPTEs _twice_; on disable
> and again on re-enable.
100% agree, especially about the fact that this approach is doomed to fail.

There are two reasons why I didn't explore the direction you propose:

(when I talked about this on IRC with Paolo, I did suggest trying to explore
this direction once)

One is that practically speaking AVIC inhibition is not something that happens often,
and in all of my tests it didn't happen after the VM had finished booting.

There are the following cases when AVIC is inhibited:

1.Not possible to support configuration

(PIT reinject mode, x2apic mode, autoeoi mode of hyperv SYNIC)

With the theoretical exception of the SYNIC autoeoi mode, it is something that happens
only once when qemu configures the VM.

In all of my tests the SYNIC autoeoi mode was also enabled once on each vCPU and
stayed that way.


2.Not yet supported configuration (nesting)
Something that will go away eventually and also something that fires only once
at least currently.

3.Dynamically when the guest dares to use local APIC's virtual wire mode to
send PIC's connected interrupts via LAPIC, and since this can't be done using AVIC
because in this mode, the interrupt must not affect local APIC registers (like IRR, ISR, etc),
a normal EVENTINJ injection has to be done.
This sometimes requires detection of the start of the interrupt window, which is something not
possible to do when AVIC is enabled because AVIC makes the CPU ignore the so called
virtual interrupts, which we inject and intercept, to detect it.

This case only happens on windows and in OVMF (both are silly enough to use this mode),
and thankfully windows only uses this mode during boot.

Thus even a gross performance issue isn't an issue yet, but it would be
indeed nice to eliminate it if we ever need to deal with a guest which
does somehow end up enabling and disabling AVIC many times per second.

>
> Rather than pile on more gunk, what about special casing the APIC access page
> memslot in try_async_pf()? E.g. zap the GFN in avic_update_access_page() when
> disabling (and bounce through kvm_{inc,dec}_notifier_count()), and have the page
> fault path skip directly to MMIO emulation without caching the MMIO info. It'd
> also give us a good excuse to rename try_async_pf() :-)

I understand what you mean about renaming try_async_pf :-)

As for the other reason, the reason is simple: Fear ;-)
While I do know the KVM's MMU an order of magnitude better than I did a year ago,
I still don't have the confidence needed to add hacks to it.

I do agree that a special AVIC hack in the mmu as you propose is miles better than
dynamic disable hack of the AVIC memslot.

>
> If lack of MMIO caching is a performance problem, an alternative solution would
> be to allow caching but add a helper to zap the MMIO SPTE and request all vCPUs to
> clear their cache.

In theory this can be an issue. LAPIC MMIO has the ICR register which is split in two,
Thus, lack of caching should hurt performance.

With that said, a management layer (e.g libvirt) these days always enables X2APIC
and KVM exposes it as supported in CPUID regardless of host support fo it,
as it still has better performance than unaccelerated xAPIC.

This means that it can be expected that a management layer will only disable X2APIC
when it enables AVIC and sets up everything such as it is actually used, and therefore
no performance impact will be felt.

(the time during which the AVIC could be dynamically inhibited is neglectable as I explained above).

>
> It's all a bit gross, especially hijacking the mmu_notifier path, but IMO it'd be
> less awful than the current memslot+SRCU mess.
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index f4d35289f59e..ea434d76cfb0 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3767,9 +3767,9 @@ static bool kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> kvm_vcpu_gfn_to_hva(vcpu, gfn), &arch);
> }
>
> -static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
> - gpa_t cr2_or_gpa, kvm_pfn_t *pfn, hva_t *hva,
> - bool write, bool *writable)
> +static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
> + gpa_t cr2_or_gpa, kvm_pfn_t *pfn, hva_t *hva,
> + bool write, bool *writable, int *r)
> {
> struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
> bool async;
> @@ -3780,13 +3780,26 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
> * be zapped before KVM inserts a new MMIO SPTE for the gfn.
> */
> if (slot && (slot->flags & KVM_MEMSLOT_INVALID))
> - return true;
> + goto out_retry;
>
> - /* Don't expose private memslots to L2. */
> - if (is_guest_mode(vcpu) && !kvm_is_visible_memslot(slot)) {
> - *pfn = KVM_PFN_NOSLOT;
> - *writable = false;
> - return false;
> + if (!kvm_is_visible_memslot(slot)) {
> + /* Don't expose private memslots to L2. */
> + if (is_guest_mode(vcpu)) {
> + *pfn = KVM_PFN_NOSLOT;
> + *writable = false;
> + return false;
> + }
> + /*
> + * If the APIC access page exists but is disabled, go directly
> + * to emulation without caching the MMIO access or creating a
> + * MMIO SPTE. That way the cache doesn't need to be purged
> + * when the AVIC is re-enabled.
> + */
> + if (slot->id == APIC_ACCESS_PAGE_PRIVATE_MEMSLOT &&
> + !vcpu->kvm->arch.apic_access_memslot_enabled) {
> + *r = RET_PF_EMULATE;
> + return true;
> + }
> }
>
> async = false;
> @@ -3800,14 +3813,19 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
> if (kvm_find_async_pf_gfn(vcpu, gfn)) {
> trace_kvm_async_pf_doublefault(cr2_or_gpa, gfn);
> kvm_make_request(KVM_REQ_APF_HALT, vcpu);
> - return true;
> - } else if (kvm_arch_setup_async_pf(vcpu, cr2_or_gpa, gfn))
> - return true;
> + goto out_retry;
> + } else if (kvm_arch_setup_async_pf(vcpu, cr2_or_gpa, gfn)) {
> + goto out_retry;
> + }
> }
>
> *pfn = __gfn_to_pfn_memslot(slot, gfn, false, NULL,
> write, writable, hva);
> return false;
> +
> +out_retry:
> + *r = RET_PF_RETRY;
> + return true;
> }
>
> static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
> @@ -3839,9 +3857,9 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
> mmu_seq = vcpu->kvm->mmu_notifier_seq;
> smp_rmb();
>
> - if (try_async_pf(vcpu, prefault, gfn, gpa, &pfn, &hva,
> - write, &map_writable))
> - return RET_PF_RETRY;
> + if (kvm_faultin_pfn(vcpu, prefault, gfn, gpa, &pfn, &hva, write,
> + &map_writable, &r))
> + return r;
>
> if (handle_abnormal_pfn(vcpu, is_tdp ? 0 : gpa, gfn, pfn, ACC_ALL, &r))
> return r;
> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> index 490a028ddabe..9747124b877d 100644
> --- a/arch/x86/kvm/mmu/paging_tmpl.h
> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> @@ -881,9 +881,9 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gpa_t addr, u32 error_code,
> mmu_seq = vcpu->kvm->mmu_notifier_seq;
> smp_rmb();
>
> - if (try_async_pf(vcpu, prefault, walker.gfn, addr, &pfn, &hva,
> - write_fault, &map_writable))
> - return RET_PF_RETRY;
> + if (kvm_faultin_pfn(vcpu, prefault, walker.gfn, addr, &pfn, &hva,
> + write_fault, &map_writable, &r))
> + return r;
>
> if (handle_abnormal_pfn(vcpu, addr, walker.gfn, pfn, walker.pte_access, &r))
> return r;
>


The above approach looks very good.
Paolo what do you think?

Best regards,
Maxim Levitsky


2021-07-22 09:17:33

by Maxim Levitsky

[permalink] [raw]
Subject: KVM's support for non default APIC base

On Mon, 2021-07-19 at 18:49 +0000, Sean Christopherson wrote:
> On Sun, Jul 18, 2021, Maxim Levitsky wrote:
> > I am more inclined to fix this by just tracking if we hold the srcu
> > lock on each VCPU manually, just as we track the srcu index anyway,
> > and then kvm_request_apicv_update can use this to drop the srcu
> > lock when needed.
>
> The entire approach of dynamically adding/removing the memslot seems doomed to
> failure, and is likely responsible for the performance issues with AVIC, e.g. a
> single vCPU temporarily inhibiting AVIC will zap all SPTEs _twice_; on disable
> and again on re-enable.
>
> Rather than pile on more gunk, what about special casing the APIC access page
> memslot in try_async_pf()? E.g. zap the GFN in avic_update_access_page() when
> disabling (and bounce through kvm_{inc,dec}_notifier_count()), and have the page
> fault path skip directly to MMIO emulation without caching the MMIO info. It'd
> also give us a good excuse to rename try_async_pf() :-)
>
> If lack of MMIO caching is a performance problem, an alternative solution would
> be to allow caching but add a helper to zap the MMIO SPTE and request all vCPUs to
> clear their cache.
>
> It's all a bit gross, especially hijacking the mmu_notifier path, but IMO it'd be
> less awful than the current memslot+SRCU mess

Hi Sean, Paolo and everyone else:

I am exploring the approach that you proposed and I noticed that we have very inconsistient
code that handles the APIC base relocation for in-kernel local apic.
I do know that APIC base relocation is not supported, and I don't have anything against
this as long as VMs don't use that feature, but I do want this to be consistent.

I did a study of the code that is involved in this mess and I would like to hear your opinion:

There are basically 3 modes of operation of in kernel local apic:

Regular unaccelerated local apic:

-> APIC MMIO base address is stored at 'apic->base_address', and updated in
kvm_lapic_set_base which is called from msr write of 'MSR_IA32_APICBASE'
(both by SVM and VMX).
The value is even migrated.

-> APIC mmio read/write is done from MMU, when we access MMIO page:
vcpu_mmio_write always calls apic_mmio_write which checks if the write is in
apic->base_address page and if so forwards the write local apic with offset
in this page.

-> APIC MMIO area has to be MMIO for 'apic_mmio_write' to be called,
thus must contain no guest memslots.
If the guest relocates the APIC base somewhere where we have a memslot,
memslot will take priority, while on real hardware, LAPIC is likely to
take priority.

APICv:

-> The default apic MMIO base (0xfee00000) is covered by a dummy page which is
allocated from qemu's process using __x86_set_memory_region.

This is done once in alloc_apic_access_page which is called on vcpu creation,
(but only once when the memslot is not yet enabled)

-> to avoid pinning this page into qemu's memory, reference to it
is dropped in alloc_apic_access_page.
(see commit c24ae0dcd3e8695efa43e71704d1fc4bc7e29e9b)

-> kvm_arch_mmu_notifier_invalidate_range -> checks if we invalidate GPA 0xfee00000
and if so, raises KVM_REQ_APIC_PAGE_RELOAD request.

-> kvm_vcpu_reload_apic_access_page handles the KVM_REQ_APIC_PAGE_RELOAD request by calling
kvm_x86_ops.set_apic_access_page_addr which is only implemented on VMX
(vmx_set_apic_access_page_addr)

-> vmx_set_apic_access_page_addr does gfn_to_page on 0xfee00000 GPA,
and if the result is valid, writes the physical address of this page to APIC_ACCESS_ADDR vmcs field.

(This is a major difference from the AVIC - AVIC's avic_vapic_bar is *GPA*, while APIC_ACCESS_ADDR
is host physical address which the hypervisor is supposed to map at APIC MMIO GPA using EPT)

Note that if we have an error here, we might end with invalid APIC_ACCESS_ADDR field.

-> writes to the HPA of that special page (which has GPA of 0xfee00000, and mapped via EPT) go to
APICv or cause special VM exits: (EXIT_REASON_APIC_ACCESS, EXIT_REASON_APIC_WRITE)

* EXIT_REASON_APIC_ACCESS (which is used for older limited 'flexpriotiy' mode which only emulates TPR practically)
actually emulates the instruction to know the written value,
but we have a special case in vcpu_is_mmio_gpa which makes the emulation treat the access to the default
apic base as MMIO.

* EXIT_REASON_APIC_WRITE is a trap VMexit which comes with full APICv, and since it also has offset
qualification and the value is already in the apic page, this info is just passed to kvm_lapic_reg_write


-> If APIC base is relocated, the APICv isn't aware of it, and the writes to new APIC base,
(assuming that we have no memslots covering it) will go through standard APIC MMIO emulation,
and *might* create a mess.

AVIC:

-> The default apic MMIO base (0xfee00000)
is also covered by a dummy page which is allocated from qemu's process using __x86_set_memory_region
in avic_update_access_page which is called also on vcpu creation (also only once),
and from SVM's dynamic AVIC inhibition.

-> The reference to this page is not dropped thus there is no KVM_REQ_APIC_PAGE_RELOAD handler.
I think we should do the same we do for APICv here?

-> avic_vapic_bar is GPA and thus contains 0xfee00000 but writes to MSR_IA32_APICBASE do update it
(avic_update_vapic_bar which is called from MSR_IA32_APICBASE write in SVM code)

thus if the guest relocates the APIC base to a writable memory page, actually AVIC would happen to work.
(opposite from the stock xAPIC handlilng, which only works when apic base is in MMIO area.)

-> writes to the GPA in avic_vapic_bar are first checked in NPT (but HPA written there ignored),
and then either go to AVIC or cause SVM_EXIT_AVIC_UNACCELERATED_ACCESS which has offset of the write
in the exit_info_1
(there is also SVM_EXIT_AVIC_INCOMPLETE_IPI which is called on some ICR writes)


As far as I know the only good reason to relocate APIC base is to access it from the real mode
which is not something that is done these days by modern BIOSes.

I vote to make it read only (#GP on MSR_IA32_APICBASE write when non default base is set and apic enabled)
and remove all remains of the support for variable APIC base.
(we already have a warning when APIC base is set to non default value)


Best regards,
Maxim Levitsky


2021-07-22 17:38:04

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [PATCH v2 8/8] KVM: x86: hyper-v: Deactivate APICv only when AutoEOI feature is in use

On Mon, 2021-07-19 at 18:49 +0000, Sean Christopherson wrote:
> On Sun, Jul 18, 2021, Maxim Levitsky wrote:
> > I am more inclined to fix this by just tracking if we hold the srcu
> > lock on each VCPU manually, just as we track the srcu index anyway,
> > and then kvm_request_apicv_update can use this to drop the srcu
> > lock when needed.
>
> The entire approach of dynamically adding/removing the memslot seems doomed to
> failure, and is likely responsible for the performance issues with AVIC, e.g. a
> single vCPU temporarily inhibiting AVIC will zap all SPTEs _twice_; on disable
> and again on re-enable.
>
> Rather than pile on more gunk, what about special casing the APIC access page
> memslot in try_async_pf()? E.g. zap the GFN in avic_update_access_page() when
> disabling (and bounce through kvm_{inc,dec}_notifier_count()), and have the page
> fault path skip directly to MMIO emulation without caching the MMIO info. It'd
> also give us a good excuse to rename try_async_pf() :-)
>
> If lack of MMIO caching is a performance problem, an alternative solution would
> be to allow caching but add a helper to zap the MMIO SPTE and request all vCPUs to
> clear their cache.
>
> It's all a bit gross, especially hijacking the mmu_notifier path, but IMO it'd be
> less awful than the current memslot+SRCU mess.

Hi!

I am testing your approach and it actually works very well! I can't seem to break it.

Could you explain why do I need to do something with kvm_{inc,dec}_notifier_count()) ?

I just use your patch as is, plus the changes that are needed in kvm_request_apicv_update.

On AVIC unlike APICv, the page in this memslot is pinned in the physical memory still
(AVIC misses the code that APICv has in this regard).
It should be done in the future though.

Best regards,
Maxim Levitsky

>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index f4d35289f59e..ea434d76cfb0 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3767,9 +3767,9 @@ static bool kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> kvm_vcpu_gfn_to_hva(vcpu, gfn), &arch);
> }
>
> -static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
> - gpa_t cr2_or_gpa, kvm_pfn_t *pfn, hva_t *hva,
> - bool write, bool *writable)
> +static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
> + gpa_t cr2_or_gpa, kvm_pfn_t *pfn, hva_t *hva,
> + bool write, bool *writable, int *r)
> {
> struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
> bool async;
> @@ -3780,13 +3780,26 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
> * be zapped before KVM inserts a new MMIO SPTE for the gfn.
> */
> if (slot && (slot->flags & KVM_MEMSLOT_INVALID))
> - return true;
> + goto out_retry;
>
> - /* Don't expose private memslots to L2. */
> - if (is_guest_mode(vcpu) && !kvm_is_visible_memslot(slot)) {
> - *pfn = KVM_PFN_NOSLOT;
> - *writable = false;
> - return false;
> + if (!kvm_is_visible_memslot(slot)) {
> + /* Don't expose private memslots to L2. */
> + if (is_guest_mode(vcpu)) {
> + *pfn = KVM_PFN_NOSLOT;
> + *writable = false;
> + return false;
> + }
> + /*
> + * If the APIC access page exists but is disabled, go directly
> + * to emulation without caching the MMIO access or creating a
> + * MMIO SPTE. That way the cache doesn't need to be purged
> + * when the AVIC is re-enabled.
> + */
> + if (slot->id == APIC_ACCESS_PAGE_PRIVATE_MEMSLOT &&
> + !vcpu->kvm->arch.apic_access_memslot_enabled) {
> + *r = RET_PF_EMULATE;
> + return true;
> + }
> }
>
> async = false;
> @@ -3800,14 +3813,19 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
> if (kvm_find_async_pf_gfn(vcpu, gfn)) {
> trace_kvm_async_pf_doublefault(cr2_or_gpa, gfn);
> kvm_make_request(KVM_REQ_APF_HALT, vcpu);
> - return true;
> - } else if (kvm_arch_setup_async_pf(vcpu, cr2_or_gpa, gfn))
> - return true;
> + goto out_retry;
> + } else if (kvm_arch_setup_async_pf(vcpu, cr2_or_gpa, gfn)) {
> + goto out_retry;
> + }
> }
>
> *pfn = __gfn_to_pfn_memslot(slot, gfn, false, NULL,
> write, writable, hva);
> return false;
> +
> +out_retry:
> + *r = RET_PF_RETRY;
> + return true;
> }
>
> static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
> @@ -3839,9 +3857,9 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
> mmu_seq = vcpu->kvm->mmu_notifier_seq;
> smp_rmb();
>
> - if (try_async_pf(vcpu, prefault, gfn, gpa, &pfn, &hva,
> - write, &map_writable))
> - return RET_PF_RETRY;
> + if (kvm_faultin_pfn(vcpu, prefault, gfn, gpa, &pfn, &hva, write,
> + &map_writable, &r))
> + return r;
>
> if (handle_abnormal_pfn(vcpu, is_tdp ? 0 : gpa, gfn, pfn, ACC_ALL, &r))
> return r;
> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> index 490a028ddabe..9747124b877d 100644
> --- a/arch/x86/kvm/mmu/paging_tmpl.h
> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> @@ -881,9 +881,9 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gpa_t addr, u32 error_code,
> mmu_seq = vcpu->kvm->mmu_notifier_seq;
> smp_rmb();
>
> - if (try_async_pf(vcpu, prefault, walker.gfn, addr, &pfn, &hva,
> - write_fault, &map_writable))
> - return RET_PF_RETRY;
> + if (kvm_faultin_pfn(vcpu, prefault, walker.gfn, addr, &pfn, &hva,
> + write_fault, &map_writable, &r))
> + return r;
>
> if (handle_abnormal_pfn(vcpu, addr, walker.gfn, pfn, walker.pte_access, &r))
> return r;
>


2021-07-22 19:09:13

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v2 8/8] KVM: x86: hyper-v: Deactivate APICv only when AutoEOI feature is in use

+Ben

On Thu, Jul 22, 2021, Maxim Levitsky wrote:
> On Mon, 2021-07-19 at 18:49 +0000, Sean Christopherson wrote:
> > On Sun, Jul 18, 2021, Maxim Levitsky wrote:
> > > I am more inclined to fix this by just tracking if we hold the srcu
> > > lock on each VCPU manually, just as we track the srcu index anyway,
> > > and then kvm_request_apicv_update can use this to drop the srcu
> > > lock when needed.
> >
> > The entire approach of dynamically adding/removing the memslot seems doomed to
> > failure, and is likely responsible for the performance issues with AVIC, e.g. a
> > single vCPU temporarily inhibiting AVIC will zap all SPTEs _twice_; on disable
> > and again on re-enable.
> >
> > Rather than pile on more gunk, what about special casing the APIC access page
> > memslot in try_async_pf()? E.g. zap the GFN in avic_update_access_page() when
> > disabling (and bounce through kvm_{inc,dec}_notifier_count()), and have the page
> > fault path skip directly to MMIO emulation without caching the MMIO info. It'd
> > also give us a good excuse to rename try_async_pf() :-)
> >
> > If lack of MMIO caching is a performance problem, an alternative solution would
> > be to allow caching but add a helper to zap the MMIO SPTE and request all vCPUs to
> > clear their cache.
> >
> > It's all a bit gross, especially hijacking the mmu_notifier path, but IMO it'd be
> > less awful than the current memslot+SRCU mess.
>
> Hi!
>
> I am testing your approach and it actually works very well! I can't seem to break it.
>
> Could you explain why do I need to do something with kvm_{inc,dec}_notifier_count()) ?

Glad you asked, there's one more change needed. kvm_zap_gfn_range() currently
takes mmu_lock for read, but it needs to take mmu_lock for write for this case
(more way below).

The existing users, update_mtrr() and kvm_post_set_cr0(), are a bit sketchy. The
whole thing is a grey area because KVM is trying to ensure it honors the guest's
UC memtype for non-coherent DMA, but the inputs (CR0 and MTRRs) are per-vCPU,
i.e. for it to work correctly, the guest has to ensure all running vCPUs do the
same transition. So in practice there's likely no observable bug, but it also
means that taking mmu_lock for read is likely pointless, because for things to
work the guest has to serialize all running vCPUs.

Ben, any objection to taking mmu_lock for write in kvm_zap_gfn_range()? It would
effectively revert commit 6103bc074048 ("KVM: x86/mmu: Allow zap gfn range to
operate under the mmu read lock"); see attached patch. And we could even bump
the notifier count in that helper, e.g. on top of the attached:

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index b607e8763aa2..7174058e982b 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5568,6 +5568,8 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)

write_lock(&kvm->mmu_lock);

+ kvm_inc_notifier_count(kvm, gfn_start, gfn_end);
+
if (kvm_memslots_have_rmaps(kvm)) {
for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
slots = __kvm_memslots(kvm, i);
@@ -5598,6 +5600,8 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
if (flush)
kvm_flush_remote_tlbs_with_address(kvm, gfn_start, gfn_end);

+ kvm_dec_notifier_count(kvm, gfn_start, gfn_end);
+
write_unlock(&kvm->mmu_lock);
}




Back to Maxim's original question...

Elevating mmu_notifier_count and bumping mmu_notifier_seq will will handle the case
where APICv is being disabled while a different vCPU is concurrently faulting in a
new mapping for the APIC page. E.g. it handles this race:

vCPU0 vCPU1
apic_access_memslot_enabled = true;
#NPF on APIC
apic_access_memslot_enabled==true, proceed with #NPF
apic_access_memslot_enabled = false
kvm_zap_gfn_range(APIC);
__direct_map(APIC)

mov [APIC], 0 <-- succeeds, but KVM wants to intercept to emulate



The elevated mmu_notifier_count and/or changed mmu_notifier_seq will cause vCPU1
to bail and resume the guest without fixing the #NPF. After acquiring mmu_lock,
vCPU1 will see the elevated mmu_notifier_count (if kvm_zap_gfn_range() is about
to be called, or just finised) and/or a modified mmu_notifier_seq (after the
count was decremented).

This is why kvm_zap_gfn_range() needs to take mmu_lock for write. If it's allowed
to run in parallel with the page fault handler, there's no guarantee that the
correct apic_access_memslot_enabled will be observed.

if (is_tdp_mmu_fault)
read_lock(&vcpu->kvm->mmu_lock);
else
write_lock(&vcpu->kvm->mmu_lock);

if (!is_noslot_pfn(pfn) && mmu_notifier_retry_hva(vcpu->kvm, mmu_seq, hva)) <--- look here!
goto out_unlock;

if (is_tdp_mmu_fault)
r = kvm_tdp_mmu_map(vcpu, gpa, error_code, map_writable, max_level,
pfn, prefault);
else
r = __direct_map(vcpu, gpa, error_code, map_writable, max_level, pfn,
prefault, is_tdp);


Attachments:
(No filename) (5.11 kB)
0001-Revert-KVM-x86-mmu-Allow-zap-gfn-range-to-operate-un.patch (4.69 kB)
Download all attachments

2021-07-26 17:29:53

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v2 0/8] My AVIC patch queue

On 13/07/21 16:20, Maxim Levitsky wrote:
> Hi!
>
> This is a series of bugfixes to the AVIC dynamic inhibition, which was
> made while trying to fix bugs as much as possible, in this area and trying
> to make the AVIC+SYNIC conditional enablement work.
>
> * Patches 1-4 address an issue of possible
> mismatch between the AVIC inhibit state and AVIC enable state on all vCPUs.
>
> Since AVICs state is changed via a request there is a window during which
> the states differ which can lead to various warnings and errors.
>
> There was an earlier attempt to fix this by changing the AVIC enable state
> on the current vCPU immediately when the AVIC inhibit request is created,
> however while this fixes the common case, it actually hides the issue deeper,
> because on all other vCPUs but current one, the two states can still
> mismatch till the KVM_REQ_APICV_UPDATE is processed on each of them.
>
> My take on this is to fix the places where the mismatch causes the
> issues instead and then drop the special case of toggling the AVIC right
> away in kvm_request_apicv_update.
>
> V2: I rewrote the commit description for the patch that touches
> avic inhibition in nested case.
>
> * Patches 5-6 in this series fix a race condition which can cause
> a lost write from a guest to APIC when the APIC write races
> the AVIC un-inhibition, and add a warning to catch this problem
> if it re-emerges again.
>
> V2: I re-implemented this with a mutex in V2.
>
> * Patch 7 is an fix yet another issue I found in AVIC inhibit code:
> Currently avic_vcpu_load/avic_vcpu_put are called on userspace entry/exit
> from KVM (aka kvm_vcpu_get/kvm_vcpu_put), and these functions update the
> "is running" bit in the AVIC physical ID remap table and update the
> target vCPU in iommu code.
>
> However both of these functions don't do anything when AVIC is inhibited
> thus the "is running" bit will be kept enabled during exit to userspace.
> This shouldn't be a big issue as the caller
> doesn't use the AVIC when inhibited but still inconsistent and can trigger
> a warning about this in avic_vcpu_load.
>
> To be on the safe side I think it makes sense to call
> avic_vcpu_put/avic_vcpu_load when inhibiting/uninhibiting the AVIC.
> This will ensure that the work these functions do is matched.
>
> * Patch 8 is the patch from Vitaly about allowing AVIC with SYNC
> as long as the guest doesn’t use the AutoEOI feature. I only slightly
> changed it to drop the SRCU lock around call to kvm_request_apicv_update
> and also expose the AutoEOI cpuid bit regardless of AVIC enablement.
>
> Despite the fact that this is the last patch in this series, this patch
> doesn't depend on the other fixes.
>
> Best regards,
> Maxim Levitsky
>
> Maxim Levitsky (7):
> KVM: SVM: svm_set_vintr don't warn if AVIC is active but is about to
> be deactivated
> KVM: SVM: tweak warning about enabled AVIC on nested entry
> KVM: SVM: use vmcb01 in svm_refresh_apicv_exec_ctrl
> KVM: x86: APICv: drop immediate APICv disablement on current vCPU
> KVM: x86: APICv: fix race in kvm_request_apicv_update on SVM
> KVM: SVM: add warning for mistmatch between AVIC state and AVIC access
> page state
> KVM: SVM: call avic_vcpu_load/avic_vcpu_put when enabling/disabling
> AVIC
>
> Vitaly Kuznetsov (1):
> KVM: x86: hyper-v: Deactivate APICv only when AutoEOI feature is in
> use
>
> arch/x86/include/asm/kvm_host.h | 3 ++
> arch/x86/kvm/hyperv.c | 34 ++++++++++++++++----
> arch/x86/kvm/svm/avic.c | 45 ++++++++++++++------------
> arch/x86/kvm/svm/nested.c | 2 +-
> arch/x86/kvm/svm/svm.c | 18 ++++++++---
> arch/x86/kvm/x86.c | 57 ++++++++++++++++++---------------
> include/linux/kvm_host.h | 1 +
> virt/kvm/kvm_main.c | 1 +
> 8 files changed, 103 insertions(+), 58 deletions(-)
>

Queued patches 1-4, thanks.

Paolo

2021-07-26 22:37:07

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v2 5/8] KVM: x86: APICv: fix race in kvm_request_apicv_update on SVM

On 13/07/21 16:20, Maxim Levitsky wrote:
> + mutex_lock(&vcpu->kvm->apicv_update_lock);
> +
> vcpu->arch.apicv_active = kvm_apicv_activated(vcpu->kvm);
> kvm_apic_update_apicv(vcpu);
> static_call(kvm_x86_refresh_apicv_exec_ctrl)(vcpu);
> @@ -9246,6 +9248,8 @@ void kvm_vcpu_update_apicv(struct kvm_vcpu *vcpu)
> */
> if (!vcpu->arch.apicv_active)
> kvm_make_request(KVM_REQ_EVENT, vcpu);
> +
> + mutex_unlock(&vcpu->kvm->apicv_update_lock);

Does this whole piece of code need the lock/unlock? Does it work and/or
make sense to do the unlock immediately after mutex_lock()? This makes
it clearer that the mutex is being to synchronize against the requestor.

> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index ed4d1581d502..ba5d5d9ebc64 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -943,6 +943,7 @@ static struct kvm *kvm_create_vm(unsigned long type)
> mutex_init(&kvm->irq_lock);
> mutex_init(&kvm->slots_lock);
> mutex_init(&kvm->slots_arch_lock);
> + mutex_init(&kvm->apicv_update_lock);
> INIT_LIST_HEAD(&kvm->devices);
>
> BUILD_BUG_ON(KVM_MEM_SLOTS_NUM > SHRT_MAX);
>

Please add comments above fields that are protected by this lock
(anything but apicv_inhibit_reasons?), and especially move it to kvm->arch.

Paolo

2021-07-27 13:10:22

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [PATCH v2 8/8] KVM: x86: hyper-v: Deactivate APICv only when AutoEOI feature is in use

On Thu, 2021-07-22 at 19:06 +0000, Sean Christopherson wrote:
> +Ben
>
> On Thu, Jul 22, 2021, Maxim Levitsky wrote:
> > On Mon, 2021-07-19 at 18:49 +0000, Sean Christopherson wrote:
> > > On Sun, Jul 18, 2021, Maxim Levitsky wrote:
> > > > I am more inclined to fix this by just tracking if we hold the srcu
> > > > lock on each VCPU manually, just as we track the srcu index anyway,
> > > > and then kvm_request_apicv_update can use this to drop the srcu
> > > > lock when needed.
> > >
> > > The entire approach of dynamically adding/removing the memslot seems doomed to
> > > failure, and is likely responsible for the performance issues with AVIC, e.g. a
> > > single vCPU temporarily inhibiting AVIC will zap all SPTEs _twice_; on disable
> > > and again on re-enable.
> > >
> > > Rather than pile on more gunk, what about special casing the APIC access page
> > > memslot in try_async_pf()? E.g. zap the GFN in avic_update_access_page() when
> > > disabling (and bounce through kvm_{inc,dec}_notifier_count()), and have the page
> > > fault path skip directly to MMIO emulation without caching the MMIO info. It'd
> > > also give us a good excuse to rename try_async_pf() :-)
> > >
> > > If lack of MMIO caching is a performance problem, an alternative solution would
> > > be to allow caching but add a helper to zap the MMIO SPTE and request all vCPUs to
> > > clear their cache.
> > >
> > > It's all a bit gross, especially hijacking the mmu_notifier path, but IMO it'd be
> > > less awful than the current memslot+SRCU mess.
> >
> > Hi!
> >
> > I am testing your approach and it actually works very well! I can't seem to break it.
> >
> > Could you explain why do I need to do something with kvm_{inc,dec}_notifier_count()) ?
>
> Glad you asked, there's one more change needed. kvm_zap_gfn_range() currently
> takes mmu_lock for read, but it needs to take mmu_lock for write for this case
> (more way below).
>
> The existing users, update_mtrr() and kvm_post_set_cr0(), are a bit sketchy. The
> whole thing is a grey area because KVM is trying to ensure it honors the guest's
> UC memtype for non-coherent DMA, but the inputs (CR0 and MTRRs) are per-vCPU,
> i.e. for it to work correctly, the guest has to ensure all running vCPUs do the
> same transition. So in practice there's likely no observable bug, but it also
> means that taking mmu_lock for read is likely pointless, because for things to
> work the guest has to serialize all running vCPUs.
>
> Ben, any objection to taking mmu_lock for write in kvm_zap_gfn_range()? It would
> effectively revert commit 6103bc074048 ("KVM: x86/mmu: Allow zap gfn range to
> operate under the mmu read lock"); see attached patch. And we could even bump
> the notifier count in that helper, e.g. on top of the attached:
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index b607e8763aa2..7174058e982b 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5568,6 +5568,8 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
>
> write_lock(&kvm->mmu_lock);
>
> + kvm_inc_notifier_count(kvm, gfn_start, gfn_end);
> +
> if (kvm_memslots_have_rmaps(kvm)) {
> for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> slots = __kvm_memslots(kvm, i);
> @@ -5598,6 +5600,8 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
> if (flush)
> kvm_flush_remote_tlbs_with_address(kvm, gfn_start, gfn_end);
>
> + kvm_dec_notifier_count(kvm, gfn_start, gfn_end);
> +
> write_unlock(&kvm->mmu_lock);
> }
>

I understand what you mean now. I thought that I need to change to code of the
kvm_inc_notifier_count/kvm_dec_notifier_count.




>
>
>
> Back to Maxim's original question...
>
> Elevating mmu_notifier_count and bumping mmu_notifier_seq will will handle the case
> where APICv is being disabled while a different vCPU is concurrently faulting in a
> new mapping for the APIC page. E.g. it handles this race:
>
> vCPU0 vCPU1
> apic_access_memslot_enabled = true;
> #NPF on APIC
> apic_access_memslot_enabled==true, proceed with #NPF
> apic_access_memslot_enabled = false
> kvm_zap_gfn_range(APIC);
> __direct_map(APIC)
>
> mov [APIC], 0 <-- succeeds, but KVM wants to intercept to emulate

I understand this now. I guess this can't happen with original memslot disable
which I guess has the needed locking and flushing to avoid this.
(I didnt' study the code in depth thought)

>
>
>
> The elevated mmu_notifier_count and/or changed mmu_notifier_seq will cause vCPU1
> to bail and resume the guest without fixing the #NPF. After acquiring mmu_lock,
> vCPU1 will see the elevated mmu_notifier_count (if kvm_zap_gfn_range() is about
> to be called, or just finised) and/or a modified mmu_notifier_seq (after the
> count was decremented).
>
> This is why kvm_zap_gfn_range() needs to take mmu_lock for write. If it's allowed
> to run in parallel with the page fault handler, there's no guarantee that the
> correct apic_access_memslot_enabled will be observed.

I understand now.

So, Paolo, Ben Gardon, what do you think. Do you think this approach is feasable?
Do you agree to revert the usage of the read lock?

I will post a new series using this approach very soon, since I already have
msot of the code done.

Best regards,
Maxim Levitsky

>
> if (is_tdp_mmu_fault)
> read_lock(&vcpu->kvm->mmu_lock);
> else
> write_lock(&vcpu->kvm->mmu_lock);
>
> if (!is_noslot_pfn(pfn) && mmu_notifier_retry_hva(vcpu->kvm, mmu_seq, hva)) <--- look here!
> goto out_unlock;
>
> if (is_tdp_mmu_fault)
> r = kvm_tdp_mmu_map(vcpu, gpa, error_code, map_writable, max_level,
> pfn, prefault);
> else
> r = __direct_map(vcpu, gpa, error_code, map_writable, max_level, pfn,
> prefault, is_tdp);



2021-07-27 13:29:32

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [PATCH v2 5/8] KVM: x86: APICv: fix race in kvm_request_apicv_update on SVM

On Tue, 2021-07-27 at 00:34 +0200, Paolo Bonzini wrote:
> On 13/07/21 16:20, Maxim Levitsky wrote:
> > + mutex_lock(&vcpu->kvm->apicv_update_lock);
> > +
> > vcpu->arch.apicv_active = kvm_apicv_activated(vcpu->kvm);
> > kvm_apic_update_apicv(vcpu);
> > static_call(kvm_x86_refresh_apicv_exec_ctrl)(vcpu);
> > @@ -9246,6 +9248,8 @@ void kvm_vcpu_update_apicv(struct kvm_vcpu *vcpu)
> > */
> > if (!vcpu->arch.apicv_active)
> > kvm_make_request(KVM_REQ_EVENT, vcpu);
> > +
> > + mutex_unlock(&vcpu->kvm->apicv_update_lock);
>
> Does this whole piece of code need the lock/unlock? Does it work and/or
> make sense to do the unlock immediately after mutex_lock()? This makes
> it clearer that the mutex is being to synchronize against the requestor.

Yes, I do need to hold the mutex for the whole duration.

The requester does the following:

1. Take the mutex

2. Kick all the vCPUs out of the guest mode with KVM_REQ_EVENT
At that point all these vCPUs will be (or soon will be) stuck on the mutex
and guaranteed to be outside of the guest mode.
which is exactly what I need to avoid them entering the guest
mode as long as the AVIC's memslot state is not up to date.

3. Update kvm->arch.apicv_inhibit_reasons. I removed the cmpchg loop
since it is now protected under the lock anyway.
This step doesn't have to be done at this point, but should be done while mutex is held
so that there is no need to cmpchg and such.

This itself isn't the justification for the mutex.

4. Update the memslot

5. Release the mutex.
Only now all other vCPUs are permitted to enter the guest mode again
(since only now the memslot is up to date)
and they will also update their per-vcpu AVIC enablement prior to entering it.


I think it might be possible to avoid the mutex, but I am not sure if this is worth it:

First of all, the above sync sequence is only really needed when we enable AVIC.

(Because from the moment we enable the memslot and to the moment the vCPU enables the AVIC,
it must not be in guest mode as otherwise it will access the dummy page in the memslot
without VMexit, instead of going through AVIC vmexit/acceleration.

The other way around is OK. IF we disable the memslot, and a vCPU still has a enabled AVIC,
it will just get a page fault which will be correctly emulated as APIC read/write by
the MMIO page fault.

If I had a guarantee that when I enable the memslot (either as it done today or
using the kvm_zap_gfn_range (which I strongly prefer), would always raise a TLB
flush request on all vCPUs, then I could (ab)use that request to update local
AVIC state.

Or I can just always check if local AVIC state matches the memslot and update
if it doesn't prior to guest mode entry.

I still think I would prefer a mutex to be 100% sure that there are no races,
since the whole AVIC disablement isn't something that is done often anyway.

Best regards,
Maxim Levitsky

>
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index ed4d1581d502..ba5d5d9ebc64 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -943,6 +943,7 @@ static struct kvm *kvm_create_vm(unsigned long type)
> > mutex_init(&kvm->irq_lock);
> > mutex_init(&kvm->slots_lock);
> > mutex_init(&kvm->slots_arch_lock);
> > + mutex_init(&kvm->apicv_update_lock);
> > INIT_LIST_HEAD(&kvm->devices);
> >
> > BUILD_BUG_ON(KVM_MEM_SLOTS_NUM > SHRT_MAX);
> >
>
> Please add comments above fields that are protected by this lock
> (anything but apicv_inhibit_reasons?), and especially move it to kvm->arch.
I agree, I will do this.


>
> Paolo



2021-07-27 17:49:58

by Ben Gardon

[permalink] [raw]
Subject: Re: [PATCH v2 8/8] KVM: x86: hyper-v: Deactivate APICv only when AutoEOI feature is in use

On Tue, Jul 27, 2021 at 6:06 AM Maxim Levitsky <[email protected]> wrote:
>
> On Thu, 2021-07-22 at 19:06 +0000, Sean Christopherson wrote:
> > +Ben
> >
> > On Thu, Jul 22, 2021, Maxim Levitsky wrote:
> > > On Mon, 2021-07-19 at 18:49 +0000, Sean Christopherson wrote:
> > > > On Sun, Jul 18, 2021, Maxim Levitsky wrote:
> > > > > I am more inclined to fix this by just tracking if we hold the srcu
> > > > > lock on each VCPU manually, just as we track the srcu index anyway,
> > > > > and then kvm_request_apicv_update can use this to drop the srcu
> > > > > lock when needed.
> > > >
> > > > The entire approach of dynamically adding/removing the memslot seems doomed to
> > > > failure, and is likely responsible for the performance issues with AVIC, e.g. a
> > > > single vCPU temporarily inhibiting AVIC will zap all SPTEs _twice_; on disable
> > > > and again on re-enable.
> > > >
> > > > Rather than pile on more gunk, what about special casing the APIC access page
> > > > memslot in try_async_pf()? E.g. zap the GFN in avic_update_access_page() when
> > > > disabling (and bounce through kvm_{inc,dec}_notifier_count()), and have the page
> > > > fault path skip directly to MMIO emulation without caching the MMIO info. It'd
> > > > also give us a good excuse to rename try_async_pf() :-)
> > > >
> > > > If lack of MMIO caching is a performance problem, an alternative solution would
> > > > be to allow caching but add a helper to zap the MMIO SPTE and request all vCPUs to
> > > > clear their cache.
> > > >
> > > > It's all a bit gross, especially hijacking the mmu_notifier path, but IMO it'd be
> > > > less awful than the current memslot+SRCU mess.
> > >
> > > Hi!
> > >
> > > I am testing your approach and it actually works very well! I can't seem to break it.
> > >
> > > Could you explain why do I need to do something with kvm_{inc,dec}_notifier_count()) ?
> >
> > Glad you asked, there's one more change needed. kvm_zap_gfn_range() currently
> > takes mmu_lock for read, but it needs to take mmu_lock for write for this case
> > (more way below).
> >
> > The existing users, update_mtrr() and kvm_post_set_cr0(), are a bit sketchy. The
> > whole thing is a grey area because KVM is trying to ensure it honors the guest's
> > UC memtype for non-coherent DMA, but the inputs (CR0 and MTRRs) are per-vCPU,
> > i.e. for it to work correctly, the guest has to ensure all running vCPUs do the
> > same transition. So in practice there's likely no observable bug, but it also
> > means that taking mmu_lock for read is likely pointless, because for things to
> > work the guest has to serialize all running vCPUs.
> >
> > Ben, any objection to taking mmu_lock for write in kvm_zap_gfn_range()? It would
> > effectively revert commit 6103bc074048 ("KVM: x86/mmu: Allow zap gfn range to
> > operate under the mmu read lock"); see attached patch. And we could even bump
> > the notifier count in that helper, e.g. on top of the attached:
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index b607e8763aa2..7174058e982b 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -5568,6 +5568,8 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
> >
> > write_lock(&kvm->mmu_lock);
> >
> > + kvm_inc_notifier_count(kvm, gfn_start, gfn_end);
> > +
> > if (kvm_memslots_have_rmaps(kvm)) {
> > for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> > slots = __kvm_memslots(kvm, i);
> > @@ -5598,6 +5600,8 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
> > if (flush)
> > kvm_flush_remote_tlbs_with_address(kvm, gfn_start, gfn_end);
> >
> > + kvm_dec_notifier_count(kvm, gfn_start, gfn_end);
> > +
> > write_unlock(&kvm->mmu_lock);
> > }
> >
>
> I understand what you mean now. I thought that I need to change to code of the
> kvm_inc_notifier_count/kvm_dec_notifier_count.
>
>
>
>
> >
> >
> >
> > Back to Maxim's original question...
> >
> > Elevating mmu_notifier_count and bumping mmu_notifier_seq will will handle the case
> > where APICv is being disabled while a different vCPU is concurrently faulting in a
> > new mapping for the APIC page. E.g. it handles this race:
> >
> > vCPU0 vCPU1
> > apic_access_memslot_enabled = true;
> > #NPF on APIC
> > apic_access_memslot_enabled==true, proceed with #NPF
> > apic_access_memslot_enabled = false
> > kvm_zap_gfn_range(APIC);
> > __direct_map(APIC)
> >
> > mov [APIC], 0 <-- succeeds, but KVM wants to intercept to emulate
>
> I understand this now. I guess this can't happen with original memslot disable
> which I guess has the needed locking and flushing to avoid this.
> (I didnt' study the code in depth thought)
>
> >
> >
> >
> > The elevated mmu_notifier_count and/or changed mmu_notifier_seq will cause vCPU1
> > to bail and resume the guest without fixing the #NPF. After acquiring mmu_lock,
> > vCPU1 will see the elevated mmu_notifier_count (if kvm_zap_gfn_range() is about
> > to be called, or just finised) and/or a modified mmu_notifier_seq (after the
> > count was decremented).
> >
> > This is why kvm_zap_gfn_range() needs to take mmu_lock for write. If it's allowed
> > to run in parallel with the page fault handler, there's no guarantee that the
> > correct apic_access_memslot_enabled will be observed.
>
> I understand now.
>
> So, Paolo, Ben Gardon, what do you think. Do you think this approach is feasable?
> Do you agree to revert the usage of the read lock?
>
> I will post a new series using this approach very soon, since I already have
> msot of the code done.
>
> Best regards,
> Maxim Levitsky

From reading through this thread, it seems like switching from read
lock to write lock is only necessary for a small range of GFNs, (i.e.
the APIC access page) is that correct?
My initial reaction was that switching kvm_zap_gfn_range back to the
write lock would be terrible for performance, but given its only two
callers, I think it would actually be fine.
If you do that though, you should pass shared=false to
kvm_tdp_mmu_zap_gfn_range in that function, so that it knows it's
operating with exclusive access to the MMU lock.

>
> >
> > if (is_tdp_mmu_fault)
> > read_lock(&vcpu->kvm->mmu_lock);
> > else
> > write_lock(&vcpu->kvm->mmu_lock);
> >
> > if (!is_noslot_pfn(pfn) && mmu_notifier_retry_hva(vcpu->kvm, mmu_seq, hva)) <--- look here!
> > goto out_unlock;
> >
> > if (is_tdp_mmu_fault)
> > r = kvm_tdp_mmu_map(vcpu, gpa, error_code, map_writable, max_level,
> > pfn, prefault);
> > else
> > r = __direct_map(vcpu, gpa, error_code, map_writable, max_level, pfn,
> > prefault, is_tdp);
>
>

2021-07-27 18:19:47

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v2 8/8] KVM: x86: hyper-v: Deactivate APICv only when AutoEOI feature is in use

On Tue, Jul 27, 2021, Ben Gardon wrote:
> On Tue, Jul 27, 2021 at 6:06 AM Maxim Levitsky <[email protected]> wrote:
> >
> > On Thu, 2021-07-22 at 19:06 +0000, Sean Christopherson wrote:
> > > The elevated mmu_notifier_count and/or changed mmu_notifier_seq will cause vCPU1
> > > to bail and resume the guest without fixing the #NPF. After acquiring mmu_lock,
> > > vCPU1 will see the elevated mmu_notifier_count (if kvm_zap_gfn_range() is about
> > > to be called, or just finised) and/or a modified mmu_notifier_seq (after the
> > > count was decremented).
> > >
> > > This is why kvm_zap_gfn_range() needs to take mmu_lock for write. If it's allowed
> > > to run in parallel with the page fault handler, there's no guarantee that the
> > > correct apic_access_memslot_enabled will be observed.
> >
> > I understand now.
> >
> > So, Paolo, Ben Gardon, what do you think. Do you think this approach is feasable?
> > Do you agree to revert the usage of the read lock?
> >
> > I will post a new series using this approach very soon, since I already have
> > msot of the code done.
> >
> > Best regards,
> > Maxim Levitsky
>
> From reading through this thread, it seems like switching from read
> lock to write lock is only necessary for a small range of GFNs, (i.e.
> the APIC access page) is that correct?

For the APICv case, yes, literally a single GFN (the default APIC base).

> My initial reaction was that switching kvm_zap_gfn_range back to the
> write lock would be terrible for performance, but given its only two
> callers, I think it would actually be fine.

And more importantly, the two callers are gated by kvm_arch_has_noncoherent_dma()
and are very rare flows for the guest (updating MTRRs, toggling CR0.CD).

> If you do that though, you should pass shared=false to
> kvm_tdp_mmu_zap_gfn_range in that function, so that it knows it's
> operating with exclusive access to the MMU lock.

Ya, my suggested revert was to drop @shared entirely since kvm_zap_gfn_range() is
the only caller that passes @shared=true.

2021-07-29 14:13:43

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [PATCH v2 8/8] KVM: x86: hyper-v: Deactivate APICv only when AutoEOI feature is in use

On Tue, 2021-07-27 at 18:17 +0000, Sean Christopherson wrote:
> On Tue, Jul 27, 2021, Ben Gardon wrote:
> > On Tue, Jul 27, 2021 at 6:06 AM Maxim Levitsky <[email protected]> wrote:
> > > On Thu, 2021-07-22 at 19:06 +0000, Sean Christopherson wrote:
> > > > The elevated mmu_notifier_count and/or changed mmu_notifier_seq will cause vCPU1
> > > > to bail and resume the guest without fixing the #NPF. After acquiring mmu_lock,
> > > > vCPU1 will see the elevated mmu_notifier_count (if kvm_zap_gfn_range() is about
> > > > to be called, or just finised) and/or a modified mmu_notifier_seq (after the
> > > > count was decremented).
> > > >
> > > > This is why kvm_zap_gfn_range() needs to take mmu_lock for write. If it's allowed
> > > > to run in parallel with the page fault handler, there's no guarantee that the
> > > > correct apic_access_memslot_enabled will be observed.
> > >
> > > I understand now.
> > >
> > > So, Paolo, Ben Gardon, what do you think. Do you think this approach is feasable?
> > > Do you agree to revert the usage of the read lock?
> > >
> > > I will post a new series using this approach very soon, since I already have
> > > msot of the code done.
> > >
> > > Best regards,
> > > Maxim Levitsky
> >
> > From reading through this thread, it seems like switching from read
> > lock to write lock is only necessary for a small range of GFNs, (i.e.
> > the APIC access page) is that correct?
>
> For the APICv case, yes, literally a single GFN (the default APIC base).
>
> > My initial reaction was that switching kvm_zap_gfn_range back to the
> > write lock would be terrible for performance, but given its only two
> > callers, I think it would actually be fine.
>
> And more importantly, the two callers are gated by kvm_arch_has_noncoherent_dma()
> and are very rare flows for the guest (updating MTRRs, toggling CR0.CD).
>
> > If you do that though, you should pass shared=false to
> > kvm_tdp_mmu_zap_gfn_range in that function, so that it knows it's
> > operating with exclusive access to the MMU lock.
>
> Ya, my suggested revert was to drop @shared entirely since kvm_zap_gfn_range() is
> the only caller that passes @shared=true.
>


Just one question:
Should I submit the patches for MMU changes that you described,
and on top of them my AVIC patches?

Should I worry about the new TDP mmu?

Best regards,
Maxim Levitsky


2021-08-02 09:23:17

by Maxim Levitsky

[permalink] [raw]
Subject: Re: KVM's support for non default APIC base

On Thu, 2021-07-22 at 12:12 +0300, Maxim Levitsky wrote:
> On Mon, 2021-07-19 at 18:49 +0000, Sean Christopherson wrote:
> > On Sun, Jul 18, 2021, Maxim Levitsky wrote:
> > > I am more inclined to fix this by just tracking if we hold the srcu
> > > lock on each VCPU manually, just as we track the srcu index anyway,
> > > and then kvm_request_apicv_update can use this to drop the srcu
> > > lock when needed.
> >
> > The entire approach of dynamically adding/removing the memslot seems doomed to
> > failure, and is likely responsible for the performance issues with AVIC, e.g. a
> > single vCPU temporarily inhibiting AVIC will zap all SPTEs _twice_; on disable
> > and again on re-enable.
> >
> > Rather than pile on more gunk, what about special casing the APIC access page
> > memslot in try_async_pf()? E.g. zap the GFN in avic_update_access_page() when
> > disabling (and bounce through kvm_{inc,dec}_notifier_count()), and have the page
> > fault path skip directly to MMIO emulation without caching the MMIO info. It'd
> > also give us a good excuse to rename try_async_pf() :-)
> >
> > If lack of MMIO caching is a performance problem, an alternative solution would
> > be to allow caching but add a helper to zap the MMIO SPTE and request all vCPUs to
> > clear their cache.
> >
> > It's all a bit gross, especially hijacking the mmu_notifier path, but IMO it'd be
> > less awful than the current memslot+SRCU mess
>
> Hi Sean, Paolo and everyone else:
>
> I am exploring the approach that you proposed and I noticed that we have very inconsistient
> code that handles the APIC base relocation for in-kernel local apic.
> I do know that APIC base relocation is not supported, and I don't have anything against
> this as long as VMs don't use that feature, but I do want this to be consistent.
>
> I did a study of the code that is involved in this mess and I would like to hear your opinion:
>
> There are basically 3 modes of operation of in kernel local apic:
>
> Regular unaccelerated local apic:
>
> -> APIC MMIO base address is stored at 'apic->base_address', and updated in
> kvm_lapic_set_base which is called from msr write of 'MSR_IA32_APICBASE'
> (both by SVM and VMX).
> The value is even migrated.
>
> -> APIC mmio read/write is done from MMU, when we access MMIO page:
> vcpu_mmio_write always calls apic_mmio_write which checks if the write is in
> apic->base_address page and if so forwards the write local apic with offset
> in this page.
>
> -> APIC MMIO area has to be MMIO for 'apic_mmio_write' to be called,
> thus must contain no guest memslots.
> If the guest relocates the APIC base somewhere where we have a memslot,
> memslot will take priority, while on real hardware, LAPIC is likely to
> take priority.
>
> APICv:
>
> -> The default apic MMIO base (0xfee00000) is covered by a dummy page which is
> allocated from qemu's process using __x86_set_memory_region.
>
> This is done once in alloc_apic_access_page which is called on vcpu creation,
> (but only once when the memslot is not yet enabled)
>
> -> to avoid pinning this page into qemu's memory, reference to it
> is dropped in alloc_apic_access_page.
> (see commit c24ae0dcd3e8695efa43e71704d1fc4bc7e29e9b)
>
> -> kvm_arch_mmu_notifier_invalidate_range -> checks if we invalidate GPA 0xfee00000
> and if so, raises KVM_REQ_APIC_PAGE_RELOAD request.
>
> -> kvm_vcpu_reload_apic_access_page handles the KVM_REQ_APIC_PAGE_RELOAD request by calling
> kvm_x86_ops.set_apic_access_page_addr which is only implemented on VMX
> (vmx_set_apic_access_page_addr)
>
> -> vmx_set_apic_access_page_addr does gfn_to_page on 0xfee00000 GPA,
> and if the result is valid, writes the physical address of this page to APIC_ACCESS_ADDR vmcs field.
>
> (This is a major difference from the AVIC - AVIC's avic_vapic_bar is *GPA*, while APIC_ACCESS_ADDR
> is host physical address which the hypervisor is supposed to map at APIC MMIO GPA using EPT)
>
> Note that if we have an error here, we might end with invalid APIC_ACCESS_ADDR field.
>
> -> writes to the HPA of that special page (which has GPA of 0xfee00000, and mapped via EPT) go to
> APICv or cause special VM exits: (EXIT_REASON_APIC_ACCESS, EXIT_REASON_APIC_WRITE)
>
> * EXIT_REASON_APIC_ACCESS (which is used for older limited 'flexpriotiy' mode which only emulates TPR practically)
> actually emulates the instruction to know the written value,
> but we have a special case in vcpu_is_mmio_gpa which makes the emulation treat the access to the default
> apic base as MMIO.
>
> * EXIT_REASON_APIC_WRITE is a trap VMexit which comes with full APICv, and since it also has offset
> qualification and the value is already in the apic page, this info is just passed to kvm_lapic_reg_write
>
>
> -> If APIC base is relocated, the APICv isn't aware of it, and the writes to new APIC base,
> (assuming that we have no memslots covering it) will go through standard APIC MMIO emulation,
> and *might* create a mess.
>
> AVIC:
>
> -> The default apic MMIO base (0xfee00000)
> is also covered by a dummy page which is allocated from qemu's process using __x86_set_memory_region
> in avic_update_access_page which is called also on vcpu creation (also only once),
> and from SVM's dynamic AVIC inhibition.
>
> -> The reference to this page is not dropped thus there is no KVM_REQ_APIC_PAGE_RELOAD handler.
> I think we should do the same we do for APICv here?
>
> -> avic_vapic_bar is GPA and thus contains 0xfee00000 but writes to MSR_IA32_APICBASE do update it
> (avic_update_vapic_bar which is called from MSR_IA32_APICBASE write in SVM code)
>
> thus if the guest relocates the APIC base to a writable memory page, actually AVIC would happen to work.
> (opposite from the stock xAPIC handlilng, which only works when apic base is in MMIO area.)
>
> -> writes to the GPA in avic_vapic_bar are first checked in NPT (but HPA written there ignored),
> and then either go to AVIC or cause SVM_EXIT_AVIC_UNACCELERATED_ACCESS which has offset of the write
> in the exit_info_1
> (there is also SVM_EXIT_AVIC_INCOMPLETE_IPI which is called on some ICR writes)
>
>
> As far as I know the only good reason to relocate APIC base is to access it from the real mode
> which is not something that is done these days by modern BIOSes.
>
> I vote to make it read only (#GP on MSR_IA32_APICBASE write when non default base is set and apic enabled)
> and remove all remains of the support for variable APIC base.
> (we already have a warning when APIC base is set to non default value)
>
>
> Best regards,
> Maxim Levitsky
>
Ping.

Best regards,
Maxim Levitsky


2021-08-07 00:20:16

by Sean Christopherson

[permalink] [raw]
Subject: Re: KVM's support for non default APIC base

On Thu, Jul 22, 2021, Maxim Levitsky wrote:
> On Mon, 2021-07-19 at 18:49 +0000, Sean Christopherson wrote:
> > On Sun, Jul 18, 2021, Maxim Levitsky wrote:
> -> APIC MMIO area has to be MMIO for 'apic_mmio_write' to be called,
> thus must contain no guest memslots.
> If the guest relocates the APIC base somewhere where we have a memslot,
> memslot will take priority, while on real hardware, LAPIC is likely to
> take priority.

Yep. The thing that really bites us is that other vCPUs should still be able to
access the memory defined by the memslot, e.g. to make it work we'd have to run
the vCPU with a completely different MMU root.

> As far as I know the only good reason to relocate APIC base is to access it
> from the real mode which is not something that is done these days by modern
> BIOSes.
>
> I vote to make it read only (#GP on MSR_IA32_APICBASE write when non default
> base is set and apic enabled) and remove all remains of the support for
> variable APIC base.

Making up our own behavior is almost never the right approach. E.g. _best_ case
scenario for an unexpected #GP is the guest immediately terminates. Worst case
scenario is the guest eats the #GP and continues on, which is basically the status
quo, except it's guaranteed to now work, whereas todays behavior can at least let
the guest function, for some definitions of "function".

I think the only viable "solution" is to exit to userspace on the guilty WRMSR.
Whether or not we can do that without breaking userspace is probably the big
question. Fully emulating APIC base relocation would be a tremendous amount of
effort and complexity for practically zero benefit.

> (we already have a warning when APIC base is set to non default value)

FWIW, that warning is worthless because it's _once(), i.e. won't help detect a
misbehaving guest unless it's the first guest to misbehave on a particular
instantiation of KVM. _ratelimited() would improve the situation, but not
completely eliminate the possibility of a misbehaving guest going unnoticed.
Anything else isn't an option becuase it's obviously guest triggerable.

2021-08-09 09:41:52

by Maxim Levitsky

[permalink] [raw]
Subject: Re: KVM's support for non default APIC base

On Fri, 2021-08-06 at 21:55 +0000, Sean Christopherson wrote:
> On Thu, Jul 22, 2021, Maxim Levitsky wrote:
> > On Mon, 2021-07-19 at 18:49 +0000, Sean Christopherson wrote:
> > > On Sun, Jul 18, 2021, Maxim Levitsky wrote:
> > -> APIC MMIO area has to be MMIO for 'apic_mmio_write' to be called,
> > thus must contain no guest memslots.
> > If the guest relocates the APIC base somewhere where we have a memslot,
> > memslot will take priority, while on real hardware, LAPIC is likely to
> > take priority.
>
> Yep. The thing that really bites us is that other vCPUs should still be able to
> access the memory defined by the memslot, e.g. to make it work we'd have to run
> the vCPU with a completely different MMU root.
That is something I haven't took in the account.
Complexity of supporting this indeed isn't worth it.

>
> > As far as I know the only good reason to relocate APIC base is to access it
> > from the real mode which is not something that is done these days by modern
> > BIOSes.
> >
> > I vote to make it read only (#GP on MSR_IA32_APICBASE write when non default
> > base is set and apic enabled) and remove all remains of the support for
> > variable APIC base.
>
> Making up our own behavior is almost never the right approach. E.g. _best_ case
> scenario for an unexpected #GP is the guest immediately terminates. Worst case
> scenario is the guest eats the #GP and continues on, which is basically the status
> quo, except it's guaranteed to now work, whereas todays behavior can at least let
> the guest function, for some definitions of "function".

Well, at least the Intel's PRM does state that APIC base relocation is not guaranteed
to work on all CPUs, so giving the guest a #GP is like telling it that current CPU doesn't
support it. In theory, a very well behaving guest can catch the exception and
fail back to the default base.

I don't understand what do you mean by 'guaranteed to now work'. If the guest
ignores this #GP and still thinks that APIC base relocation worked, it is its fault.
A well behaving guest should never assume that a msr write that failed with #GP
worked.


>
> I think the only viable "solution" is to exit to userspace on the guilty WRMSR.
> Whether or not we can do that without breaking userspace is probably the big
> question. Fully emulating APIC base relocation would be a tremendous amount of
> effort and complexity for practically zero benefit.

I have nothing against this as well although I kind of like the #GP approach a bit more,
and knowing that there are barely any reasons
to relocate the APIC base, and that it doesn't work well, there is a good chance
that no one does it anyway (except our kvm unit tests, but that isn't an issue).

>
> > (we already have a warning when APIC base is set to non default value)
>
> FWIW, that warning is worthless because it's _once(), i.e. won't help detect a
> misbehaving guest unless it's the first guest to misbehave on a particular
> instantiation of KVM. _ratelimited() would improve the situation, but not
> completely eliminate the possibility of a misbehaving guest going unnoticed.
> Anything else isn't an option becuase it's obviously guest triggerable.

100% agree.

I'll say I would first make it _ratelimited() for few KVM versions, and then
if nobody complains, make it a KVM internal error / #GP, and remove all the leftovers
from the code that pretend that it can work.

And add a comment explaining *why* as you explained, supporting APIC base relocation
isn't worth it.

Best regards,
Maxim Levitsky

>


2021-08-09 16:00:05

by Sean Christopherson

[permalink] [raw]
Subject: Re: KVM's support for non default APIC base

On Mon, Aug 09, 2021, Maxim Levitsky wrote:
> On Fri, 2021-08-06 at 21:55 +0000, Sean Christopherson wrote:
> > Making up our own behavior is almost never the right approach. E.g. _best_ case
> > scenario for an unexpected #GP is the guest immediately terminates. Worst case
> > scenario is the guest eats the #GP and continues on, which is basically the status
> > quo, except it's guaranteed to now work, whereas todays behavior can at least let
> > the guest function, for some definitions of "function".
>
> Well, at least the Intel's PRM does state that APIC base relocation is not guaranteed
> to work on all CPUs, so giving the guest a #GP is like telling it that current CPU doesn't
> support it. In theory, a very well behaving guest can catch the exception and
> fail back to the default base.
>
> I don't understand what do you mean by 'guaranteed to now work'. If the guest

Doh, typo, it should be "not", i.e. "guaranteed to not work". As in, allowing the
unsupported WRMSR could work depending on what features KVM is using and what the
guest is doing, whereas injecting #GP is guaranteed to break the guest.

> ignores this #GP and still thinks that APIC base relocation worked, it is its fault.
> A well behaving guest should never assume that a msr write that failed with #GP
> worked.
>
> >
> > I think the only viable "solution" is to exit to userspace on the guilty WRMSR.
> > Whether or not we can do that without breaking userspace is probably the big
> > question. Fully emulating APIC base relocation would be a tremendous amount of
> > effort and complexity for practically zero benefit.
>
> I have nothing against this as well although I kind of like the #GP approach
> a bit more, and knowing that there are barely any reasons to relocate the
> APIC base, and that it doesn't work well, there is a good chance that no one
> does it anyway (except our kvm unit tests, but that isn't an issue).

Injecting an exception that architecturally should not happen is simply not
acceptable. Silently (and partially) ignoring the WRMSR isn't acceptable either,
but we can't travel back in time to fix that so we're stuck with it unless we can
change the behavior without anyone complaining.

> > > (we already have a warning when APIC base is set to non default value)
> >
> > FWIW, that warning is worthless because it's _once(), i.e. won't help detect a
> > misbehaving guest unless it's the first guest to misbehave on a particular
> > instantiation of KVM. _ratelimited() would improve the situation, but not
> > completely eliminate the possibility of a misbehaving guest going unnoticed.
> > Anything else isn't an option becuase it's obviously guest triggerable.
>
> 100% agree.
>
> I'll say I would first make it _ratelimited() for few KVM versions, and then
> if nobody complains, make it a KVM internal error / #GP, and remove all the
> leftovers from the code that pretend that it can work.

I don't see any point in temporarily making it _ratelimited(), (1) the odds of someone
running a guest that relies on APIC base relocation are very low, (2) the odds of
that someone noticing a _ratelimited() and not a _once() are even lower, and (3) the
odds of that prompting a bug report are even lower still.

> And add a comment explaining *why* as you explained, supporting APIC base relocation
> isn't worth it.
>
> Best regards,
> Maxim Levitsky
>
> >
>
>

2021-08-09 16:49:25

by Jim Mattson

[permalink] [raw]
Subject: Re: KVM's support for non default APIC base

On Mon, Aug 9, 2021 at 2:40 AM Maxim Levitsky <[email protected]> wrote:
>
> On Fri, 2021-08-06 at 21:55 +0000, Sean Christopherson wrote:
> > On Thu, Jul 22, 2021, Maxim Levitsky wrote:
> > > On Mon, 2021-07-19 at 18:49 +0000, Sean Christopherson wrote:
> > > > On Sun, Jul 18, 2021, Maxim Levitsky wrote:
> > > -> APIC MMIO area has to be MMIO for 'apic_mmio_write' to be called,
> > > thus must contain no guest memslots.
> > > If the guest relocates the APIC base somewhere where we have a memslot,
> > > memslot will take priority, while on real hardware, LAPIC is likely to
> > > take priority.
> >
> > Yep. The thing that really bites us is that other vCPUs should still be able to
> > access the memory defined by the memslot, e.g. to make it work we'd have to run
> > the vCPU with a completely different MMU root.
> That is something I haven't took in the account.
> Complexity of supporting this indeed isn't worth it.
>
> >
> > > As far as I know the only good reason to relocate APIC base is to access it
> > > from the real mode which is not something that is done these days by modern
> > > BIOSes.
> > >
> > > I vote to make it read only (#GP on MSR_IA32_APICBASE write when non default
> > > base is set and apic enabled) and remove all remains of the support for
> > > variable APIC base.
> >
> > Making up our own behavior is almost never the right approach. E.g. _best_ case
> > scenario for an unexpected #GP is the guest immediately terminates. Worst case
> > scenario is the guest eats the #GP and continues on, which is basically the status
> > quo, except it's guaranteed to now work, whereas todays behavior can at least let
> > the guest function, for some definitions of "function".
>
> Well, at least the Intel's PRM does state that APIC base relocation is not guaranteed
> to work on all CPUs, so giving the guest a #GP is like telling it that current CPU doesn't
> support it. In theory, a very well behaving guest can catch the exception and
> fail back to the default base.
>
> I don't understand what do you mean by 'guaranteed to now work'. If the guest
> ignores this #GP and still thinks that APIC base relocation worked, it is its fault.
> A well behaving guest should never assume that a msr write that failed with #GP
> worked.
>
>
> >
> > I think the only viable "solution" is to exit to userspace on the guilty WRMSR.
> > Whether or not we can do that without breaking userspace is probably the big
> > question. Fully emulating APIC base relocation would be a tremendous amount of
> > effort and complexity for practically zero benefit.
>
> I have nothing against this as well although I kind of like the #GP approach a bit more,
> and knowing that there are barely any reasons
> to relocate the APIC base, and that it doesn't work well, there is a good chance
> that no one does it anyway (except our kvm unit tests, but that isn't an issue).
>
> >
> > > (we already have a warning when APIC base is set to non default value)
> >
> > FWIW, that warning is worthless because it's _once(), i.e. won't help detect a
> > misbehaving guest unless it's the first guest to misbehave on a particular
> > instantiation of KVM. _ratelimited() would improve the situation, but not
> > completely eliminate the possibility of a misbehaving guest going unnoticed.
> > Anything else isn't an option becuase it's obviously guest triggerable.
>
> 100% agree.
>
> I'll say I would first make it _ratelimited() for few KVM versions, and then
> if nobody complains, make it a KVM internal error / #GP, and remove all the leftovers
> from the code that pretend that it can work.

Printing things to syslog is not very helpful. Any time that kvm
violates the architectural specification, it should provide
information about the emulation error to userspace.

2021-08-10 20:46:49

by Maxim Levitsky

[permalink] [raw]
Subject: Re: KVM's support for non default APIC base

On Mon, 2021-08-09 at 09:47 -0700, Jim Mattson wrote:
> On Mon, Aug 9, 2021 at 2:40 AM Maxim Levitsky <[email protected]> wrote:
> > On Fri, 2021-08-06 at 21:55 +0000, Sean Christopherson wrote:
> > > On Thu, Jul 22, 2021, Maxim Levitsky wrote:
> > > > On Mon, 2021-07-19 at 18:49 +0000, Sean Christopherson wrote:
> > > > > On Sun, Jul 18, 2021, Maxim Levitsky wrote:
> > > > -> APIC MMIO area has to be MMIO for 'apic_mmio_write' to be called,
> > > > thus must contain no guest memslots.
> > > > If the guest relocates the APIC base somewhere where we have a memslot,
> > > > memslot will take priority, while on real hardware, LAPIC is likely to
> > > > take priority.
> > >
> > > Yep. The thing that really bites us is that other vCPUs should still be able to
> > > access the memory defined by the memslot, e.g. to make it work we'd have to run
> > > the vCPU with a completely different MMU root.
> > That is something I haven't took in the account.
> > Complexity of supporting this indeed isn't worth it.
> >
> > > > As far as I know the only good reason to relocate APIC base is to access it
> > > > from the real mode which is not something that is done these days by modern
> > > > BIOSes.
> > > >
> > > > I vote to make it read only (#GP on MSR_IA32_APICBASE write when non default
> > > > base is set and apic enabled) and remove all remains of the support for
> > > > variable APIC base.
> > >
> > > Making up our own behavior is almost never the right approach. E.g. _best_ case
> > > scenario for an unexpected #GP is the guest immediately terminates. Worst case
> > > scenario is the guest eats the #GP and continues on, which is basically the status
> > > quo, except it's guaranteed to now work, whereas todays behavior can at least let
> > > the guest function, for some definitions of "function".
> >
> > Well, at least the Intel's PRM does state that APIC base relocation is not guaranteed
> > to work on all CPUs, so giving the guest a #GP is like telling it that current CPU doesn't
> > support it. In theory, a very well behaving guest can catch the exception and
> > fail back to the default base.
> >
> > I don't understand what do you mean by 'guaranteed to now work'. If the guest
> > ignores this #GP and still thinks that APIC base relocation worked, it is its fault.
> > A well behaving guest should never assume that a msr write that failed with #GP
> > worked.
> >
> >
> > > I think the only viable "solution" is to exit to userspace on the guilty WRMSR.
> > > Whether or not we can do that without breaking userspace is probably the big
> > > question. Fully emulating APIC base relocation would be a tremendous amount of
> > > effort and complexity for practically zero benefit.
> >
> > I have nothing against this as well although I kind of like the #GP approach a bit more,
> > and knowing that there are barely any reasons
> > to relocate the APIC base, and that it doesn't work well, there is a good chance
> > that no one does it anyway (except our kvm unit tests, but that isn't an issue).
> >
> > > > (we already have a warning when APIC base is set to non default value)
> > >
> > > FWIW, that warning is worthless because it's _once(), i.e. won't help detect a
> > > misbehaving guest unless it's the first guest to misbehave on a particular
> > > instantiation of KVM. _ratelimited() would improve the situation, but not
> > > completely eliminate the possibility of a misbehaving guest going unnoticed.
> > > Anything else isn't an option becuase it's obviously guest triggerable.
> >
> > 100% agree.
> >
> > I'll say I would first make it _ratelimited() for few KVM versions, and then
> > if nobody complains, make it a KVM internal error / #GP, and remove all the leftovers
> > from the code that pretend that it can work.
>
> Printing things to syslog is not very helpful. Any time that kvm
> violates the architectural specification, it should provide
> information about the emulation error to userspace.
>
Paolo, what do you think?

My personal opinion is that we should indeed cause KVM internal error
on all attempts to change the APIC base.

If someone complains, then we can look at their use-case.

My view is that any half-working feature is bound to bitrot
and cause harm and confusion.

Best regards,
Maxim Levitsky