This is basically two series smushed into one. The first "half" aims
to differentiate between "halt" and a more generic "block", where "halt"
aligns with x86's HLT instruction, the halt-polling mechanisms, and
associated stats, and "block" means any guest action that causes the vCPU
to block/wait.
The second "half" overhauls x86's APIC virtualization code (Posted
Interrupts on Intel VMX, AVIC on AMD SVM) to do their updates in response
to vCPU (un)blocking in the vcpu_load/put() paths, keying off of the
vCPU's rcuwait status to determine when a blocking vCPU is being put and
reloaded. This idea comes from arm64's kvm_timer_vcpu_put(), which I
stumbled across when diving into the history of arm64's (un)blocking hooks.
The x86 APICv overhaul allows for killing off several sets of hooks in
common KVM and in x86 KVM (to the vendor code). Moving everything to
vcpu_put/load() also realizes nice cleanups, especially for the Posted
Interrupt code, which required some impressive mental gymnastics to
understand how vCPU task migration interacted with vCPU blocking.
Non-x86 folks, sorry for the noise. I'm hoping the common parts can get
applied without much fuss so that future versions can be x86-only.
v2:
- Collect reviews. [Christian, David]
- Add patch to move arm64 WFI functionality out of hooks. [Marc]
- Add RISC-V to the fun.
- Add all the APICv fun.
v1: https://lkml.kernel.org/r/[email protected]
Jing Zhang (1):
KVM: stats: Add stat to detect if vcpu is currently blocking
Sean Christopherson (42):
KVM: VMX: Don't unblock vCPU w/ Posted IRQ if IRQs are disabled in
guest
KVM: SVM: Ensure target pCPU is read once when signalling AVIC
doorbell
KVM: s390: Ensure kvm_arch_no_poll() is read once when blocking vCPU
KVM: Force PPC to define its own rcuwait object
KVM: Update halt-polling stats if and only if halt-polling was
attempted
KVM: Refactor and document halt-polling stats update helper
KVM: Reconcile discrepancies in halt-polling stats
KVM: s390: Clear valid_wakeup in kvm_s390_handle_wait(), not in arch
hook
KVM: Drop obsolete kvm_arch_vcpu_block_finish()
KVM: arm64: Move vGIC v4 handling for WFI out arch callback hook
KVM: Don't block+unblock when halt-polling is successful
KVM: x86: Tweak halt emulation helper names to free up kvm_vcpu_halt()
KVM: Rename kvm_vcpu_block() => kvm_vcpu_halt()
KVM: Split out a kvm_vcpu_block() helper from kvm_vcpu_halt()
KVM: Don't redo ktime_get() when calculating halt-polling
stop/deadline
KVM: x86: Directly block (instead of "halting") UNINITIALIZED vCPUs
KVM: x86: Invoke kvm_vcpu_block() directly for non-HALTED wait states
KVM: Add helpers to wake/query blocking vCPU
KVM: VMX: Skip Posted Interrupt updates if APICv is hard disabled
KVM: VMX: Clean up PI pre/post-block WARNs
KVM: VMX: Drop unnecessary PI logic to handle impossible conditions
KVM: VMX: Use boolean returns for Posted Interrupt "test" helpers
KVM: VMX: Drop pointless PI.NDST update when blocking
KVM: VMX: Save/restore IRQs (instead of CLI/STI) during PI pre/post
block
KVM: VMX: Read Posted Interrupt "control" exactly once per loop
iteration
KVM: VMX: Move Posted Interrupt ndst computation out of write loop
KVM: VMX: Remove vCPU from PI wakeup list before updating PID.NV
KVM: VMX: Handle PI wakeup shenanigans during vcpu_put/load
KVM: Drop unused kvm_vcpu.pre_pcpu field
KVM: Move x86 VMX's posted interrupt list_head to vcpu_vmx
KVM: VMX: Move preemption timer <=> hrtimer dance to common x86
KVM: x86: Unexport LAPIC's switch_to_{hv,sw}_timer() helpers
KVM: x86: Remove defunct pre_block/post_block kvm_x86_ops hooks
KVM: SVM: Signal AVIC doorbell iff vCPU is in guest mode
KVM: SVM: Don't bother checking for "running" AVIC when kicking for
IPIs
KVM: SVM: Unconditionally mark AVIC as running on vCPU load (with
APICv)
KVM: Drop defunct kvm_arch_vcpu_(un)blocking() hooks
KVM: VMX: Don't do full kick when triggering posted interrupt "fails"
KVM: VMX: Wake vCPU when delivering posted IRQ even if vCPU == this
vCPU
KVM: VMX: Pass desired vector instead of bool for triggering posted
IRQ
KVM: VMX: Fold fallback path into triggering posted IRQ helper
KVM: VMX: Don't do full kick when handling posted interrupt wakeup
arch/arm64/include/asm/kvm_emulate.h | 2 +
arch/arm64/include/asm/kvm_host.h | 1 -
arch/arm64/kvm/arch_timer.c | 5 +-
arch/arm64/kvm/arm.c | 60 +++---
arch/arm64/kvm/handle_exit.c | 5 +-
arch/arm64/kvm/psci.c | 2 +-
arch/mips/include/asm/kvm_host.h | 3 -
arch/mips/kvm/emulate.c | 2 +-
arch/powerpc/include/asm/kvm_host.h | 4 +-
arch/powerpc/kvm/book3s_pr.c | 2 +-
arch/powerpc/kvm/book3s_pr_papr.c | 2 +-
arch/powerpc/kvm/booke.c | 2 +-
arch/powerpc/kvm/powerpc.c | 5 +-
arch/riscv/include/asm/kvm_host.h | 1 -
arch/riscv/kvm/vcpu_exit.c | 2 +-
arch/s390/include/asm/kvm_host.h | 4 -
arch/s390/kvm/interrupt.c | 3 +-
arch/s390/kvm/kvm-s390.c | 7 +-
arch/x86/include/asm/kvm-x86-ops.h | 4 -
arch/x86/include/asm/kvm_host.h | 29 +--
arch/x86/kvm/lapic.c | 4 +-
arch/x86/kvm/svm/avic.c | 95 ++++-----
arch/x86/kvm/svm/svm.c | 8 -
arch/x86/kvm/svm/svm.h | 14 --
arch/x86/kvm/vmx/nested.c | 2 +-
arch/x86/kvm/vmx/posted_intr.c | 279 ++++++++++++---------------
arch/x86/kvm/vmx/posted_intr.h | 14 +-
arch/x86/kvm/vmx/vmx.c | 63 +++---
arch/x86/kvm/vmx/vmx.h | 3 +
arch/x86/kvm/x86.c | 55 ++++--
include/linux/kvm_host.h | 27 ++-
include/linux/kvm_types.h | 1 +
virt/kvm/async_pf.c | 2 +-
virt/kvm/kvm_main.c | 138 +++++++------
34 files changed, 413 insertions(+), 437 deletions(-)
--
2.33.0.882.g93a45727a2-goog
Ensure vcpu->cpu is read once when signalling the AVIC doorbell. If the
compiler rereads the field and the vCPU is migrated between the check and
writing the doorbell, KVM would signal the wrong physical CPU.
Functionally, signalling the wrong CPU in this case is not an issue as
task migration means the vCPU has exited and will pick up any pending
interrupts on the next VMRUN. Add the READ_ONCE() purely to clean up the
code.
Opportunistically add a comment explaining the task migration behavior,
and rename cpuid=>cpu to avoid conflating the CPU number with KVM's more
common usage of CPUID.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/svm/avic.c | 13 ++++++++++---
1 file changed, 10 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 8052d92069e0..208c5c71e827 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -675,10 +675,17 @@ int svm_deliver_avic_intr(struct kvm_vcpu *vcpu, int vec)
smp_mb__after_atomic();
if (avic_vcpu_is_running(vcpu)) {
- int cpuid = vcpu->cpu;
+ int cpu = READ_ONCE(vcpu->cpu);
- if (cpuid != get_cpu())
- wrmsrl(SVM_AVIC_DOORBELL, kvm_cpu_get_apicid(cpuid));
+ /*
+ * Note, the vCPU could get migrated to a different pCPU at any
+ * point, which could result in signalling the wrong/previous
+ * pCPU. But if that happens the vCPU is guaranteed to do a
+ * VMRUN (after being migrated) and thus will process pending
+ * interrupts, i.e. a doorbell is not needed (and the spurious)
+ */
+ if (cpu != get_cpu())
+ wrmsrl(SVM_AVIC_DOORBELL, kvm_cpu_get_apicid(cpu));
put_cpu();
} else
kvm_vcpu_wake_up(vcpu);
--
2.33.0.882.g93a45727a2-goog
Wrap s390's halt_poll_max_steal with READ_ONCE and snapshot the result of
kvm_arch_no_poll() in kvm_vcpu_block() to avoid a mostly-theoretical,
largely benign bug on s390 where the result of kvm_arch_no_poll() could
change due to userspace modifying halt_poll_max_steal while the vCPU is
blocking. The bug is largely benign as it will either cause KVM to skip
updating halt-polling times (no_poll toggles false=>true) or to update
halt-polling times with a slightly flawed block_ns.
Note, READ_ONCE is unnecessary in the current code, add it in case the
arch hook is ever inlined, and to provide a hint that userspace can
change the param at will.
Fixes: 8b905d28ee17 ("KVM: s390: provide kvm_arch_no_poll function")
Reviewed-by: Christian Borntraeger <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/s390/kvm/kvm-s390.c | 2 +-
virt/kvm/kvm_main.c | 5 +++--
2 files changed, 4 insertions(+), 3 deletions(-)
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 6a6dd5e1daf6..7cabe6778b1b 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -3446,7 +3446,7 @@ bool kvm_arch_no_poll(struct kvm_vcpu *vcpu)
{
/* do not poll with more than halt_poll_max_steal percent of steal time */
if (S390_lowcore.avg_steal_timer * 100 / (TICK_USEC << 12) >=
- halt_poll_max_steal) {
+ READ_ONCE(halt_poll_max_steal)) {
vcpu->stat.halt_no_poll_steal++;
return true;
}
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 3f6d450355f0..7bc38549487e 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3213,6 +3213,7 @@ update_halt_poll_stats(struct kvm_vcpu *vcpu, u64 poll_ns, bool waited)
*/
void kvm_vcpu_block(struct kvm_vcpu *vcpu)
{
+ bool halt_poll_allowed = !kvm_arch_no_poll(vcpu);
ktime_t start, cur, poll_end;
bool waited = false;
u64 block_ns;
@@ -3220,7 +3221,7 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
kvm_arch_vcpu_blocking(vcpu);
start = cur = poll_end = ktime_get();
- if (vcpu->halt_poll_ns && !kvm_arch_no_poll(vcpu)) {
+ if (vcpu->halt_poll_ns && halt_poll_allowed) {
ktime_t stop = ktime_add_ns(ktime_get(), vcpu->halt_poll_ns);
++vcpu->stat.generic.halt_attempted_poll;
@@ -3275,7 +3276,7 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
update_halt_poll_stats(
vcpu, ktime_to_ns(ktime_sub(poll_end, start)), waited);
- if (!kvm_arch_no_poll(vcpu)) {
+ if (halt_poll_allowed) {
if (!vcpu_valid_wakeup(vcpu)) {
shrink_halt_poll_ns(vcpu);
} else if (vcpu->kvm->max_halt_poll_ns) {
--
2.33.0.882.g93a45727a2-goog
Do not define/reference kvm_vcpu.wait if __KVM_HAVE_ARCH_WQP is true, and
instead force the architecture (PPC) to define its own rcuwait object.
Allowing common KVM to directly access vcpu->wait without a guard makes
it all too easy to introduce potential bugs, e.g. kvm_vcpu_block(),
kvm_vcpu_on_spin(), and async_pf_execute() all operate on vcpu->wait, not
the result of kvm_arch_vcpu_get_wait(), and so may do the wrong thing for
PPC.
Due to PPC's shenanigans with respect to callbacks and waits (it switches
to the virtual core's wait object at KVM_RUN!?!?), it's not clear whether
or not this fixes any bugs.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/powerpc/include/asm/kvm_host.h | 1 +
arch/powerpc/kvm/powerpc.c | 3 ++-
include/linux/kvm_host.h | 2 ++
virt/kvm/async_pf.c | 2 +-
virt/kvm/kvm_main.c | 9 ++++++---
5 files changed, 12 insertions(+), 5 deletions(-)
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index 59cb38b04ede..876c10803cda 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -749,6 +749,7 @@ struct kvm_vcpu_arch {
u8 irq_pending; /* Used by XIVE to signal pending guest irqs */
u32 last_inst;
+ struct rcuwait wait;
struct rcuwait *waitp;
struct kvmppc_vcore *vcore;
int ret;
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 8ab90ce8738f..be22da157569 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -762,7 +762,8 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
if (err)
goto out_vcpu_uninit;
- vcpu->arch.waitp = &vcpu->wait;
+ rcuwait_init(&vcpu->arch.wait);
+ vcpu->arch.waitp = &vcpu->arch.wait;
kvmppc_create_vcpu_debugfs(vcpu, vcpu->vcpu_id);
return 0;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 60a35d9fe259..1ced2914d9ca 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -310,7 +310,9 @@ struct kvm_vcpu {
struct mutex mutex;
struct kvm_run *run;
+#ifndef __KVM_HAVE_ARCH_WQP
struct rcuwait wait;
+#endif
struct pid __rcu *pid;
int sigset_active;
sigset_t sigset;
diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
index dd777688d14a..ccb35c22785e 100644
--- a/virt/kvm/async_pf.c
+++ b/virt/kvm/async_pf.c
@@ -85,7 +85,7 @@ static void async_pf_execute(struct work_struct *work)
trace_kvm_async_pf_completed(addr, cr2_or_gpa);
- rcuwait_wake_up(&vcpu->wait);
+ rcuwait_wake_up(kvm_arch_vcpu_get_wait(vcpu));
mmput(mm);
kvm_put_kvm(vcpu->kvm);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 7bc38549487e..5d4a90032277 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -421,7 +421,9 @@ static void kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
vcpu->kvm = kvm;
vcpu->vcpu_id = id;
vcpu->pid = NULL;
+#ifndef __KVM_HAVE_ARCH_WQP
rcuwait_init(&vcpu->wait);
+#endif
kvm_async_pf_vcpu_init(vcpu);
vcpu->pre_pcpu = -1;
@@ -3213,6 +3215,7 @@ update_halt_poll_stats(struct kvm_vcpu *vcpu, u64 poll_ns, bool waited)
*/
void kvm_vcpu_block(struct kvm_vcpu *vcpu)
{
+ struct rcuwait *wait = kvm_arch_vcpu_get_wait(vcpu);
bool halt_poll_allowed = !kvm_arch_no_poll(vcpu);
ktime_t start, cur, poll_end;
bool waited = false;
@@ -3251,7 +3254,7 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
}
- prepare_to_rcuwait(&vcpu->wait);
+ prepare_to_rcuwait(wait);
for (;;) {
set_current_state(TASK_INTERRUPTIBLE);
@@ -3261,7 +3264,7 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
waited = true;
schedule();
}
- finish_rcuwait(&vcpu->wait);
+ finish_rcuwait(wait);
cur = ktime_get();
if (waited) {
vcpu->stat.generic.halt_wait_ns +=
@@ -3460,7 +3463,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
continue;
if (vcpu == me)
continue;
- if (rcuwait_active(&vcpu->wait) &&
+ if (rcuwait_active(kvm_arch_vcpu_get_wait(vcpu)) &&
!vcpu_dy_runnable(vcpu))
continue;
if (READ_ONCE(vcpu->preempted) && yield_to_kernel_mode &&
--
2.33.0.882.g93a45727a2-goog
Don't configure the wakeup handler when a vCPU is blocking with IRQs
disabled, in which case any IRQ, posted or otherwise, should not be
recognized and thus should not wake the vCPU.
Fixes: bf9f6ac8d749 ("KVM: Update Posted-Interrupts Descriptor when vCPU is blocked")
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/vmx/posted_intr.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
index 5f81ef092bd4..3263056784f5 100644
--- a/arch/x86/kvm/vmx/posted_intr.c
+++ b/arch/x86/kvm/vmx/posted_intr.c
@@ -142,8 +142,9 @@ int pi_pre_block(struct kvm_vcpu *vcpu)
struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
if (!kvm_arch_has_assigned_device(vcpu->kvm) ||
- !irq_remapping_cap(IRQ_POSTING_CAP) ||
- !kvm_vcpu_apicv_active(vcpu))
+ !irq_remapping_cap(IRQ_POSTING_CAP) ||
+ !kvm_vcpu_apicv_active(vcpu) ||
+ vmx_interrupt_blocked(vcpu))
return 0;
WARN_ON(irqs_disabled());
--
2.33.0.882.g93a45727a2-goog
Don't update halt-polling stats if halt-polling wasn't attempted. This
is a nop as @poll_ns is guaranteed to be '0' (poll_end == start), but it
will allow a future patch to move the histogram stats into the helper to
resolve a discrepancy in what is considered a "successful" halt-poll.
No functional change intended.
Reviewed-by: David Matlack <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
virt/kvm/kvm_main.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 5d4a90032277..6156719bcbbc 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3217,6 +3217,7 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
{
struct rcuwait *wait = kvm_arch_vcpu_get_wait(vcpu);
bool halt_poll_allowed = !kvm_arch_no_poll(vcpu);
+ bool do_halt_poll = halt_poll_allowed && vcpu->halt_poll_ns;
ktime_t start, cur, poll_end;
bool waited = false;
u64 block_ns;
@@ -3224,7 +3225,7 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
kvm_arch_vcpu_blocking(vcpu);
start = cur = poll_end = ktime_get();
- if (vcpu->halt_poll_ns && halt_poll_allowed) {
+ if (do_halt_poll) {
ktime_t stop = ktime_add_ns(ktime_get(), vcpu->halt_poll_ns);
++vcpu->stat.generic.halt_attempted_poll;
@@ -3276,8 +3277,9 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
kvm_arch_vcpu_unblocking(vcpu);
block_ns = ktime_to_ns(cur) - ktime_to_ns(start);
- update_halt_poll_stats(
- vcpu, ktime_to_ns(ktime_sub(poll_end, start)), waited);
+ if (do_halt_poll)
+ update_halt_poll_stats(
+ vcpu, ktime_to_ns(ktime_sub(poll_end, start)), waited);
if (halt_poll_allowed) {
if (!vcpu_valid_wakeup(vcpu)) {
--
2.33.0.882.g93a45727a2-goog
Add a comment to document that halt-polling is considered successful even
if the polling loop itself didn't detect a wake event, i.e. if a wake
event was detect in the final kvm_vcpu_check_block(). Invert the param
to update helper so that the helper is a dumb function that is "told"
whether or not polling was successful, as opposed to determining success
based on blocking behavior.
Opportunistically tweak the params to the update helper to reduce the
line length for the call site so that it fits on a single line, and so
that the prototype conforms to the more traditional kernel style.
No functional change intended.
Reviewed-by: David Matlack <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
virt/kvm/kvm_main.c | 20 +++++++++++++-------
1 file changed, 13 insertions(+), 7 deletions(-)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 6156719bcbbc..4dfcd736b274 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3201,13 +3201,15 @@ static int kvm_vcpu_check_block(struct kvm_vcpu *vcpu)
return ret;
}
-static inline void
-update_halt_poll_stats(struct kvm_vcpu *vcpu, u64 poll_ns, bool waited)
+static inline void update_halt_poll_stats(struct kvm_vcpu *vcpu, ktime_t start,
+ ktime_t end, bool success)
{
- if (waited)
- vcpu->stat.generic.halt_poll_fail_ns += poll_ns;
- else
+ u64 poll_ns = ktime_to_ns(ktime_sub(end, start));
+
+ if (success)
vcpu->stat.generic.halt_poll_success_ns += poll_ns;
+ else
+ vcpu->stat.generic.halt_poll_fail_ns += poll_ns;
}
/*
@@ -3277,9 +3279,13 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
kvm_arch_vcpu_unblocking(vcpu);
block_ns = ktime_to_ns(cur) - ktime_to_ns(start);
+ /*
+ * Note, halt-polling is considered successful so long as the vCPU was
+ * never actually scheduled out, i.e. even if the wake event arrived
+ * after of the halt-polling loop itself, but before the full wait.
+ */
if (do_halt_poll)
- update_halt_poll_stats(
- vcpu, ktime_to_ns(ktime_sub(poll_end, start)), waited);
+ update_halt_poll_stats(vcpu, start, poll_end, !waited);
if (halt_poll_allowed) {
if (!vcpu_valid_wakeup(vcpu)) {
--
2.33.0.882.g93a45727a2-goog
Move the halt-polling "success" and histogram stats update into the
dedicated helper to fix a discrepancy where the success/fail "time" stats
consider polling successful so long as the wait is avoided, but the main
"success" and histogram stats consider polling successful if and only if
a wake event was detected by the halt-polling loop.
Move halt_attempted_poll to the helper as well so that all the stats are
updated in a single location. While it's a bit odd to update the stat
well after the fact, practically speaking there's no meaningful advantage
to updating before polling.
Note, there is a functional change in addition to the success vs. fail
change. The histogram updates previously called ktime_get() instead of
using "cur". But that change is desirable as it means all the stats are
now updated with the same polling time, and avoids the extra ktime_get(),
which isn't expensive but isn't free either.
Reviewed-by: David Matlack <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
virt/kvm/kvm_main.c | 35 ++++++++++++++++-------------------
1 file changed, 16 insertions(+), 19 deletions(-)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 4dfcd736b274..1292c7876d3f 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3204,12 +3204,23 @@ static int kvm_vcpu_check_block(struct kvm_vcpu *vcpu)
static inline void update_halt_poll_stats(struct kvm_vcpu *vcpu, ktime_t start,
ktime_t end, bool success)
{
+ struct kvm_vcpu_stat_generic *stats = &vcpu->stat.generic;
u64 poll_ns = ktime_to_ns(ktime_sub(end, start));
- if (success)
- vcpu->stat.generic.halt_poll_success_ns += poll_ns;
- else
- vcpu->stat.generic.halt_poll_fail_ns += poll_ns;
+ ++vcpu->stat.generic.halt_attempted_poll;
+
+ if (success) {
+ ++vcpu->stat.generic.halt_successful_poll;
+
+ if (!vcpu_valid_wakeup(vcpu))
+ ++vcpu->stat.generic.halt_poll_invalid;
+
+ stats->halt_poll_success_ns += poll_ns;
+ KVM_STATS_LOG_HIST_UPDATE(stats->halt_poll_success_hist, poll_ns);
+ } else {
+ stats->halt_poll_fail_ns += poll_ns;
+ KVM_STATS_LOG_HIST_UPDATE(stats->halt_poll_fail_hist, poll_ns);
+ }
}
/*
@@ -3230,30 +3241,16 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
if (do_halt_poll) {
ktime_t stop = ktime_add_ns(ktime_get(), vcpu->halt_poll_ns);
- ++vcpu->stat.generic.halt_attempted_poll;
do {
/*
* This sets KVM_REQ_UNHALT if an interrupt
* arrives.
*/
- if (kvm_vcpu_check_block(vcpu) < 0) {
- ++vcpu->stat.generic.halt_successful_poll;
- if (!vcpu_valid_wakeup(vcpu))
- ++vcpu->stat.generic.halt_poll_invalid;
-
- KVM_STATS_LOG_HIST_UPDATE(
- vcpu->stat.generic.halt_poll_success_hist,
- ktime_to_ns(ktime_get()) -
- ktime_to_ns(start));
+ if (kvm_vcpu_check_block(vcpu) < 0)
goto out;
- }
cpu_relax();
poll_end = cur = ktime_get();
} while (kvm_vcpu_can_poll(cur, stop));
-
- KVM_STATS_LOG_HIST_UPDATE(
- vcpu->stat.generic.halt_poll_fail_hist,
- ktime_to_ns(ktime_get()) - ktime_to_ns(start));
}
--
2.33.0.882.g93a45727a2-goog
Move the put and reload of the vGIC out of the block/unblock callbacks
and into a dedicated WFI helper. Functionally, this is nearly a nop as
the block hook is called at the very beginning of kvm_vcpu_block(), and
the only code in kvm_vcpu_block() after the unblock hook is to update the
halt-polling controls, i.e. can only affect the next WFI.
Back when the arch (un)blocking hooks were added by commits 3217f7c25bca
("KVM: Add kvm_arch_vcpu_{un}blocking callbacks) and d35268da6687
("arm/arm64: KVM: arch_timer: Only schedule soft timer on vcpu_block"),
the hooks were invoked only when KVM was about to "block", i.e. schedule
out the vCPU. The use case at the time was to schedule a timer in the
host based on the earliest timer in the guest in order to wake the
blocking vCPU when the emulated guest timer fired. Commit accb99bcd0ca
("KVM: arm/arm64: Simplify bg_timer programming") reworked the timer
logic to be even more precise, by waiting until the vCPU was actually
scheduled out, and so move the timer logic from the (un)blocking hooks to
vcpu_load/put.
In the meantime, the hooks gained usage for enabling vGIC v4 doorbells in
commit df9ba95993b9 ("KVM: arm/arm64: GICv4: Use the doorbell interrupt
as an unblocking source"), and added related logic for the VMCR in commit
5eeaf10eec39 ("KVM: arm/arm64: Sync ICH_VMCR_EL2 back when about to block").
Finally, commit 07ab0f8d9a12 ("KVM: Call kvm_arch_vcpu_blocking early
into the blocking sequence") hoisted the (un)blocking hooks so that they
wrapped KVM's halt-polling logic in addition to the core "block" logic.
In other words, the original need for arch hooks to take action _only_
in the block path is long since gone.
Cc: Oliver Upton <[email protected]>
Cc: Marc Zyngier <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/arm64/include/asm/kvm_emulate.h | 2 ++
arch/arm64/kvm/arm.c | 52 +++++++++++++++++++---------
arch/arm64/kvm/handle_exit.c | 3 +-
3 files changed, 38 insertions(+), 19 deletions(-)
diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h
index fd418955e31e..de8b4f5922b7 100644
--- a/arch/arm64/include/asm/kvm_emulate.h
+++ b/arch/arm64/include/asm/kvm_emulate.h
@@ -41,6 +41,8 @@ void kvm_inject_vabt(struct kvm_vcpu *vcpu);
void kvm_inject_dabt(struct kvm_vcpu *vcpu, unsigned long addr);
void kvm_inject_pabt(struct kvm_vcpu *vcpu, unsigned long addr);
+void kvm_vcpu_wfi(struct kvm_vcpu *vcpu);
+
static __always_inline bool vcpu_el1_is_32bit(struct kvm_vcpu *vcpu)
{
return !(vcpu->arch.hcr_el2 & HCR_RW);
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 7838e9fb693e..1346f81b34df 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -359,27 +359,12 @@ int kvm_cpu_has_pending_timer(struct kvm_vcpu *vcpu)
void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu)
{
- /*
- * If we're about to block (most likely because we've just hit a
- * WFI), we need to sync back the state of the GIC CPU interface
- * so that we have the latest PMR and group enables. This ensures
- * that kvm_arch_vcpu_runnable has up-to-date data to decide
- * whether we have pending interrupts.
- *
- * For the same reason, we want to tell GICv4 that we need
- * doorbells to be signalled, should an interrupt become pending.
- */
- preempt_disable();
- kvm_vgic_vmcr_sync(vcpu);
- vgic_v4_put(vcpu, true);
- preempt_enable();
+
}
void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu)
{
- preempt_disable();
- vgic_v4_load(vcpu);
- preempt_enable();
+
}
void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
@@ -662,6 +647,39 @@ static void vcpu_req_sleep(struct kvm_vcpu *vcpu)
smp_rmb();
}
+/**
+ * kvm_vcpu_wfi - emulate Wait-For-Interrupt behavior
+ * @vcpu: The VCPU pointer
+ *
+ * Suspend execution of a vCPU until a valid wake event is detected, i.e. until
+ * the vCPU is runnable. The vCPU may or may not be scheduled out, depending
+ * on when a wake event arrives, e.g. there may already be a pending wake event.
+ */
+void kvm_vcpu_wfi(struct kvm_vcpu *vcpu)
+{
+ /*
+ * Sync back the state of the GIC CPU interface so that we have
+ * the latest PMR and group enables. This ensures that
+ * kvm_arch_vcpu_runnable has up-to-date data to decide whether
+ * we have pending interrupts, e.g. when determining if the
+ * vCPU should block.
+ *
+ * For the same reason, we want to tell GICv4 that we need
+ * doorbells to be signalled, should an interrupt become pending.
+ */
+ preempt_disable();
+ kvm_vgic_vmcr_sync(vcpu);
+ vgic_v4_put(vcpu, true);
+ preempt_enable();
+
+ kvm_vcpu_block(vcpu);
+ kvm_clear_request(KVM_REQ_UNHALT, vcpu);
+
+ preempt_disable();
+ vgic_v4_load(vcpu);
+ preempt_enable();
+}
+
static int kvm_vcpu_initialized(struct kvm_vcpu *vcpu)
{
return vcpu->arch.target >= 0;
diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c
index 275a27368a04..4794563a506b 100644
--- a/arch/arm64/kvm/handle_exit.c
+++ b/arch/arm64/kvm/handle_exit.c
@@ -95,8 +95,7 @@ static int kvm_handle_wfx(struct kvm_vcpu *vcpu)
} else {
trace_kvm_wfx_arm64(*vcpu_pc(vcpu), false);
vcpu->stat.wfi_exit_stat++;
- kvm_vcpu_block(vcpu);
- kvm_clear_request(KVM_REQ_UNHALT, vcpu);
+ kvm_vcpu_wfi(vcpu);
}
kvm_incr_pc(vcpu);
--
2.33.0.882.g93a45727a2-goog
Invoke the arch hooks for block+unblock if and only if KVM actually
attempts to block the vCPU. The only non-nop implementation is on x86,
specifically SVM's AVIC, and there is no need to put the AVIC prior to
halt-polling as KVM x86's kvm_vcpu_has_events() will scour the full vIRR
to find pending IRQs regardless of whether the AVIC is loaded/"running".
The primary motivation is to allow future cleanup to split out "block"
from "halt", but this is also likely a small performance boost on x86 SVM
when halt-polling is successful.
Adjust the post-block path to update "cur" after unblocking, i.e. include
AVIC load time in halt_wait_ns and halt_wait_hist, so that the behavior
is consistent. Moving just the pre-block arch hook would result in only
the AVIC put latency being included in the halt_wait stats. There is no
obvious evidence that one way or the other is correct, so just ensure KVM
is consistent.
Note, x86 has two separate paths for handling APICv with respect to vCPU
blocking. VMX uses hooks in x86's vcpu_block(), while SVM uses the arch
hooks in kvm_vcpu_block(). Prior to this path, the two paths were more
or less functionally identical. That is very much not the case after
this patch, as the hooks used by VMX _must_ fire before halt-polling.
x86's entire mess will be cleaned up in future patches.
Signed-off-by: Sean Christopherson <[email protected]>
---
virt/kvm/kvm_main.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index f90b3ed05628..227f6bbe0716 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3235,8 +3235,6 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
bool waited = false;
u64 block_ns;
- kvm_arch_vcpu_blocking(vcpu);
-
start = cur = poll_end = ktime_get();
if (do_halt_poll) {
ktime_t stop = ktime_add_ns(ktime_get(), vcpu->halt_poll_ns);
@@ -3253,6 +3251,7 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
} while (kvm_vcpu_can_poll(cur, stop));
}
+ kvm_arch_vcpu_blocking(vcpu);
prepare_to_rcuwait(wait);
for (;;) {
@@ -3265,6 +3264,9 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
schedule();
}
finish_rcuwait(wait);
+
+ kvm_arch_vcpu_unblocking(vcpu);
+
cur = ktime_get();
if (waited) {
vcpu->stat.generic.halt_wait_ns +=
@@ -3273,7 +3275,6 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
ktime_to_ns(cur) - ktime_to_ns(poll_end));
}
out:
- kvm_arch_vcpu_unblocking(vcpu);
block_ns = ktime_to_ns(cur) - ktime_to_ns(start);
/*
--
2.33.0.882.g93a45727a2-goog
Drop kvm_arch_vcpu_block_finish() now that all arch implementations are
nops.
No functional change intended.
Acked-by: Christian Borntraeger <[email protected]>
Reviewed-by: David Matlack <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/arm64/include/asm/kvm_host.h | 1 -
arch/mips/include/asm/kvm_host.h | 1 -
arch/powerpc/include/asm/kvm_host.h | 1 -
arch/riscv/include/asm/kvm_host.h | 1 -
arch/s390/include/asm/kvm_host.h | 2 --
arch/s390/kvm/kvm-s390.c | 5 -----
arch/x86/include/asm/kvm_host.h | 2 --
virt/kvm/kvm_main.c | 1 -
8 files changed, 14 deletions(-)
diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 369c30e28301..fe4dec96d1c3 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -716,7 +716,6 @@ void kvm_arm_vcpu_ptrauth_trap(struct kvm_vcpu *vcpu);
static inline void kvm_arch_hardware_unsetup(void) {}
static inline void kvm_arch_sync_events(struct kvm *kvm) {}
static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
-static inline void kvm_arch_vcpu_block_finish(struct kvm_vcpu *vcpu) {}
void kvm_arm_init_debug(void);
void kvm_arm_vcpu_init_debug(struct kvm_vcpu *vcpu);
diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h
index 696f6b009377..72b90d45a46e 100644
--- a/arch/mips/include/asm/kvm_host.h
+++ b/arch/mips/include/asm/kvm_host.h
@@ -897,7 +897,6 @@ static inline void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen) {}
static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) {}
static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {}
-static inline void kvm_arch_vcpu_block_finish(struct kvm_vcpu *vcpu) {}
#define __KVM_HAVE_ARCH_FLUSH_REMOTE_TLB
int kvm_arch_flush_remote_tlb(struct kvm *kvm);
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index 876c10803cda..4a195c161592 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -865,6 +865,5 @@ static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
static inline void kvm_arch_exit(void) {}
static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) {}
static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {}
-static inline void kvm_arch_vcpu_block_finish(struct kvm_vcpu *vcpu) {}
#endif /* __POWERPC_KVM_HOST_H__ */
diff --git a/arch/riscv/include/asm/kvm_host.h b/arch/riscv/include/asm/kvm_host.h
index d7e1696cd2ec..b3f0c3773603 100644
--- a/arch/riscv/include/asm/kvm_host.h
+++ b/arch/riscv/include/asm/kvm_host.h
@@ -209,7 +209,6 @@ struct kvm_vcpu_arch {
static inline void kvm_arch_hardware_unsetup(void) {}
static inline void kvm_arch_sync_events(struct kvm *kvm) {}
static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
-static inline void kvm_arch_vcpu_block_finish(struct kvm_vcpu *vcpu) {}
#define KVM_ARCH_WANT_MMU_NOTIFIER
diff --git a/arch/s390/include/asm/kvm_host.h b/arch/s390/include/asm/kvm_host.h
index a604d51acfc8..a22c9266ea05 100644
--- a/arch/s390/include/asm/kvm_host.h
+++ b/arch/s390/include/asm/kvm_host.h
@@ -1010,6 +1010,4 @@ static inline void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) {}
static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {}
-void kvm_arch_vcpu_block_finish(struct kvm_vcpu *vcpu);
-
#endif
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 08ed68639a21..17fabb260c35 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -5080,11 +5080,6 @@ static inline unsigned long nonhyp_mask(int i)
return 0x0000ffffffffffffUL >> (nonhyp_fai << 4);
}
-void kvm_arch_vcpu_block_finish(struct kvm_vcpu *vcpu)
-{
-
-}
-
static int __init kvm_s390_init(void)
{
int i;
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 88f0326c184a..7aafc27ce7a9 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1926,8 +1926,6 @@ static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu)
static_call_cond(kvm_x86_vcpu_unblocking)(vcpu);
}
-static inline void kvm_arch_vcpu_block_finish(struct kvm_vcpu *vcpu) {}
-
static inline int kvm_cpu_get_apicid(int mps_cpu)
{
#ifdef CONFIG_X86_LOCAL_APIC
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 1292c7876d3f..f90b3ed05628 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3304,7 +3304,6 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
}
trace_kvm_vcpu_wakeup(block_ns, waited, vcpu_valid_wakeup(vcpu));
- kvm_arch_vcpu_block_finish(vcpu);
}
EXPORT_SYMBOL_GPL(kvm_vcpu_block);
--
2.33.0.882.g93a45727a2-goog
Rename a variety of HLT-related helpers to free up the function name
"kvm_vcpu_halt" for future use in generic KVM code, e.g. to differentiate
between "block" and "halt".
No functional change intended.
Reviewed-by: David Matlack <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 2 +-
arch/x86/kvm/vmx/nested.c | 2 +-
arch/x86/kvm/vmx/vmx.c | 4 ++--
arch/x86/kvm/x86.c | 13 +++++++------
4 files changed, 11 insertions(+), 10 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 7aafc27ce7a9..328103a520d3 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1689,7 +1689,7 @@ int kvm_emulate_monitor(struct kvm_vcpu *vcpu);
int kvm_fast_pio(struct kvm_vcpu *vcpu, int size, unsigned short port, int in);
int kvm_emulate_cpuid(struct kvm_vcpu *vcpu);
int kvm_emulate_halt(struct kvm_vcpu *vcpu);
-int kvm_vcpu_halt(struct kvm_vcpu *vcpu);
+int kvm_emulate_halt_noskip(struct kvm_vcpu *vcpu);
int kvm_emulate_ap_reset_hold(struct kvm_vcpu *vcpu);
int kvm_emulate_wbinvd(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index af1bbb73430a..d0237a441feb 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -3619,7 +3619,7 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch)
!(nested_cpu_has(vmcs12, CPU_BASED_INTR_WINDOW_EXITING) &&
(vmcs12->guest_rflags & X86_EFLAGS_IF))) {
vmx->nested.nested_run_pending = 0;
- return kvm_vcpu_halt(vcpu);
+ return kvm_emulate_halt_noskip(vcpu);
}
break;
case GUEST_ACTIVITY_WAIT_SIPI:
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 1c8b2b6e7ed9..5517893f12fc 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4741,7 +4741,7 @@ static int handle_rmode_exception(struct kvm_vcpu *vcpu,
if (kvm_emulate_instruction(vcpu, 0)) {
if (vcpu->arch.halt_request) {
vcpu->arch.halt_request = 0;
- return kvm_vcpu_halt(vcpu);
+ return kvm_emulate_halt_noskip(vcpu);
}
return 1;
}
@@ -5415,7 +5415,7 @@ static int handle_invalid_guest_state(struct kvm_vcpu *vcpu)
if (vcpu->arch.halt_request) {
vcpu->arch.halt_request = 0;
- return kvm_vcpu_halt(vcpu);
+ return kvm_emulate_halt_noskip(vcpu);
}
/*
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4a52a08707de..9c23ae1d483d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8649,7 +8649,7 @@ void kvm_arch_exit(void)
#endif
}
-static int __kvm_vcpu_halt(struct kvm_vcpu *vcpu, int state, int reason)
+static int __kvm_emulate_halt(struct kvm_vcpu *vcpu, int state, int reason)
{
++vcpu->stat.halt_exits;
if (lapic_in_kernel(vcpu)) {
@@ -8661,11 +8661,11 @@ static int __kvm_vcpu_halt(struct kvm_vcpu *vcpu, int state, int reason)
}
}
-int kvm_vcpu_halt(struct kvm_vcpu *vcpu)
+int kvm_emulate_halt_noskip(struct kvm_vcpu *vcpu)
{
- return __kvm_vcpu_halt(vcpu, KVM_MP_STATE_HALTED, KVM_EXIT_HLT);
+ return __kvm_emulate_halt(vcpu, KVM_MP_STATE_HALTED, KVM_EXIT_HLT);
}
-EXPORT_SYMBOL_GPL(kvm_vcpu_halt);
+EXPORT_SYMBOL_GPL(kvm_emulate_halt_noskip);
int kvm_emulate_halt(struct kvm_vcpu *vcpu)
{
@@ -8674,7 +8674,7 @@ int kvm_emulate_halt(struct kvm_vcpu *vcpu)
* TODO: we might be squashing a GUESTDBG_SINGLESTEP-triggered
* KVM_EXIT_DEBUG here.
*/
- return kvm_vcpu_halt(vcpu) && ret;
+ return kvm_emulate_halt_noskip(vcpu) && ret;
}
EXPORT_SYMBOL_GPL(kvm_emulate_halt);
@@ -8682,7 +8682,8 @@ int kvm_emulate_ap_reset_hold(struct kvm_vcpu *vcpu)
{
int ret = kvm_skip_emulated_instruction(vcpu);
- return __kvm_vcpu_halt(vcpu, KVM_MP_STATE_AP_RESET_HOLD, KVM_EXIT_AP_RESET_HOLD) && ret;
+ return __kvm_emulate_halt(vcpu, KVM_MP_STATE_AP_RESET_HOLD,
+ KVM_EXIT_AP_RESET_HOLD) && ret;
}
EXPORT_SYMBOL_GPL(kvm_emulate_ap_reset_hold);
--
2.33.0.882.g93a45727a2-goog
Rename kvm_vcpu_block() to kvm_vcpu_halt() in preparation for splitting
the actual "block" sequences into a separate helper (to be named
kvm_vcpu_block()). x86 will use the standalone block-only path to handle
non-halt cases where the vCPU is not runnable.
Rename block_ns to halt_ns to match the new function name.
No functional change intended.
Reviewed-by: David Matlack <[email protected]>
Reviewed-by: Christian Borntraeger <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/arm64/kvm/arch_timer.c | 2 +-
arch/arm64/kvm/arm.c | 2 +-
arch/arm64/kvm/handle_exit.c | 2 +-
arch/arm64/kvm/psci.c | 2 +-
arch/mips/kvm/emulate.c | 2 +-
arch/powerpc/kvm/book3s_pr.c | 2 +-
arch/powerpc/kvm/book3s_pr_papr.c | 2 +-
arch/powerpc/kvm/booke.c | 2 +-
arch/powerpc/kvm/powerpc.c | 2 +-
arch/riscv/kvm/vcpu_exit.c | 2 +-
arch/s390/kvm/interrupt.c | 2 +-
arch/x86/kvm/x86.c | 11 +++++++++--
include/linux/kvm_host.h | 2 +-
virt/kvm/kvm_main.c | 20 +++++++++-----------
14 files changed, 30 insertions(+), 25 deletions(-)
diff --git a/arch/arm64/kvm/arch_timer.c b/arch/arm64/kvm/arch_timer.c
index 3df67c127489..7e8396f74010 100644
--- a/arch/arm64/kvm/arch_timer.c
+++ b/arch/arm64/kvm/arch_timer.c
@@ -467,7 +467,7 @@ static void timer_save_state(struct arch_timer_context *ctx)
}
/*
- * Schedule the background timer before calling kvm_vcpu_block, so that this
+ * Schedule the background timer before calling kvm_vcpu_halt, so that this
* thread is removed from its waitqueue and made runnable when there's a timer
* interrupt to handle.
*/
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 1346f81b34df..268b1e7bf700 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -672,7 +672,7 @@ void kvm_vcpu_wfi(struct kvm_vcpu *vcpu)
vgic_v4_put(vcpu, true);
preempt_enable();
- kvm_vcpu_block(vcpu);
+ kvm_vcpu_halt(vcpu);
kvm_clear_request(KVM_REQ_UNHALT, vcpu);
preempt_disable();
diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c
index 4794563a506b..6d0baf71aa67 100644
--- a/arch/arm64/kvm/handle_exit.c
+++ b/arch/arm64/kvm/handle_exit.c
@@ -82,7 +82,7 @@ static int handle_no_fpsimd(struct kvm_vcpu *vcpu)
*
* WFE: Yield the CPU and come back to this vcpu when the scheduler
* decides to.
- * WFI: Simply call kvm_vcpu_block(), which will halt execution of
+ * WFI: Simply call kvm_vcpu_halt(), which will halt execution of
* world-switches and schedule other host processes until there is an
* incoming IRQ or FIQ to the VM.
*/
diff --git a/arch/arm64/kvm/psci.c b/arch/arm64/kvm/psci.c
index 74c47d420253..e275b2ca08b9 100644
--- a/arch/arm64/kvm/psci.c
+++ b/arch/arm64/kvm/psci.c
@@ -46,7 +46,7 @@ static unsigned long kvm_psci_vcpu_suspend(struct kvm_vcpu *vcpu)
* specification (ARM DEN 0022A). This means all suspend states
* for KVM will preserve the register state.
*/
- kvm_vcpu_block(vcpu);
+ kvm_vcpu_halt(vcpu);
kvm_clear_request(KVM_REQ_UNHALT, vcpu);
return PSCI_RET_SUCCESS;
diff --git a/arch/mips/kvm/emulate.c b/arch/mips/kvm/emulate.c
index 22e745e49b0a..b494d8d39290 100644
--- a/arch/mips/kvm/emulate.c
+++ b/arch/mips/kvm/emulate.c
@@ -952,7 +952,7 @@ enum emulation_result kvm_mips_emul_wait(struct kvm_vcpu *vcpu)
if (!vcpu->arch.pending_exceptions) {
kvm_vz_lose_htimer(vcpu);
vcpu->arch.wait = 1;
- kvm_vcpu_block(vcpu);
+ kvm_vcpu_halt(vcpu);
/*
* We we are runnable, then definitely go off to user space to
diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c
index 6bc9425acb32..0ced1b16f0e5 100644
--- a/arch/powerpc/kvm/book3s_pr.c
+++ b/arch/powerpc/kvm/book3s_pr.c
@@ -492,7 +492,7 @@ static void kvmppc_set_msr_pr(struct kvm_vcpu *vcpu, u64 msr)
if (msr & MSR_POW) {
if (!vcpu->arch.pending_exceptions) {
- kvm_vcpu_block(vcpu);
+ kvm_vcpu_halt(vcpu);
kvm_clear_request(KVM_REQ_UNHALT, vcpu);
vcpu->stat.generic.halt_wakeup++;
diff --git a/arch/powerpc/kvm/book3s_pr_papr.c b/arch/powerpc/kvm/book3s_pr_papr.c
index ac14239f3424..1f10e7dfcdd0 100644
--- a/arch/powerpc/kvm/book3s_pr_papr.c
+++ b/arch/powerpc/kvm/book3s_pr_papr.c
@@ -376,7 +376,7 @@ int kvmppc_h_pr(struct kvm_vcpu *vcpu, unsigned long cmd)
return kvmppc_h_pr_stuff_tce(vcpu);
case H_CEDE:
kvmppc_set_msr_fast(vcpu, kvmppc_get_msr(vcpu) | MSR_EE);
- kvm_vcpu_block(vcpu);
+ kvm_vcpu_halt(vcpu);
kvm_clear_request(KVM_REQ_UNHALT, vcpu);
vcpu->stat.generic.halt_wakeup++;
return EMULATE_DONE;
diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index 977801c83aff..12abffa40cd9 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -718,7 +718,7 @@ int kvmppc_core_prepare_to_enter(struct kvm_vcpu *vcpu)
if (vcpu->arch.shared->msr & MSR_WE) {
local_irq_enable();
- kvm_vcpu_block(vcpu);
+ kvm_vcpu_halt(vcpu);
kvm_clear_request(KVM_REQ_UNHALT, vcpu);
hard_irq_disable();
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index be22da157569..6a94545b99fc 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -236,7 +236,7 @@ int kvmppc_kvm_pv(struct kvm_vcpu *vcpu)
break;
case EV_HCALL_TOKEN(EV_IDLE):
r = EV_SUCCESS;
- kvm_vcpu_block(vcpu);
+ kvm_vcpu_halt(vcpu);
kvm_clear_request(KVM_REQ_UNHALT, vcpu);
break;
default:
diff --git a/arch/riscv/kvm/vcpu_exit.c b/arch/riscv/kvm/vcpu_exit.c
index 13bbc3f73713..949bb9828aa5 100644
--- a/arch/riscv/kvm/vcpu_exit.c
+++ b/arch/riscv/kvm/vcpu_exit.c
@@ -146,7 +146,7 @@ static int system_opcode_insn(struct kvm_vcpu *vcpu,
vcpu->stat.wfi_exit_stat++;
if (!kvm_arch_vcpu_runnable(vcpu)) {
srcu_read_unlock(&vcpu->kvm->srcu, vcpu->arch.srcu_idx);
- kvm_vcpu_block(vcpu);
+ kvm_vcpu_halt(vcpu);
vcpu->arch.srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
kvm_clear_request(KVM_REQ_UNHALT, vcpu);
}
diff --git a/arch/s390/kvm/interrupt.c b/arch/s390/kvm/interrupt.c
index 520450a7956f..10bd648170b7 100644
--- a/arch/s390/kvm/interrupt.c
+++ b/arch/s390/kvm/interrupt.c
@@ -1335,7 +1335,7 @@ int kvm_s390_handle_wait(struct kvm_vcpu *vcpu)
VCPU_EVENT(vcpu, 4, "enabled wait: %llu ns", sltime);
no_timer:
srcu_read_unlock(&vcpu->kvm->srcu, vcpu->srcu_idx);
- kvm_vcpu_block(vcpu);
+ kvm_vcpu_halt(vcpu);
vcpu->valid_wakeup = false;
__unset_cpu_idle(vcpu);
vcpu->srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9c23ae1d483d..e6c17bbed25c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8651,6 +8651,13 @@ void kvm_arch_exit(void)
static int __kvm_emulate_halt(struct kvm_vcpu *vcpu, int state, int reason)
{
+ /*
+ * The vCPU has halted, e.g. executed HLT. Update the run state if the
+ * local APIC is in-kernel, the run loop will detect the non-runnable
+ * state and halt the vCPU. Exit to userspace if the local APIC is
+ * managed by userspace, in which case userspace is responsible for
+ * handling wake events.
+ */
++vcpu->stat.halt_exits;
if (lapic_in_kernel(vcpu)) {
vcpu->arch.mp_state = state;
@@ -9892,7 +9899,7 @@ static inline int vcpu_block(struct kvm *kvm, struct kvm_vcpu *vcpu)
if (!kvm_arch_vcpu_runnable(vcpu) &&
(!kvm_x86_ops.pre_block || static_call(kvm_x86_pre_block)(vcpu) == 0)) {
srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
- kvm_vcpu_block(vcpu);
+ kvm_vcpu_halt(vcpu);
vcpu->srcu_idx = srcu_read_lock(&kvm->srcu);
if (kvm_x86_ops.post_block)
@@ -10126,7 +10133,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
r = -EINTR;
goto out;
}
- kvm_vcpu_block(vcpu);
+ kvm_vcpu_halt(vcpu);
if (kvm_apic_accept_events(vcpu) < 0) {
r = 0;
goto out;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 1ced2914d9ca..c2ea4004553a 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -967,7 +967,7 @@ void kvm_vcpu_mark_page_dirty(struct kvm_vcpu *vcpu, gfn_t gfn);
void kvm_sigset_activate(struct kvm_vcpu *vcpu);
void kvm_sigset_deactivate(struct kvm_vcpu *vcpu);
-void kvm_vcpu_block(struct kvm_vcpu *vcpu);
+void kvm_vcpu_halt(struct kvm_vcpu *vcpu);
void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu);
void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu);
bool kvm_vcpu_wake_up(struct kvm_vcpu *vcpu);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 227f6bbe0716..c13bf3367fda 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3223,17 +3223,14 @@ static inline void update_halt_poll_stats(struct kvm_vcpu *vcpu, ktime_t start,
}
}
-/*
- * The vCPU has executed a HLT instruction with in-kernel mode enabled.
- */
-void kvm_vcpu_block(struct kvm_vcpu *vcpu)
+void kvm_vcpu_halt(struct kvm_vcpu *vcpu)
{
struct rcuwait *wait = kvm_arch_vcpu_get_wait(vcpu);
bool halt_poll_allowed = !kvm_arch_no_poll(vcpu);
bool do_halt_poll = halt_poll_allowed && vcpu->halt_poll_ns;
ktime_t start, cur, poll_end;
bool waited = false;
- u64 block_ns;
+ u64 halt_ns;
start = cur = poll_end = ktime_get();
if (do_halt_poll) {
@@ -3275,7 +3272,8 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
ktime_to_ns(cur) - ktime_to_ns(poll_end));
}
out:
- block_ns = ktime_to_ns(cur) - ktime_to_ns(start);
+ /* The total time the vCPU was "halted", including polling time. */
+ halt_ns = ktime_to_ns(cur) - ktime_to_ns(start);
/*
* Note, halt-polling is considered successful so long as the vCPU was
@@ -3289,24 +3287,24 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
if (!vcpu_valid_wakeup(vcpu)) {
shrink_halt_poll_ns(vcpu);
} else if (vcpu->kvm->max_halt_poll_ns) {
- if (block_ns <= vcpu->halt_poll_ns)
+ if (halt_ns <= vcpu->halt_poll_ns)
;
/* we had a long block, shrink polling */
else if (vcpu->halt_poll_ns &&
- block_ns > vcpu->kvm->max_halt_poll_ns)
+ halt_ns > vcpu->kvm->max_halt_poll_ns)
shrink_halt_poll_ns(vcpu);
/* we had a short halt and our poll time is too small */
else if (vcpu->halt_poll_ns < vcpu->kvm->max_halt_poll_ns &&
- block_ns < vcpu->kvm->max_halt_poll_ns)
+ halt_ns < vcpu->kvm->max_halt_poll_ns)
grow_halt_poll_ns(vcpu);
} else {
vcpu->halt_poll_ns = 0;
}
}
- trace_kvm_vcpu_wakeup(block_ns, waited, vcpu_valid_wakeup(vcpu));
+ trace_kvm_vcpu_wakeup(halt_ns, waited, vcpu_valid_wakeup(vcpu));
}
-EXPORT_SYMBOL_GPL(kvm_vcpu_block);
+EXPORT_SYMBOL_GPL(kvm_vcpu_halt);
bool kvm_vcpu_wake_up(struct kvm_vcpu *vcpu)
{
--
2.33.0.882.g93a45727a2-goog
Go directly to kvm_vcpu_block() when handling the case where userspace
attempts to run an UNINITIALIZED vCPU. The vCPU is not halted, nor is it
likely that halt-polling will be successful in this case.
Reviewed-by: David Matlack <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/x86.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e6c17bbed25c..cd51f100e906 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10133,7 +10133,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
r = -EINTR;
goto out;
}
- kvm_vcpu_halt(vcpu);
+ kvm_vcpu_block(vcpu);
if (kvm_apic_accept_events(vcpu) < 0) {
r = 0;
goto out;
--
2.33.0.882.g93a45727a2-goog
Calculate the halt-polling "stop" time using "cur" instead of redoing
ktime_get(). In the happy case where hardware correctly predicts
do_halt_poll, "cur" is only a few cycles old. And if the branch is
mispredicted, arguably that extra latency should count toward the
halt-polling time.
In all likelihood, the numbers involved are in the noise and either
approach is perfectly ok.
Reviewed-by: David Matlack <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
virt/kvm/kvm_main.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index a36ccdc93a72..481e8178b43d 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3272,7 +3272,7 @@ void kvm_vcpu_halt(struct kvm_vcpu *vcpu)
start = cur = poll_end = ktime_get();
if (do_halt_poll) {
- ktime_t stop = ktime_add_ns(ktime_get(), vcpu->halt_poll_ns);
+ ktime_t stop = ktime_add_ns(cur, vcpu->halt_poll_ns);
do {
/*
--
2.33.0.882.g93a45727a2-goog
Call kvm_vcpu_block() directly for all wait states except HALTED so that
kvm_vcpu_halt() is no longer a misnomer on x86.
Functionally, this means KVM will never attempt halt-polling or adjust
vcpu->halt_poll_ns for INIT_RECEIVED (a.k.a. Wait-For-SIPI (WFS)) or
AP_RESET_HOLD; UNINITIALIZED is handled in kvm_arch_vcpu_ioctl_run(),
and x86 doesn't use any other "wait" states.
As mentioned above, the motivation of this is purely so that "halt" isn't
overloaded on x86, e.g. in KVM's stats. Skipping halt-polling for WFS
(and RESET_HOLD) has no meaningful effect on guest performance as there
are typically single-digit numbers of INIT-SIPI sequences per AP vCPU,
per boot, versus thousands of HLTs just to boot to console.
Reviewed-by: David Matlack <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/x86.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index cd51f100e906..e0219acfd9cf 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9899,7 +9899,10 @@ static inline int vcpu_block(struct kvm *kvm, struct kvm_vcpu *vcpu)
if (!kvm_arch_vcpu_runnable(vcpu) &&
(!kvm_x86_ops.pre_block || static_call(kvm_x86_pre_block)(vcpu) == 0)) {
srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
- kvm_vcpu_halt(vcpu);
+ if (vcpu->arch.mp_state == KVM_MP_STATE_HALTED)
+ kvm_vcpu_halt(vcpu);
+ else
+ kvm_vcpu_block(vcpu);
vcpu->srcu_idx = srcu_read_lock(&kvm->srcu);
if (kvm_x86_ops.post_block)
--
2.33.0.882.g93a45727a2-goog
Factor out the "block" part of kvm_vcpu_halt() so that x86 can emulate
non-halt wait/sleep/block conditions that should not be subjected to
halt-polling.
No functional change intended.
Reviewed-by: Christian Borntraeger <[email protected]>
Reviewed-by: David Matlack <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
include/linux/kvm_host.h | 1 +
virt/kvm/kvm_main.c | 52 +++++++++++++++++++++++++++-------------
2 files changed, 37 insertions(+), 16 deletions(-)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index c2ea4004553a..2d837e06eeec 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -968,6 +968,7 @@ void kvm_sigset_activate(struct kvm_vcpu *vcpu);
void kvm_sigset_deactivate(struct kvm_vcpu *vcpu);
void kvm_vcpu_halt(struct kvm_vcpu *vcpu);
+bool kvm_vcpu_block(struct kvm_vcpu *vcpu);
void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu);
void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu);
bool kvm_vcpu_wake_up(struct kvm_vcpu *vcpu);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index c13bf3367fda..42894ff7c474 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3201,6 +3201,35 @@ static int kvm_vcpu_check_block(struct kvm_vcpu *vcpu)
return ret;
}
+/*
+ * Block the vCPU until the vCPU is runnable, an event arrives, or a signal is
+ * pending. This is mostly used when halting a vCPU, but may also be used
+ * directly for other vCPU non-runnable states, e.g. x86's Wait-For-SIPI.
+ */
+bool kvm_vcpu_block(struct kvm_vcpu *vcpu)
+{
+ struct rcuwait *wait = kvm_arch_vcpu_get_wait(vcpu);
+ bool waited = false;
+
+ kvm_arch_vcpu_blocking(vcpu);
+
+ prepare_to_rcuwait(wait);
+ for (;;) {
+ set_current_state(TASK_INTERRUPTIBLE);
+
+ if (kvm_vcpu_check_block(vcpu) < 0)
+ break;
+
+ waited = true;
+ schedule();
+ }
+ finish_rcuwait(wait);
+
+ kvm_arch_vcpu_unblocking(vcpu);
+
+ return waited;
+}
+
static inline void update_halt_poll_stats(struct kvm_vcpu *vcpu, ktime_t start,
ktime_t end, bool success)
{
@@ -3223,9 +3252,14 @@ static inline void update_halt_poll_stats(struct kvm_vcpu *vcpu, ktime_t start,
}
}
+/*
+ * Emulate a vCPU halt condition, e.g. HLT on x86, WFI on arm, etc... If halt
+ * polling is enabled, busy wait for a short time before blocking to avoid the
+ * expensive block+unblock sequence if a wake event arrives soon after the vCPU
+ * is halted.
+ */
void kvm_vcpu_halt(struct kvm_vcpu *vcpu)
{
- struct rcuwait *wait = kvm_arch_vcpu_get_wait(vcpu);
bool halt_poll_allowed = !kvm_arch_no_poll(vcpu);
bool do_halt_poll = halt_poll_allowed && vcpu->halt_poll_ns;
ktime_t start, cur, poll_end;
@@ -3248,21 +3282,7 @@ void kvm_vcpu_halt(struct kvm_vcpu *vcpu)
} while (kvm_vcpu_can_poll(cur, stop));
}
- kvm_arch_vcpu_blocking(vcpu);
-
- prepare_to_rcuwait(wait);
- for (;;) {
- set_current_state(TASK_INTERRUPTIBLE);
-
- if (kvm_vcpu_check_block(vcpu) < 0)
- break;
-
- waited = true;
- schedule();
- }
- finish_rcuwait(wait);
-
- kvm_arch_vcpu_unblocking(vcpu);
+ waited = kvm_vcpu_block(vcpu);
cur = ktime_get();
if (waited) {
--
2.33.0.882.g93a45727a2-goog
Explicitly skip posted interrupt updates if APICv is disabled in all of
KVM, or if the guest doesn't have an in-kernel APIC. The PI descriptor
is kept up-to-date if APICv is inhibited, e.g. so that re-enabling APICv
doesn't require a bunch of updates, but neither the module param nor the
APIC type can be changed on-the-fly.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/vmx/posted_intr.c | 11 +++++++----
1 file changed, 7 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
index 3263056784f5..351666c41bbc 100644
--- a/arch/x86/kvm/vmx/posted_intr.c
+++ b/arch/x86/kvm/vmx/posted_intr.c
@@ -28,11 +28,14 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
unsigned int dest;
/*
- * In case of hot-plug or hot-unplug, we may have to undo
- * vmx_vcpu_pi_put even if there is no assigned device. And we
- * always keep PI.NDST up to date for simplicity: it makes the
- * code easier, and CPU migration is not a fast path.
+ * To simplify hot-plug and dynamic toggling of APICv, keep PI.NDST and
+ * PI.SN up-to-date even if there is no assigned device or if APICv is
+ * deactivated due to a dynamic inhibit bit, e.g. for Hyper-V's SyncIC.
*/
+ if (!enable_apicv || !lapic_in_kernel(vcpu))
+ return;
+
+ /* Nothing to do if PI.SN==0 and the vCPU isn't being migrated. */
if (!pi_test_sn(pi_desc) && vcpu->cpu == cpu)
return;
--
2.33.0.882.g93a45727a2-goog
Drop sanity checks on the validity of the previous pCPU when handling
vCPU block/unlock for posted interrupts. Barring a code bug or memory
corruption, the sanity checks will never fire, and any code bug that does
trip the WARN is all but guaranteed to completely break posted interrupts,
i.e. should never get anywhere near production.
This is the first of several steps toward eliminating kvm_vcpu.pre_cpu.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/vmx/posted_intr.c | 24 ++++++++++--------------
1 file changed, 10 insertions(+), 14 deletions(-)
diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
index 67cbe6ab8f66..6c2110d91b06 100644
--- a/arch/x86/kvm/vmx/posted_intr.c
+++ b/arch/x86/kvm/vmx/posted_intr.c
@@ -118,12 +118,10 @@ static void __pi_post_block(struct kvm_vcpu *vcpu)
} while (cmpxchg64(&pi_desc->control, old.control,
new.control) != old.control);
- if (!WARN_ON_ONCE(vcpu->pre_pcpu == -1)) {
- spin_lock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->pre_pcpu));
- list_del(&vcpu->blocked_vcpu_list);
- spin_unlock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->pre_pcpu));
- vcpu->pre_pcpu = -1;
- }
+ spin_lock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->pre_pcpu));
+ list_del(&vcpu->blocked_vcpu_list);
+ spin_unlock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->pre_pcpu));
+ vcpu->pre_pcpu = -1;
}
/*
@@ -153,14 +151,12 @@ int pi_pre_block(struct kvm_vcpu *vcpu)
WARN_ON(irqs_disabled());
local_irq_disable();
- if (!WARN_ON_ONCE(vcpu->pre_pcpu != -1)) {
- vcpu->pre_pcpu = vcpu->cpu;
- spin_lock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->pre_pcpu));
- list_add_tail(&vcpu->blocked_vcpu_list,
- &per_cpu(blocked_vcpu_on_cpu,
- vcpu->pre_pcpu));
- spin_unlock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->pre_pcpu));
- }
+
+ vcpu->pre_pcpu = vcpu->cpu;
+ spin_lock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->pre_pcpu));
+ list_add_tail(&vcpu->blocked_vcpu_list,
+ &per_cpu(blocked_vcpu_on_cpu, vcpu->pre_pcpu));
+ spin_unlock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->pre_pcpu));
WARN(pi_desc->sn == 1,
"Posted Interrupt Suppress Notification set before blocking");
--
2.33.0.882.g93a45727a2-goog
Return bools instead of ints for the posted interrupt "test" helpers.
The bit position of the flag being test does not matter to the callers,
and is in fact lost by virtue of test_bit() itself returning a bool.
Returning ints is potentially dangerous, e.g. "pi_test_on(pi_desc) == 1"
is safe-ish because ON is bit 0 and thus any sane implementation of
pi_test_on() will work, but for SN (bit 1), checking "== 1" would rely on
pi_test_on() to return 0 or 1, a.k.a. bools, as opposed to 0 or 2 (the
positive bit position).
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/vmx/posted_intr.c | 4 ++--
arch/x86/kvm/vmx/posted_intr.h | 6 +++---
2 files changed, 5 insertions(+), 5 deletions(-)
diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
index 6c2110d91b06..1688f8dc535a 100644
--- a/arch/x86/kvm/vmx/posted_intr.c
+++ b/arch/x86/kvm/vmx/posted_intr.c
@@ -185,7 +185,7 @@ int pi_pre_block(struct kvm_vcpu *vcpu)
new.control) != old.control);
/* We should not block the vCPU if an interrupt is posted for it. */
- if (pi_test_on(pi_desc) == 1)
+ if (pi_test_on(pi_desc))
__pi_post_block(vcpu);
local_irq_enable();
@@ -216,7 +216,7 @@ void pi_wakeup_handler(void)
blocked_vcpu_list) {
struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
- if (pi_test_on(pi_desc) == 1)
+ if (pi_test_on(pi_desc))
kvm_vcpu_kick(vcpu);
}
spin_unlock(&per_cpu(blocked_vcpu_on_cpu_lock, cpu));
diff --git a/arch/x86/kvm/vmx/posted_intr.h b/arch/x86/kvm/vmx/posted_intr.h
index 7f7b2326caf5..36ae035f14aa 100644
--- a/arch/x86/kvm/vmx/posted_intr.h
+++ b/arch/x86/kvm/vmx/posted_intr.h
@@ -40,7 +40,7 @@ static inline bool pi_test_and_clear_on(struct pi_desc *pi_desc)
(unsigned long *)&pi_desc->control);
}
-static inline int pi_test_and_set_pir(int vector, struct pi_desc *pi_desc)
+static inline bool pi_test_and_set_pir(int vector, struct pi_desc *pi_desc)
{
return test_and_set_bit(vector, (unsigned long *)pi_desc->pir);
}
@@ -74,13 +74,13 @@ static inline void pi_clear_sn(struct pi_desc *pi_desc)
(unsigned long *)&pi_desc->control);
}
-static inline int pi_test_on(struct pi_desc *pi_desc)
+static inline bool pi_test_on(struct pi_desc *pi_desc)
{
return test_bit(POSTED_INTR_ON,
(unsigned long *)&pi_desc->control);
}
-static inline int pi_test_sn(struct pi_desc *pi_desc)
+static inline bool pi_test_sn(struct pi_desc *pi_desc)
{
return test_bit(POSTED_INTR_SN,
(unsigned long *)&pi_desc->control);
--
2.33.0.882.g93a45727a2-goog
Move the clearing of valid_wakeup from kvm_arch_vcpu_block_finish() so
that a future patch can drop said arch hook. Unlike the other blocking-
related arch hooks, vcpu_blocking/unblocking(), vcpu_block_finish() needs
to be called even if the KVM doesn't actually block the vCPU. This will
allow future patches to differentiate between truly blocking the vCPU and
emulating a halt condition without introducing a contradiction.
Alternatively, the hook could be renamed to kvm_arch_vcpu_halt_finish(),
but there's literally one call site in s390, and future cleanup can also
be done to handle valid_wakeup fully within kvm_s390_handle_wait() and
allow generic KVM to drop vcpu_valid_wakeup().
No functional change intended.
Reviewed-by: Christian Borntraeger <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/s390/kvm/interrupt.c | 1 +
arch/s390/kvm/kvm-s390.c | 2 +-
2 files changed, 2 insertions(+), 1 deletion(-)
diff --git a/arch/s390/kvm/interrupt.c b/arch/s390/kvm/interrupt.c
index 10722455fd02..520450a7956f 100644
--- a/arch/s390/kvm/interrupt.c
+++ b/arch/s390/kvm/interrupt.c
@@ -1336,6 +1336,7 @@ int kvm_s390_handle_wait(struct kvm_vcpu *vcpu)
no_timer:
srcu_read_unlock(&vcpu->kvm->srcu, vcpu->srcu_idx);
kvm_vcpu_block(vcpu);
+ vcpu->valid_wakeup = false;
__unset_cpu_idle(vcpu);
vcpu->srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 7cabe6778b1b..08ed68639a21 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -5082,7 +5082,7 @@ static inline unsigned long nonhyp_mask(int i)
void kvm_arch_vcpu_block_finish(struct kvm_vcpu *vcpu)
{
- vcpu->valid_wakeup = false;
+
}
static int __init kvm_s390_init(void)
--
2.33.0.882.g93a45727a2-goog
Don't update Posted Interrupt's NDST, a.k.a. the target pCPU, in the
pre-block path, as NDST is guaranteed to be up-to-date. The comment
about the vCPU being preempted during the update is simply wrong, as the
update path runs with IRQs disabled (from before snapshotting vcpu->cpu,
until after the update completes).
The vCPU can get preempted _before_ the update starts, but not during.
And if the vCPU is preempted before, vmx_vcpu_pi_load() is responsible
for updating NDST when the vCPU is scheduled back in. In that case, the
check against the wakeup vector in vmx_vcpu_pi_load() cannot be true as
that would require the notification vector to have been set to the wakeup
vector _before_ blocking.
Opportunistically switch to using vcpu->cpu for the list/lock lookups,
which presumably used pre_pcpu only for some phantom preemption logic.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/vmx/posted_intr.c | 23 +++--------------------
1 file changed, 3 insertions(+), 20 deletions(-)
diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
index 1688f8dc535a..239e0e72a0dd 100644
--- a/arch/x86/kvm/vmx/posted_intr.c
+++ b/arch/x86/kvm/vmx/posted_intr.c
@@ -130,7 +130,6 @@ static void __pi_post_block(struct kvm_vcpu *vcpu)
* - Store the vCPU to the wakeup list, so when interrupts happen
* we can find the right vCPU to wake up.
* - Change the Posted-interrupt descriptor as below:
- * 'NDST' <-- vcpu->pre_pcpu
* 'NV' <-- POSTED_INTR_WAKEUP_VECTOR
* - If 'ON' is set during this process, which means at least one
* interrupt is posted for this vCPU, we cannot block it, in
@@ -139,7 +138,6 @@ static void __pi_post_block(struct kvm_vcpu *vcpu)
*/
int pi_pre_block(struct kvm_vcpu *vcpu)
{
- unsigned int dest;
struct pi_desc old, new;
struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
@@ -153,10 +151,10 @@ int pi_pre_block(struct kvm_vcpu *vcpu)
local_irq_disable();
vcpu->pre_pcpu = vcpu->cpu;
- spin_lock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->pre_pcpu));
+ spin_lock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->cpu));
list_add_tail(&vcpu->blocked_vcpu_list,
- &per_cpu(blocked_vcpu_on_cpu, vcpu->pre_pcpu));
- spin_unlock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->pre_pcpu));
+ &per_cpu(blocked_vcpu_on_cpu, vcpu->cpu));
+ spin_unlock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->cpu));
WARN(pi_desc->sn == 1,
"Posted Interrupt Suppress Notification set before blocking");
@@ -164,21 +162,6 @@ int pi_pre_block(struct kvm_vcpu *vcpu)
do {
old.control = new.control = pi_desc->control;
- /*
- * Since vCPU can be preempted during this process,
- * vcpu->cpu could be different with pre_pcpu, we
- * need to set pre_pcpu as the destination of wakeup
- * notification event, then we can find the right vCPU
- * to wakeup in wakeup handler if interrupts happen
- * when the vCPU is in blocked state.
- */
- dest = cpu_physical_id(vcpu->pre_pcpu);
-
- if (x2apic_mode)
- new.ndst = dest;
- else
- new.ndst = (dest << 8) & 0xFF00;
-
/* set 'NV' to 'wakeup vector' */
new.nv = POSTED_INTR_WAKEUP_VECTOR;
} while (cmpxchg64(&pi_desc->control, old.control,
--
2.33.0.882.g93a45727a2-goog
Save/restore IRQs when disabling IRQs in posted interrupt pre/post block
in preparation for moving the code into vcpu_put/load(), and thus may be
called with IRQs already disabled.
No functional changed intended.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/vmx/posted_intr.c | 13 +++++++------
1 file changed, 7 insertions(+), 6 deletions(-)
diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
index 239e0e72a0dd..414ea6972b5c 100644
--- a/arch/x86/kvm/vmx/posted_intr.c
+++ b/arch/x86/kvm/vmx/posted_intr.c
@@ -140,6 +140,7 @@ int pi_pre_block(struct kvm_vcpu *vcpu)
{
struct pi_desc old, new;
struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
+ unsigned long flags;
if (!kvm_arch_has_assigned_device(vcpu->kvm) ||
!irq_remapping_cap(IRQ_POSTING_CAP) ||
@@ -147,8 +148,7 @@ int pi_pre_block(struct kvm_vcpu *vcpu)
vmx_interrupt_blocked(vcpu))
return 0;
- WARN_ON(irqs_disabled());
- local_irq_disable();
+ local_irq_save(flags);
vcpu->pre_pcpu = vcpu->cpu;
spin_lock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->cpu));
@@ -171,19 +171,20 @@ int pi_pre_block(struct kvm_vcpu *vcpu)
if (pi_test_on(pi_desc))
__pi_post_block(vcpu);
- local_irq_enable();
+ local_irq_restore(flags);
return (vcpu->pre_pcpu == -1);
}
void pi_post_block(struct kvm_vcpu *vcpu)
{
+ unsigned long flags;
+
if (vcpu->pre_pcpu == -1)
return;
- WARN_ON(irqs_disabled());
- local_irq_disable();
+ local_irq_save(flags);
__pi_post_block(vcpu);
- local_irq_enable();
+ local_irq_restore(flags);
}
/*
--
2.33.0.882.g93a45727a2-goog
Drop kvm_x86_ops' pre/post_block() now that all implementations are nops.
No functional change intended.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/include/asm/kvm-x86-ops.h | 2 --
arch/x86/include/asm/kvm_host.h | 12 ------------
arch/x86/kvm/vmx/vmx.c | 13 -------------
arch/x86/kvm/x86.c | 6 +-----
4 files changed, 1 insertion(+), 32 deletions(-)
diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index cefe1d81e2e8..c2b007171abd 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -96,8 +96,6 @@ KVM_X86_OP(handle_exit_irqoff)
KVM_X86_OP_NULL(request_immediate_exit)
KVM_X86_OP(sched_in)
KVM_X86_OP_NULL(update_cpu_dirty_logging)
-KVM_X86_OP_NULL(pre_block)
-KVM_X86_OP_NULL(post_block)
KVM_X86_OP_NULL(vcpu_blocking)
KVM_X86_OP_NULL(vcpu_unblocking)
KVM_X86_OP_NULL(update_pi_irte)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 328103a520d3..76a8dddc1a48 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1445,18 +1445,6 @@ struct kvm_x86_ops {
const struct kvm_pmu_ops *pmu_ops;
const struct kvm_x86_nested_ops *nested_ops;
- /*
- * Architecture specific hooks for vCPU blocking due to
- * HLT instruction.
- * Returns for .pre_block():
- * - 0 means continue to block the vCPU.
- * - 1 means we cannot block the vCPU since some event
- * happens during this period, such as, 'ON' bit in
- * posted-interrupts descriptor is set.
- */
- int (*pre_block)(struct kvm_vcpu *vcpu);
- void (*post_block)(struct kvm_vcpu *vcpu);
-
void (*vcpu_blocking)(struct kvm_vcpu *vcpu);
void (*vcpu_unblocking)(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index a24f19874716..13e732a818f3 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7462,16 +7462,6 @@ void vmx_update_cpu_dirty_logging(struct kvm_vcpu *vcpu)
secondary_exec_controls_clearbit(vmx, SECONDARY_EXEC_ENABLE_PML);
}
-static int vmx_pre_block(struct kvm_vcpu *vcpu)
-{
- return 0;
-}
-
-static void vmx_post_block(struct kvm_vcpu *vcpu)
-{
-
-}
-
static void vmx_setup_mce(struct kvm_vcpu *vcpu)
{
if (vcpu->arch.mcg_cap & MCG_LMCE_P)
@@ -7665,9 +7655,6 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {
.cpu_dirty_log_size = PML_ENTITY_NUM,
.update_cpu_dirty_logging = vmx_update_cpu_dirty_logging,
- .pre_block = vmx_pre_block,
- .post_block = vmx_post_block,
-
.pmu_ops = &intel_pmu_ops,
.nested_ops = &vmx_nested_ops,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 909e932a7ae7..9643f23c28c7 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9898,8 +9898,7 @@ static inline int vcpu_block(struct kvm *kvm, struct kvm_vcpu *vcpu)
{
bool hv_timer;
- if (!kvm_arch_vcpu_runnable(vcpu) &&
- (!kvm_x86_ops.pre_block || static_call(kvm_x86_pre_block)(vcpu) == 0)) {
+ if (!kvm_arch_vcpu_runnable(vcpu)) {
/*
* Switch to the software timer before halt-polling/blocking as
* the guest's timer may be a break event for the vCPU, and the
@@ -9921,9 +9920,6 @@ static inline int vcpu_block(struct kvm *kvm, struct kvm_vcpu *vcpu)
if (hv_timer)
kvm_lapic_switch_to_hv_timer(vcpu);
- if (kvm_x86_ops.post_block)
- static_call(kvm_x86_post_block)(vcpu);
-
if (!kvm_check_request(KVM_REQ_UNHALT, vcpu))
return 1;
}
--
2.33.0.882.g93a45727a2-goog
Move the seemingly generic block_vcpu_list from kvm_vcpu to vcpu_vmx, and
rename the list and all associated variables to clarify that it tracks
the set of vCPU that need to be poked on a posted interrupt to the wakeup
vector. The list is not used to track _all_ vCPUs that are blocking, and
the term "blocked" can be misleading as it may refer to a blocking
condition in the host or the guest, where as the PI wakeup case is
specifically for the vCPUs that are actively blocking from within the
guest.
No functional change intended.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/vmx/posted_intr.c | 39 +++++++++++++++++-----------------
arch/x86/kvm/vmx/vmx.c | 2 ++
arch/x86/kvm/vmx/vmx.h | 3 +++
include/linux/kvm_host.h | 2 --
virt/kvm/kvm_main.c | 2 --
5 files changed, 25 insertions(+), 23 deletions(-)
diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
index d2b3d75c57d1..f1bcf8c32b6d 100644
--- a/arch/x86/kvm/vmx/posted_intr.c
+++ b/arch/x86/kvm/vmx/posted_intr.c
@@ -18,7 +18,7 @@
* wake the target vCPUs. vCPUs are removed from the list and the notification
* vector is reset when the vCPU is scheduled in.
*/
-static DEFINE_PER_CPU(struct list_head, blocked_vcpu_on_cpu);
+static DEFINE_PER_CPU(struct list_head, wakeup_vcpus_on_cpu);
/*
* Protect the per-CPU list with a per-CPU spinlock to handle task migration.
* When a blocking vCPU is awakened _and_ migrated to a different pCPU, the
@@ -26,7 +26,7 @@ static DEFINE_PER_CPU(struct list_head, blocked_vcpu_on_cpu);
* CPU. IRQs must be disabled when taking this lock, otherwise deadlock will
* occur if a wakeup IRQ arrives and attempts to acquire the lock.
*/
-static DEFINE_PER_CPU(spinlock_t, blocked_vcpu_on_cpu_lock);
+static DEFINE_PER_CPU(spinlock_t, wakeup_vcpus_on_cpu_lock);
static inline struct pi_desc *vcpu_to_pi_desc(struct kvm_vcpu *vcpu)
{
@@ -36,6 +36,7 @@ static inline struct pi_desc *vcpu_to_pi_desc(struct kvm_vcpu *vcpu)
void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
{
struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
+ struct vcpu_vmx *vmx = to_vmx(vcpu);
struct pi_desc old, new;
unsigned long flags;
unsigned int dest;
@@ -71,9 +72,9 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
* current pCPU if the task was migrated.
*/
if (pi_desc->nv == POSTED_INTR_WAKEUP_VECTOR) {
- spin_lock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->cpu));
- list_del(&vcpu->blocked_vcpu_list);
- spin_unlock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->cpu));
+ spin_lock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));
+ list_del(&vmx->pi_wakeup_list);
+ spin_unlock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));
}
dest = cpu_physical_id(cpu);
@@ -121,15 +122,16 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
static void pi_enable_wakeup_handler(struct kvm_vcpu *vcpu)
{
struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
+ struct vcpu_vmx *vmx = to_vmx(vcpu);
struct pi_desc old, new;
unsigned long flags;
local_irq_save(flags);
- spin_lock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->cpu));
- list_add_tail(&vcpu->blocked_vcpu_list,
- &per_cpu(blocked_vcpu_on_cpu, vcpu->cpu));
- spin_unlock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->cpu));
+ spin_lock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));
+ list_add_tail(&vmx->pi_wakeup_list,
+ &per_cpu(wakeup_vcpus_on_cpu, vcpu->cpu));
+ spin_unlock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));
WARN(pi_desc->sn, "PI descriptor SN field set before blocking");
@@ -182,24 +184,23 @@ void vmx_vcpu_pi_put(struct kvm_vcpu *vcpu)
*/
void pi_wakeup_handler(void)
{
- struct kvm_vcpu *vcpu;
int cpu = smp_processor_id();
+ struct vcpu_vmx *vmx;
- spin_lock(&per_cpu(blocked_vcpu_on_cpu_lock, cpu));
- list_for_each_entry(vcpu, &per_cpu(blocked_vcpu_on_cpu, cpu),
- blocked_vcpu_list) {
- struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
+ spin_lock(&per_cpu(wakeup_vcpus_on_cpu_lock, cpu));
+ list_for_each_entry(vmx, &per_cpu(wakeup_vcpus_on_cpu, cpu),
+ pi_wakeup_list) {
- if (pi_test_on(pi_desc))
- kvm_vcpu_kick(vcpu);
+ if (pi_test_on(&vmx->pi_desc))
+ kvm_vcpu_kick(&vmx->vcpu);
}
- spin_unlock(&per_cpu(blocked_vcpu_on_cpu_lock, cpu));
+ spin_unlock(&per_cpu(wakeup_vcpus_on_cpu_lock, cpu));
}
void __init pi_init_cpu(int cpu)
{
- INIT_LIST_HEAD(&per_cpu(blocked_vcpu_on_cpu, cpu));
- spin_lock_init(&per_cpu(blocked_vcpu_on_cpu_lock, cpu));
+ INIT_LIST_HEAD(&per_cpu(wakeup_vcpus_on_cpu, cpu));
+ spin_lock_init(&per_cpu(wakeup_vcpus_on_cpu_lock, cpu));
}
bool pi_has_pending_interrupt(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 26ed8cd1a1f2..b3bb2031a7ac 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6848,6 +6848,8 @@ static int vmx_create_vcpu(struct kvm_vcpu *vcpu)
BUILD_BUG_ON(offsetof(struct vcpu_vmx, vcpu) != 0);
vmx = to_vmx(vcpu);
+ INIT_LIST_HEAD(&vmx->pi_wakeup_list);
+
err = -ENOMEM;
vmx->vpid = allocate_vpid();
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 592217fd7d92..d1a720be9a64 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -298,6 +298,9 @@ struct vcpu_vmx {
/* Posted interrupt descriptor */
struct pi_desc pi_desc;
+ /* Used if this vCPU is waiting for PI notification wakeup. */
+ struct list_head pi_wakeup_list;
+
/* Support for a guest hypervisor (nested VMX) */
struct nested_vmx nested;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 87996b22e681..c5961a361c73 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -304,8 +304,6 @@ struct kvm_vcpu {
u64 requests;
unsigned long guest_debug;
- struct list_head blocked_vcpu_list;
-
struct mutex mutex;
struct kvm_run *run;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 2bbf5c9d410f..c1850b60f38b 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -426,8 +426,6 @@ static void kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
#endif
kvm_async_pf_vcpu_init(vcpu);
- INIT_LIST_HEAD(&vcpu->blocked_vcpu_list);
-
kvm_vcpu_set_in_spin_loop(vcpu, false);
kvm_vcpu_set_dy_eligible(vcpu, false);
vcpu->preempted = false;
--
2.33.0.882.g93a45727a2-goog
Handle the switch to/from the hypervisor/software timer when a vCPU is
blocking in common x86 instead of in VMX. Even though VMX is the only
user of a hypervisor timer, the logic and all functions involved are
generic x86 (unless future CPUs do something completely different and
implement a hypervisor timer that runs regardless of mode).
Handling the switch in common x86 will allow for the elimination of the
pre/post_blocks hooks, and also lets KVM switch back to the hypervisor
timer if and only if it was in use (without additional params). Add a
comment explaining why the switch cannot be deferred to kvm_sched_out()
or kvm_vcpu_block().
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 6 +-----
arch/x86/kvm/x86.c | 21 +++++++++++++++++++++
2 files changed, 22 insertions(+), 5 deletions(-)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index b3bb2031a7ac..a24f19874716 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7464,16 +7464,12 @@ void vmx_update_cpu_dirty_logging(struct kvm_vcpu *vcpu)
static int vmx_pre_block(struct kvm_vcpu *vcpu)
{
- if (kvm_lapic_hv_timer_in_use(vcpu))
- kvm_lapic_switch_to_sw_timer(vcpu);
-
return 0;
}
static void vmx_post_block(struct kvm_vcpu *vcpu)
{
- if (kvm_x86_ops.set_hv_timer)
- kvm_lapic_switch_to_hv_timer(vcpu);
+
}
static void vmx_setup_mce(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e0219acfd9cf..909e932a7ae7 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9896,8 +9896,21 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
static inline int vcpu_block(struct kvm *kvm, struct kvm_vcpu *vcpu)
{
+ bool hv_timer;
+
if (!kvm_arch_vcpu_runnable(vcpu) &&
(!kvm_x86_ops.pre_block || static_call(kvm_x86_pre_block)(vcpu) == 0)) {
+ /*
+ * Switch to the software timer before halt-polling/blocking as
+ * the guest's timer may be a break event for the vCPU, and the
+ * hypervisor timer runs only when the CPU is in guest mode.
+ * Switch before halt-polling so that KVM recognizes an expired
+ * timer before blocking.
+ */
+ hv_timer = kvm_lapic_hv_timer_in_use(vcpu);
+ if (hv_timer)
+ kvm_lapic_switch_to_sw_timer(vcpu);
+
srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
if (vcpu->arch.mp_state == KVM_MP_STATE_HALTED)
kvm_vcpu_halt(vcpu);
@@ -9905,6 +9918,9 @@ static inline int vcpu_block(struct kvm *kvm, struct kvm_vcpu *vcpu)
kvm_vcpu_block(vcpu);
vcpu->srcu_idx = srcu_read_lock(&kvm->srcu);
+ if (hv_timer)
+ kvm_lapic_switch_to_hv_timer(vcpu);
+
if (kvm_x86_ops.post_block)
static_call(kvm_x86_post_block)(vcpu);
@@ -10136,6 +10152,11 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
r = -EINTR;
goto out;
}
+ /*
+ * It should be impossible for the hypervisor timer to be in
+ * use before KVM has ever run the vCPU.
+ */
+ WARN_ON_ONCE(kvm_lapic_hv_timer_in_use(vcpu));
kvm_vcpu_block(vcpu);
if (kvm_apic_accept_events(vcpu) < 0) {
r = 0;
--
2.33.0.882.g93a45727a2-goog
From: Jing Zhang <[email protected]>
Add a "blocking" stat that userspace can use to detect the case where a
vCPU is not being run because of an vCPU/guest action, e.g. HLT or WFS on
x86, WFI on arm64, etc... Current guest/host/halt stats don't show this
well, e.g. if a guest halts for a long period of time then the vCPU could
could appear pathologically blocked due to a host condition, when in
reality the vCPU has been put into a not-runnable state by the guest.
Originally-by: Cannon Matthews <[email protected]>
Suggested-by: Sean Christopherson <[email protected]>
Reviewed-by: David Matlack <[email protected]>
Signed-off-by: Jing Zhang <[email protected]>
[sean: renamed stat to "blocking", massaged changelog]
Signed-off-by: Sean Christopherson <[email protected]>
---
include/linux/kvm_host.h | 3 ++-
include/linux/kvm_types.h | 1 +
virt/kvm/kvm_main.c | 4 ++++
3 files changed, 7 insertions(+), 1 deletion(-)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 2d837e06eeec..bdaa0e70b060 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1462,7 +1462,8 @@ struct _kvm_stats_desc {
STATS_DESC_LOGHIST_TIME_NSEC(VCPU_GENERIC, halt_poll_fail_hist, \
HALT_POLL_HIST_COUNT), \
STATS_DESC_LOGHIST_TIME_NSEC(VCPU_GENERIC, halt_wait_hist, \
- HALT_POLL_HIST_COUNT)
+ HALT_POLL_HIST_COUNT), \
+ STATS_DESC_ICOUNTER(VCPU_GENERIC, blocking)
extern struct dentry *kvm_debugfs_dir;
diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
index 2237abb93ccd..c4f9257bf32d 100644
--- a/include/linux/kvm_types.h
+++ b/include/linux/kvm_types.h
@@ -94,6 +94,7 @@ struct kvm_vcpu_stat_generic {
u64 halt_poll_success_hist[HALT_POLL_HIST_COUNT];
u64 halt_poll_fail_hist[HALT_POLL_HIST_COUNT];
u64 halt_wait_hist[HALT_POLL_HIST_COUNT];
+ u64 blocking;
};
#define KVM_STATS_NAME_SIZE 48
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 42894ff7c474..a36ccdc93a72 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3211,6 +3211,8 @@ bool kvm_vcpu_block(struct kvm_vcpu *vcpu)
struct rcuwait *wait = kvm_arch_vcpu_get_wait(vcpu);
bool waited = false;
+ vcpu->stat.generic.blocking = 1;
+
kvm_arch_vcpu_blocking(vcpu);
prepare_to_rcuwait(wait);
@@ -3227,6 +3229,8 @@ bool kvm_vcpu_block(struct kvm_vcpu *vcpu)
kvm_arch_vcpu_unblocking(vcpu);
+ vcpu->stat.generic.blocking = 0;
+
return waited;
}
--
2.33.0.882.g93a45727a2-goog
Refactor the posted interrupt helper to take the desired notification
vector instead of a bool so that the callers are self-documenting.
No functional change intended.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 8 +++-----
1 file changed, 3 insertions(+), 5 deletions(-)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 78c8bc7f1b3b..f505fee3cf5c 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -3928,11 +3928,9 @@ static void vmx_msr_filter_changed(struct kvm_vcpu *vcpu)
}
static inline bool kvm_vcpu_trigger_posted_interrupt(struct kvm_vcpu *vcpu,
- bool nested)
+ int pi_vec)
{
#ifdef CONFIG_SMP
- int pi_vec = nested ? POSTED_INTR_NESTED_VECTOR : POSTED_INTR_VECTOR;
-
if (vcpu->mode == IN_GUEST_MODE) {
/*
* The vector of interrupt to be delivered to vcpu had
@@ -3986,7 +3984,7 @@ static int vmx_deliver_nested_posted_interrupt(struct kvm_vcpu *vcpu,
*/
kvm_make_request(KVM_REQ_EVENT, vcpu);
/* the PIR and ON have been set by L1. */
- if (!kvm_vcpu_trigger_posted_interrupt(vcpu, true))
+ if (!kvm_vcpu_trigger_posted_interrupt(vcpu, POSTED_INTR_NESTED_VECTOR))
kvm_vcpu_wake_up(vcpu);
return 0;
}
@@ -4024,7 +4022,7 @@ static int vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector)
* guaranteed to see PID.ON=1 and sync the PIR to IRR if triggering a
* posted interrupt "fails" because vcpu->mode != IN_GUEST_MODE.
*/
- if (!kvm_vcpu_trigger_posted_interrupt(vcpu, false))
+ if (!kvm_vcpu_trigger_posted_interrupt(vcpu, POSTED_INTR_VECTOR))
kvm_vcpu_wake_up(vcpu);
return 0;
--
2.33.0.882.g93a45727a2-goog
Add helpers to wake and query a blocking vCPU. In addition to providing
nice names, the helpers reduce the probability of KVM neglecting to use
kvm_arch_vcpu_get_wait().
No functional change intended.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/arm64/kvm/arch_timer.c | 3 +--
arch/arm64/kvm/arm.c | 2 +-
arch/x86/kvm/lapic.c | 2 +-
include/linux/kvm_host.h | 14 ++++++++++++++
virt/kvm/async_pf.c | 2 +-
virt/kvm/kvm_main.c | 8 ++------
6 files changed, 20 insertions(+), 11 deletions(-)
diff --git a/arch/arm64/kvm/arch_timer.c b/arch/arm64/kvm/arch_timer.c
index 7e8396f74010..addd53b6eba6 100644
--- a/arch/arm64/kvm/arch_timer.c
+++ b/arch/arm64/kvm/arch_timer.c
@@ -649,7 +649,6 @@ void kvm_timer_vcpu_put(struct kvm_vcpu *vcpu)
{
struct arch_timer_cpu *timer = vcpu_timer(vcpu);
struct timer_map map;
- struct rcuwait *wait = kvm_arch_vcpu_get_wait(vcpu);
if (unlikely(!timer->enabled))
return;
@@ -672,7 +671,7 @@ void kvm_timer_vcpu_put(struct kvm_vcpu *vcpu)
if (map.emul_ptimer)
soft_timer_cancel(&map.emul_ptimer->hrtimer);
- if (rcuwait_active(wait))
+ if (kvm_vcpu_is_blocking(vcpu))
kvm_timer_blocking(vcpu);
/*
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 268b1e7bf700..9ff0e85a9f16 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -622,7 +622,7 @@ void kvm_arm_resume_guest(struct kvm *kvm)
kvm_for_each_vcpu(i, vcpu, kvm) {
vcpu->arch.pause = false;
- rcuwait_wake_up(kvm_arch_vcpu_get_wait(vcpu));
+ __kvm_vcpu_wake_up(vcpu);
}
}
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 76fb00921203..0cd7ed21b205 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1931,7 +1931,7 @@ void kvm_lapic_expired_hv_timer(struct kvm_vcpu *vcpu)
/* If the preempt notifier has already run, it also called apic_timer_expired */
if (!apic->lapic_timer.hv_timer_in_use)
goto out;
- WARN_ON(rcuwait_active(&vcpu->wait));
+ WARN_ON(kvm_vcpu_is_blocking(vcpu));
apic_timer_expired(apic, false);
cancel_hv_timer(apic);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index bdaa0e70b060..1fa38dc00b87 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1151,6 +1151,20 @@ static inline struct rcuwait *kvm_arch_vcpu_get_wait(struct kvm_vcpu *vcpu)
#endif
}
+/*
+ * Wake a vCPU if necessary, but don't do any stats/metadata updates. Returns
+ * true if the vCPU was blocking and was awakened, false otherwise.
+ */
+static inline bool __kvm_vcpu_wake_up(struct kvm_vcpu *vcpu)
+{
+ return !!rcuwait_wake_up(kvm_arch_vcpu_get_wait(vcpu));
+}
+
+static inline bool kvm_vcpu_is_blocking(struct kvm_vcpu *vcpu)
+{
+ return rcuwait_active(kvm_arch_vcpu_get_wait(vcpu));
+}
+
#ifdef __KVM_HAVE_ARCH_INTC_INITIALIZED
/*
* returns true if the virtual interrupt controller is initialized and
diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
index ccb35c22785e..9bfe1d6f6529 100644
--- a/virt/kvm/async_pf.c
+++ b/virt/kvm/async_pf.c
@@ -85,7 +85,7 @@ static void async_pf_execute(struct work_struct *work)
trace_kvm_async_pf_completed(addr, cr2_or_gpa);
- rcuwait_wake_up(kvm_arch_vcpu_get_wait(vcpu));
+ __kvm_vcpu_wake_up(vcpu);
mmput(mm);
kvm_put_kvm(vcpu->kvm);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 481e8178b43d..c870cae7e776 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3332,10 +3332,7 @@ EXPORT_SYMBOL_GPL(kvm_vcpu_halt);
bool kvm_vcpu_wake_up(struct kvm_vcpu *vcpu)
{
- struct rcuwait *waitp;
-
- waitp = kvm_arch_vcpu_get_wait(vcpu);
- if (rcuwait_wake_up(waitp)) {
+ if (__kvm_vcpu_wake_up(vcpu)) {
WRITE_ONCE(vcpu->ready, true);
++vcpu->stat.generic.halt_wakeup;
return true;
@@ -3490,8 +3487,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
continue;
if (vcpu == me)
continue;
- if (rcuwait_active(kvm_arch_vcpu_get_wait(vcpu)) &&
- !vcpu_dy_runnable(vcpu))
+ if (kvm_vcpu_is_blocking(vcpu) && !vcpu_dy_runnable(vcpu))
continue;
if (READ_ONCE(vcpu->preempted) && yield_to_kernel_mode &&
!kvm_arch_dy_has_pending_interrupt(vcpu) &&
--
2.33.0.882.g93a45727a2-goog
Drop the avic_vcpu_is_running() check when waking vCPUs in response to a
VM-Exit due to incomplete IPI delivery. The check isn't wrong per se, but
it's not 100% accurate in the sense that it doesn't guarantee that the vCPU
was one of the vCPUs that didn't receive the IPI.
The check isn't required for correctness as blocking == !running in this
context.
From a performance perspective, waking a live task is not expensive as the
only moderately costly operation is a locked operation to temporarily
disable preemption. And if that is indeed a performance issue,
kvm_vcpu_is_blocking() would be a better check than poking into the AVIC.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/svm/avic.c | 15 +++++++++------
arch/x86/kvm/svm/svm.h | 11 -----------
2 files changed, 9 insertions(+), 17 deletions(-)
diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index cbf02e7e20d0..b43b05610ade 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -295,13 +295,16 @@ static void avic_kick_target_vcpus(struct kvm *kvm, struct kvm_lapic *source,
struct kvm_vcpu *vcpu;
int i;
+ /*
+ * Wake any target vCPUs that are blocking, i.e. waiting for a wake
+ * event. There's no need to signal doorbells, as hardware has handled
+ * vCPUs that were in guest at the time of the IPI, and vCPUs that have
+ * since entered the guest will have processed pending IRQs at VMRUN.
+ */
kvm_for_each_vcpu(i, vcpu, kvm) {
- bool m = kvm_apic_match_dest(vcpu, source,
- icrl & APIC_SHORT_MASK,
- GET_APIC_DEST_FIELD(icrh),
- icrl & APIC_DEST_MASK);
-
- if (m && !avic_vcpu_is_running(vcpu))
+ if (kvm_apic_match_dest(vcpu, source, icrl & APIC_SHORT_MASK,
+ GET_APIC_DEST_FIELD(icrh),
+ icrl & APIC_DEST_MASK))
kvm_vcpu_wake_up(vcpu);
}
}
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 0d7bbe548ac3..7f5b01bbee29 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -509,17 +509,6 @@ extern struct kvm_x86_nested_ops svm_nested_ops;
#define VMCB_AVIC_APIC_BAR_MASK 0xFFFFFFFFFF000ULL
-static inline bool avic_vcpu_is_running(struct kvm_vcpu *vcpu)
-{
- struct vcpu_svm *svm = to_svm(vcpu);
- u64 *entry = svm->avic_physical_id_cache;
-
- if (!entry)
- return false;
-
- return (READ_ONCE(*entry) & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK);
-}
-
int avic_ga_log_notifier(u32 ga_tag);
void avic_vm_destroy(struct kvm *kvm);
int avic_vm_init(struct kvm *kvm);
--
2.33.0.882.g93a45727a2-goog
Hoist the CPU => APIC ID conversion for the Posted Interrupt descriptor
out of the loop to write the descriptor, preemption is disabled so the
CPU won't change, and if the APIC ID changes KVM has bigger problems.
No functional change intended.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/vmx/posted_intr.c | 25 +++++++++++--------------
1 file changed, 11 insertions(+), 14 deletions(-)
diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
index fea343dcc011..2b2206339174 100644
--- a/arch/x86/kvm/vmx/posted_intr.c
+++ b/arch/x86/kvm/vmx/posted_intr.c
@@ -51,17 +51,15 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
goto after_clear_sn;
}
- /* The full case. */
+ /* The full case. Set the new destination and clear SN. */
+ dest = cpu_physical_id(cpu);
+ if (!x2apic_mode)
+ dest = (dest << 8) & 0xFF00;
+
do {
old.control = new.control = READ_ONCE(pi_desc->control);
- dest = cpu_physical_id(cpu);
-
- if (x2apic_mode)
- new.ndst = dest;
- else
- new.ndst = (dest << 8) & 0xFF00;
-
+ new.ndst = dest;
new.sn = 0;
} while (cmpxchg64(&pi_desc->control, old.control,
new.control) != old.control);
@@ -103,15 +101,14 @@ static void __pi_post_block(struct kvm_vcpu *vcpu)
WARN(pi_desc->nv != POSTED_INTR_WAKEUP_VECTOR,
"Wakeup handler not enabled while the vCPU was blocking");
+ dest = cpu_physical_id(vcpu->cpu);
+ if (!x2apic_mode)
+ dest = (dest << 8) & 0xFF00;
+
do {
old.control = new.control = READ_ONCE(pi_desc->control);
- dest = cpu_physical_id(vcpu->cpu);
-
- if (x2apic_mode)
- new.ndst = dest;
- else
- new.ndst = (dest << 8) & 0xFF00;
+ new.ndst = dest;
/* set 'NV' to 'notification vector' */
new.nv = POSTED_INTR_VECTOR;
--
2.33.0.882.g93a45727a2-goog
Move the WARN sanity checks out of the PI descriptor update loop so as
not to spam the kernel log if the condition is violated and the update
takes multiple attempts due to another writer. This also eliminates a
few extra uops from the retry path.
Technically not checking every attempt could mean KVM will now fail to
WARN in a scenario that would have failed before, but any such failure
would be inherently racy as some other agent (CPU or device) would have
to concurrent modify the PI descriptor.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/vmx/posted_intr.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
index 351666c41bbc..67cbe6ab8f66 100644
--- a/arch/x86/kvm/vmx/posted_intr.c
+++ b/arch/x86/kvm/vmx/posted_intr.c
@@ -100,10 +100,11 @@ static void __pi_post_block(struct kvm_vcpu *vcpu)
struct pi_desc old, new;
unsigned int dest;
+ WARN(pi_desc->nv != POSTED_INTR_WAKEUP_VECTOR,
+ "Wakeup handler not enabled while the vCPU was blocking");
+
do {
old.control = new.control = pi_desc->control;
- WARN(old.nv != POSTED_INTR_WAKEUP_VECTOR,
- "Wakeup handler not enabled while the VCPU is blocked\n");
dest = cpu_physical_id(vcpu->cpu);
@@ -161,13 +162,12 @@ int pi_pre_block(struct kvm_vcpu *vcpu)
spin_unlock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->pre_pcpu));
}
+ WARN(pi_desc->sn == 1,
+ "Posted Interrupt Suppress Notification set before blocking");
+
do {
old.control = new.control = pi_desc->control;
- WARN((pi_desc->sn == 1),
- "Warning: SN field of posted-interrupts "
- "is set before blocking\n");
-
/*
* Since vCPU can be preempted during this process,
* vcpu->cpu could be different with pre_pcpu, we
--
2.33.0.882.g93a45727a2-goog
Always mark the AVIC as "running" on vCPU load when the AVIC is enabled and
drop the vcpu_blocking/unblocking hooks that toggle "running". There is no
harm in keeping the flag set for a wee bit longer when a vCPU is blocking,
i.e. between the start of blocking and being scheduled out. At worst, an
agent in the host will unnecessarily signal the doorbell, but that's
already the status quo in KVM as the "running" flag is set the entire time
a vCPU is loaded, not just when it's actively running the guest.
In addition to simplifying the code, keeping the "running" flag set longer
can reduce the number of VM-Exits due to incomplete IPI delivery.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/svm/avic.c | 53 +++++++++++++----------------------------
arch/x86/kvm/svm/svm.c | 8 -------
arch/x86/kvm/svm/svm.h | 3 ---
3 files changed, 17 insertions(+), 47 deletions(-)
diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index b43b05610ade..213f5223f63e 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -967,6 +967,15 @@ void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
int h_physical_id = kvm_cpu_get_apicid(cpu);
struct vcpu_svm *svm = to_svm(vcpu);
+ /* TODO: Document why the unblocking path checks for updates. */
+ if (kvm_vcpu_is_blocking(vcpu) &&
+ kvm_check_request(KVM_REQ_APICV_UPDATE, vcpu)) {
+ kvm_vcpu_update_apicv(vcpu);
+
+ if (!kvm_vcpu_apicv_active(vcpu))
+ return;
+ }
+
/*
* Since the host physical APIC id is 8 bits,
* we can support host APIC ID upto 255.
@@ -974,19 +983,21 @@ void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
if (WARN_ON(h_physical_id > AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK))
return;
+ /*
+ * Unconditionally mark the AVIC as "running", even if the vCPU is in
+ * kvm_vcpu_block(). kvm_vcpu_check_block() will detect pending IRQs
+ * and bail out of the block loop, and if not, avic_vcpu_put() will
+ * set the AVIC back to "not running" when the vCPU is scheduled out.
+ */
entry = READ_ONCE(*(svm->avic_physical_id_cache));
WARN_ON(entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK);
entry &= ~AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK;
entry |= (h_physical_id & AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK);
-
- entry &= ~AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK;
- if (svm->avic_is_running)
- entry |= AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK;
+ entry |= AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK;
WRITE_ONCE(*(svm->avic_physical_id_cache), entry);
- avic_update_iommu_vcpu_affinity(vcpu, h_physical_id,
- svm->avic_is_running);
+ avic_update_iommu_vcpu_affinity(vcpu, h_physical_id, true);
}
void avic_vcpu_put(struct kvm_vcpu *vcpu)
@@ -1001,33 +1012,3 @@ void avic_vcpu_put(struct kvm_vcpu *vcpu)
entry &= ~AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK;
WRITE_ONCE(*(svm->avic_physical_id_cache), entry);
}
-
-/*
- * This function is called during VCPU halt/unhalt.
- */
-static void avic_set_running(struct kvm_vcpu *vcpu, bool is_run)
-{
- struct vcpu_svm *svm = to_svm(vcpu);
-
- svm->avic_is_running = is_run;
-
- if (!kvm_vcpu_apicv_active(vcpu))
- return;
-
- if (is_run)
- avic_vcpu_load(vcpu, vcpu->cpu);
- else
- avic_vcpu_put(vcpu);
-}
-
-void svm_vcpu_blocking(struct kvm_vcpu *vcpu)
-{
- avic_set_running(vcpu, false);
-}
-
-void svm_vcpu_unblocking(struct kvm_vcpu *vcpu)
-{
- if (kvm_check_request(KVM_REQ_APICV_UPDATE, vcpu))
- kvm_vcpu_update_apicv(vcpu);
- avic_set_running(vcpu, true);
-}
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 89077160d463..a1ca5707f2c8 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1433,12 +1433,6 @@ static int svm_create_vcpu(struct kvm_vcpu *vcpu)
if (err)
goto error_free_vmsa_page;
- /* We initialize this flag to true to make sure that the is_running
- * bit would be set the first time the vcpu is loaded.
- */
- if (irqchip_in_kernel(vcpu->kvm) && kvm_apicv_activated(vcpu->kvm))
- svm->avic_is_running = true;
-
svm->msrpm = svm_vcpu_alloc_msrpm();
if (!svm->msrpm) {
err = -ENOMEM;
@@ -4597,8 +4591,6 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
.prepare_guest_switch = svm_prepare_guest_switch,
.vcpu_load = svm_vcpu_load,
.vcpu_put = svm_vcpu_put,
- .vcpu_blocking = svm_vcpu_blocking,
- .vcpu_unblocking = svm_vcpu_unblocking,
.update_exception_bitmap = svm_update_exception_bitmap,
.get_msr_feature = svm_get_msr_feature,
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 7f5b01bbee29..652d71acfb6c 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -169,7 +169,6 @@ struct vcpu_svm {
u32 dfr_reg;
struct page *avic_backing_page;
u64 *avic_physical_id_cache;
- bool avic_is_running;
/*
* Per-vcpu list of struct amd_svm_iommu_ir:
@@ -529,8 +528,6 @@ int svm_deliver_avic_intr(struct kvm_vcpu *vcpu, int vec);
bool svm_dy_apicv_has_pending_interrupt(struct kvm_vcpu *vcpu);
int svm_update_pi_irte(struct kvm *kvm, unsigned int host_irq,
uint32_t guest_irq, bool set);
-void svm_vcpu_blocking(struct kvm_vcpu *vcpu);
-void svm_vcpu_unblocking(struct kvm_vcpu *vcpu);
/* sev.c */
--
2.33.0.882.g93a45727a2-goog
Remove kvm_arch_vcpu_(un)blocking() now that all implementations are nops.
No functional change intended.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/arm64/kvm/arm.c | 10 ----------
arch/mips/include/asm/kvm_host.h | 2 --
arch/powerpc/include/asm/kvm_host.h | 2 --
arch/s390/include/asm/kvm_host.h | 2 --
arch/x86/include/asm/kvm-x86-ops.h | 2 --
arch/x86/include/asm/kvm_host.h | 13 -------------
include/linux/kvm_host.h | 2 --
virt/kvm/kvm_main.c | 4 ----
8 files changed, 37 deletions(-)
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 9ff0e85a9f16..444d6f5a980a 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -357,16 +357,6 @@ int kvm_cpu_has_pending_timer(struct kvm_vcpu *vcpu)
return kvm_timer_is_pending(vcpu);
}
-void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu)
-{
-
-}
-
-void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu)
-{
-
-}
-
void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
{
struct kvm_s2_mmu *mmu;
diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h
index 72b90d45a46e..28110f71089b 100644
--- a/arch/mips/include/asm/kvm_host.h
+++ b/arch/mips/include/asm/kvm_host.h
@@ -895,8 +895,6 @@ static inline void kvm_arch_free_memslot(struct kvm *kvm,
struct kvm_memory_slot *slot) {}
static inline void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen) {}
static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
-static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) {}
-static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {}
#define __KVM_HAVE_ARCH_FLUSH_REMOTE_TLB
int kvm_arch_flush_remote_tlb(struct kvm *kvm);
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index 4a195c161592..0dfee6866541 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -863,7 +863,5 @@ static inline void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen) {}
static inline void kvm_arch_flush_shadow_all(struct kvm *kvm) {}
static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
static inline void kvm_arch_exit(void) {}
-static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) {}
-static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {}
#endif /* __POWERPC_KVM_HOST_H__ */
diff --git a/arch/s390/include/asm/kvm_host.h b/arch/s390/include/asm/kvm_host.h
index a22c9266ea05..25ed4ec66f4a 100644
--- a/arch/s390/include/asm/kvm_host.h
+++ b/arch/s390/include/asm/kvm_host.h
@@ -1007,7 +1007,5 @@ static inline void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen) {}
static inline void kvm_arch_flush_shadow_all(struct kvm *kvm) {}
static inline void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
struct kvm_memory_slot *slot) {}
-static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) {}
-static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {}
#endif
diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index c2b007171abd..f2c38acdcad6 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -96,8 +96,6 @@ KVM_X86_OP(handle_exit_irqoff)
KVM_X86_OP_NULL(request_immediate_exit)
KVM_X86_OP(sched_in)
KVM_X86_OP_NULL(update_cpu_dirty_logging)
-KVM_X86_OP_NULL(vcpu_blocking)
-KVM_X86_OP_NULL(vcpu_unblocking)
KVM_X86_OP_NULL(update_pi_irte)
KVM_X86_OP_NULL(start_assignment)
KVM_X86_OP_NULL(apicv_post_state_restore)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 76a8dddc1a48..bebd42926321 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1445,9 +1445,6 @@ struct kvm_x86_ops {
const struct kvm_pmu_ops *pmu_ops;
const struct kvm_x86_nested_ops *nested_ops;
- void (*vcpu_blocking)(struct kvm_vcpu *vcpu);
- void (*vcpu_unblocking)(struct kvm_vcpu *vcpu);
-
int (*update_pi_irte)(struct kvm *kvm, unsigned int host_irq,
uint32_t guest_irq, bool set);
void (*start_assignment)(struct kvm *kvm);
@@ -1904,16 +1901,6 @@ static inline bool kvm_irq_is_postable(struct kvm_lapic_irq *irq)
irq->delivery_mode == APIC_DM_LOWEST);
}
-static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu)
-{
- static_call_cond(kvm_x86_vcpu_blocking)(vcpu);
-}
-
-static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu)
-{
- static_call_cond(kvm_x86_vcpu_unblocking)(vcpu);
-}
-
static inline int kvm_cpu_get_apicid(int mps_cpu)
{
#ifdef CONFIG_X86_LOCAL_APIC
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index c5961a361c73..6a84b020daa6 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -966,8 +966,6 @@ void kvm_sigset_deactivate(struct kvm_vcpu *vcpu);
void kvm_vcpu_halt(struct kvm_vcpu *vcpu);
bool kvm_vcpu_block(struct kvm_vcpu *vcpu);
-void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu);
-void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu);
bool kvm_vcpu_wake_up(struct kvm_vcpu *vcpu);
void kvm_vcpu_kick(struct kvm_vcpu *vcpu);
int kvm_vcpu_yield_to(struct kvm_vcpu *target);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index c1850b60f38b..96de905e26e4 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3210,8 +3210,6 @@ bool kvm_vcpu_block(struct kvm_vcpu *vcpu)
vcpu->stat.generic.blocking = 1;
- kvm_arch_vcpu_blocking(vcpu);
-
prepare_to_rcuwait(wait);
for (;;) {
set_current_state(TASK_INTERRUPTIBLE);
@@ -3224,8 +3222,6 @@ bool kvm_vcpu_block(struct kvm_vcpu *vcpu)
}
finish_rcuwait(wait);
- kvm_arch_vcpu_unblocking(vcpu);
-
vcpu->stat.generic.blocking = 0;
return waited;
--
2.33.0.882.g93a45727a2-goog
Replace the full "kick" with just the "wake" in the fallback path when
triggering a virtual interrupt via a posted interrupt fails because the
guest is not IN_GUEST_MODE. If the guest transitions into guest mode
between the check and the kick, then it's guaranteed to see the pending
interrupt as KVM syncs the PIR to IRR (and onto GUEST_RVI) after setting
IN_GUEST_MODE. Kicking the guest in this case is nothing more than an
unnecessary VM-Exit (and host IRQ).
Opportunistically update comments to explain the various ordering rules
and barriers at play.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 16 ++++++++++++++--
arch/x86/kvm/x86.c | 5 +++--
2 files changed, 17 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 13e732a818f3..44d760dde0f9 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -3978,10 +3978,16 @@ static int vmx_deliver_nested_posted_interrupt(struct kvm_vcpu *vcpu,
* we will accomplish it in the next vmentry.
*/
vmx->nested.pi_pending = true;
+ /*
+ * The smp_wmb() in kvm_make_request() pairs with the smp_mb_*()
+ * after setting vcpu->mode in vcpu_enter_guest(), thus the vCPU
+ * is guaranteed to see the event request if triggering a posted
+ * interrupt "fails" because vcpu->mode != IN_GUEST_MODE.
+ */
kvm_make_request(KVM_REQ_EVENT, vcpu);
/* the PIR and ON have been set by L1. */
if (!kvm_vcpu_trigger_posted_interrupt(vcpu, true))
- kvm_vcpu_kick(vcpu);
+ kvm_vcpu_wake_up(vcpu);
return 0;
}
return -1;
@@ -4012,9 +4018,15 @@ static int vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector)
if (pi_test_and_set_on(&vmx->pi_desc))
return 0;
+ /*
+ * The implied barrier in pi_test_and_set_on() pairs with the smp_mb_*()
+ * after setting vcpu->mode in vcpu_enter_guest(), thus the vCPU is
+ * guaranteed to see PID.ON=1 and sync the PIR to IRR if triggering a
+ * posted interrupt "fails" because vcpu->mode != IN_GUEST_MODE.
+ */
if (vcpu != kvm_get_running_vcpu() &&
!kvm_vcpu_trigger_posted_interrupt(vcpu, false))
- kvm_vcpu_kick(vcpu);
+ kvm_vcpu_wake_up(vcpu);
return 0;
}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9643f23c28c7..274d295cabfb 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9752,8 +9752,9 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
smp_mb__after_srcu_read_unlock();
/*
- * This handles the case where a posted interrupt was
- * notified with kvm_vcpu_kick.
+ * Process pending posted interrupts to handle the case where the
+ * notification IRQ arrived in the host, or was never sent (because the
+ * target vCPU wasn't running).
*/
if (kvm_lapic_enabled(vcpu) && vcpu->arch.apicv_active)
static_call(kvm_x86_sync_pir_to_irr)(vcpu);
--
2.33.0.882.g93a45727a2-goog
Use READ_ONCE() when loading the posted interrupt descriptor control
field to ensure "old" and "new" have the same base value. If the
compiler emits separate loads, and loads into "new" before "old", KVM
could theoretically drop the ON bit if it were set between the loads.
Fixes: 28b835d60fcc ("KVM: Update Posted-Interrupts Descriptor when vCPU is preempted")
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/vmx/posted_intr.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
index 414ea6972b5c..fea343dcc011 100644
--- a/arch/x86/kvm/vmx/posted_intr.c
+++ b/arch/x86/kvm/vmx/posted_intr.c
@@ -53,7 +53,7 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
/* The full case. */
do {
- old.control = new.control = pi_desc->control;
+ old.control = new.control = READ_ONCE(pi_desc->control);
dest = cpu_physical_id(cpu);
@@ -104,7 +104,7 @@ static void __pi_post_block(struct kvm_vcpu *vcpu)
"Wakeup handler not enabled while the vCPU was blocking");
do {
- old.control = new.control = pi_desc->control;
+ old.control = new.control = READ_ONCE(pi_desc->control);
dest = cpu_physical_id(vcpu->cpu);
@@ -160,7 +160,7 @@ int pi_pre_block(struct kvm_vcpu *vcpu)
"Posted Interrupt Suppress Notification set before blocking");
do {
- old.control = new.control = pi_desc->control;
+ old.control = new.control = READ_ONCE(pi_desc->control);
/* set 'NV' to 'wakeup vector' */
new.nv = POSTED_INTR_WAKEUP_VECTOR;
--
2.33.0.882.g93a45727a2-goog
When waking vCPUs in the posted interrupt wakeup handling, do exactly
that and no more. There is no need to kick the vCPU as the wakeup
handler just need to get the vCPU task running, and if it's in the guest
then it's definitely running.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/vmx/posted_intr.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
index f1bcf8c32b6d..06eb9c950760 100644
--- a/arch/x86/kvm/vmx/posted_intr.c
+++ b/arch/x86/kvm/vmx/posted_intr.c
@@ -192,7 +192,7 @@ void pi_wakeup_handler(void)
pi_wakeup_list) {
if (pi_test_on(&vmx->pi_desc))
- kvm_vcpu_kick(&vmx->vcpu);
+ kvm_vcpu_wake_up(&vmx->vcpu);
}
spin_unlock(&per_cpu(wakeup_vcpus_on_cpu_lock, cpu));
}
--
2.33.0.882.g93a45727a2-goog
Remove kvm_vcpu.pre_pcpu as it no longer has any users. No functional
change intended.
Signed-off-by: Sean Christopherson <[email protected]>
---
include/linux/kvm_host.h | 1 -
virt/kvm/kvm_main.c | 1 -
2 files changed, 2 deletions(-)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 1fa38dc00b87..87996b22e681 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -304,7 +304,6 @@ struct kvm_vcpu {
u64 requests;
unsigned long guest_debug;
- int pre_pcpu;
struct list_head blocked_vcpu_list;
struct mutex mutex;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index c870cae7e776..2bbf5c9d410f 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -426,7 +426,6 @@ static void kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
#endif
kvm_async_pf_vcpu_init(vcpu);
- vcpu->pre_pcpu = -1;
INIT_LIST_HEAD(&vcpu->blocked_vcpu_list);
kvm_vcpu_set_in_spin_loop(vcpu, false);
--
2.33.0.882.g93a45727a2-goog
Remove the vCPU from the wakeup list before updating the notification
vector in the posted interrupt post-block helper. There is no need to
wake the current vCPU as it is by definition not blocking. Practically
speaking this is a nop as it only shaves a few meager cycles in the
unlikely case that the vCPU was migrated and the previous pCPU gets a
wakeup IRQ right before PID.NV is updated. The real motivation is to
allow for more readable code in the future, when post-block is merged
with vmx_vcpu_pi_load(), at which point removal from the list will be
conditional on the old notification vector.
Opportunistically add comments to document why KVM has a per-CPU spinlock
that, at first glance, appears to be taken only on the owning CPU.
Explicitly call out that the spinlock must be taken with IRQs disabled, a
detail that was "lost" when KVM switched from spin_lock_irqsave() to
spin_lock(), with IRQs disabled for the entirety of the relevant path.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/vmx/posted_intr.c | 49 +++++++++++++++++++++++-----------
1 file changed, 33 insertions(+), 16 deletions(-)
diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
index 2b2206339174..901b7a5f7777 100644
--- a/arch/x86/kvm/vmx/posted_intr.c
+++ b/arch/x86/kvm/vmx/posted_intr.c
@@ -10,10 +10,22 @@
#include "vmx.h"
/*
- * We maintain a per-CPU linked-list of vCPU, so in wakeup_handler() we
- * can find which vCPU should be waken up.
+ * Maintain a per-CPU list of vCPUs that need to be awakened by wakeup_handler()
+ * when a WAKEUP_VECTOR interrupted is posted. vCPUs are added to the list when
+ * the vCPU is scheduled out and is blocking (e.g. in HLT) with IRQs enabled.
+ * The vCPUs posted interrupt descriptor is updated at the same time to set its
+ * notification vector to WAKEUP_VECTOR, so that posted interrupt from devices
+ * wake the target vCPUs. vCPUs are removed from the list and the notification
+ * vector is reset when the vCPU is scheduled in.
*/
static DEFINE_PER_CPU(struct list_head, blocked_vcpu_on_cpu);
+/*
+ * Protect the per-CPU list with a per-CPU spinlock to handle task migration.
+ * When a blocking vCPU is awakened _and_ migrated to a different pCPU, the
+ * ->sched_in() path will need to take the vCPU off the list of the _previous_
+ * CPU. IRQs must be disabled when taking this lock, otherwise deadlock will
+ * occur if a wakeup IRQ arrives and attempts to acquire the lock.
+ */
static DEFINE_PER_CPU(spinlock_t, blocked_vcpu_on_cpu_lock);
static inline struct pi_desc *vcpu_to_pi_desc(struct kvm_vcpu *vcpu)
@@ -101,23 +113,28 @@ static void __pi_post_block(struct kvm_vcpu *vcpu)
WARN(pi_desc->nv != POSTED_INTR_WAKEUP_VECTOR,
"Wakeup handler not enabled while the vCPU was blocking");
- dest = cpu_physical_id(vcpu->cpu);
- if (!x2apic_mode)
- dest = (dest << 8) & 0xFF00;
-
- do {
- old.control = new.control = READ_ONCE(pi_desc->control);
-
- new.ndst = dest;
-
- /* set 'NV' to 'notification vector' */
- new.nv = POSTED_INTR_VECTOR;
- } while (cmpxchg64(&pi_desc->control, old.control,
- new.control) != old.control);
-
+ /*
+ * Remove the vCPU from the wakeup list of the _previous_ pCPU, which
+ * will not be the same as the current pCPU if the task was migrated.
+ */
spin_lock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->pre_pcpu));
list_del(&vcpu->blocked_vcpu_list);
spin_unlock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->pre_pcpu));
+
+ dest = cpu_physical_id(vcpu->cpu);
+ if (!x2apic_mode)
+ dest = (dest << 8) & 0xFF00;
+
+ do {
+ old.control = new.control = READ_ONCE(pi_desc->control);
+
+ new.ndst = dest;
+
+ /* set 'NV' to 'notification vector' */
+ new.nv = POSTED_INTR_VECTOR;
+ } while (cmpxchg64(&pi_desc->control, old.control,
+ new.control) != old.control);
+
vcpu->pre_pcpu = -1;
}
--
2.33.0.882.g93a45727a2-goog
Move the posted interrupt pre/post_block logic into vcpu_put/load
respectively, using the kvm_vcpu_is_blocking() to determining whether or
not the wakeup handler needs to be set (and unset). This avoids updating
the PI descriptor if halt-polling is successful, reduces the number of
touchpoints for updating the descriptor, and eliminates the confusing
behavior of intentionally leaving a "stale" PI.NDST when a blocking vCPU
is scheduled back in after preemption.
The downside is that KVM will do the PID update twice if the vCPU is
preempted after prepare_to_rcuwait() but before schedule(), but that's a
rare case (and non-existent on !PREEMPT kernels).
The notable wart is the need to send a self-IPI on the wakeup vector if
an outstanding notification is pending after configuring the wakeup
vector. Ideally, KVM would just do a kvm_vcpu_wake_up() in this case,
but the scheduler doesn't support waking a task from its preemption
notifier callback, i.e. while the task is smack dab in the middle of
being scheduled out.
Note, setting the wakeup vector before halt-polling is not necessary as
the pending IRQ will be recorded in the PIR and detected as a blocking-
breaking condition by kvm_vcpu_has_events() -> vmx_sync_pir_to_irr().
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/vmx/posted_intr.c | 162 ++++++++++++++-------------------
arch/x86/kvm/vmx/posted_intr.h | 8 +-
arch/x86/kvm/vmx/vmx.c | 5 -
3 files changed, 75 insertions(+), 100 deletions(-)
diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
index 901b7a5f7777..d2b3d75c57d1 100644
--- a/arch/x86/kvm/vmx/posted_intr.c
+++ b/arch/x86/kvm/vmx/posted_intr.c
@@ -37,33 +37,45 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
{
struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
struct pi_desc old, new;
+ unsigned long flags;
unsigned int dest;
/*
- * To simplify hot-plug and dynamic toggling of APICv, keep PI.NDST and
- * PI.SN up-to-date even if there is no assigned device or if APICv is
+ * To simplify hot-plug and dynamic toggling of APICv, keep PI.NDST
+ * up-to-date even if there is no assigned device or if APICv is
* deactivated due to a dynamic inhibit bit, e.g. for Hyper-V's SyncIC.
*/
if (!enable_apicv || !lapic_in_kernel(vcpu))
return;
- /* Nothing to do if PI.SN==0 and the vCPU isn't being migrated. */
- if (!pi_test_sn(pi_desc) && vcpu->cpu == cpu)
+ /*
+ * If the vCPU wasn't on the wakeup list and wasn't migrated, then the
+ * full update can be skipped as neither the vector nor the destination
+ * needs to be changed.
+ */
+ if (pi_desc->nv != POSTED_INTR_WAKEUP_VECTOR && vcpu->cpu == cpu) {
+ /*
+ * Clear SN if it was set due to being preempted. Again, do
+ * this even if there is no assigned device for simplicity.
+ */
+ if (pi_test_and_clear_sn(pi_desc))
+ goto after_clear_sn;
return;
+ }
+
+ local_irq_save(flags);
/*
- * If the 'nv' field is POSTED_INTR_WAKEUP_VECTOR, do not change
- * PI.NDST: pi_post_block is the one expected to change PID.NDST and the
- * wakeup handler expects the vCPU to be on the blocked_vcpu_list that
- * matches PI.NDST. Otherwise, a vcpu may not be able to be woken up
- * correctly.
+ * If the vCPU was waiting for wakeup, remove the vCPU from the wakeup
+ * list of the _previous_ pCPU, which will not be the same as the
+ * current pCPU if the task was migrated.
*/
- if (pi_desc->nv == POSTED_INTR_WAKEUP_VECTOR || vcpu->cpu == cpu) {
- pi_clear_sn(pi_desc);
- goto after_clear_sn;
+ if (pi_desc->nv == POSTED_INTR_WAKEUP_VECTOR) {
+ spin_lock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->cpu));
+ list_del(&vcpu->blocked_vcpu_list);
+ spin_unlock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->cpu));
}
- /* The full case. Set the new destination and clear SN. */
dest = cpu_physical_id(cpu);
if (!x2apic_mode)
dest = (dest << 8) & 0xFF00;
@@ -71,11 +83,23 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
do {
old.control = new.control = READ_ONCE(pi_desc->control);
+ /*
+ * Clear SN (as above) and refresh the destination APIC ID to
+ * handle task migration (@cpu != vcpu->cpu).
+ */
new.ndst = dest;
new.sn = 0;
+
+ /*
+ * Restore the notification vector; in the blocking case, the
+ * descriptor was modified on "put" to use the wakeup vector.
+ */
+ new.nv = POSTED_INTR_VECTOR;
} while (cmpxchg64(&pi_desc->control, old.control,
new.control) != old.control);
+ local_irq_restore(flags);
+
after_clear_sn:
/*
@@ -90,88 +114,24 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
pi_set_on(pi_desc);
}
-void vmx_vcpu_pi_put(struct kvm_vcpu *vcpu)
-{
- struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
-
- if (!kvm_arch_has_assigned_device(vcpu->kvm) ||
- !irq_remapping_cap(IRQ_POSTING_CAP) ||
- !kvm_vcpu_apicv_active(vcpu))
- return;
-
- /* Set SN when the vCPU is preempted */
- if (vcpu->preempted)
- pi_set_sn(pi_desc);
-}
-
-static void __pi_post_block(struct kvm_vcpu *vcpu)
-{
- struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
- struct pi_desc old, new;
- unsigned int dest;
-
- WARN(pi_desc->nv != POSTED_INTR_WAKEUP_VECTOR,
- "Wakeup handler not enabled while the vCPU was blocking");
-
- /*
- * Remove the vCPU from the wakeup list of the _previous_ pCPU, which
- * will not be the same as the current pCPU if the task was migrated.
- */
- spin_lock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->pre_pcpu));
- list_del(&vcpu->blocked_vcpu_list);
- spin_unlock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->pre_pcpu));
-
- dest = cpu_physical_id(vcpu->cpu);
- if (!x2apic_mode)
- dest = (dest << 8) & 0xFF00;
-
- do {
- old.control = new.control = READ_ONCE(pi_desc->control);
-
- new.ndst = dest;
-
- /* set 'NV' to 'notification vector' */
- new.nv = POSTED_INTR_VECTOR;
- } while (cmpxchg64(&pi_desc->control, old.control,
- new.control) != old.control);
-
- vcpu->pre_pcpu = -1;
-}
-
/*
- * This routine does the following things for vCPU which is going
- * to be blocked if VT-d PI is enabled.
- * - Store the vCPU to the wakeup list, so when interrupts happen
- * we can find the right vCPU to wake up.
- * - Change the Posted-interrupt descriptor as below:
- * 'NV' <-- POSTED_INTR_WAKEUP_VECTOR
- * - If 'ON' is set during this process, which means at least one
- * interrupt is posted for this vCPU, we cannot block it, in
- * this case, return 1, otherwise, return 0.
- *
+ * Put the vCPU on this pCPU's list of vCPUs that needs to be awakened and set
+ * WAKEUP as the notification vector in the PI descriptor.
*/
-int pi_pre_block(struct kvm_vcpu *vcpu)
+static void pi_enable_wakeup_handler(struct kvm_vcpu *vcpu)
{
- struct pi_desc old, new;
struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
+ struct pi_desc old, new;
unsigned long flags;
- if (!kvm_arch_has_assigned_device(vcpu->kvm) ||
- !irq_remapping_cap(IRQ_POSTING_CAP) ||
- !kvm_vcpu_apicv_active(vcpu) ||
- vmx_interrupt_blocked(vcpu))
- return 0;
-
local_irq_save(flags);
- vcpu->pre_pcpu = vcpu->cpu;
spin_lock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->cpu));
list_add_tail(&vcpu->blocked_vcpu_list,
&per_cpu(blocked_vcpu_on_cpu, vcpu->cpu));
spin_unlock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->cpu));
- WARN(pi_desc->sn == 1,
- "Posted Interrupt Suppress Notification set before blocking");
+ WARN(pi_desc->sn, "PI descriptor SN field set before blocking");
do {
old.control = new.control = READ_ONCE(pi_desc->control);
@@ -181,24 +141,40 @@ int pi_pre_block(struct kvm_vcpu *vcpu)
} while (cmpxchg64(&pi_desc->control, old.control,
new.control) != old.control);
- /* We should not block the vCPU if an interrupt is posted for it. */
- if (pi_test_on(pi_desc))
- __pi_post_block(vcpu);
+ /*
+ * Send a wakeup IPI to this CPU if an interrupt may have been posted
+ * before the notification vector was updated, in which case the IRQ
+ * will arrive on the non-wakeup vector. An IPI is needed as calling
+ * try_to_wake_up() from ->sched_out() isn't allowed (IRQs are not
+ * enabled until it is safe to call try_to_wake_up() on the task being
+ * scheduled out).
+ */
+ if (pi_test_on(&new))
+ apic->send_IPI_self(POSTED_INTR_WAKEUP_VECTOR);
local_irq_restore(flags);
- return (vcpu->pre_pcpu == -1);
}
-void pi_post_block(struct kvm_vcpu *vcpu)
+void vmx_vcpu_pi_put(struct kvm_vcpu *vcpu)
{
- unsigned long flags;
+ struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
- if (vcpu->pre_pcpu == -1)
+ if (!kvm_arch_has_assigned_device(vcpu->kvm) ||
+ !irq_remapping_cap(IRQ_POSTING_CAP) ||
+ !kvm_vcpu_apicv_active(vcpu))
return;
- local_irq_save(flags);
- __pi_post_block(vcpu);
- local_irq_restore(flags);
+ if (kvm_vcpu_is_blocking(vcpu) && !vmx_interrupt_blocked(vcpu))
+ pi_enable_wakeup_handler(vcpu);
+
+ /*
+ * Set SN when the vCPU is preempted. Note, the vCPU can both be seen
+ * as blocking and preempted, e.g. if it's preempted between setting
+ * its wait state and manually scheduling out. In that case, KVM will
+ * update
+ */
+ if (vcpu->preempted)
+ pi_set_sn(pi_desc);
}
/*
@@ -239,7 +215,7 @@ bool pi_has_pending_interrupt(struct kvm_vcpu *vcpu)
* Bail out of the block loop if the VM has an assigned
* device, but the blocking vCPU didn't reconfigure the
* PI.NV to the wakeup vector, i.e. the assigned device
- * came along after the initial check in pi_pre_block().
+ * came along after the initial check in vmx_vcpu_pi_put().
*/
void vmx_pi_start_assignment(struct kvm *kvm)
{
diff --git a/arch/x86/kvm/vmx/posted_intr.h b/arch/x86/kvm/vmx/posted_intr.h
index 36ae035f14aa..eb14e76b84ef 100644
--- a/arch/x86/kvm/vmx/posted_intr.h
+++ b/arch/x86/kvm/vmx/posted_intr.h
@@ -40,6 +40,12 @@ static inline bool pi_test_and_clear_on(struct pi_desc *pi_desc)
(unsigned long *)&pi_desc->control);
}
+static inline bool pi_test_and_clear_sn(struct pi_desc *pi_desc)
+{
+ return test_and_clear_bit(POSTED_INTR_SN,
+ (unsigned long *)&pi_desc->control);
+}
+
static inline bool pi_test_and_set_pir(int vector, struct pi_desc *pi_desc)
{
return test_and_set_bit(vector, (unsigned long *)pi_desc->pir);
@@ -88,8 +94,6 @@ static inline bool pi_test_sn(struct pi_desc *pi_desc)
void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu);
void vmx_vcpu_pi_put(struct kvm_vcpu *vcpu);
-int pi_pre_block(struct kvm_vcpu *vcpu);
-void pi_post_block(struct kvm_vcpu *vcpu);
void pi_wakeup_handler(void);
void __init pi_init_cpu(int cpu);
bool pi_has_pending_interrupt(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 5517893f12fc..26ed8cd1a1f2 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7462,9 +7462,6 @@ void vmx_update_cpu_dirty_logging(struct kvm_vcpu *vcpu)
static int vmx_pre_block(struct kvm_vcpu *vcpu)
{
- if (pi_pre_block(vcpu))
- return 1;
-
if (kvm_lapic_hv_timer_in_use(vcpu))
kvm_lapic_switch_to_sw_timer(vcpu);
@@ -7475,8 +7472,6 @@ static void vmx_post_block(struct kvm_vcpu *vcpu)
{
if (kvm_x86_ops.set_hv_timer)
kvm_lapic_switch_to_hv_timer(vcpu);
-
- pi_post_block(vcpu);
}
static void vmx_setup_mce(struct kvm_vcpu *vcpu)
--
2.33.0.882.g93a45727a2-goog
Unexport switch_to_{hv,sw}_timer() now that common x86 handles the
transitions.
No functional change intended.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/lapic.c | 2 --
1 file changed, 2 deletions(-)
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 0cd7ed21b205..cfb64bd4a1c1 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1948,7 +1948,6 @@ void kvm_lapic_switch_to_hv_timer(struct kvm_vcpu *vcpu)
{
restart_apic_timer(vcpu->arch.apic);
}
-EXPORT_SYMBOL_GPL(kvm_lapic_switch_to_hv_timer);
void kvm_lapic_switch_to_sw_timer(struct kvm_vcpu *vcpu)
{
@@ -1960,7 +1959,6 @@ void kvm_lapic_switch_to_sw_timer(struct kvm_vcpu *vcpu)
start_sw_timer(apic);
preempt_enable();
}
-EXPORT_SYMBOL_GPL(kvm_lapic_switch_to_sw_timer);
void kvm_lapic_restart_hv_timer(struct kvm_vcpu *vcpu)
{
--
2.33.0.882.g93a45727a2-goog
Move the fallback "wake_up" path into the helper to trigger posted
interrupt helper now that the nested and non-nested paths are identical.
No functional change intended.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 18 ++++++++++--------
1 file changed, 10 insertions(+), 8 deletions(-)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index f505fee3cf5c..b0d97cf18c34 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -3927,7 +3927,7 @@ static void vmx_msr_filter_changed(struct kvm_vcpu *vcpu)
pt_update_intercept_for_msr(vcpu);
}
-static inline bool kvm_vcpu_trigger_posted_interrupt(struct kvm_vcpu *vcpu,
+static inline void kvm_vcpu_trigger_posted_interrupt(struct kvm_vcpu *vcpu,
int pi_vec)
{
#ifdef CONFIG_SMP
@@ -3958,10 +3958,15 @@ static inline bool kvm_vcpu_trigger_posted_interrupt(struct kvm_vcpu *vcpu,
*/
apic->send_IPI_mask(get_cpu_mask(vcpu->cpu), pi_vec);
- return true;
+ return;
}
#endif
- return false;
+ /*
+ * The vCPU isn't in the guest; wake the vCPU in case it is blocking,
+ * otherwise do nothing as KVM will grab the highest priority pending
+ * IRQ via ->sync_pir_to_irr() in vcpu_enter_guest().
+ */
+ kvm_vcpu_wake_up(vcpu);
}
static int vmx_deliver_nested_posted_interrupt(struct kvm_vcpu *vcpu,
@@ -3984,8 +3989,7 @@ static int vmx_deliver_nested_posted_interrupt(struct kvm_vcpu *vcpu,
*/
kvm_make_request(KVM_REQ_EVENT, vcpu);
/* the PIR and ON have been set by L1. */
- if (!kvm_vcpu_trigger_posted_interrupt(vcpu, POSTED_INTR_NESTED_VECTOR))
- kvm_vcpu_wake_up(vcpu);
+ kvm_vcpu_trigger_posted_interrupt(vcpu, POSTED_INTR_NESTED_VECTOR);
return 0;
}
return -1;
@@ -4022,9 +4026,7 @@ static int vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector)
* guaranteed to see PID.ON=1 and sync the PIR to IRR if triggering a
* posted interrupt "fails" because vcpu->mode != IN_GUEST_MODE.
*/
- if (!kvm_vcpu_trigger_posted_interrupt(vcpu, POSTED_INTR_VECTOR))
- kvm_vcpu_wake_up(vcpu);
-
+ kvm_vcpu_trigger_posted_interrupt(vcpu, POSTED_INTR_VECTOR);
return 0;
}
--
2.33.0.882.g93a45727a2-goog
Signal the AVIC doorbell iff the vCPU is running in the guest. If the vCPU
is not IN_GUEST_MODE, it's guaranteed to pick up any pending IRQs on the
next VMRUN, which unconditionally processes the vIRR.
Add comments to document the logic.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/svm/avic.c | 14 ++++++++++++--
1 file changed, 12 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 208c5c71e827..cbf02e7e20d0 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -674,7 +674,12 @@ int svm_deliver_avic_intr(struct kvm_vcpu *vcpu, int vec)
kvm_lapic_set_irr(vec, vcpu->arch.apic);
smp_mb__after_atomic();
- if (avic_vcpu_is_running(vcpu)) {
+ /*
+ * Signal the doorbell to tell hardware to inject the IRQ if the vCPU
+ * is in the guest. If the vCPU is not in the guest, hardware will
+ * automatically process AVIC interrupts at VMRUN.
+ */
+ if (vcpu->mode == IN_GUEST_MODE) {
int cpu = READ_ONCE(vcpu->cpu);
/*
@@ -687,8 +692,13 @@ int svm_deliver_avic_intr(struct kvm_vcpu *vcpu, int vec)
if (cpu != get_cpu())
wrmsrl(SVM_AVIC_DOORBELL, kvm_cpu_get_apicid(cpu));
put_cpu();
- } else
+ } else {
+ /*
+ * Wake the vCPU if it was blocking. KVM will then detect the
+ * pending IRQ when checking if the vCPU has a wake event.
+ */
kvm_vcpu_wake_up(vcpu);
+ }
return 0;
}
--
2.33.0.882.g93a45727a2-goog
Drop a check that guards triggering a posted interrupt on the currently
running vCPU, and more importantly guards waking the target vCPU if
triggering a posted interrupt fails because the vCPU isn't IN_GUEST_MODE.
The "do nothing" logic when "vcpu == running_vcpu" works only because KVM
doesn't have a path to ->deliver_posted_interrupt() from asynchronous
context, e.g. if apic_timer_expired() were changed to always go down the
posted interrupt path for APICv, or if the IN_GUEST_MODE check in
kvm_use_posted_timer_interrupt() were dropped, and the hrtimer fired in
kvm_vcpu_block() after the final kvm_vcpu_check_block() check, the vCPU
would be scheduled() out without being awakened, i.e. would "miss" the
timer interrupt.
One could argue that invoking kvm_apic_local_deliver() from (soft) IRQ
context for the current running vCPU should be illegal, but nothing in
KVM actually enforces that rules. There's also no strong obvious benefit
to making such behavior illegal, e.g. checking IN_GUEST_MODE and calling
kvm_vcpu_wake_up() is at worst marginally more costly than querying the
current running vCPU.
Lastly, this aligns the non-nested and nested usage of triggering posted
interrupts, and will allow for additional cleanups.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 44d760dde0f9..78c8bc7f1b3b 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4024,8 +4024,7 @@ static int vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector)
* guaranteed to see PID.ON=1 and sync the PIR to IRR if triggering a
* posted interrupt "fails" because vcpu->mode != IN_GUEST_MODE.
*/
- if (vcpu != kvm_get_running_vcpu() &&
- !kvm_vcpu_trigger_posted_interrupt(vcpu, false))
+ if (!kvm_vcpu_trigger_posted_interrupt(vcpu, false))
kvm_vcpu_wake_up(vcpu);
return 0;
--
2.33.0.882.g93a45727a2-goog
On Sat, Oct 9, 2021 at 7:43 AM Sean Christopherson <[email protected]> wrote:
>
> Drop kvm_arch_vcpu_block_finish() now that all arch implementations are
> nops.
>
> No functional change intended.
>
> Acked-by: Christian Borntraeger <[email protected]>
> Reviewed-by: David Matlack <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
For KVM RISC-V:
Acked-by: Anup Patel <[email protected]>
Reviewed-by: Anup Patel <[email protected]>
Regards,
Anup
> ---
> arch/arm64/include/asm/kvm_host.h | 1 -
> arch/mips/include/asm/kvm_host.h | 1 -
> arch/powerpc/include/asm/kvm_host.h | 1 -
> arch/riscv/include/asm/kvm_host.h | 1 -
> arch/s390/include/asm/kvm_host.h | 2 --
> arch/s390/kvm/kvm-s390.c | 5 -----
> arch/x86/include/asm/kvm_host.h | 2 --
> virt/kvm/kvm_main.c | 1 -
> 8 files changed, 14 deletions(-)
>
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index 369c30e28301..fe4dec96d1c3 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -716,7 +716,6 @@ void kvm_arm_vcpu_ptrauth_trap(struct kvm_vcpu *vcpu);
> static inline void kvm_arch_hardware_unsetup(void) {}
> static inline void kvm_arch_sync_events(struct kvm *kvm) {}
> static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
> -static inline void kvm_arch_vcpu_block_finish(struct kvm_vcpu *vcpu) {}
>
> void kvm_arm_init_debug(void);
> void kvm_arm_vcpu_init_debug(struct kvm_vcpu *vcpu);
> diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h
> index 696f6b009377..72b90d45a46e 100644
> --- a/arch/mips/include/asm/kvm_host.h
> +++ b/arch/mips/include/asm/kvm_host.h
> @@ -897,7 +897,6 @@ static inline void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen) {}
> static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
> static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) {}
> static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {}
> -static inline void kvm_arch_vcpu_block_finish(struct kvm_vcpu *vcpu) {}
>
> #define __KVM_HAVE_ARCH_FLUSH_REMOTE_TLB
> int kvm_arch_flush_remote_tlb(struct kvm *kvm);
> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> index 876c10803cda..4a195c161592 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -865,6 +865,5 @@ static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
> static inline void kvm_arch_exit(void) {}
> static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) {}
> static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {}
> -static inline void kvm_arch_vcpu_block_finish(struct kvm_vcpu *vcpu) {}
>
> #endif /* __POWERPC_KVM_HOST_H__ */
> diff --git a/arch/riscv/include/asm/kvm_host.h b/arch/riscv/include/asm/kvm_host.h
> index d7e1696cd2ec..b3f0c3773603 100644
> --- a/arch/riscv/include/asm/kvm_host.h
> +++ b/arch/riscv/include/asm/kvm_host.h
> @@ -209,7 +209,6 @@ struct kvm_vcpu_arch {
> static inline void kvm_arch_hardware_unsetup(void) {}
> static inline void kvm_arch_sync_events(struct kvm *kvm) {}
> static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
> -static inline void kvm_arch_vcpu_block_finish(struct kvm_vcpu *vcpu) {}
>
> #define KVM_ARCH_WANT_MMU_NOTIFIER
>
> diff --git a/arch/s390/include/asm/kvm_host.h b/arch/s390/include/asm/kvm_host.h
> index a604d51acfc8..a22c9266ea05 100644
> --- a/arch/s390/include/asm/kvm_host.h
> +++ b/arch/s390/include/asm/kvm_host.h
> @@ -1010,6 +1010,4 @@ static inline void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
> static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) {}
> static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {}
>
> -void kvm_arch_vcpu_block_finish(struct kvm_vcpu *vcpu);
> -
> #endif
> diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
> index 08ed68639a21..17fabb260c35 100644
> --- a/arch/s390/kvm/kvm-s390.c
> +++ b/arch/s390/kvm/kvm-s390.c
> @@ -5080,11 +5080,6 @@ static inline unsigned long nonhyp_mask(int i)
> return 0x0000ffffffffffffUL >> (nonhyp_fai << 4);
> }
>
> -void kvm_arch_vcpu_block_finish(struct kvm_vcpu *vcpu)
> -{
> -
> -}
> -
> static int __init kvm_s390_init(void)
> {
> int i;
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 88f0326c184a..7aafc27ce7a9 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1926,8 +1926,6 @@ static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu)
> static_call_cond(kvm_x86_vcpu_unblocking)(vcpu);
> }
>
> -static inline void kvm_arch_vcpu_block_finish(struct kvm_vcpu *vcpu) {}
> -
> static inline int kvm_cpu_get_apicid(int mps_cpu)
> {
> #ifdef CONFIG_X86_LOCAL_APIC
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 1292c7876d3f..f90b3ed05628 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -3304,7 +3304,6 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
> }
>
> trace_kvm_vcpu_wakeup(block_ns, waited, vcpu_valid_wakeup(vcpu));
> - kvm_arch_vcpu_block_finish(vcpu);
> }
> EXPORT_SYMBOL_GPL(kvm_vcpu_block);
>
> --
> 2.33.0.882.g93a45727a2-goog
>
> _______________________________________________
> kvmarm mailing list
> [email protected]
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
On Sat, Oct 9, 2021 at 7:43 AM Sean Christopherson <[email protected]> wrote:
>
> Rename kvm_vcpu_block() to kvm_vcpu_halt() in preparation for splitting
> the actual "block" sequences into a separate helper (to be named
> kvm_vcpu_block()). x86 will use the standalone block-only path to handle
> non-halt cases where the vCPU is not runnable.
>
> Rename block_ns to halt_ns to match the new function name.
>
> No functional change intended.
>
> Reviewed-by: David Matlack <[email protected]>
> Reviewed-by: Christian Borntraeger <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
For KVM RISC-V:
Reviewed-by: Anup Patel <[email protected]>
Regards,
Anup
> arch/arm64/kvm/arch_timer.c | 2 +-
> arch/arm64/kvm/arm.c | 2 +-
> arch/arm64/kvm/handle_exit.c | 2 +-
> arch/arm64/kvm/psci.c | 2 +-
> arch/mips/kvm/emulate.c | 2 +-
> arch/powerpc/kvm/book3s_pr.c | 2 +-
> arch/powerpc/kvm/book3s_pr_papr.c | 2 +-
> arch/powerpc/kvm/booke.c | 2 +-
> arch/powerpc/kvm/powerpc.c | 2 +-
> arch/riscv/kvm/vcpu_exit.c | 2 +-
> arch/s390/kvm/interrupt.c | 2 +-
> arch/x86/kvm/x86.c | 11 +++++++++--
> include/linux/kvm_host.h | 2 +-
> virt/kvm/kvm_main.c | 20 +++++++++-----------
> 14 files changed, 30 insertions(+), 25 deletions(-)
>
> diff --git a/arch/arm64/kvm/arch_timer.c b/arch/arm64/kvm/arch_timer.c
> index 3df67c127489..7e8396f74010 100644
> --- a/arch/arm64/kvm/arch_timer.c
> +++ b/arch/arm64/kvm/arch_timer.c
> @@ -467,7 +467,7 @@ static void timer_save_state(struct arch_timer_context *ctx)
> }
>
> /*
> - * Schedule the background timer before calling kvm_vcpu_block, so that this
> + * Schedule the background timer before calling kvm_vcpu_halt, so that this
> * thread is removed from its waitqueue and made runnable when there's a timer
> * interrupt to handle.
> */
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index 1346f81b34df..268b1e7bf700 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -672,7 +672,7 @@ void kvm_vcpu_wfi(struct kvm_vcpu *vcpu)
> vgic_v4_put(vcpu, true);
> preempt_enable();
>
> - kvm_vcpu_block(vcpu);
> + kvm_vcpu_halt(vcpu);
> kvm_clear_request(KVM_REQ_UNHALT, vcpu);
>
> preempt_disable();
> diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c
> index 4794563a506b..6d0baf71aa67 100644
> --- a/arch/arm64/kvm/handle_exit.c
> +++ b/arch/arm64/kvm/handle_exit.c
> @@ -82,7 +82,7 @@ static int handle_no_fpsimd(struct kvm_vcpu *vcpu)
> *
> * WFE: Yield the CPU and come back to this vcpu when the scheduler
> * decides to.
> - * WFI: Simply call kvm_vcpu_block(), which will halt execution of
> + * WFI: Simply call kvm_vcpu_halt(), which will halt execution of
> * world-switches and schedule other host processes until there is an
> * incoming IRQ or FIQ to the VM.
> */
> diff --git a/arch/arm64/kvm/psci.c b/arch/arm64/kvm/psci.c
> index 74c47d420253..e275b2ca08b9 100644
> --- a/arch/arm64/kvm/psci.c
> +++ b/arch/arm64/kvm/psci.c
> @@ -46,7 +46,7 @@ static unsigned long kvm_psci_vcpu_suspend(struct kvm_vcpu *vcpu)
> * specification (ARM DEN 0022A). This means all suspend states
> * for KVM will preserve the register state.
> */
> - kvm_vcpu_block(vcpu);
> + kvm_vcpu_halt(vcpu);
> kvm_clear_request(KVM_REQ_UNHALT, vcpu);
>
> return PSCI_RET_SUCCESS;
> diff --git a/arch/mips/kvm/emulate.c b/arch/mips/kvm/emulate.c
> index 22e745e49b0a..b494d8d39290 100644
> --- a/arch/mips/kvm/emulate.c
> +++ b/arch/mips/kvm/emulate.c
> @@ -952,7 +952,7 @@ enum emulation_result kvm_mips_emul_wait(struct kvm_vcpu *vcpu)
> if (!vcpu->arch.pending_exceptions) {
> kvm_vz_lose_htimer(vcpu);
> vcpu->arch.wait = 1;
> - kvm_vcpu_block(vcpu);
> + kvm_vcpu_halt(vcpu);
>
> /*
> * We we are runnable, then definitely go off to user space to
> diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c
> index 6bc9425acb32..0ced1b16f0e5 100644
> --- a/arch/powerpc/kvm/book3s_pr.c
> +++ b/arch/powerpc/kvm/book3s_pr.c
> @@ -492,7 +492,7 @@ static void kvmppc_set_msr_pr(struct kvm_vcpu *vcpu, u64 msr)
>
> if (msr & MSR_POW) {
> if (!vcpu->arch.pending_exceptions) {
> - kvm_vcpu_block(vcpu);
> + kvm_vcpu_halt(vcpu);
> kvm_clear_request(KVM_REQ_UNHALT, vcpu);
> vcpu->stat.generic.halt_wakeup++;
>
> diff --git a/arch/powerpc/kvm/book3s_pr_papr.c b/arch/powerpc/kvm/book3s_pr_papr.c
> index ac14239f3424..1f10e7dfcdd0 100644
> --- a/arch/powerpc/kvm/book3s_pr_papr.c
> +++ b/arch/powerpc/kvm/book3s_pr_papr.c
> @@ -376,7 +376,7 @@ int kvmppc_h_pr(struct kvm_vcpu *vcpu, unsigned long cmd)
> return kvmppc_h_pr_stuff_tce(vcpu);
> case H_CEDE:
> kvmppc_set_msr_fast(vcpu, kvmppc_get_msr(vcpu) | MSR_EE);
> - kvm_vcpu_block(vcpu);
> + kvm_vcpu_halt(vcpu);
> kvm_clear_request(KVM_REQ_UNHALT, vcpu);
> vcpu->stat.generic.halt_wakeup++;
> return EMULATE_DONE;
> diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
> index 977801c83aff..12abffa40cd9 100644
> --- a/arch/powerpc/kvm/booke.c
> +++ b/arch/powerpc/kvm/booke.c
> @@ -718,7 +718,7 @@ int kvmppc_core_prepare_to_enter(struct kvm_vcpu *vcpu)
>
> if (vcpu->arch.shared->msr & MSR_WE) {
> local_irq_enable();
> - kvm_vcpu_block(vcpu);
> + kvm_vcpu_halt(vcpu);
> kvm_clear_request(KVM_REQ_UNHALT, vcpu);
> hard_irq_disable();
>
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index be22da157569..6a94545b99fc 100644
> --- a/arch/powerpc/kvm/powerpc.c
> +++ b/arch/powerpc/kvm/powerpc.c
> @@ -236,7 +236,7 @@ int kvmppc_kvm_pv(struct kvm_vcpu *vcpu)
> break;
> case EV_HCALL_TOKEN(EV_IDLE):
> r = EV_SUCCESS;
> - kvm_vcpu_block(vcpu);
> + kvm_vcpu_halt(vcpu);
> kvm_clear_request(KVM_REQ_UNHALT, vcpu);
> break;
> default:
> diff --git a/arch/riscv/kvm/vcpu_exit.c b/arch/riscv/kvm/vcpu_exit.c
> index 13bbc3f73713..949bb9828aa5 100644
> --- a/arch/riscv/kvm/vcpu_exit.c
> +++ b/arch/riscv/kvm/vcpu_exit.c
> @@ -146,7 +146,7 @@ static int system_opcode_insn(struct kvm_vcpu *vcpu,
> vcpu->stat.wfi_exit_stat++;
> if (!kvm_arch_vcpu_runnable(vcpu)) {
> srcu_read_unlock(&vcpu->kvm->srcu, vcpu->arch.srcu_idx);
> - kvm_vcpu_block(vcpu);
> + kvm_vcpu_halt(vcpu);
> vcpu->arch.srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
> kvm_clear_request(KVM_REQ_UNHALT, vcpu);
> }
> diff --git a/arch/s390/kvm/interrupt.c b/arch/s390/kvm/interrupt.c
> index 520450a7956f..10bd648170b7 100644
> --- a/arch/s390/kvm/interrupt.c
> +++ b/arch/s390/kvm/interrupt.c
> @@ -1335,7 +1335,7 @@ int kvm_s390_handle_wait(struct kvm_vcpu *vcpu)
> VCPU_EVENT(vcpu, 4, "enabled wait: %llu ns", sltime);
> no_timer:
> srcu_read_unlock(&vcpu->kvm->srcu, vcpu->srcu_idx);
> - kvm_vcpu_block(vcpu);
> + kvm_vcpu_halt(vcpu);
> vcpu->valid_wakeup = false;
> __unset_cpu_idle(vcpu);
> vcpu->srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 9c23ae1d483d..e6c17bbed25c 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -8651,6 +8651,13 @@ void kvm_arch_exit(void)
>
> static int __kvm_emulate_halt(struct kvm_vcpu *vcpu, int state, int reason)
> {
> + /*
> + * The vCPU has halted, e.g. executed HLT. Update the run state if the
> + * local APIC is in-kernel, the run loop will detect the non-runnable
> + * state and halt the vCPU. Exit to userspace if the local APIC is
> + * managed by userspace, in which case userspace is responsible for
> + * handling wake events.
> + */
> ++vcpu->stat.halt_exits;
> if (lapic_in_kernel(vcpu)) {
> vcpu->arch.mp_state = state;
> @@ -9892,7 +9899,7 @@ static inline int vcpu_block(struct kvm *kvm, struct kvm_vcpu *vcpu)
> if (!kvm_arch_vcpu_runnable(vcpu) &&
> (!kvm_x86_ops.pre_block || static_call(kvm_x86_pre_block)(vcpu) == 0)) {
> srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
> - kvm_vcpu_block(vcpu);
> + kvm_vcpu_halt(vcpu);
> vcpu->srcu_idx = srcu_read_lock(&kvm->srcu);
>
> if (kvm_x86_ops.post_block)
> @@ -10126,7 +10133,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
> r = -EINTR;
> goto out;
> }
> - kvm_vcpu_block(vcpu);
> + kvm_vcpu_halt(vcpu);
> if (kvm_apic_accept_events(vcpu) < 0) {
> r = 0;
> goto out;
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 1ced2914d9ca..c2ea4004553a 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -967,7 +967,7 @@ void kvm_vcpu_mark_page_dirty(struct kvm_vcpu *vcpu, gfn_t gfn);
> void kvm_sigset_activate(struct kvm_vcpu *vcpu);
> void kvm_sigset_deactivate(struct kvm_vcpu *vcpu);
>
> -void kvm_vcpu_block(struct kvm_vcpu *vcpu);
> +void kvm_vcpu_halt(struct kvm_vcpu *vcpu);
> void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu);
> void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu);
> bool kvm_vcpu_wake_up(struct kvm_vcpu *vcpu);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 227f6bbe0716..c13bf3367fda 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -3223,17 +3223,14 @@ static inline void update_halt_poll_stats(struct kvm_vcpu *vcpu, ktime_t start,
> }
> }
>
> -/*
> - * The vCPU has executed a HLT instruction with in-kernel mode enabled.
> - */
> -void kvm_vcpu_block(struct kvm_vcpu *vcpu)
> +void kvm_vcpu_halt(struct kvm_vcpu *vcpu)
> {
> struct rcuwait *wait = kvm_arch_vcpu_get_wait(vcpu);
> bool halt_poll_allowed = !kvm_arch_no_poll(vcpu);
> bool do_halt_poll = halt_poll_allowed && vcpu->halt_poll_ns;
> ktime_t start, cur, poll_end;
> bool waited = false;
> - u64 block_ns;
> + u64 halt_ns;
>
> start = cur = poll_end = ktime_get();
> if (do_halt_poll) {
> @@ -3275,7 +3272,8 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
> ktime_to_ns(cur) - ktime_to_ns(poll_end));
> }
> out:
> - block_ns = ktime_to_ns(cur) - ktime_to_ns(start);
> + /* The total time the vCPU was "halted", including polling time. */
> + halt_ns = ktime_to_ns(cur) - ktime_to_ns(start);
>
> /*
> * Note, halt-polling is considered successful so long as the vCPU was
> @@ -3289,24 +3287,24 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
> if (!vcpu_valid_wakeup(vcpu)) {
> shrink_halt_poll_ns(vcpu);
> } else if (vcpu->kvm->max_halt_poll_ns) {
> - if (block_ns <= vcpu->halt_poll_ns)
> + if (halt_ns <= vcpu->halt_poll_ns)
> ;
> /* we had a long block, shrink polling */
> else if (vcpu->halt_poll_ns &&
> - block_ns > vcpu->kvm->max_halt_poll_ns)
> + halt_ns > vcpu->kvm->max_halt_poll_ns)
> shrink_halt_poll_ns(vcpu);
> /* we had a short halt and our poll time is too small */
> else if (vcpu->halt_poll_ns < vcpu->kvm->max_halt_poll_ns &&
> - block_ns < vcpu->kvm->max_halt_poll_ns)
> + halt_ns < vcpu->kvm->max_halt_poll_ns)
> grow_halt_poll_ns(vcpu);
> } else {
> vcpu->halt_poll_ns = 0;
> }
> }
>
> - trace_kvm_vcpu_wakeup(block_ns, waited, vcpu_valid_wakeup(vcpu));
> + trace_kvm_vcpu_wakeup(halt_ns, waited, vcpu_valid_wakeup(vcpu));
> }
> -EXPORT_SYMBOL_GPL(kvm_vcpu_block);
> +EXPORT_SYMBOL_GPL(kvm_vcpu_halt);
>
> bool kvm_vcpu_wake_up(struct kvm_vcpu *vcpu)
> {
> --
> 2.33.0.882.g93a45727a2-goog
>
> _______________________________________________
> kvmarm mailing list
> [email protected]
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
On 09/10/21 04:12, Sean Christopherson wrote:
> Drop sanity checks on the validity of the previous pCPU when handling
> vCPU block/unlock for posted interrupts. Barring a code bug or memory
> corruption, the sanity checks will never fire, and any code bug that does
> trip the WARN is all but guaranteed to completely break posted interrupts,
> i.e. should never get anywhere near production.
>
> This is the first of several steps toward eliminating kvm_vcpu.pre_cpu.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/vmx/posted_intr.c | 24 ++++++++++--------------
> 1 file changed, 10 insertions(+), 14 deletions(-)
The idea here is to avoid making things worse by not making the list
inconsistent. But that's impossible to do if pre_pcpu goes away, so
fair enough.
Paolo
> diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
> index 67cbe6ab8f66..6c2110d91b06 100644
> --- a/arch/x86/kvm/vmx/posted_intr.c
> +++ b/arch/x86/kvm/vmx/posted_intr.c
> @@ -118,12 +118,10 @@ static void __pi_post_block(struct kvm_vcpu *vcpu)
> } while (cmpxchg64(&pi_desc->control, old.control,
> new.control) != old.control);
>
> - if (!WARN_ON_ONCE(vcpu->pre_pcpu == -1)) {
> - spin_lock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->pre_pcpu));
> - list_del(&vcpu->blocked_vcpu_list);
> - spin_unlock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->pre_pcpu));
> - vcpu->pre_pcpu = -1;
> - }
> + spin_lock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->pre_pcpu));
> + list_del(&vcpu->blocked_vcpu_list);
> + spin_unlock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->pre_pcpu));
> + vcpu->pre_pcpu = -1;
> }
>
> /*
> @@ -153,14 +151,12 @@ int pi_pre_block(struct kvm_vcpu *vcpu)
>
> WARN_ON(irqs_disabled());
> local_irq_disable();
> - if (!WARN_ON_ONCE(vcpu->pre_pcpu != -1)) {
> - vcpu->pre_pcpu = vcpu->cpu;
> - spin_lock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->pre_pcpu));
> - list_add_tail(&vcpu->blocked_vcpu_list,
> - &per_cpu(blocked_vcpu_on_cpu,
> - vcpu->pre_pcpu));
> - spin_unlock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->pre_pcpu));
> - }
> +
> + vcpu->pre_pcpu = vcpu->cpu;
> + spin_lock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->pre_pcpu));
> + list_add_tail(&vcpu->blocked_vcpu_list,
> + &per_cpu(blocked_vcpu_on_cpu, vcpu->pre_pcpu));
> + spin_unlock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->pre_pcpu));
>
> WARN(pi_desc->sn == 1,
> "Posted Interrupt Suppress Notification set before blocking");
>
On 09/10/21 04:12, Sean Christopherson wrote:
> Don't update Posted Interrupt's NDST, a.k.a. the target pCPU, in the
> pre-block path, as NDST is guaranteed to be up-to-date. The comment
> about the vCPU being preempted during the update is simply wrong, as the
> update path runs with IRQs disabled (from before snapshotting vcpu->cpu,
> until after the update completes).
Right, it didn't as of commit bf9f6ac8d74969690df1485b33b7c238ca9f2269
(when VT-d posted interrupts were introduced).
The interrupt disable/enable pair was added in the same commit that
motivated the introduction of the sanity checks:
commit 8b306e2f3c41939ea528e6174c88cfbfff893ce1
Author: Paolo Bonzini <[email protected]>
Date: Tue Jun 6 12:57:05 2017 +0200
KVM: VMX: avoid double list add with VT-d posted interrupts
In some cases, for example involving hot-unplug of assigned
devices, pi_post_block can forget to remove the vCPU from the
blocked_vcpu_list. When this happens, the next call to
pi_pre_block corrupts the list.
Fix this in two ways. First, check vcpu->pre_pcpu in pi_pre_block
and WARN instead of adding the element twice in the list. Second,
always do the list removal in pi_post_block if vcpu->pre_pcpu is
set (not -1).
The new code keeps interrupts disabled for the whole duration of
pi_pre_block/pi_post_block. This is not strictly necessary, but
easier to follow. For the same reason, PI.ON is checked only
after the cmpxchg, and to handle it we just call the post-block
code. This removes duplication of the list removal code.
At the time, I didn't notice the now useless NDST update.
Paolo
> The vCPU can get preempted_before_ the update starts, but not during.
> And if the vCPU is preempted before, vmx_vcpu_pi_load() is responsible
> for updating NDST when the vCPU is scheduled back in. In that case, the
> check against the wakeup vector in vmx_vcpu_pi_load() cannot be true as
> that would require the notification vector to have been set to the wakeup
> vector_before_ blocking.
>
> Opportunistically switch to using vcpu->cpu for the list/lock lookups,
> which presumably used pre_pcpu only for some phantom preemption logic.
On 09/10/21 04:12, Sean Christopherson wrote:
> diff --git a/arch/arm64/kvm/arch_timer.c b/arch/arm64/kvm/arch_timer.c
> index 7e8396f74010..addd53b6eba6 100644
> --- a/arch/arm64/kvm/arch_timer.c
> +++ b/arch/arm64/kvm/arch_timer.c
> @@ -649,7 +649,6 @@ void kvm_timer_vcpu_put(struct kvm_vcpu *vcpu)
> {
> struct arch_timer_cpu *timer = vcpu_timer(vcpu);
> struct timer_map map;
> - struct rcuwait *wait = kvm_arch_vcpu_get_wait(vcpu);
>
> if (unlikely(!timer->enabled))
> return;
> @@ -672,7 +671,7 @@ void kvm_timer_vcpu_put(struct kvm_vcpu *vcpu)
> if (map.emul_ptimer)
> soft_timer_cancel(&map.emul_ptimer->hrtimer);
>
> - if (rcuwait_active(wait))
> + if (kvm_vcpu_is_blocking(vcpu))
> kvm_timer_blocking(vcpu);
>
> /*
So this trick is what you're applying to x86 too instead of using
vmx_pre_block, I see.
Paolo
On 09/10/21 04:11, Sean Christopherson wrote:
> This is basically two series smushed into one. The first "half" aims
> to differentiate between "halt" and a more generic "block", where "halt"
> aligns with x86's HLT instruction, the halt-polling mechanisms, and
> associated stats, and "block" means any guest action that causes the vCPU
> to block/wait.
>
> The second "half" overhauls x86's APIC virtualization code (Posted
> Interrupts on Intel VMX, AVIC on AMD SVM) to do their updates in response
> to vCPU (un)blocking in the vcpu_load/put() paths, keying off of the
> vCPU's rcuwait status to determine when a blocking vCPU is being put and
> reloaded. This idea comes from arm64's kvm_timer_vcpu_put(), which I
> stumbled across when diving into the history of arm64's (un)blocking hooks.
>
> The x86 APICv overhaul allows for killing off several sets of hooks in
> common KVM and in x86 KVM (to the vendor code). Moving everything to
> vcpu_put/load() also realizes nice cleanups, especially for the Posted
> Interrupt code, which required some impressive mental gymnastics to
> understand how vCPU task migration interacted with vCPU blocking.
>
> Non-x86 folks, sorry for the noise. I'm hoping the common parts can get
> applied without much fuss so that future versions can be x86-only.
>
> v2:
> - Collect reviews. [Christian, David]
> - Add patch to move arm64 WFI functionality out of hooks. [Marc]
> - Add RISC-V to the fun.
> - Add all the APICv fun.
>
> v1: https://lkml.kernel.org/r/[email protected]
>
> Jing Zhang (1):
> KVM: stats: Add stat to detect if vcpu is currently blocking
>
> Sean Christopherson (42):
> KVM: VMX: Don't unblock vCPU w/ Posted IRQ if IRQs are disabled in
> guest
> KVM: SVM: Ensure target pCPU is read once when signalling AVIC
> doorbell
> KVM: s390: Ensure kvm_arch_no_poll() is read once when blocking vCPU
> KVM: Force PPC to define its own rcuwait object
> KVM: Update halt-polling stats if and only if halt-polling was
> attempted
> KVM: Refactor and document halt-polling stats update helper
> KVM: Reconcile discrepancies in halt-polling stats
> KVM: s390: Clear valid_wakeup in kvm_s390_handle_wait(), not in arch
> hook
> KVM: Drop obsolete kvm_arch_vcpu_block_finish()
> KVM: arm64: Move vGIC v4 handling for WFI out arch callback hook
> KVM: Don't block+unblock when halt-polling is successful
> KVM: x86: Tweak halt emulation helper names to free up kvm_vcpu_halt()
> KVM: Rename kvm_vcpu_block() => kvm_vcpu_halt()
> KVM: Split out a kvm_vcpu_block() helper from kvm_vcpu_halt()
> KVM: Don't redo ktime_get() when calculating halt-polling
> stop/deadline
> KVM: x86: Directly block (instead of "halting") UNINITIALIZED vCPUs
> KVM: x86: Invoke kvm_vcpu_block() directly for non-HALTED wait states
> KVM: Add helpers to wake/query blocking vCPU
> KVM: VMX: Skip Posted Interrupt updates if APICv is hard disabled
> KVM: VMX: Clean up PI pre/post-block WARNs
> KVM: VMX: Drop unnecessary PI logic to handle impossible conditions
> KVM: VMX: Use boolean returns for Posted Interrupt "test" helpers
> KVM: VMX: Drop pointless PI.NDST update when blocking
> KVM: VMX: Save/restore IRQs (instead of CLI/STI) during PI pre/post
> block
> KVM: VMX: Read Posted Interrupt "control" exactly once per loop
> iteration
> KVM: VMX: Move Posted Interrupt ndst computation out of write loop
> KVM: VMX: Remove vCPU from PI wakeup list before updating PID.NV
> KVM: VMX: Handle PI wakeup shenanigans during vcpu_put/load
> KVM: Drop unused kvm_vcpu.pre_pcpu field
> KVM: Move x86 VMX's posted interrupt list_head to vcpu_vmx
> KVM: VMX: Move preemption timer <=> hrtimer dance to common x86
> KVM: x86: Unexport LAPIC's switch_to_{hv,sw}_timer() helpers
> KVM: x86: Remove defunct pre_block/post_block kvm_x86_ops hooks
> KVM: SVM: Signal AVIC doorbell iff vCPU is in guest mode
> KVM: SVM: Don't bother checking for "running" AVIC when kicking for
> IPIs
> KVM: SVM: Unconditionally mark AVIC as running on vCPU load (with
> APICv)
> KVM: Drop defunct kvm_arch_vcpu_(un)blocking() hooks
> KVM: VMX: Don't do full kick when triggering posted interrupt "fails"
> KVM: VMX: Wake vCPU when delivering posted IRQ even if vCPU == this
> vCPU
> KVM: VMX: Pass desired vector instead of bool for triggering posted
> IRQ
> KVM: VMX: Fold fallback path into triggering posted IRQ helper
> KVM: VMX: Don't do full kick when handling posted interrupt wakeup
>
> arch/arm64/include/asm/kvm_emulate.h | 2 +
> arch/arm64/include/asm/kvm_host.h | 1 -
> arch/arm64/kvm/arch_timer.c | 5 +-
> arch/arm64/kvm/arm.c | 60 +++---
> arch/arm64/kvm/handle_exit.c | 5 +-
> arch/arm64/kvm/psci.c | 2 +-
> arch/mips/include/asm/kvm_host.h | 3 -
> arch/mips/kvm/emulate.c | 2 +-
> arch/powerpc/include/asm/kvm_host.h | 4 +-
> arch/powerpc/kvm/book3s_pr.c | 2 +-
> arch/powerpc/kvm/book3s_pr_papr.c | 2 +-
> arch/powerpc/kvm/booke.c | 2 +-
> arch/powerpc/kvm/powerpc.c | 5 +-
> arch/riscv/include/asm/kvm_host.h | 1 -
> arch/riscv/kvm/vcpu_exit.c | 2 +-
> arch/s390/include/asm/kvm_host.h | 4 -
> arch/s390/kvm/interrupt.c | 3 +-
> arch/s390/kvm/kvm-s390.c | 7 +-
> arch/x86/include/asm/kvm-x86-ops.h | 4 -
> arch/x86/include/asm/kvm_host.h | 29 +--
> arch/x86/kvm/lapic.c | 4 +-
> arch/x86/kvm/svm/avic.c | 95 ++++-----
> arch/x86/kvm/svm/svm.c | 8 -
> arch/x86/kvm/svm/svm.h | 14 --
> arch/x86/kvm/vmx/nested.c | 2 +-
> arch/x86/kvm/vmx/posted_intr.c | 279 ++++++++++++---------------
> arch/x86/kvm/vmx/posted_intr.h | 14 +-
> arch/x86/kvm/vmx/vmx.c | 63 +++---
> arch/x86/kvm/vmx/vmx.h | 3 +
> arch/x86/kvm/x86.c | 55 ++++--
> include/linux/kvm_host.h | 27 ++-
> include/linux/kvm_types.h | 1 +
> virt/kvm/async_pf.c | 2 +-
> virt/kvm/kvm_main.c | 138 +++++++------
> 34 files changed, 413 insertions(+), 437 deletions(-)
>
Queued 1-20 and 22-28. Initially I skipped 21 because I didn't receive
it, but I have to think more about whether I agree with it.
In reality the CMPXCHG loops can really fail just once, because they
only race with the processor setting ON=1. But if the warnings were to
trigger at all, it would mean that something iffy is happening in the
pi_desc->control state machine, and having the check on every iteration
is (very marginally) more effective.
It's all theoretical, granted.
Paolo
On 09/10/21 04:12, Sean Christopherson wrote:
> + */
> + if (vcpu->mode == IN_GUEST_MODE) {
> int cpu = READ_ONCE(vcpu->cpu);
>
> /*
> @@ -687,8 +692,13 @@ int svm_deliver_avic_intr(struct kvm_vcpu *vcpu, int vec)
> if (cpu != get_cpu())
> wrmsrl(SVM_AVIC_DOORBELL, kvm_cpu_get_apicid(cpu));
> put_cpu();
> - } else
> + } else {
> + /*
> + * Wake the vCPU if it was blocking. KVM will then detect the
> + * pending IRQ when checking if the vCPU has a wake event.
> + */
> kvm_vcpu_wake_up(vcpu);
> + }
>
Does this still need to check the "running" flag? That should be a
strict superset of vcpu->mode == IN_GUEST_MODE.
Paolo
On 09/10/21 04:12, Sean Christopherson wrote:
> + /*
> + * The smp_wmb() in kvm_make_request() pairs with the smp_mb_*()
> + * after setting vcpu->mode in vcpu_enter_guest(), thus the vCPU
> + * is guaranteed to see the event request if triggering a posted
> + * interrupt "fails" because vcpu->mode != IN_GUEST_MODE.
This explanation doesn't make much sense to me. This is just the usual
request/kick pattern explained in
Documentation/virt/kvm/vcpu-requests.rst; except that we don't bother
with a "kick" out of guest mode because the entry always goes through
kvm_check_request (in the nVMX case) or sync_pir_to_irr (if non-nested)
and completes the delivery itself.
In other word, it is a similar idea as patch 43/43.
What this smp_wmb() pair with, is the smp_mb__after_atomic in
kvm_check_request(KVM_REQ_EVENT, vcpu). Setting the interrupt in the
PIR orders before kvm_make_request in this thread, and orders after
kvm_make_request in the vCPU thread.
Here, instead:
> + /*
> + * The implied barrier in pi_test_and_set_on() pairs with the smp_mb_*()
> + * after setting vcpu->mode in vcpu_enter_guest(), thus the vCPU is
> + * guaranteed to see PID.ON=1 and sync the PIR to IRR if triggering a
> + * posted interrupt "fails" because vcpu->mode != IN_GUEST_MODE.
> + */
> if (vcpu != kvm_get_running_vcpu() &&
> !kvm_vcpu_trigger_posted_interrupt(vcpu, false))
> - kvm_vcpu_kick(vcpu);
> + kvm_vcpu_wake_up(vcpu);
>
it pairs with the smp_mb__after_atomic in vmx_sync_pir_to_irr(). As
explained again in vcpu-requests.rst, the ON bit has the same function
as vcpu->request in the previous case.
Paolo
> + */
> kvm_make_request(KVM_REQ_EVENT, vcpu);
On Mon, Oct 25, 2021, Paolo Bonzini wrote:
> On 09/10/21 04:12, Sean Christopherson wrote:
> > + /* TODO: Document why the unblocking path checks for updates. */
>
> Is that a riddle or what? :)
Yes? I haven't been able to figure out why the unblocking path explicitly
checks and handles an APICv update.
On Mon, Oct 25, 2021, Paolo Bonzini wrote:
> On 25/10/21 17:48, Sean Christopherson wrote:
> > On Mon, Oct 25, 2021, Paolo Bonzini wrote:
> > > On 09/10/21 04:12, Sean Christopherson wrote:
> > > > + /* TODO: Document why the unblocking path checks for updates. */
> > >
> > > Is that a riddle or what? :)
> >
> > Yes? I haven't been able to figure out why the unblocking path explicitly
> > checks and handles an APICv update.
> >
>
> Challenge accepted. In the original code, it was because without it
> avic_vcpu_load would do nothing, and nothing would update the IS_RUNNING
> flag.
>
> It shouldn't be necessary anymore since commit df7e4827c549 ("KVM: SVM: call
> avic_vcpu_load/avic_vcpu_put when enabling/disabling AVIC", 2021-08-20),
> where svm_refresh_apicv_exec_ctrl takes care of the avic_vcpu_load on the
> next VMRUN.
Aha! Thanks, I'll work in a removal for the next version.
On 09/10/21 04:11, Sean Christopherson wrote:
> + * point, which could result in signalling the wrong/previous
> + * pCPU. But if that happens the vCPU is guaranteed to do a
> + * VMRUN (after being migrated) and thus will process pending
> + * interrupts, i.e. a doorbell is not needed (and the spurious)
... one is harmless, I suppose.
Paolo
On 09/10/21 04:12, Sean Christopherson wrote:
> Move the put and reload of the vGIC out of the block/unblock callbacks
> and into a dedicated WFI helper. Functionally, this is nearly a nop as
> the block hook is called at the very beginning of kvm_vcpu_block(), and
> the only code in kvm_vcpu_block() after the unblock hook is to update the
> halt-polling controls, i.e. can only affect the next WFI.
>
> Back when the arch (un)blocking hooks were added by commits 3217f7c25bca
> ("KVM: Add kvm_arch_vcpu_{un}blocking callbacks) and d35268da6687
> ("arm/arm64: KVM: arch_timer: Only schedule soft timer on vcpu_block"),
> the hooks were invoked only when KVM was about to "block", i.e. schedule
> out the vCPU. The use case at the time was to schedule a timer in the
> host based on the earliest timer in the guest in order to wake the
> blocking vCPU when the emulated guest timer fired. Commit accb99bcd0ca
> ("KVM: arm/arm64: Simplify bg_timer programming") reworked the timer
> logic to be even more precise, by waiting until the vCPU was actually
> scheduled out, and so move the timer logic from the (un)blocking hooks to
> vcpu_load/put.
>
> In the meantime, the hooks gained usage for enabling vGIC v4 doorbells in
> commit df9ba95993b9 ("KVM: arm/arm64: GICv4: Use the doorbell interrupt
> as an unblocking source"), and added related logic for the VMCR in commit
> 5eeaf10eec39 ("KVM: arm/arm64: Sync ICH_VMCR_EL2 back when about to block").
>
> Finally, commit 07ab0f8d9a12 ("KVM: Call kvm_arch_vcpu_blocking early
> into the blocking sequence") hoisted the (un)blocking hooks so that they
> wrapped KVM's halt-polling logic in addition to the core "block" logic.
>
> In other words, the original need for arch hooks to take action _only_
> in the block path is long since gone.
>
> Cc: Oliver Upton <[email protected]>
> Cc: Marc Zyngier <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
This needs a word on why kvm_psci_vcpu_suspend does not need the hooks.
Or it needs to be changed to also use kvm_vcpu_wfi in the PSCI code, I
don't know.
Marc, can you review and/or advise?
Thanks,
Paolo
> ---
> arch/arm64/include/asm/kvm_emulate.h | 2 ++
> arch/arm64/kvm/arm.c | 52 +++++++++++++++++++---------
> arch/arm64/kvm/handle_exit.c | 3 +-
> 3 files changed, 38 insertions(+), 19 deletions(-)
>
> diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h
> index fd418955e31e..de8b4f5922b7 100644
> --- a/arch/arm64/include/asm/kvm_emulate.h
> +++ b/arch/arm64/include/asm/kvm_emulate.h
> @@ -41,6 +41,8 @@ void kvm_inject_vabt(struct kvm_vcpu *vcpu);
> void kvm_inject_dabt(struct kvm_vcpu *vcpu, unsigned long addr);
> void kvm_inject_pabt(struct kvm_vcpu *vcpu, unsigned long addr);
>
> +void kvm_vcpu_wfi(struct kvm_vcpu *vcpu);
> +
> static __always_inline bool vcpu_el1_is_32bit(struct kvm_vcpu *vcpu)
> {
> return !(vcpu->arch.hcr_el2 & HCR_RW);
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index 7838e9fb693e..1346f81b34df 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -359,27 +359,12 @@ int kvm_cpu_has_pending_timer(struct kvm_vcpu *vcpu)
>
> void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu)
> {
> - /*
> - * If we're about to block (most likely because we've just hit a
> - * WFI), we need to sync back the state of the GIC CPU interface
> - * so that we have the latest PMR and group enables. This ensures
> - * that kvm_arch_vcpu_runnable has up-to-date data to decide
> - * whether we have pending interrupts.
> - *
> - * For the same reason, we want to tell GICv4 that we need
> - * doorbells to be signalled, should an interrupt become pending.
> - */
> - preempt_disable();
> - kvm_vgic_vmcr_sync(vcpu);
> - vgic_v4_put(vcpu, true);
> - preempt_enable();
> +
> }
>
> void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu)
> {
> - preempt_disable();
> - vgic_v4_load(vcpu);
> - preempt_enable();
> +
> }
>
> void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> @@ -662,6 +647,39 @@ static void vcpu_req_sleep(struct kvm_vcpu *vcpu)
> smp_rmb();
> }
>
> +/**
> + * kvm_vcpu_wfi - emulate Wait-For-Interrupt behavior
> + * @vcpu: The VCPU pointer
> + *
> + * Suspend execution of a vCPU until a valid wake event is detected, i.e. until
> + * the vCPU is runnable. The vCPU may or may not be scheduled out, depending
> + * on when a wake event arrives, e.g. there may already be a pending wake event.
> + */
> +void kvm_vcpu_wfi(struct kvm_vcpu *vcpu)
> +{
> + /*
> + * Sync back the state of the GIC CPU interface so that we have
> + * the latest PMR and group enables. This ensures that
> + * kvm_arch_vcpu_runnable has up-to-date data to decide whether
> + * we have pending interrupts, e.g. when determining if the
> + * vCPU should block.
> + *
> + * For the same reason, we want to tell GICv4 that we need
> + * doorbells to be signalled, should an interrupt become pending.
> + */
> + preempt_disable();
> + kvm_vgic_vmcr_sync(vcpu);
> + vgic_v4_put(vcpu, true);
> + preempt_enable();
> +
> + kvm_vcpu_block(vcpu);
> + kvm_clear_request(KVM_REQ_UNHALT, vcpu);
> +
> + preempt_disable();
> + vgic_v4_load(vcpu);
> + preempt_enable();
> +}
> +
> static int kvm_vcpu_initialized(struct kvm_vcpu *vcpu)
> {
> return vcpu->arch.target >= 0;
> diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c
> index 275a27368a04..4794563a506b 100644
> --- a/arch/arm64/kvm/handle_exit.c
> +++ b/arch/arm64/kvm/handle_exit.c
> @@ -95,8 +95,7 @@ static int kvm_handle_wfx(struct kvm_vcpu *vcpu)
> } else {
> trace_kvm_wfx_arm64(*vcpu_pc(vcpu), false);
> vcpu->stat.wfi_exit_stat++;
> - kvm_vcpu_block(vcpu);
> - kvm_clear_request(KVM_REQ_UNHALT, vcpu);
> + kvm_vcpu_wfi(vcpu);
> }
>
> kvm_incr_pc(vcpu);
>
On 09/10/21 04:12, Sean Christopherson wrote:
> + /* Nothing to do if PI.SN==0 and the vCPU isn't being migrated. */
> if (!pi_test_sn(pi_desc) && vcpu->cpu == cpu)
> return;
This does not quite say "why", so:
/* Nothing to do if PI.SN and PI.NDST both have the desired value. */
Paolo
On 09/10/21 04:12, Sean Christopherson wrote:
> When waking vCPUs in the posted interrupt wakeup handling, do exactly
> that and no more. There is no need to kick the vCPU as the wakeup
> handler just need to get the vCPU task running, and if it's in the guest
> then it's definitely running.
And more important, the transition from blocking to running will have
gone through sync_pir_to_irr, thus checking ON and manually moving the
vector from PIR to RVI.
Paolo
On 09/10/21 04:12, Sean Christopherson wrote:
> Calculate the halt-polling "stop" time using "cur" instead of redoing
> ktime_get(). In the happy case where hardware correctly predicts
> do_halt_poll, "cur" is only a few cycles old. And if the branch is
> mispredicted, arguably that extra latency should count toward the
> halt-polling time.
>
> In all likelihood, the numbers involved are in the noise and either
> approach is perfectly ok.
Using "start" makes the change even more obvious, so:
Calculate the halt-polling "stop" time using "start" instead of redoing
ktime_get(). In practice, the numbers involved are in the noise (e.g.,
in the happy case where hardware correctly predicts do_halt_poll and
there are no interrupts, "start" is probably only a few cycles old)
and either approach is perfectly ok. But it's more precise to count
any extra latency toward the halt-polling time.
Paolo
On 09/10/21 04:12, Sean Christopherson wrote:
>
> Lastly, this aligns the non-nested and nested usage of triggering posted
> interrupts, and will allow for additional cleanups.
It also aligns with SVM a little bit more (especially given patch 35),
doesn't it?
Paolo
On 09/10/21 04:12, Sean Christopherson wrote:
> + /* TODO: Document why the unblocking path checks for updates. */
Is that a riddle or what? :)
Paolo
> + if (kvm_vcpu_is_blocking(vcpu) &&
> + kvm_check_request(KVM_REQ_APICV_UPDATE, vcpu)) {
> + kvm_vcpu_update_apicv(vcpu);
> +
> + if (!kvm_vcpu_apicv_active(vcpu))
> + return;
> + }
> +
On 25/10/21 17:48, Sean Christopherson wrote:
> On Mon, Oct 25, 2021, Paolo Bonzini wrote:
>> On 09/10/21 04:12, Sean Christopherson wrote:
>>> + /* TODO: Document why the unblocking path checks for updates. */
>>
>> Is that a riddle or what? :)
>
> Yes? I haven't been able to figure out why the unblocking path explicitly
> checks and handles an APICv update.
>
Challenge accepted. In the original code, it was because without it
avic_vcpu_load would do nothing, and nothing would update the IS_RUNNING
flag.
It shouldn't be necessary anymore since commit df7e4827c549 ("KVM: SVM:
call avic_vcpu_load/avic_vcpu_put when enabling/disabling AVIC",
2021-08-20), where svm_refresh_apicv_exec_ctrl takes care of the
avic_vcpu_load on the next VMRUN.
Paolo
Am 09.10.21 um 04:11 schrieb Sean Christopherson:
> This is basically two series smushed into one. The first "half" aims
> to differentiate between "halt" and a more generic "block", where "halt"
> aligns with x86's HLT instruction, the halt-polling mechanisms, and
> associated stats, and "block" means any guest action that causes the vCPU
> to block/wait.
>
> The second "half" overhauls x86's APIC virtualization code (Posted
> Interrupts on Intel VMX, AVIC on AMD SVM) to do their updates in response
> to vCPU (un)blocking in the vcpu_load/put() paths, keying off of the
> vCPU's rcuwait status to determine when a blocking vCPU is being put and
> reloaded. This idea comes from arm64's kvm_timer_vcpu_put(), which I
> stumbled across when diving into the history of arm64's (un)blocking hooks.
>
> The x86 APICv overhaul allows for killing off several sets of hooks in
> common KVM and in x86 KVM (to the vendor code). Moving everything to
> vcpu_put/load() also realizes nice cleanups, especially for the Posted
> Interrupt code, which required some impressive mental gymnastics to
> understand how vCPU task migration interacted with vCPU blocking.
>
> Non-x86 folks, sorry for the noise. I'm hoping the common parts can get
> applied without much fuss so that future versions can be x86-only.
>
> v2:
> - Collect reviews. [Christian, David]
> - Add patch to move arm64 WFI functionality out of hooks. [Marc]
> - Add RISC-V to the fun.
> - Add all the APICv fun.
Have we actually followed up on the regression regarding halt_poll_ns=0 no longer disabling
polling for running systems?
>
> v1: https://lkml.kernel.org/r/[email protected]
>
> Jing Zhang (1):
> KVM: stats: Add stat to detect if vcpu is currently blocking
>
> Sean Christopherson (42):
> KVM: VMX: Don't unblock vCPU w/ Posted IRQ if IRQs are disabled in
> guest
> KVM: SVM: Ensure target pCPU is read once when signalling AVIC
> doorbell
> KVM: s390: Ensure kvm_arch_no_poll() is read once when blocking vCPU
> KVM: Force PPC to define its own rcuwait object
> KVM: Update halt-polling stats if and only if halt-polling was
> attempted
> KVM: Refactor and document halt-polling stats update helper
> KVM: Reconcile discrepancies in halt-polling stats
> KVM: s390: Clear valid_wakeup in kvm_s390_handle_wait(), not in arch
> hook
> KVM: Drop obsolete kvm_arch_vcpu_block_finish()
> KVM: arm64: Move vGIC v4 handling for WFI out arch callback hook
> KVM: Don't block+unblock when halt-polling is successful
> KVM: x86: Tweak halt emulation helper names to free up kvm_vcpu_halt()
> KVM: Rename kvm_vcpu_block() => kvm_vcpu_halt()
> KVM: Split out a kvm_vcpu_block() helper from kvm_vcpu_halt()
> KVM: Don't redo ktime_get() when calculating halt-polling
> stop/deadline
> KVM: x86: Directly block (instead of "halting") UNINITIALIZED vCPUs
> KVM: x86: Invoke kvm_vcpu_block() directly for non-HALTED wait states
> KVM: Add helpers to wake/query blocking vCPU
> KVM: VMX: Skip Posted Interrupt updates if APICv is hard disabled
> KVM: VMX: Clean up PI pre/post-block WARNs
> KVM: VMX: Drop unnecessary PI logic to handle impossible conditions
> KVM: VMX: Use boolean returns for Posted Interrupt "test" helpers
> KVM: VMX: Drop pointless PI.NDST update when blocking
> KVM: VMX: Save/restore IRQs (instead of CLI/STI) during PI pre/post
> block
> KVM: VMX: Read Posted Interrupt "control" exactly once per loop
> iteration
> KVM: VMX: Move Posted Interrupt ndst computation out of write loop
> KVM: VMX: Remove vCPU from PI wakeup list before updating PID.NV
> KVM: VMX: Handle PI wakeup shenanigans during vcpu_put/load
> KVM: Drop unused kvm_vcpu.pre_pcpu field
> KVM: Move x86 VMX's posted interrupt list_head to vcpu_vmx
> KVM: VMX: Move preemption timer <=> hrtimer dance to common x86
> KVM: x86: Unexport LAPIC's switch_to_{hv,sw}_timer() helpers
> KVM: x86: Remove defunct pre_block/post_block kvm_x86_ops hooks
> KVM: SVM: Signal AVIC doorbell iff vCPU is in guest mode
> KVM: SVM: Don't bother checking for "running" AVIC when kicking for
> IPIs
> KVM: SVM: Unconditionally mark AVIC as running on vCPU load (with
> APICv)
> KVM: Drop defunct kvm_arch_vcpu_(un)blocking() hooks
> KVM: VMX: Don't do full kick when triggering posted interrupt "fails"
> KVM: VMX: Wake vCPU when delivering posted IRQ even if vCPU == this
> vCPU
> KVM: VMX: Pass desired vector instead of bool for triggering posted
> IRQ
> KVM: VMX: Fold fallback path into triggering posted IRQ helper
> KVM: VMX: Don't do full kick when handling posted interrupt wakeup
>
> arch/arm64/include/asm/kvm_emulate.h | 2 +
> arch/arm64/include/asm/kvm_host.h | 1 -
> arch/arm64/kvm/arch_timer.c | 5 +-
> arch/arm64/kvm/arm.c | 60 +++---
> arch/arm64/kvm/handle_exit.c | 5 +-
> arch/arm64/kvm/psci.c | 2 +-
> arch/mips/include/asm/kvm_host.h | 3 -
> arch/mips/kvm/emulate.c | 2 +-
> arch/powerpc/include/asm/kvm_host.h | 4 +-
> arch/powerpc/kvm/book3s_pr.c | 2 +-
> arch/powerpc/kvm/book3s_pr_papr.c | 2 +-
> arch/powerpc/kvm/booke.c | 2 +-
> arch/powerpc/kvm/powerpc.c | 5 +-
> arch/riscv/include/asm/kvm_host.h | 1 -
> arch/riscv/kvm/vcpu_exit.c | 2 +-
> arch/s390/include/asm/kvm_host.h | 4 -
> arch/s390/kvm/interrupt.c | 3 +-
> arch/s390/kvm/kvm-s390.c | 7 +-
> arch/x86/include/asm/kvm-x86-ops.h | 4 -
> arch/x86/include/asm/kvm_host.h | 29 +--
> arch/x86/kvm/lapic.c | 4 +-
> arch/x86/kvm/svm/avic.c | 95 ++++-----
> arch/x86/kvm/svm/svm.c | 8 -
> arch/x86/kvm/svm/svm.h | 14 --
> arch/x86/kvm/vmx/nested.c | 2 +-
> arch/x86/kvm/vmx/posted_intr.c | 279 ++++++++++++---------------
> arch/x86/kvm/vmx/posted_intr.h | 14 +-
> arch/x86/kvm/vmx/vmx.c | 63 +++---
> arch/x86/kvm/vmx/vmx.h | 3 +
> arch/x86/kvm/x86.c | 55 ++++--
> include/linux/kvm_host.h | 27 ++-
> include/linux/kvm_types.h | 1 +
> virt/kvm/async_pf.c | 2 +-
> virt/kvm/kvm_main.c | 138 +++++++------
> 34 files changed, 413 insertions(+), 437 deletions(-)
>
On Tue, Oct 26, 2021, Christian Borntraeger wrote:
> Am 09.10.21 um 04:11 schrieb Sean Christopherson:
> > This is basically two series smushed into one. The first "half" aims
> > to differentiate between "halt" and a more generic "block", where "halt"
> > aligns with x86's HLT instruction, the halt-polling mechanisms, and
> > associated stats, and "block" means any guest action that causes the vCPU
> > to block/wait.
> >
> > The second "half" overhauls x86's APIC virtualization code (Posted
> > Interrupts on Intel VMX, AVIC on AMD SVM) to do their updates in response
> > to vCPU (un)blocking in the vcpu_load/put() paths, keying off of the
> > vCPU's rcuwait status to determine when a blocking vCPU is being put and
> > reloaded. This idea comes from arm64's kvm_timer_vcpu_put(), which I
> > stumbled across when diving into the history of arm64's (un)blocking hooks.
> >
> > The x86 APICv overhaul allows for killing off several sets of hooks in
> > common KVM and in x86 KVM (to the vendor code). Moving everything to
> > vcpu_put/load() also realizes nice cleanups, especially for the Posted
> > Interrupt code, which required some impressive mental gymnastics to
> > understand how vCPU task migration interacted with vCPU blocking.
> >
> > Non-x86 folks, sorry for the noise. I'm hoping the common parts can get
> > applied without much fuss so that future versions can be x86-only.
> >
> > v2:
> > - Collect reviews. [Christian, David]
> > - Add patch to move arm64 WFI functionality out of hooks. [Marc]
> > - Add RISC-V to the fun.
> > - Add all the APICv fun.
>
> Have we actually followed up on the regression regarding halt_poll_ns=0 no longer disabling
> polling for running systems?
No, I have that conversation flagged but haven't gotten back to it. I still like
the idea of special casing halt_poll_ns=0 to override the capability. I can send
a proper patch for that unless there's a different/better idea?
On Mon, 25 Oct 2021 14:31:48 +0100,
Paolo Bonzini <[email protected]> wrote:
>
> On 09/10/21 04:12, Sean Christopherson wrote:
> > Move the put and reload of the vGIC out of the block/unblock callbacks
> > and into a dedicated WFI helper. Functionally, this is nearly a nop as
> > the block hook is called at the very beginning of kvm_vcpu_block(), and
> > the only code in kvm_vcpu_block() after the unblock hook is to update the
> > halt-polling controls, i.e. can only affect the next WFI.
> >
> > Back when the arch (un)blocking hooks were added by commits 3217f7c25bca
> > ("KVM: Add kvm_arch_vcpu_{un}blocking callbacks) and d35268da6687
> > ("arm/arm64: KVM: arch_timer: Only schedule soft timer on vcpu_block"),
> > the hooks were invoked only when KVM was about to "block", i.e. schedule
> > out the vCPU. The use case at the time was to schedule a timer in the
> > host based on the earliest timer in the guest in order to wake the
> > blocking vCPU when the emulated guest timer fired. Commit accb99bcd0ca
> > ("KVM: arm/arm64: Simplify bg_timer programming") reworked the timer
> > logic to be even more precise, by waiting until the vCPU was actually
> > scheduled out, and so move the timer logic from the (un)blocking hooks to
> > vcpu_load/put.
> >
> > In the meantime, the hooks gained usage for enabling vGIC v4 doorbells in
> > commit df9ba95993b9 ("KVM: arm/arm64: GICv4: Use the doorbell interrupt
> > as an unblocking source"), and added related logic for the VMCR in commit
> > 5eeaf10eec39 ("KVM: arm/arm64: Sync ICH_VMCR_EL2 back when about to block").
> >
> > Finally, commit 07ab0f8d9a12 ("KVM: Call kvm_arch_vcpu_blocking early
> > into the blocking sequence") hoisted the (un)blocking hooks so that they
> > wrapped KVM's halt-polling logic in addition to the core "block" logic.
> >
> > In other words, the original need for arch hooks to take action _only_
> > in the block path is long since gone.
> >
> > Cc: Oliver Upton <[email protected]>
> > Cc: Marc Zyngier <[email protected]>
> > Signed-off-by: Sean Christopherson <[email protected]>
>
> This needs a word on why kvm_psci_vcpu_suspend does not need the
> hooks. Or it needs to be changed to also use kvm_vcpu_wfi in the PSCI
> code, I don't know.
>
> Marc, can you review and/or advise?
I was looking at that over the weekend, and that's a pre-existing
bug. I would have addressed it independently, but it looks like you
already have queued the patch.
I guess I'll have to revisit this once the whole thing lands
somewhere.
M.
--
Without deviation from the norm, progress is not possible.
On 26/10/21 17:41, Marc Zyngier wrote:
>> This needs a word on why kvm_psci_vcpu_suspend does not need the
>> hooks. Or it needs to be changed to also use kvm_vcpu_wfi in the PSCI
>> code, I don't know.
>>
>> Marc, can you review and/or advise?
> I was looking at that over the weekend, and that's a pre-existing
> bug. I would have addressed it independently, but it looks like you
> already have queued the patch.
I have "queued" it, but that's just my queue - it's not on kernel.org
and it's not going to be in 5.16, at least not in the first batch.
There's plenty of time for me to rebase on top of a fix, if you want to
send the fix through your kvm-arm pull request. Just Cc me so that I
understand what's going on.
Thanks,
Paolo
Am 26.10.21 um 16:48 schrieb Sean Christopherson:
> On Tue, Oct 26, 2021, Christian Borntraeger wrote:
>> Am 09.10.21 um 04:11 schrieb Sean Christopherson:
>>> This is basically two series smushed into one. The first "half" aims
>>> to differentiate between "halt" and a more generic "block", where "halt"
>>> aligns with x86's HLT instruction, the halt-polling mechanisms, and
>>> associated stats, and "block" means any guest action that causes the vCPU
>>> to block/wait.
>>>
>>> The second "half" overhauls x86's APIC virtualization code (Posted
>>> Interrupts on Intel VMX, AVIC on AMD SVM) to do their updates in response
>>> to vCPU (un)blocking in the vcpu_load/put() paths, keying off of the
>>> vCPU's rcuwait status to determine when a blocking vCPU is being put and
>>> reloaded. This idea comes from arm64's kvm_timer_vcpu_put(), which I
>>> stumbled across when diving into the history of arm64's (un)blocking hooks.
>>>
>>> The x86 APICv overhaul allows for killing off several sets of hooks in
>>> common KVM and in x86 KVM (to the vendor code). Moving everything to
>>> vcpu_put/load() also realizes nice cleanups, especially for the Posted
>>> Interrupt code, which required some impressive mental gymnastics to
>>> understand how vCPU task migration interacted with vCPU blocking.
>>>
>>> Non-x86 folks, sorry for the noise. I'm hoping the common parts can get
>>> applied without much fuss so that future versions can be x86-only.
>>>
>>> v2:
>>> - Collect reviews. [Christian, David]
>>> - Add patch to move arm64 WFI functionality out of hooks. [Marc]
>>> - Add RISC-V to the fun.
>>> - Add all the APICv fun.
>>
>> Have we actually followed up on the regression regarding halt_poll_ns=0 no longer disabling
>> polling for running systems?
>
> No, I have that conversation flagged but haven't gotten back to it. I still like
> the idea of special casing halt_poll_ns=0 to override the capability. I can send
> a proper patch for that unless there's a different/better idea?
I think I would prefer a variant that uses the halt_poll_ns value AS IS for all
guests that have not opted in the per guest feature.
And then MAYBE have 0 as a special case to disable that also for the opted in
VMs.
On Fri, 2021-10-08 at 19:11 -0700, Sean Christopherson wrote:
> Ensure vcpu->cpu is read once when signalling the AVIC doorbell. If the
> compiler rereads the field and the vCPU is migrated between the check and
> writing the doorbell, KVM would signal the wrong physical CPU.
Since vcpu->cpu can change any moment anyway, adding READ_ONCE I think can't really fix anything
but I do agree that it makes this more readable.
Reviewed-by: Maxim Levitsky <[email protected]>
>
> Functionally, signalling the wrong CPU in this case is not an issue as
> task migration means the vCPU has exited and will pick up any pending
> interrupts on the next VMRUN. Add the READ_ONCE() purely to clean up the
> code.
>
> Opportunistically add a comment explaining the task migration behavior,
> and rename cpuid=>cpu to avoid conflating the CPU number with KVM's more
> common usage of CPUID.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/svm/avic.c | 13 ++++++++++---
> 1 file changed, 10 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
> index 8052d92069e0..208c5c71e827 100644
> --- a/arch/x86/kvm/svm/avic.c
> +++ b/arch/x86/kvm/svm/avic.c
> @@ -675,10 +675,17 @@ int svm_deliver_avic_intr(struct kvm_vcpu *vcpu, int vec)
> smp_mb__after_atomic();
>
> if (avic_vcpu_is_running(vcpu)) {
> - int cpuid = vcpu->cpu;
> + int cpu = READ_ONCE(vcpu->cpu);
>
> - if (cpuid != get_cpu())
> - wrmsrl(SVM_AVIC_DOORBELL, kvm_cpu_get_apicid(cpuid));
> + /*
> + * Note, the vCPU could get migrated to a different pCPU at any
> + * point, which could result in signalling the wrong/previous
> + * pCPU. But if that happens the vCPU is guaranteed to do a
> + * VMRUN (after being migrated) and thus will process pending
> + * interrupts, i.e. a doorbell is not needed (and the spurious)
> + */
> + if (cpu != get_cpu())
> + wrmsrl(SVM_AVIC_DOORBELL, kvm_cpu_get_apicid(cpu));
> put_cpu();
> } else
> kvm_vcpu_wake_up(vcpu);
On Fri, 2021-10-08 at 19:11 -0700, Sean Christopherson wrote:
> Add a comment to document that halt-polling is considered successful even
> if the polling loop itself didn't detect a wake event, i.e. if a wake
> event was detect in the final kvm_vcpu_check_block(). Invert the param
> to update helper so that the helper is a dumb function that is "told"
> whether or not polling was successful, as opposed to determining success
> based on blocking behavior.
>
> Opportunistically tweak the params to the update helper to reduce the
> line length for the call site so that it fits on a single line, and so
> that the prototype conforms to the more traditional kernel style.
>
> No functional change intended.
>
> Reviewed-by: David Matlack <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> virt/kvm/kvm_main.c | 20 +++++++++++++-------
> 1 file changed, 13 insertions(+), 7 deletions(-)
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 6156719bcbbc..4dfcd736b274 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -3201,13 +3201,15 @@ static int kvm_vcpu_check_block(struct kvm_vcpu *vcpu)
> return ret;
> }
>
> -static inline void
> -update_halt_poll_stats(struct kvm_vcpu *vcpu, u64 poll_ns, bool waited)
> +static inline void update_halt_poll_stats(struct kvm_vcpu *vcpu, ktime_t start,
> + ktime_t end, bool success)
> {
> - if (waited)
> - vcpu->stat.generic.halt_poll_fail_ns += poll_ns;
> - else
> + u64 poll_ns = ktime_to_ns(ktime_sub(end, start));
> +
> + if (success)
> vcpu->stat.generic.halt_poll_success_ns += poll_ns;
> + else
> + vcpu->stat.generic.halt_poll_fail_ns += poll_ns;
> }
>
> /*
> @@ -3277,9 +3279,13 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
> kvm_arch_vcpu_unblocking(vcpu);
> block_ns = ktime_to_ns(cur) - ktime_to_ns(start);
>
> + /*
> + * Note, halt-polling is considered successful so long as the vCPU was
> + * never actually scheduled out, i.e. even if the wake event arrived
> + * after of the halt-polling loop itself, but before the full wait.
> + */
> if (do_halt_poll)
> - update_halt_poll_stats(
> - vcpu, ktime_to_ns(ktime_sub(poll_end, start)), waited);
> + update_halt_poll_stats(vcpu, start, poll_end, !waited);
>
> if (halt_poll_allowed) {
> if (!vcpu_valid_wakeup(vcpu)) {
Reviewed-by: Maxim Levitsky <[email protected]>
Best regards,
Maxim Levitsky
On Fri, 2021-10-08 at 19:12 -0700, Sean Christopherson wrote:
> Move the halt-polling "success" and histogram stats update into the
> dedicated helper to fix a discrepancy where the success/fail "time" stats
> consider polling successful so long as the wait is avoided, but the main
> "success" and histogram stats consider polling successful if and only if
> a wake event was detected by the halt-polling loop.
>
> Move halt_attempted_poll to the helper as well so that all the stats are
> updated in a single location. While it's a bit odd to update the stat
> well after the fact, practically speaking there's no meaningful advantage
> to updating before polling.
>
> Note, there is a functional change in addition to the success vs. fail
> change. The histogram updates previously called ktime_get() instead of
> using "cur". But that change is desirable as it means all the stats are
> now updated with the same polling time, and avoids the extra ktime_get(),
> which isn't expensive but isn't free either.
>
> Reviewed-by: David Matlack <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> virt/kvm/kvm_main.c | 35 ++++++++++++++++-------------------
> 1 file changed, 16 insertions(+), 19 deletions(-)
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 4dfcd736b274..1292c7876d3f 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -3204,12 +3204,23 @@ static int kvm_vcpu_check_block(struct kvm_vcpu *vcpu)
> static inline void update_halt_poll_stats(struct kvm_vcpu *vcpu, ktime_t start,
> ktime_t end, bool success)
> {
> + struct kvm_vcpu_stat_generic *stats = &vcpu->stat.generic;
> u64 poll_ns = ktime_to_ns(ktime_sub(end, start));
>
> - if (success)
> - vcpu->stat.generic.halt_poll_success_ns += poll_ns;
> - else
> - vcpu->stat.generic.halt_poll_fail_ns += poll_ns;
> + ++vcpu->stat.generic.halt_attempted_poll;
> +
> + if (success) {
> + ++vcpu->stat.generic.halt_successful_poll;
> +
> + if (!vcpu_valid_wakeup(vcpu))
> + ++vcpu->stat.generic.halt_poll_invalid;
> +
> + stats->halt_poll_success_ns += poll_ns;
> + KVM_STATS_LOG_HIST_UPDATE(stats->halt_poll_success_hist, poll_ns);
> + } else {
> + stats->halt_poll_fail_ns += poll_ns;
> + KVM_STATS_LOG_HIST_UPDATE(stats->halt_poll_fail_hist, poll_ns);
> + }
> }
>
> /*
> @@ -3230,30 +3241,16 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
> if (do_halt_poll) {
> ktime_t stop = ktime_add_ns(ktime_get(), vcpu->halt_poll_ns);
>
> - ++vcpu->stat.generic.halt_attempted_poll;
> do {
> /*
> * This sets KVM_REQ_UNHALT if an interrupt
> * arrives.
> */
> - if (kvm_vcpu_check_block(vcpu) < 0) {
> - ++vcpu->stat.generic.halt_successful_poll;
> - if (!vcpu_valid_wakeup(vcpu))
> - ++vcpu->stat.generic.halt_poll_invalid;
> -
> - KVM_STATS_LOG_HIST_UPDATE(
> - vcpu->stat.generic.halt_poll_success_hist,
> - ktime_to_ns(ktime_get()) -
> - ktime_to_ns(start));
> + if (kvm_vcpu_check_block(vcpu) < 0)
> goto out;
> - }
> cpu_relax();
> poll_end = cur = ktime_get();
> } while (kvm_vcpu_can_poll(cur, stop));
> -
> - KVM_STATS_LOG_HIST_UPDATE(
> - vcpu->stat.generic.halt_poll_fail_hist,
> - ktime_to_ns(ktime_get()) - ktime_to_ns(start));
> }
>
>
Reviewed-by: Maxim Levitsky <[email protected]>
Best regards,
Maxim Levitsky
On Fri, 2021-10-08 at 19:12 -0700, Sean Christopherson wrote:
> Invoke the arch hooks for block+unblock if and only if KVM actually
> attempts to block the vCPU. The only non-nop implementation is on x86,
> specifically SVM's AVIC, and there is no need to put the AVIC prior to
> halt-polling as KVM x86's kvm_vcpu_has_events() will scour the full vIRR
> to find pending IRQs regardless of whether the AVIC is loaded/"running".
>
> The primary motivation is to allow future cleanup to split out "block"
> from "halt", but this is also likely a small performance boost on x86 SVM
> when halt-polling is successful.
>
> Adjust the post-block path to update "cur" after unblocking, i.e. include
> AVIC load time in halt_wait_ns and halt_wait_hist, so that the behavior
> is consistent. Moving just the pre-block arch hook would result in only
> the AVIC put latency being included in the halt_wait stats. There is no
> obvious evidence that one way or the other is correct, so just ensure KVM
> is consistent.
>
> Note, x86 has two separate paths for handling APICv with respect to vCPU
> blocking. VMX uses hooks in x86's vcpu_block(), while SVM uses the arch
> hooks in kvm_vcpu_block(). Prior to this path, the two paths were more
> or less functionally identical. That is very much not the case after
> this patch, as the hooks used by VMX _must_ fire before halt-polling.
> x86's entire mess will be cleaned up in future patches.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> virt/kvm/kvm_main.c | 7 ++++---
> 1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index f90b3ed05628..227f6bbe0716 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -3235,8 +3235,6 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
> bool waited = false;
> u64 block_ns;
>
> - kvm_arch_vcpu_blocking(vcpu);
> -
> start = cur = poll_end = ktime_get();
> if (do_halt_poll) {
> ktime_t stop = ktime_add_ns(ktime_get(), vcpu->halt_poll_ns);
> @@ -3253,6 +3251,7 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
> } while (kvm_vcpu_can_poll(cur, stop));
> }
>
> + kvm_arch_vcpu_blocking(vcpu);
>
> prepare_to_rcuwait(wait);
> for (;;) {
> @@ -3265,6 +3264,9 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
> schedule();
> }
> finish_rcuwait(wait);
> +
> + kvm_arch_vcpu_unblocking(vcpu);
> +
> cur = ktime_get();
> if (waited) {
> vcpu->stat.generic.halt_wait_ns +=
> @@ -3273,7 +3275,6 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
> ktime_to_ns(cur) - ktime_to_ns(poll_end));
> }
> out:
> - kvm_arch_vcpu_unblocking(vcpu);
> block_ns = ktime_to_ns(cur) - ktime_to_ns(start);
>
> /*
Makes sense.
Reviewed-by: Maxim Levitsky <[email protected]>
Best regards,
Maxim Levitsky
On Mon, Oct 25, 2021, Paolo Bonzini wrote:
> On 09/10/21 04:11, Sean Christopherson wrote:
> Queued 1-20 and 22-28. Initially I skipped 21 because I didn't receive it,
> but I have to think more about whether I agree with it.
https://lkml.kernel.org/r/[email protected]
> In reality the CMPXCHG loops can really fail just once, because they only
> race with the processor setting ON=1. But if the warnings were to trigger
> at all, it would mean that something iffy is happening in the
> pi_desc->control state machine, and having the check on every iteration is
> (very marginally) more effective.
Yeah, the "very marginally" caveat is essentially my argument. The WARNs are
really there to ensure that the vCPU itself did the correct setup/clean before
and after blocking. Because IRQs are disabled, a failure on iteration>0 but not
iteration=0 would mean that a different CPU or a device modified the PI descriptor.
If that happens, (a) something is wildly wrong and (b) as you noted, the odds of
the WARN firing in the tiny window between iteration=0 and iteration=1 are really,
really low.
The other thing I don't like about having the WARN in the loop is that it suggests
that something other than the vCPU can modify the NDST and SN fields, which is
wrong and confusing (for me). The WARNs in the loops made more sense when the
loops ran with IRQs enabled prior to commit 8b306e2f3c41 ("KVM: VMX: avoid
double list add with VT-d posted interrupts"). Then it would be at least plausible
that a vCPU could mess up its own descriptor while being scheduled out/in.
On Wed, 2021-10-27 at 17:10 +0300, Maxim Levitsky wrote:
> On Fri, 2021-10-08 at 19:12 -0700, Sean Christopherson wrote:
> > Rename a variety of HLT-related helpers to free up the function name
> > "kvm_vcpu_halt" for future use in generic KVM code, e.g. to differentiate
> > between "block" and "halt".
> >
> > No functional change intended.
> >
> > Reviewed-by: David Matlack <[email protected]>
> > Signed-off-by: Sean Christopherson <[email protected]>
> > ---
> > arch/x86/include/asm/kvm_host.h | 2 +-
> > arch/x86/kvm/vmx/nested.c | 2 +-
> > arch/x86/kvm/vmx/vmx.c | 4 ++--
> > arch/x86/kvm/x86.c | 13 +++++++------
> > 4 files changed, 11 insertions(+), 10 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 7aafc27ce7a9..328103a520d3 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -1689,7 +1689,7 @@ int kvm_emulate_monitor(struct kvm_vcpu *vcpu);
> > int kvm_fast_pio(struct kvm_vcpu *vcpu, int size, unsigned short port, int in);
> > int kvm_emulate_cpuid(struct kvm_vcpu *vcpu);
> > int kvm_emulate_halt(struct kvm_vcpu *vcpu);
> > -int kvm_vcpu_halt(struct kvm_vcpu *vcpu);
> > +int kvm_emulate_halt_noskip(struct kvm_vcpu *vcpu);
> > int kvm_emulate_ap_reset_hold(struct kvm_vcpu *vcpu);
> > int kvm_emulate_wbinvd(struct kvm_vcpu *vcpu);
> >
> > diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> > index af1bbb73430a..d0237a441feb 100644
> > --- a/arch/x86/kvm/vmx/nested.c
> > +++ b/arch/x86/kvm/vmx/nested.c
> > @@ -3619,7 +3619,7 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch)
> > !(nested_cpu_has(vmcs12, CPU_BASED_INTR_WINDOW_EXITING) &&
> > (vmcs12->guest_rflags & X86_EFLAGS_IF))) {
> > vmx->nested.nested_run_pending = 0;
> > - return kvm_vcpu_halt(vcpu);
> > + return kvm_emulate_halt_noskip(vcpu);
> > }
> > break;
> > case GUEST_ACTIVITY_WAIT_SIPI:
> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > index 1c8b2b6e7ed9..5517893f12fc 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -4741,7 +4741,7 @@ static int handle_rmode_exception(struct kvm_vcpu *vcpu,
> > if (kvm_emulate_instruction(vcpu, 0)) {
> > if (vcpu->arch.halt_request) {
> > vcpu->arch.halt_request = 0;
> > - return kvm_vcpu_halt(vcpu);
> > + return kvm_emulate_halt_noskip(vcpu);
>
> Could you elaborate on why you choose _noskip suffix?
>
> As far as I see, kvm_vcpu_halt just calls __kvm_vcpu_halt with new VCPU run state/exit reason,
> which is used only when local apic is not in the kernel (which is these days not that
> supported configuration).
>
> Other user of __kvm_vcpu_halt is something SEV related.
>
> Best regards,
> Maxim Levitsky
>
>
> > }
> > return 1;
> > }
> > @@ -5415,7 +5415,7 @@ static int handle_invalid_guest_state(struct kvm_vcpu *vcpu)
> >
> > if (vcpu->arch.halt_request) {
> > vcpu->arch.halt_request = 0;
> > - return kvm_vcpu_halt(vcpu);
> > + return kvm_emulate_halt_noskip(vcpu);
> > }
> >
> > /*
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 4a52a08707de..9c23ae1d483d 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -8649,7 +8649,7 @@ void kvm_arch_exit(void)
> > #endif
> > }
> >
> > -static int __kvm_vcpu_halt(struct kvm_vcpu *vcpu, int state, int reason)
> > +static int __kvm_emulate_halt(struct kvm_vcpu *vcpu, int state, int reason)
> > {
> > ++vcpu->stat.halt_exits;
> > if (lapic_in_kernel(vcpu)) {
> > @@ -8661,11 +8661,11 @@ static int __kvm_vcpu_halt(struct kvm_vcpu *vcpu, int state, int reason)
> > }
> > }
> >
> > -int kvm_vcpu_halt(struct kvm_vcpu *vcpu)
> > +int kvm_emulate_halt_noskip(struct kvm_vcpu *vcpu)
> > {
> > - return __kvm_vcpu_halt(vcpu, KVM_MP_STATE_HALTED, KVM_EXIT_HLT);
> > + return __kvm_emulate_halt(vcpu, KVM_MP_STATE_HALTED, KVM_EXIT_HLT);
> > }
> > -EXPORT_SYMBOL_GPL(kvm_vcpu_halt);
> > +EXPORT_SYMBOL_GPL(kvm_emulate_halt_noskip);
> >
> > int kvm_emulate_halt(struct kvm_vcpu *vcpu)
> > {
> > @@ -8674,7 +8674,7 @@ int kvm_emulate_halt(struct kvm_vcpu *vcpu)
> > * TODO: we might be squashing a GUESTDBG_SINGLESTEP-triggered
> > * KVM_EXIT_DEBUG here.
> > */
> > - return kvm_vcpu_halt(vcpu) && ret;
> > + return kvm_emulate_halt_noskip(vcpu) && ret;
> > }
> > EXPORT_SYMBOL_GPL(kvm_emulate_halt);
> >
> > @@ -8682,7 +8682,8 @@ int kvm_emulate_ap_reset_hold(struct kvm_vcpu *vcpu)
> > {
> > int ret = kvm_skip_emulated_instruction(vcpu);
> >
> > - return __kvm_vcpu_halt(vcpu, KVM_MP_STATE_AP_RESET_HOLD, KVM_EXIT_AP_RESET_HOLD) && ret;
> > + return __kvm_emulate_halt(vcpu, KVM_MP_STATE_AP_RESET_HOLD,
> > + KVM_EXIT_AP_RESET_HOLD) && ret;
> > }
> > EXPORT_SYMBOL_GPL(kvm_emulate_ap_reset_hold);
> >
Also while at it, why not to use say '__kvm_emulate_hlt' ('hlt' instead of 'halt') to
put emphasis on the fact that we are emulating a cpu instruction?
Best regards,
Maxim Levitsky
On Fri, 2021-10-08 at 19:12 -0700, Sean Christopherson wrote:
> Go directly to kvm_vcpu_block() when handling the case where userspace
> attempts to run an UNINITIALIZED vCPU. The vCPU is not halted, nor is it
> likely that halt-polling will be successful in this case.
>
> Reviewed-by: David Matlack <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/x86.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index e6c17bbed25c..cd51f100e906 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -10133,7 +10133,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
> r = -EINTR;
> goto out;
> }
> - kvm_vcpu_halt(vcpu);
> + kvm_vcpu_block(vcpu);
> if (kvm_apic_accept_events(vcpu) < 0) {
> r = 0;
> goto out;
Makes sense.
Reviewed-by: Maxim Levitsky <[email protected]>
Best regards,
Maxim levitsky
On 27/10/21 16:41, Sean Christopherson wrote:
> The other thing I don't like about having the WARN in the loop is that it suggests
> that something other than the vCPU can modify the NDST and SN fields, which is
> wrong and confusing (for me).
Yeah, I can agree with that. Can you add it in a comment above the
cmpxchg loop, it can be as simple as
/* The processor can set ON concurrently. */
when you respin patch 21 and the rest of the series?
Paolo
On Wed, Oct 27, 2021, Paolo Bonzini wrote:
> On 27/10/21 17:06, Sean Christopherson wrote:
> > > Does this still need to check the "running" flag? That should be a strict
> > > superset of vcpu->mode == IN_GUEST_MODE.
> >
> > No. Signalling the doorbell when "running" is set but the vCPU is not in the
> > guest is just an expensive nop. So even if KVM were to rework its handling of
> > "running" to set the flag immediately before VMRUN and clear it immediately after,
> > keying off IN_GUEST_MODE and not "running" would not be wrong, just sub-optimal.
> >
> > I doubt KVM will ever make the "running" flag super precise, because keeping the
> > flag set when the vCPU is loaded avoids VM-Exits on other vCPUs due to undelivered
> > IPIs.
>
> Right, so should we drop the "if (running)" check in this patch, at the same
> time as it's adding the IN_GUEST_MODE check?
LOL, I think we have a Three^WTwo Stooges routine going on. This patch does
remove avic_vcpu_is_running() and replaces it with the vcpu->mode check. Or am
I completely misunderstanding what your referring to?
- if (avic_vcpu_is_running(vcpu)) {
+ /*
+ * Signal the doorbell to tell hardware to inject the IRQ if the vCPU
+ * is in the guest. If the vCPU is not in the guest, hardware will
+ * automatically process AVIC interrupts at VMRUN.
+ */
+ if (vcpu->mode == IN_GUEST_MODE) {
int cpu = READ_ONCE(vcpu->cpu);
On 27/10/21 18:08, Sean Christopherson wrote:
>> Right, so should we drop the "if (running)" check in this patch, at the same
>> time as it's adding the IN_GUEST_MODE check?
> LOL, I think we have a Three^WTwo Stooges routine going on. This patch does
> remove avic_vcpu_is_running() and replaces it with the vcpu->mode check. Or am
> I completely misunderstanding what your referring to?
>
> - if (avic_vcpu_is_running(vcpu)) {
> + /*
> + * Signal the doorbell to tell hardware to inject the IRQ if the vCPU
> + * is in the guest. If the vCPU is not in the guest, hardware will
> + * automatically process AVIC interrupts at VMRUN.
> + */
> + if (vcpu->mode == IN_GUEST_MODE) {
> int cpu = READ_ONCE(vcpu->cpu);
Nevermind, I confused svm_deliver_avic_intr with avic_kick_target_vcpus,
which anyway you are handling in patch 36.
Paolo
On Mon, Oct 25, 2021, Paolo Bonzini wrote:
> On 09/10/21 04:12, Sean Christopherson wrote:
> >
> > Lastly, this aligns the non-nested and nested usage of triggering posted
> > interrupts, and will allow for additional cleanups.
>
> It also aligns with SVM a little bit more (especially given patch 35),
> doesn't it?
Yes, aligning VMX and SVM APICv behavior as much as possible is definitely a goal
of this series, though I suspect I failed to state that anywhere.
On Mon, Oct 25, 2021, Paolo Bonzini wrote:
> On 09/10/21 04:12, Sean Christopherson wrote:
> > + /*
> > + * The smp_wmb() in kvm_make_request() pairs with the smp_mb_*()
> > + * after setting vcpu->mode in vcpu_enter_guest(), thus the vCPU
> > + * is guaranteed to see the event request if triggering a posted
> > + * interrupt "fails" because vcpu->mode != IN_GUEST_MODE.
>
> This explanation doesn't make much sense to me. This is just the usual
> request/kick pattern explained in Documentation/virt/kvm/vcpu-requests.rst;
> except that we don't bother with a "kick" out of guest mode because the
> entry always goes through kvm_check_request (in the nVMX case) or
> sync_pir_to_irr (if non-nested) and completes the delivery itself.
>
> In other word, it is a similar idea as patch 43/43.
>
> What this smp_wmb() pair with, is the smp_mb__after_atomic in
> kvm_check_request(KVM_REQ_EVENT, vcpu).
I don't think that's correct. There is no kvm_check_request() in the relevant path.
kvm_vcpu_exit_request() uses kvm_request_pending(), which is just a READ_ONCE()
without a barrier. The smp_mb__after_atomic ensures that any assets that were
modified prior to making the request are seen by the vCPU handling the request.
It does not provide any guarantees for a different vCPU/task making a request
and checking vcpu->mode versus the target vCPU setting vcpu->mode and checking
for a pending request.
> Setting the interrupt in the PIR orders before kvm_make_request in this
> thread, and orders after kvm_make_request in the vCPU thread.
>
> Here, instead:
>
> > + /*
> > + * The implied barrier in pi_test_and_set_on() pairs with the smp_mb_*()
> > + * after setting vcpu->mode in vcpu_enter_guest(), thus the vCPU is
> > + * guaranteed to see PID.ON=1 and sync the PIR to IRR if triggering a
> > + * posted interrupt "fails" because vcpu->mode != IN_GUEST_MODE.
> > + */
> > if (vcpu != kvm_get_running_vcpu() &&
> > !kvm_vcpu_trigger_posted_interrupt(vcpu, false))
> > - kvm_vcpu_kick(vcpu);
> > + kvm_vcpu_wake_up(vcpu);
>
> it pairs with the smp_mb__after_atomic in vmx_sync_pir_to_irr(). As
> explained again in vcpu-requests.rst, the ON bit has the same function as
> vcpu->request in the previous case.
Same as above, I don't think that's correct. The smp_mb__after_atomic() ensures
that there's no race between the IOMMU writing vIRR and setting ON, and KVM
clearing ON and processing the vIRR.
pi_test_on() is not an atomic operation, and there's no memory barrier if ON=0.
It's the same behavior as kvm_check_request(), but again the ordering with respect
to vcpu->mode isn't being handled by PID.ON/kvm_check_request().
AIUI, this is the barrier that's paired with the PI barriers. This is even called
out in (2).
vcpu->mode = IN_GUEST_MODE;
srcu_read_unlock(&vcpu->kvm->srcu, vcpu->srcu_idx);
/*
* 1) We should set ->mode before checking ->requests. Please see
* the comment in kvm_vcpu_exiting_guest_mode().
*
* 2) For APICv, we should set ->mode before checking PID.ON. This
* pairs with the memory barrier implicit in pi_test_and_set_on
* (see vmx_deliver_posted_interrupt).
*
* 3) This also orders the write to mode from any reads to the page
* tables done while the VCPU is running. Please see the comment
* in kvm_flush_remote_tlbs.
*/
smp_mb__after_srcu_read_unlock();
On 27/10/21 17:06, Sean Christopherson wrote:
>> Does this still need to check the "running" flag? That should be a strict
>> superset of vcpu->mode == IN_GUEST_MODE.
>
> No. Signalling the doorbell when "running" is set but the vCPU is not in the
> guest is just an expensive nop. So even if KVM were to rework its handling of
> "running" to set the flag immediately before VMRUN and clear it immediately after,
> keying off IN_GUEST_MODE and not "running" would not be wrong, just sub-optimal.
>
> I doubt KVM will ever make the "running" flag super precise, because keeping the
> flag set when the vCPU is loaded avoids VM-Exits on other vCPUs due to undelivered
> IPIs.
Right, so should we drop the "if (running)" check in this patch, at the
same time as it's adding the IN_GUEST_MODE check?
Paolo
On Fri, 2021-10-08 at 19:12 -0700, Sean Christopherson wrote:
> Rename a variety of HLT-related helpers to free up the function name
> "kvm_vcpu_halt" for future use in generic KVM code, e.g. to differentiate
> between "block" and "halt".
>
> No functional change intended.
>
> Reviewed-by: David Matlack <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 2 +-
> arch/x86/kvm/vmx/nested.c | 2 +-
> arch/x86/kvm/vmx/vmx.c | 4 ++--
> arch/x86/kvm/x86.c | 13 +++++++------
> 4 files changed, 11 insertions(+), 10 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 7aafc27ce7a9..328103a520d3 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1689,7 +1689,7 @@ int kvm_emulate_monitor(struct kvm_vcpu *vcpu);
> int kvm_fast_pio(struct kvm_vcpu *vcpu, int size, unsigned short port, int in);
> int kvm_emulate_cpuid(struct kvm_vcpu *vcpu);
> int kvm_emulate_halt(struct kvm_vcpu *vcpu);
> -int kvm_vcpu_halt(struct kvm_vcpu *vcpu);
> +int kvm_emulate_halt_noskip(struct kvm_vcpu *vcpu);
> int kvm_emulate_ap_reset_hold(struct kvm_vcpu *vcpu);
> int kvm_emulate_wbinvd(struct kvm_vcpu *vcpu);
>
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index af1bbb73430a..d0237a441feb 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -3619,7 +3619,7 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch)
> !(nested_cpu_has(vmcs12, CPU_BASED_INTR_WINDOW_EXITING) &&
> (vmcs12->guest_rflags & X86_EFLAGS_IF))) {
> vmx->nested.nested_run_pending = 0;
> - return kvm_vcpu_halt(vcpu);
> + return kvm_emulate_halt_noskip(vcpu);
> }
> break;
> case GUEST_ACTIVITY_WAIT_SIPI:
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 1c8b2b6e7ed9..5517893f12fc 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -4741,7 +4741,7 @@ static int handle_rmode_exception(struct kvm_vcpu *vcpu,
> if (kvm_emulate_instruction(vcpu, 0)) {
> if (vcpu->arch.halt_request) {
> vcpu->arch.halt_request = 0;
> - return kvm_vcpu_halt(vcpu);
> + return kvm_emulate_halt_noskip(vcpu);
Could you elaborate on why you choose _noskip suffix?
As far as I see, kvm_vcpu_halt just calls __kvm_vcpu_halt with new VCPU run state/exit reason,
which is used only when local apic is not in the kernel (which is these days not that
supported configuration).
Other user of __kvm_vcpu_halt is something SEV related.
Best regards,
Maxim Levitsky
> }
> return 1;
> }
> @@ -5415,7 +5415,7 @@ static int handle_invalid_guest_state(struct kvm_vcpu *vcpu)
>
> if (vcpu->arch.halt_request) {
> vcpu->arch.halt_request = 0;
> - return kvm_vcpu_halt(vcpu);
> + return kvm_emulate_halt_noskip(vcpu);
> }
>
> /*
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 4a52a08707de..9c23ae1d483d 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -8649,7 +8649,7 @@ void kvm_arch_exit(void)
> #endif
> }
>
> -static int __kvm_vcpu_halt(struct kvm_vcpu *vcpu, int state, int reason)
> +static int __kvm_emulate_halt(struct kvm_vcpu *vcpu, int state, int reason)
> {
> ++vcpu->stat.halt_exits;
> if (lapic_in_kernel(vcpu)) {
> @@ -8661,11 +8661,11 @@ static int __kvm_vcpu_halt(struct kvm_vcpu *vcpu, int state, int reason)
> }
> }
>
> -int kvm_vcpu_halt(struct kvm_vcpu *vcpu)
> +int kvm_emulate_halt_noskip(struct kvm_vcpu *vcpu)
> {
> - return __kvm_vcpu_halt(vcpu, KVM_MP_STATE_HALTED, KVM_EXIT_HLT);
> + return __kvm_emulate_halt(vcpu, KVM_MP_STATE_HALTED, KVM_EXIT_HLT);
> }
> -EXPORT_SYMBOL_GPL(kvm_vcpu_halt);
> +EXPORT_SYMBOL_GPL(kvm_emulate_halt_noskip);
>
> int kvm_emulate_halt(struct kvm_vcpu *vcpu)
> {
> @@ -8674,7 +8674,7 @@ int kvm_emulate_halt(struct kvm_vcpu *vcpu)
> * TODO: we might be squashing a GUESTDBG_SINGLESTEP-triggered
> * KVM_EXIT_DEBUG here.
> */
> - return kvm_vcpu_halt(vcpu) && ret;
> + return kvm_emulate_halt_noskip(vcpu) && ret;
> }
> EXPORT_SYMBOL_GPL(kvm_emulate_halt);
>
> @@ -8682,7 +8682,8 @@ int kvm_emulate_ap_reset_hold(struct kvm_vcpu *vcpu)
> {
> int ret = kvm_skip_emulated_instruction(vcpu);
>
> - return __kvm_vcpu_halt(vcpu, KVM_MP_STATE_AP_RESET_HOLD, KVM_EXIT_AP_RESET_HOLD) && ret;
> + return __kvm_emulate_halt(vcpu, KVM_MP_STATE_AP_RESET_HOLD,
> + KVM_EXIT_AP_RESET_HOLD) && ret;
> }
> EXPORT_SYMBOL_GPL(kvm_emulate_ap_reset_hold);
>
On Fri, 2021-10-08 at 19:12 -0700, Sean Christopherson wrote:
> Call kvm_vcpu_block() directly for all wait states except HALTED so that
> kvm_vcpu_halt() is no longer a misnomer on x86.
>
> Functionally, this means KVM will never attempt halt-polling or adjust
> vcpu->halt_poll_ns for INIT_RECEIVED (a.k.a. Wait-For-SIPI (WFS)) or
> AP_RESET_HOLD; UNINITIALIZED is handled in kvm_arch_vcpu_ioctl_run(),
> and x86 doesn't use any other "wait" states.
>
> As mentioned above, the motivation of this is purely so that "halt" isn't
> overloaded on x86, e.g. in KVM's stats. Skipping halt-polling for WFS
> (and RESET_HOLD) has no meaningful effect on guest performance as there
> are typically single-digit numbers of INIT-SIPI sequences per AP vCPU,
> per boot, versus thousands of HLTs just to boot to console.
>
> Reviewed-by: David Matlack <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/x86.c | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index cd51f100e906..e0219acfd9cf 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -9899,7 +9899,10 @@ static inline int vcpu_block(struct kvm *kvm, struct kvm_vcpu *vcpu)
> if (!kvm_arch_vcpu_runnable(vcpu) &&
> (!kvm_x86_ops.pre_block || static_call(kvm_x86_pre_block)(vcpu) == 0)) {
> srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
> - kvm_vcpu_halt(vcpu);
> + if (vcpu->arch.mp_state == KVM_MP_STATE_HALTED)
> + kvm_vcpu_halt(vcpu);
> + else
> + kvm_vcpu_block(vcpu);
> vcpu->srcu_idx = srcu_read_lock(&kvm->srcu);
>
> if (kvm_x86_ops.post_block)
Reviewed-by: Maxim Levitsky <[email protected]>
Best regards,
Maxim Levitsky
On Mon, Oct 25, 2021, Paolo Bonzini wrote:
> On 09/10/21 04:12, Sean Christopherson wrote:
> > + */
> > + if (vcpu->mode == IN_GUEST_MODE) {
> > int cpu = READ_ONCE(vcpu->cpu);
> > /*
> > @@ -687,8 +692,13 @@ int svm_deliver_avic_intr(struct kvm_vcpu *vcpu, int vec)
> > if (cpu != get_cpu())
> > wrmsrl(SVM_AVIC_DOORBELL, kvm_cpu_get_apicid(cpu));
> > put_cpu();
> > - } else
> > + } else {
> > + /*
> > + * Wake the vCPU if it was blocking. KVM will then detect the
> > + * pending IRQ when checking if the vCPU has a wake event.
> > + */
> > kvm_vcpu_wake_up(vcpu);
> > + }
>
> Does this still need to check the "running" flag? That should be a strict
> superset of vcpu->mode == IN_GUEST_MODE.
No. Signalling the doorbell when "running" is set but the vCPU is not in the
guest is just an expensive nop. So even if KVM were to rework its handling of
"running" to set the flag immediately before VMRUN and clear it immediately after,
keying off IN_GUEST_MODE and not "running" would not be wrong, just sub-optimal.
I doubt KVM will ever make the "running" flag super precise, because keeping the
flag set when the vCPU is loaded avoids VM-Exits on other vCPUs due to undelivered
IPIs. But the flip side is that it means the flag has terrible granularity, and
is arguably inaccurate when viewed from a software perspective. Anyways, if the
treatment of "running" were ever changed, then this code should also be changed
to essentially revert this commit since vcpu->mode would then be redundant.
And IMO, it makes sense to intentionally separate KVM's delivery of interrupts
from hardware's delivery of interrupts. I.e. use the same core rules as
kvm_vcpu_kick() for when to send interrupts and when to wake for the AVIC.
On Wed, Oct 27, 2021, Paolo Bonzini wrote:
> On 27/10/21 16:41, Sean Christopherson wrote:
> > The other thing I don't like about having the WARN in the loop is that it suggests
> > that something other than the vCPU can modify the NDST and SN fields, which is
> > wrong and confusing (for me).
>
> Yeah, I can agree with that. Can you add it in a comment above the cmpxchg
> loop, it can be as simple as
>
> /* The processor can set ON concurrently. */
>
> when you respin patch 21 and the rest of the series?
I can definitely add a comment, but I think that comment is incorrect. AIUI,
the CPU is the one thing in the system that _doesn't_ set ON, at least not without
IPI virtualization (haven't read that spec yet). KVM (software) sets it when
emulating IPIs, and the IOMMU (hardware) sets it for "real" posted interrupts,
but the CPU (sans IPI virtualization) only clears ON when processing an IRQ on
the notification vector.
So something like this?
/* ON can be set concurrently by a different vCPU or by hardware. */
On Mon, 2021-10-25 at 16:26 +0200, Paolo Bonzini wrote:
> On 09/10/21 04:12, Sean Christopherson wrote:
> > Calculate the halt-polling "stop" time using "cur" instead of redoing
> > ktime_get(). In the happy case where hardware correctly predicts
> > do_halt_poll, "cur" is only a few cycles old. And if the branch is
> > mispredicted, arguably that extra latency should count toward the
> > halt-polling time.
> >
> > In all likelihood, the numbers involved are in the noise and either
> > approach is perfectly ok.
>
> Using "start" makes the change even more obvious, so:
>
> Calculate the halt-polling "stop" time using "start" instead of redoing
> ktime_get(). In practice, the numbers involved are in the noise (e.g.,
> in the happy case where hardware correctly predicts do_halt_poll and
> there are no interrupts, "start" is probably only a few cycles old)
> and either approach is perfectly ok. But it's more precise to count
> any extra latency toward the halt-polling time.
>
> Paolo
>
Agreed.
Reviewed-by: Maxim Levitsky <[email protected]>
On 27/10/21 17:28, Sean Christopherson wrote:
> On Wed, Oct 27, 2021, Paolo Bonzini wrote:
>> On 27/10/21 16:41, Sean Christopherson wrote:
>>> The other thing I don't like about having the WARN in the loop is that it suggests
>>> that something other than the vCPU can modify the NDST and SN fields, which is
>>> wrong and confusing (for me).
>>
>> Yeah, I can agree with that. Can you add it in a comment above the cmpxchg
>> loop, it can be as simple as
>>
>> /* The processor can set ON concurrently. */
>>
>> when you respin patch 21 and the rest of the series?
>
> I can definitely add a comment, but I think that comment is incorrect.
It's completely backwards indeed. I first had "the hardware" and then
shut down my brain for a second to replace it.
> So something like this?
>
> /* ON can be set concurrently by a different vCPU or by hardware. */
Yes, of course.
Paolo
On Mon, Oct 25, 2021, Paolo Bonzini wrote:
> On 09/10/21 04:12, Sean Christopherson wrote:
> > Don't update Posted Interrupt's NDST, a.k.a. the target pCPU, in the
> > pre-block path, as NDST is guaranteed to be up-to-date. The comment
> > about the vCPU being preempted during the update is simply wrong, as the
> > update path runs with IRQs disabled (from before snapshotting vcpu->cpu,
> > until after the update completes).
>
> Right, it didn't as of commit bf9f6ac8d74969690df1485b33b7c238ca9f2269 (when
> VT-d posted interrupts were introduced).
>
> The interrupt disable/enable pair was added in the same commit that
> motivated the introduction of the sanity checks:
Ya, I found that commit when digging around for different commit in the series
and forgot to come back to this changelog. I'll incorporate this info into the
next version.
> commit 8b306e2f3c41939ea528e6174c88cfbfff893ce1
> Author: Paolo Bonzini <[email protected]>
> Date: Tue Jun 6 12:57:05 2017 +0200
>
> KVM: VMX: avoid double list add with VT-d posted interrupts
>
> In some cases, for example involving hot-unplug of assigned
> devices, pi_post_block can forget to remove the vCPU from the
> blocked_vcpu_list. When this happens, the next call to
> pi_pre_block corrupts the list.
>
> Fix this in two ways. First, check vcpu->pre_pcpu in pi_pre_block
> and WARN instead of adding the element twice in the list. Second,
> always do the list removal in pi_post_block if vcpu->pre_pcpu is
> set (not -1).
>
> The new code keeps interrupts disabled for the whole duration of
> pi_pre_block/pi_post_block. This is not strictly necessary, but
> easier to follow. For the same reason, PI.ON is checked only
> after the cmpxchg, and to handle it we just call the post-block
> code. This removes duplication of the list removal code.
>
> At the time, I didn't notice the now useless NDST update.
>
> Paolo
>
> > The vCPU can get preempted_before_ the update starts, but not during.
> > And if the vCPU is preempted before, vmx_vcpu_pi_load() is responsible
> > for updating NDST when the vCPU is scheduled back in. In that case, the
> > check against the wakeup vector in vmx_vcpu_pi_load() cannot be true as
> > that would require the notification vector to have been set to the wakeup
> > vector_before_ blocking.
> >
> > Opportunistically switch to using vcpu->cpu for the list/lock lookups,
> > which presumably used pre_pcpu only for some phantom preemption logic.
>
On Fri, 2021-10-08 at 19:12 -0700, Sean Christopherson wrote:
> Add helpers to wake and query a blocking vCPU. In addition to providing
> nice names, the helpers reduce the probability of KVM neglecting to use
> kvm_arch_vcpu_get_wait().
>
> No functional change intended.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/arm64/kvm/arch_timer.c | 3 +--
> arch/arm64/kvm/arm.c | 2 +-
> arch/x86/kvm/lapic.c | 2 +-
> include/linux/kvm_host.h | 14 ++++++++++++++
> virt/kvm/async_pf.c | 2 +-
> virt/kvm/kvm_main.c | 8 ++------
> 6 files changed, 20 insertions(+), 11 deletions(-)
>
> diff --git a/arch/arm64/kvm/arch_timer.c b/arch/arm64/kvm/arch_timer.c
> index 7e8396f74010..addd53b6eba6 100644
> --- a/arch/arm64/kvm/arch_timer.c
> +++ b/arch/arm64/kvm/arch_timer.c
> @@ -649,7 +649,6 @@ void kvm_timer_vcpu_put(struct kvm_vcpu *vcpu)
> {
> struct arch_timer_cpu *timer = vcpu_timer(vcpu);
> struct timer_map map;
> - struct rcuwait *wait = kvm_arch_vcpu_get_wait(vcpu);
>
> if (unlikely(!timer->enabled))
> return;
> @@ -672,7 +671,7 @@ void kvm_timer_vcpu_put(struct kvm_vcpu *vcpu)
> if (map.emul_ptimer)
> soft_timer_cancel(&map.emul_ptimer->hrtimer);
>
> - if (rcuwait_active(wait))
> + if (kvm_vcpu_is_blocking(vcpu))
> kvm_timer_blocking(vcpu);
>
> /*
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index 268b1e7bf700..9ff0e85a9f16 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -622,7 +622,7 @@ void kvm_arm_resume_guest(struct kvm *kvm)
>
> kvm_for_each_vcpu(i, vcpu, kvm) {
> vcpu->arch.pause = false;
> - rcuwait_wake_up(kvm_arch_vcpu_get_wait(vcpu));
> + __kvm_vcpu_wake_up(vcpu);
> }
> }
>
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 76fb00921203..0cd7ed21b205 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -1931,7 +1931,7 @@ void kvm_lapic_expired_hv_timer(struct kvm_vcpu *vcpu)
> /* If the preempt notifier has already run, it also called apic_timer_expired */
> if (!apic->lapic_timer.hv_timer_in_use)
> goto out;
> - WARN_ON(rcuwait_active(&vcpu->wait));
> + WARN_ON(kvm_vcpu_is_blocking(vcpu));
> apic_timer_expired(apic, false);
> cancel_hv_timer(apic);
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index bdaa0e70b060..1fa38dc00b87 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1151,6 +1151,20 @@ static inline struct rcuwait *kvm_arch_vcpu_get_wait(struct kvm_vcpu *vcpu)
> #endif
> }
>
> +/*
> + * Wake a vCPU if necessary, but don't do any stats/metadata updates. Returns
> + * true if the vCPU was blocking and was awakened, false otherwise.
> + */
> +static inline bool __kvm_vcpu_wake_up(struct kvm_vcpu *vcpu)
> +{
> + return !!rcuwait_wake_up(kvm_arch_vcpu_get_wait(vcpu));
> +}
> +
> +static inline bool kvm_vcpu_is_blocking(struct kvm_vcpu *vcpu)
> +{
> + return rcuwait_active(kvm_arch_vcpu_get_wait(vcpu));
> +}
> +
> #ifdef __KVM_HAVE_ARCH_INTC_INITIALIZED
> /*
> * returns true if the virtual interrupt controller is initialized and
> diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
> index ccb35c22785e..9bfe1d6f6529 100644
> --- a/virt/kvm/async_pf.c
> +++ b/virt/kvm/async_pf.c
> @@ -85,7 +85,7 @@ static void async_pf_execute(struct work_struct *work)
>
> trace_kvm_async_pf_completed(addr, cr2_or_gpa);
>
> - rcuwait_wake_up(kvm_arch_vcpu_get_wait(vcpu));
> + __kvm_vcpu_wake_up(vcpu);
>
> mmput(mm);
> kvm_put_kvm(vcpu->kvm);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 481e8178b43d..c870cae7e776 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -3332,10 +3332,7 @@ EXPORT_SYMBOL_GPL(kvm_vcpu_halt);
>
> bool kvm_vcpu_wake_up(struct kvm_vcpu *vcpu)
> {
> - struct rcuwait *waitp;
> -
> - waitp = kvm_arch_vcpu_get_wait(vcpu);
> - if (rcuwait_wake_up(waitp)) {
> + if (__kvm_vcpu_wake_up(vcpu)) {
> WRITE_ONCE(vcpu->ready, true);
> ++vcpu->stat.generic.halt_wakeup;
> return true;
> @@ -3490,8 +3487,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
> continue;
> if (vcpu == me)
> continue;
> - if (rcuwait_active(kvm_arch_vcpu_get_wait(vcpu)) &&
> - !vcpu_dy_runnable(vcpu))
> + if (kvm_vcpu_is_blocking(vcpu) && !vcpu_dy_runnable(vcpu))
> continue;
> if (READ_ONCE(vcpu->preempted) && yield_to_kernel_mode &&
> !kvm_arch_dy_has_pending_interrupt(vcpu) &&
Reviewed-by: Maxim Levitsky <[email protected]>
On Fri, 2021-10-08 at 19:11 -0700, Sean Christopherson wrote:
> Don't update halt-polling stats if halt-polling wasn't attempted. This
> is a nop as @poll_ns is guaranteed to be '0' (poll_end == start), but it
> will allow a future patch to move the histogram stats into the helper to
> resolve a discrepancy in what is considered a "successful" halt-poll.
>
> No functional change intended.
>
> Reviewed-by: David Matlack <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> virt/kvm/kvm_main.c | 8 +++++---
> 1 file changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 5d4a90032277..6156719bcbbc 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -3217,6 +3217,7 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
> {
> struct rcuwait *wait = kvm_arch_vcpu_get_wait(vcpu);
> bool halt_poll_allowed = !kvm_arch_no_poll(vcpu);
> + bool do_halt_poll = halt_poll_allowed && vcpu->halt_poll_ns;
> ktime_t start, cur, poll_end;
> bool waited = false;
> u64 block_ns;
> @@ -3224,7 +3225,7 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
> kvm_arch_vcpu_blocking(vcpu);
>
> start = cur = poll_end = ktime_get();
> - if (vcpu->halt_poll_ns && halt_poll_allowed) {
> + if (do_halt_poll) {
> ktime_t stop = ktime_add_ns(ktime_get(), vcpu->halt_poll_ns);
>
> ++vcpu->stat.generic.halt_attempted_poll;
> @@ -3276,8 +3277,9 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
> kvm_arch_vcpu_unblocking(vcpu);
> block_ns = ktime_to_ns(cur) - ktime_to_ns(start);
>
> - update_halt_poll_stats(
> - vcpu, ktime_to_ns(ktime_sub(poll_end, start)), waited);
> + if (do_halt_poll)
> + update_halt_poll_stats(
> + vcpu, ktime_to_ns(ktime_sub(poll_end, start)), waited);
>
> if (halt_poll_allowed) {
> if (!vcpu_valid_wakeup(vcpu)) {
Reviewed-by: Maxim Levitsky <[email protected]>
Best regards,
Maxim Levitsky
On 27/10/21 18:04, Sean Christopherson wrote:
>>> + /*
>>> + * The smp_wmb() in kvm_make_request() pairs with the smp_mb_*()
>>> + * after setting vcpu->mode in vcpu_enter_guest(), thus the vCPU
>>> + * is guaranteed to see the event request if triggering a posted
>>> + * interrupt "fails" because vcpu->mode != IN_GUEST_MODE.
>>
>> What this smp_wmb() pair with, is the smp_mb__after_atomic in
>> kvm_check_request(KVM_REQ_EVENT, vcpu).
>
> I don't think that's correct. There is no kvm_check_request() in the relevant path.
> kvm_vcpu_exit_request() uses kvm_request_pending(), which is just a READ_ONCE()
> without a barrier.
Ok, we are talking about two different set of barriers. This is mine:
- smp_wmb() in kvm_make_request() pairs with the smp_mb__after_atomic() in
kvm_check_request(); it ensures that everything before the request
(in this case, pi_pending = true) is seen by inject_pending_event.
- pi_test_and_set_on() orders the write to ON after the write to PIR,
pairing with vmx_sync_pir_to_irr and ensuring that the bit in the PIR is
seen.
And this is yours:
- pi_test_and_set_on() _also_ orders the write to ON before the read of
vcpu->mode, pairing with vcpu_enter_guest()
- kvm_make_request() however does _not_ order the write to
vcpu->requests before the read of vcpu->mode, even though it's needed.
Usually that's handled by kvm_vcpu_exiting_guest_mode(), but in this case
vcpu->mode is read in kvm_vcpu_trigger_posted_interrupt.
So vmx_deliver_nested_posted_interrupt() is missing a smp_mb__after_atomic().
It's documentation only for x86, but still easily done in v3.
Paolo
On Fri, 2021-10-08 at 19:12 -0700, Sean Christopherson wrote:
> Return bools instead of ints for the posted interrupt "test" helpers.
> The bit position of the flag being test does not matter to the callers,
> and is in fact lost by virtue of test_bit() itself returning a bool.
>
> Returning ints is potentially dangerous, e.g. "pi_test_on(pi_desc) == 1"
> is safe-ish because ON is bit 0 and thus any sane implementation of
> pi_test_on() will work, but for SN (bit 1), checking "== 1" would rely on
> pi_test_on() to return 0 or 1, a.k.a. bools, as opposed to 0 or 2 (the
> positive bit position).
>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/vmx/posted_intr.c | 4 ++--
> arch/x86/kvm/vmx/posted_intr.h | 6 +++---
> 2 files changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
> index 6c2110d91b06..1688f8dc535a 100644
> --- a/arch/x86/kvm/vmx/posted_intr.c
> +++ b/arch/x86/kvm/vmx/posted_intr.c
> @@ -185,7 +185,7 @@ int pi_pre_block(struct kvm_vcpu *vcpu)
> new.control) != old.control);
>
> /* We should not block the vCPU if an interrupt is posted for it. */
> - if (pi_test_on(pi_desc) == 1)
> + if (pi_test_on(pi_desc))
> __pi_post_block(vcpu);
>
> local_irq_enable();
> @@ -216,7 +216,7 @@ void pi_wakeup_handler(void)
> blocked_vcpu_list) {
> struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
>
> - if (pi_test_on(pi_desc) == 1)
> + if (pi_test_on(pi_desc))
> kvm_vcpu_kick(vcpu);
> }
> spin_unlock(&per_cpu(blocked_vcpu_on_cpu_lock, cpu));
> diff --git a/arch/x86/kvm/vmx/posted_intr.h b/arch/x86/kvm/vmx/posted_intr.h
> index 7f7b2326caf5..36ae035f14aa 100644
> --- a/arch/x86/kvm/vmx/posted_intr.h
> +++ b/arch/x86/kvm/vmx/posted_intr.h
> @@ -40,7 +40,7 @@ static inline bool pi_test_and_clear_on(struct pi_desc *pi_desc)
> (unsigned long *)&pi_desc->control);
> }
>
> -static inline int pi_test_and_set_pir(int vector, struct pi_desc *pi_desc)
> +static inline bool pi_test_and_set_pir(int vector, struct pi_desc *pi_desc)
> {
> return test_and_set_bit(vector, (unsigned long *)pi_desc->pir);
> }
> @@ -74,13 +74,13 @@ static inline void pi_clear_sn(struct pi_desc *pi_desc)
> (unsigned long *)&pi_desc->control);
> }
>
> -static inline int pi_test_on(struct pi_desc *pi_desc)
> +static inline bool pi_test_on(struct pi_desc *pi_desc)
> {
> return test_bit(POSTED_INTR_ON,
> (unsigned long *)&pi_desc->control);
> }
>
> -static inline int pi_test_sn(struct pi_desc *pi_desc)
> +static inline bool pi_test_sn(struct pi_desc *pi_desc)
> {
> return test_bit(POSTED_INTR_SN,
> (unsigned long *)&pi_desc->control);
Reviewed-by: Maxim Levitsky <[email protected]>
Best regards,
Maxim Levitsky
On Fri, 2021-10-08 at 19:12 -0700, Sean Christopherson wrote:
> Explicitly skip posted interrupt updates if APICv is disabled in all of
> KVM, or if the guest doesn't have an in-kernel APIC. The PI descriptor
> is kept up-to-date if APICv is inhibited, e.g. so that re-enabling APICv
> doesn't require a bunch of updates, but neither the module param nor the
> APIC type can be changed on-the-fly.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/vmx/posted_intr.c | 11 +++++++----
> 1 file changed, 7 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
> index 3263056784f5..351666c41bbc 100644
> --- a/arch/x86/kvm/vmx/posted_intr.c
> +++ b/arch/x86/kvm/vmx/posted_intr.c
> @@ -28,11 +28,14 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
> unsigned int dest;
>
> /*
> - * In case of hot-plug or hot-unplug, we may have to undo
> - * vmx_vcpu_pi_put even if there is no assigned device. And we
> - * always keep PI.NDST up to date for simplicity: it makes the
> - * code easier, and CPU migration is not a fast path.
> + * To simplify hot-plug and dynamic toggling of APICv, keep PI.NDST and
> + * PI.SN up-to-date even if there is no assigned device or if APICv is
> + * deactivated due to a dynamic inhibit bit, e.g. for Hyper-V's SyncIC.
> */
> + if (!enable_apicv || !lapic_in_kernel(vcpu))
> + return;
> +
> + /* Nothing to do if PI.SN==0 and the vCPU isn't being migrated. */
> if (!pi_test_sn(pi_desc) && vcpu->cpu == cpu)
> return;
>
Reviewed-by: Maxim Levitsky <[email protected]>
Best regards,
Maxim Levitsky
On Fri, 2021-10-08 at 19:12 -0700, Sean Christopherson wrote:
> Move the WARN sanity checks out of the PI descriptor update loop so as
> not to spam the kernel log if the condition is violated and the update
> takes multiple attempts due to another writer. This also eliminates a
> few extra uops from the retry path.
>
> Technically not checking every attempt could mean KVM will now fail to
> WARN in a scenario that would have failed before, but any such failure
> would be inherently racy as some other agent (CPU or device) would have
> to concurrent modify the PI descriptor.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/vmx/posted_intr.c | 12 ++++++------
> 1 file changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
> index 351666c41bbc..67cbe6ab8f66 100644
> --- a/arch/x86/kvm/vmx/posted_intr.c
> +++ b/arch/x86/kvm/vmx/posted_intr.c
> @@ -100,10 +100,11 @@ static void __pi_post_block(struct kvm_vcpu *vcpu)
> struct pi_desc old, new;
> unsigned int dest;
>
> + WARN(pi_desc->nv != POSTED_INTR_WAKEUP_VECTOR,
> + "Wakeup handler not enabled while the vCPU was blocking");
> +
> do {
> old.control = new.control = pi_desc->control;
> - WARN(old.nv != POSTED_INTR_WAKEUP_VECTOR,
> - "Wakeup handler not enabled while the VCPU is blocked\n");
>
> dest = cpu_physical_id(vcpu->cpu);
>
> @@ -161,13 +162,12 @@ int pi_pre_block(struct kvm_vcpu *vcpu)
> spin_unlock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->pre_pcpu));
> }
>
> + WARN(pi_desc->sn == 1,
> + "Posted Interrupt Suppress Notification set before blocking");
> +
> do {
> old.control = new.control = pi_desc->control;
>
> - WARN((pi_desc->sn == 1),
> - "Warning: SN field of posted-interrupts "
> - "is set before blocking\n");
> -
> /*
> * Since vCPU can be preempted during this process,
> * vcpu->cpu could be different with pre_pcpu, we
Don't know for sure if this is desired. I'll would just use WARN_ON_ONCE instead
if the warning spams the log.
If there is a race I would rather want to catch it even if rare.
Best regards,
Maxim Levitsky
On Fri, 2021-10-08 at 19:12 -0700, Sean Christopherson wrote:
> Don't update Posted Interrupt's NDST, a.k.a. the target pCPU, in the
> pre-block path, as NDST is guaranteed to be up-to-date. The comment
> about the vCPU being preempted during the update is simply wrong, as the
> update path runs with IRQs disabled (from before snapshotting vcpu->cpu,
> until after the update completes).
>
> The vCPU can get preempted _before_ the update starts, but not during.
> And if the vCPU is preempted before, vmx_vcpu_pi_load() is responsible
> for updating NDST when the vCPU is scheduled back in. In that case, the
> check against the wakeup vector in vmx_vcpu_pi_load() cannot be true as
> that would require the notification vector to have been set to the wakeup
> vector _before_ blocking.
>
> Opportunistically switch to using vcpu->cpu for the list/lock lookups,
> which presumably used pre_pcpu only for some phantom preemption logic.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/vmx/posted_intr.c | 23 +++--------------------
> 1 file changed, 3 insertions(+), 20 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
> index 1688f8dc535a..239e0e72a0dd 100644
> --- a/arch/x86/kvm/vmx/posted_intr.c
> +++ b/arch/x86/kvm/vmx/posted_intr.c
> @@ -130,7 +130,6 @@ static void __pi_post_block(struct kvm_vcpu *vcpu)
> * - Store the vCPU to the wakeup list, so when interrupts happen
> * we can find the right vCPU to wake up.
> * - Change the Posted-interrupt descriptor as below:
> - * 'NDST' <-- vcpu->pre_pcpu
> * 'NV' <-- POSTED_INTR_WAKEUP_VECTOR
> * - If 'ON' is set during this process, which means at least one
> * interrupt is posted for this vCPU, we cannot block it, in
> @@ -139,7 +138,6 @@ static void __pi_post_block(struct kvm_vcpu *vcpu)
> */
> int pi_pre_block(struct kvm_vcpu *vcpu)
> {
> - unsigned int dest;
> struct pi_desc old, new;
> struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
>
> @@ -153,10 +151,10 @@ int pi_pre_block(struct kvm_vcpu *vcpu)
> local_irq_disable();
>
> vcpu->pre_pcpu = vcpu->cpu;
> - spin_lock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->pre_pcpu));
> + spin_lock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->cpu));
> list_add_tail(&vcpu->blocked_vcpu_list,
> - &per_cpu(blocked_vcpu_on_cpu, vcpu->pre_pcpu));
> - spin_unlock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->pre_pcpu));
> + &per_cpu(blocked_vcpu_on_cpu, vcpu->cpu));
> + spin_unlock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->cpu));
>
> WARN(pi_desc->sn == 1,
> "Posted Interrupt Suppress Notification set before blocking");
> @@ -164,21 +162,6 @@ int pi_pre_block(struct kvm_vcpu *vcpu)
> do {
> old.control = new.control = pi_desc->control;
>
> - /*
> - * Since vCPU can be preempted during this process,
> - * vcpu->cpu could be different with pre_pcpu, we
> - * need to set pre_pcpu as the destination of wakeup
> - * notification event, then we can find the right vCPU
> - * to wakeup in wakeup handler if interrupts happen
> - * when the vCPU is in blocked state.
> - */
> - dest = cpu_physical_id(vcpu->pre_pcpu);
> -
> - if (x2apic_mode)
> - new.ndst = dest;
> - else
> - new.ndst = (dest << 8) & 0xFF00;
> -
> /* set 'NV' to 'wakeup vector' */
> new.nv = POSTED_INTR_WAKEUP_VECTOR;
> } while (cmpxchg64(&pi_desc->control, old.control,
Reviewed-by : Maxim Levitsky <[email protected]>
Best regards,
Maxim Levitsky
On Fri, 2021-10-08 at 19:12 -0700, Sean Christopherson wrote:
> Save/restore IRQs when disabling IRQs in posted interrupt pre/post block
> in preparation for moving the code into vcpu_put/load(), and thus may be
> called with IRQs already disabled.
>
> No functional changed intended.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/vmx/posted_intr.c | 13 +++++++------
> 1 file changed, 7 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
> index 239e0e72a0dd..414ea6972b5c 100644
> --- a/arch/x86/kvm/vmx/posted_intr.c
> +++ b/arch/x86/kvm/vmx/posted_intr.c
> @@ -140,6 +140,7 @@ int pi_pre_block(struct kvm_vcpu *vcpu)
> {
> struct pi_desc old, new;
> struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
> + unsigned long flags;
>
> if (!kvm_arch_has_assigned_device(vcpu->kvm) ||
> !irq_remapping_cap(IRQ_POSTING_CAP) ||
> @@ -147,8 +148,7 @@ int pi_pre_block(struct kvm_vcpu *vcpu)
> vmx_interrupt_blocked(vcpu))
> return 0;
>
> - WARN_ON(irqs_disabled());
> - local_irq_disable();
> + local_irq_save(flags);
>
> vcpu->pre_pcpu = vcpu->cpu;
> spin_lock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->cpu));
> @@ -171,19 +171,20 @@ int pi_pre_block(struct kvm_vcpu *vcpu)
> if (pi_test_on(pi_desc))
> __pi_post_block(vcpu);
>
> - local_irq_enable();
> + local_irq_restore(flags);
> return (vcpu->pre_pcpu == -1);
> }
>
> void pi_post_block(struct kvm_vcpu *vcpu)
> {
> + unsigned long flags;
> +
> if (vcpu->pre_pcpu == -1)
> return;
>
> - WARN_ON(irqs_disabled());
> - local_irq_disable();
> + local_irq_save(flags);
> __pi_post_block(vcpu);
> - local_irq_enable();
> + local_irq_restore(flags);
> }
>
> /*
Reviewed-by: Maxim Levitsky <[email protected]>
Best regards,
Maxim Levitsky
On Fri, 2021-10-08 at 19:12 -0700, Sean Christopherson wrote:
> Use READ_ONCE() when loading the posted interrupt descriptor control
> field to ensure "old" and "new" have the same base value. If the
> compiler emits separate loads, and loads into "new" before "old", KVM
> could theoretically drop the ON bit if it were set between the loads.
>
> Fixes: 28b835d60fcc ("KVM: Update Posted-Interrupts Descriptor when vCPU is preempted")
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/vmx/posted_intr.c | 6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
> index 414ea6972b5c..fea343dcc011 100644
> --- a/arch/x86/kvm/vmx/posted_intr.c
> +++ b/arch/x86/kvm/vmx/posted_intr.c
> @@ -53,7 +53,7 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
>
> /* The full case. */
> do {
> - old.control = new.control = pi_desc->control;
> + old.control = new.control = READ_ONCE(pi_desc->control);
>
> dest = cpu_physical_id(cpu);
>
> @@ -104,7 +104,7 @@ static void __pi_post_block(struct kvm_vcpu *vcpu)
> "Wakeup handler not enabled while the vCPU was blocking");
>
> do {
> - old.control = new.control = pi_desc->control;
> + old.control = new.control = READ_ONCE(pi_desc->control);
>
> dest = cpu_physical_id(vcpu->cpu);
>
> @@ -160,7 +160,7 @@ int pi_pre_block(struct kvm_vcpu *vcpu)
> "Posted Interrupt Suppress Notification set before blocking");
>
> do {
> - old.control = new.control = pi_desc->control;
> + old.control = new.control = READ_ONCE(pi_desc->control);
>
> /* set 'NV' to 'wakeup vector' */
> new.nv = POSTED_INTR_WAKEUP_VECTOR;
I wish there was a way to mark fields in a struct, as requiring 'READ_ONCE' on them
so that compiler would complain if this isn't done, or automatically use 'READ_ONCE'
logic.
Reviewed-by: Maxim Levitsky <[email protected]>
Best regards,
Maxim Levitsky
On Fri, 2021-10-08 at 19:12 -0700, Sean Christopherson wrote:
> Hoist the CPU => APIC ID conversion for the Posted Interrupt descriptor
> out of the loop to write the descriptor, preemption is disabled so the
> CPU won't change, and if the APIC ID changes KVM has bigger problems.
>
> No functional change intended.
Is preemption always disabled in vmx_vcpu_pi_load? vmx_vcpu_pi_load is called from vmx_vcpu_load,
which is called indirectly from vcpu_load which is called from many ioctls,
which userspace does. In these places I don't think that preemption is disabled.
Best regards,
Maxim Levitsky
>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/vmx/posted_intr.c | 25 +++++++++++--------------
> 1 file changed, 11 insertions(+), 14 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
> index fea343dcc011..2b2206339174 100644
> --- a/arch/x86/kvm/vmx/posted_intr.c
> +++ b/arch/x86/kvm/vmx/posted_intr.c
> @@ -51,17 +51,15 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
> goto after_clear_sn;
> }
>
> - /* The full case. */
> + /* The full case. Set the new destination and clear SN. */
> + dest = cpu_physical_id(cpu);
> + if (!x2apic_mode)
> + dest = (dest << 8) & 0xFF00;
> +
> do {
> old.control = new.control = READ_ONCE(pi_desc->control);
>
> - dest = cpu_physical_id(cpu);
> -
> - if (x2apic_mode)
> - new.ndst = dest;
> - else
> - new.ndst = (dest << 8) & 0xFF00;
> -
> + new.ndst = dest;
> new.sn = 0;
> } while (cmpxchg64(&pi_desc->control, old.control,
> new.control) != old.control);
> @@ -103,15 +101,14 @@ static void __pi_post_block(struct kvm_vcpu *vcpu)
> WARN(pi_desc->nv != POSTED_INTR_WAKEUP_VECTOR,
> "Wakeup handler not enabled while the vCPU was blocking");
>
> + dest = cpu_physical_id(vcpu->cpu);
> + if (!x2apic_mode)
> + dest = (dest << 8) & 0xFF00;
> +
> do {
> old.control = new.control = READ_ONCE(pi_desc->control);
>
> - dest = cpu_physical_id(vcpu->cpu);
> -
> - if (x2apic_mode)
> - new.ndst = dest;
> - else
> - new.ndst = (dest << 8) & 0xFF00;
> + new.ndst = dest;
>
> /* set 'NV' to 'notification vector' */
> new.nv = POSTED_INTR_VECTOR;
On Fri, 2021-10-08 at 19:12 -0700, Sean Christopherson wrote:
> Remove the vCPU from the wakeup list before updating the notification
> vector in the posted interrupt post-block helper. There is no need to
> wake the current vCPU as it is by definition not blocking. Practically
> speaking this is a nop as it only shaves a few meager cycles in the
> unlikely case that the vCPU was migrated and the previous pCPU gets a
> wakeup IRQ right before PID.NV is updated. The real motivation is to
> allow for more readable code in the future, when post-block is merged
> with vmx_vcpu_pi_load(), at which point removal from the list will be
> conditional on the old notification vector.
>
> Opportunistically add comments to document why KVM has a per-CPU spinlock
> that, at first glance, appears to be taken only on the owning CPU.
> Explicitly call out that the spinlock must be taken with IRQs disabled, a
> detail that was "lost" when KVM switched from spin_lock_irqsave() to
> spin_lock(), with IRQs disabled for the entirety of the relevant path.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/vmx/posted_intr.c | 49 +++++++++++++++++++++++-----------
> 1 file changed, 33 insertions(+), 16 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
> index 2b2206339174..901b7a5f7777 100644
> --- a/arch/x86/kvm/vmx/posted_intr.c
> +++ b/arch/x86/kvm/vmx/posted_intr.c
> @@ -10,10 +10,22 @@
> #include "vmx.h"
>
> /*
> - * We maintain a per-CPU linked-list of vCPU, so in wakeup_handler() we
> - * can find which vCPU should be waken up.
> + * Maintain a per-CPU list of vCPUs that need to be awakened by wakeup_handler()
Nit: While at it, it would be nice to rename this to pi_wakeup_hanlder() so that it can be more easilly
found.
> + * when a WAKEUP_VECTOR interrupted is posted. vCPUs are added to the list when
> + * the vCPU is scheduled out and is blocking (e.g. in HLT) with IRQs enabled.
s/interrupted/interrupt ?
Isn't that comment incorrect? As I see, the PI hardware is setup to use the WAKEUP_VECTOR
when vcpu blocks (in pi_pre_block) and then that vcpu is added to the list.
The pi_wakeup_hanlder just goes over the list and wakes up all vcpus on the lsit.
> + * The vCPUs posted interrupt descriptor is updated at the same time to set its
> + * notification vector to WAKEUP_VECTOR, so that posted interrupt from devices
> + * wake the target vCPUs. vCPUs are removed from the list and the notification
> + * vector is reset when the vCPU is scheduled in.
> */
> static DEFINE_PER_CPU(struct list_head, blocked_vcpu_on_cpu);
Also while at it, why not to rename this to 'blocked_vcpu_list'?
to explain that this is list of blocked vcpus. Its a per-cpu variable
so 'on_cpu' suffix isn't needed IMHO.
> +/*
> + * Protect the per-CPU list with a per-CPU spinlock to handle task migration.
> + * When a blocking vCPU is awakened _and_ migrated to a different pCPU, the
> + * ->sched_in() path will need to take the vCPU off the list of the _previous_
> + * CPU. IRQs must be disabled when taking this lock, otherwise deadlock will
> + * occur if a wakeup IRQ arrives and attempts to acquire the lock.
> + */
> static DEFINE_PER_CPU(spinlock_t, blocked_vcpu_on_cpu_lock);
>
> static inline struct pi_desc *vcpu_to_pi_desc(struct kvm_vcpu *vcpu)
> @@ -101,23 +113,28 @@ static void __pi_post_block(struct kvm_vcpu *vcpu)
> WARN(pi_desc->nv != POSTED_INTR_WAKEUP_VECTOR,
> "Wakeup handler not enabled while the vCPU was blocking");
>
> - dest = cpu_physical_id(vcpu->cpu);
> - if (!x2apic_mode)
> - dest = (dest << 8) & 0xFF00;
> -
> - do {
> - old.control = new.control = READ_ONCE(pi_desc->control);
> -
> - new.ndst = dest;
> -
> - /* set 'NV' to 'notification vector' */
> - new.nv = POSTED_INTR_VECTOR;
> - } while (cmpxchg64(&pi_desc->control, old.control,
> - new.control) != old.control);
> -
> + /*
> + * Remove the vCPU from the wakeup list of the _previous_ pCPU, which
> + * will not be the same as the current pCPU if the task was migrated.
> + */
> spin_lock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->pre_pcpu));
> list_del(&vcpu->blocked_vcpu_list);
> spin_unlock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->pre_pcpu));
> +
> + dest = cpu_physical_id(vcpu->cpu);
> + if (!x2apic_mode)
> + dest = (dest << 8) & 0xFF00;
It would be nice to have a function for this, this appears in this file twice.
Maybe there is a function already somewhere?
> +
> + do {
> + old.control = new.control = READ_ONCE(pi_desc->control);
> +
> + new.ndst = dest;
> +
> + /* set 'NV' to 'notification vector' */
> + new.nv = POSTED_INTR_VECTOR;
> + } while (cmpxchg64(&pi_desc->control, old.control,
> + new.control) != old.control);
> +
> vcpu->pre_pcpu = -1;
> }
>
Best regards,
Maxim Levitsky
On Fri, 2021-10-08 at 19:12 -0700, Sean Christopherson wrote:
> Drop sanity checks on the validity of the previous pCPU when handling
> vCPU block/unlock for posted interrupts. Barring a code bug or memory
> corruption, the sanity checks will never fire, and any code bug that does
> trip the WARN is all but guaranteed to completely break posted interrupts,
> i.e. should never get anywhere near production.
>
> This is the first of several steps toward eliminating kvm_vcpu.pre_cpu.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/vmx/posted_intr.c | 24 ++++++++++--------------
> 1 file changed, 10 insertions(+), 14 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
> index 67cbe6ab8f66..6c2110d91b06 100644
> --- a/arch/x86/kvm/vmx/posted_intr.c
> +++ b/arch/x86/kvm/vmx/posted_intr.c
> @@ -118,12 +118,10 @@ static void __pi_post_block(struct kvm_vcpu *vcpu)
> } while (cmpxchg64(&pi_desc->control, old.control,
> new.control) != old.control);
>
> - if (!WARN_ON_ONCE(vcpu->pre_pcpu == -1)) {
> - spin_lock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->pre_pcpu));
> - list_del(&vcpu->blocked_vcpu_list);
> - spin_unlock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->pre_pcpu));
> - vcpu->pre_pcpu = -1;
> - }
> + spin_lock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->pre_pcpu));
> + list_del(&vcpu->blocked_vcpu_list);
> + spin_unlock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->pre_pcpu));
> + vcpu->pre_pcpu = -1;
> }
>
> /*
> @@ -153,14 +151,12 @@ int pi_pre_block(struct kvm_vcpu *vcpu)
>
> WARN_ON(irqs_disabled());
> local_irq_disable();
> - if (!WARN_ON_ONCE(vcpu->pre_pcpu != -1)) {
> - vcpu->pre_pcpu = vcpu->cpu;
> - spin_lock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->pre_pcpu));
> - list_add_tail(&vcpu->blocked_vcpu_list,
> - &per_cpu(blocked_vcpu_on_cpu,
> - vcpu->pre_pcpu));
> - spin_unlock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->pre_pcpu));
> - }
> +
> + vcpu->pre_pcpu = vcpu->cpu;
> + spin_lock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->pre_pcpu));
> + list_add_tail(&vcpu->blocked_vcpu_list,
> + &per_cpu(blocked_vcpu_on_cpu, vcpu->pre_pcpu));
> + spin_unlock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->pre_pcpu));
>
> WARN(pi_desc->sn == 1,
> "Posted Interrupt Suppress Notification set before blocking");
Reviewed-by: Maxim Levitsky <[email protected]>
On Fri, 2021-10-08 at 19:12 -0700, Sean Christopherson wrote:
> Move the posted interrupt pre/post_block logic into vcpu_put/load
> respectively, using the kvm_vcpu_is_blocking() to determining whether or
> not the wakeup handler needs to be set (and unset). This avoids updating
> the PI descriptor if halt-polling is successful, reduces the number of
> touchpoints for updating the descriptor, and eliminates the confusing
> behavior of intentionally leaving a "stale" PI.NDST when a blocking vCPU
> is scheduled back in after preemption.
>
> The downside is that KVM will do the PID update twice if the vCPU is
> preempted after prepare_to_rcuwait() but before schedule(), but that's a
> rare case (and non-existent on !PREEMPT kernels).
>
> The notable wart is the need to send a self-IPI on the wakeup vector if
> an outstanding notification is pending after configuring the wakeup
> vector. Ideally, KVM would just do a kvm_vcpu_wake_up() in this case,
> but the scheduler doesn't support waking a task from its preemption
> notifier callback, i.e. while the task is smack dab in the middle of
> being scheduled out.
>
> Note, setting the wakeup vector before halt-polling is not necessary as
> the pending IRQ will be recorded in the PIR and detected as a blocking-
> breaking condition by kvm_vcpu_has_events() -> vmx_sync_pir_to_irr().
>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/vmx/posted_intr.c | 162 ++++++++++++++-------------------
> arch/x86/kvm/vmx/posted_intr.h | 8 +-
> arch/x86/kvm/vmx/vmx.c | 5 -
> 3 files changed, 75 insertions(+), 100 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
> index 901b7a5f7777..d2b3d75c57d1 100644
> --- a/arch/x86/kvm/vmx/posted_intr.c
> +++ b/arch/x86/kvm/vmx/posted_intr.c
> @@ -37,33 +37,45 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
> {
> struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
> struct pi_desc old, new;
> + unsigned long flags;
> unsigned int dest;
>
> /*
> - * To simplify hot-plug and dynamic toggling of APICv, keep PI.NDST and
> - * PI.SN up-to-date even if there is no assigned device or if APICv is
> + * To simplify hot-plug and dynamic toggling of APICv, keep PI.NDST
> + * up-to-date even if there is no assigned device or if APICv is
> * deactivated due to a dynamic inhibit bit, e.g. for Hyper-V's SyncIC.
> */
> if (!enable_apicv || !lapic_in_kernel(vcpu))
> return;
>
> - /* Nothing to do if PI.SN==0 and the vCPU isn't being migrated. */
> - if (!pi_test_sn(pi_desc) && vcpu->cpu == cpu)
> + /*
> + * If the vCPU wasn't on the wakeup list and wasn't migrated, then the
> + * full update can be skipped as neither the vector nor the destination
> + * needs to be changed.
> + */
> + if (pi_desc->nv != POSTED_INTR_WAKEUP_VECTOR && vcpu->cpu == cpu) {
> + /*
> + * Clear SN if it was set due to being preempted. Again, do
> + * this even if there is no assigned device for simplicity.
> + */
> + if (pi_test_and_clear_sn(pi_desc))
> + goto after_clear_sn;
> return;
> + }
> +
> + local_irq_save(flags);
>
> /*
> - * If the 'nv' field is POSTED_INTR_WAKEUP_VECTOR, do not change
> - * PI.NDST: pi_post_block is the one expected to change PID.NDST and the
> - * wakeup handler expects the vCPU to be on the blocked_vcpu_list that
> - * matches PI.NDST. Otherwise, a vcpu may not be able to be woken up
> - * correctly.
> + * If the vCPU was waiting for wakeup, remove the vCPU from the wakeup
> + * list of the _previous_ pCPU, which will not be the same as the
> + * current pCPU if the task was migrated.
> */
> - if (pi_desc->nv == POSTED_INTR_WAKEUP_VECTOR || vcpu->cpu == cpu) {
> - pi_clear_sn(pi_desc);
> - goto after_clear_sn;
> + if (pi_desc->nv == POSTED_INTR_WAKEUP_VECTOR) {
> + spin_lock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->cpu));
> + list_del(&vcpu->blocked_vcpu_list);
> + spin_unlock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->cpu));
> }
>
> - /* The full case. Set the new destination and clear SN. */
> dest = cpu_physical_id(cpu);
> if (!x2apic_mode)
> dest = (dest << 8) & 0xFF00;
> @@ -71,11 +83,23 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
> do {
> old.control = new.control = READ_ONCE(pi_desc->control);
>
> + /*
> + * Clear SN (as above) and refresh the destination APIC ID to
> + * handle task migration (@cpu != vcpu->cpu).
> + */
> new.ndst = dest;
> new.sn = 0;
> +
> + /*
> + * Restore the notification vector; in the blocking case, the
> + * descriptor was modified on "put" to use the wakeup vector.
> + */
> + new.nv = POSTED_INTR_VECTOR;
> } while (cmpxchg64(&pi_desc->control, old.control,
> new.control) != old.control);
>
> + local_irq_restore(flags);
> +
> after_clear_sn:
>
> /*
> @@ -90,88 +114,24 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
> pi_set_on(pi_desc);
> }
>
> -void vmx_vcpu_pi_put(struct kvm_vcpu *vcpu)
> -{
> - struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
> -
> - if (!kvm_arch_has_assigned_device(vcpu->kvm) ||
> - !irq_remapping_cap(IRQ_POSTING_CAP) ||
> - !kvm_vcpu_apicv_active(vcpu))
> - return;
> -
> - /* Set SN when the vCPU is preempted */
> - if (vcpu->preempted)
> - pi_set_sn(pi_desc);
> -}
> -
> -static void __pi_post_block(struct kvm_vcpu *vcpu)
> -{
> - struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
> - struct pi_desc old, new;
> - unsigned int dest;
> -
> - WARN(pi_desc->nv != POSTED_INTR_WAKEUP_VECTOR,
> - "Wakeup handler not enabled while the vCPU was blocking");
> -
> - /*
> - * Remove the vCPU from the wakeup list of the _previous_ pCPU, which
> - * will not be the same as the current pCPU if the task was migrated.
> - */
> - spin_lock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->pre_pcpu));
> - list_del(&vcpu->blocked_vcpu_list);
> - spin_unlock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->pre_pcpu));
> -
> - dest = cpu_physical_id(vcpu->cpu);
> - if (!x2apic_mode)
> - dest = (dest << 8) & 0xFF00;
> -
> - do {
> - old.control = new.control = READ_ONCE(pi_desc->control);
> -
> - new.ndst = dest;
> -
> - /* set 'NV' to 'notification vector' */
> - new.nv = POSTED_INTR_VECTOR;
> - } while (cmpxchg64(&pi_desc->control, old.control,
> - new.control) != old.control);
> -
> - vcpu->pre_pcpu = -1;
> -}
> -
> /*
> - * This routine does the following things for vCPU which is going
> - * to be blocked if VT-d PI is enabled.
> - * - Store the vCPU to the wakeup list, so when interrupts happen
> - * we can find the right vCPU to wake up.
> - * - Change the Posted-interrupt descriptor as below:
> - * 'NV' <-- POSTED_INTR_WAKEUP_VECTOR
> - * - If 'ON' is set during this process, which means at least one
> - * interrupt is posted for this vCPU, we cannot block it, in
> - * this case, return 1, otherwise, return 0.
> - *
> + * Put the vCPU on this pCPU's list of vCPUs that needs to be awakened and set
> + * WAKEUP as the notification vector in the PI descriptor.
> */
> -int pi_pre_block(struct kvm_vcpu *vcpu)
> +static void pi_enable_wakeup_handler(struct kvm_vcpu *vcpu)
> {
> - struct pi_desc old, new;
> struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
> + struct pi_desc old, new;
> unsigned long flags;
>
> - if (!kvm_arch_has_assigned_device(vcpu->kvm) ||
> - !irq_remapping_cap(IRQ_POSTING_CAP) ||
> - !kvm_vcpu_apicv_active(vcpu) ||
> - vmx_interrupt_blocked(vcpu))
> - return 0;
> -
> local_irq_save(flags);
>
> - vcpu->pre_pcpu = vcpu->cpu;
> spin_lock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->cpu));
> list_add_tail(&vcpu->blocked_vcpu_list,
> &per_cpu(blocked_vcpu_on_cpu, vcpu->cpu));
> spin_unlock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->cpu));
>
> - WARN(pi_desc->sn == 1,
> - "Posted Interrupt Suppress Notification set before blocking");
> + WARN(pi_desc->sn, "PI descriptor SN field set before blocking");
>
> do {
> old.control = new.control = READ_ONCE(pi_desc->control);
> @@ -181,24 +141,40 @@ int pi_pre_block(struct kvm_vcpu *vcpu)
> } while (cmpxchg64(&pi_desc->control, old.control,
> new.control) != old.control);
>
> - /* We should not block the vCPU if an interrupt is posted for it. */
> - if (pi_test_on(pi_desc))
> - __pi_post_block(vcpu);
> + /*
> + * Send a wakeup IPI to this CPU if an interrupt may have been posted
> + * before the notification vector was updated, in which case the IRQ
> + * will arrive on the non-wakeup vector. An IPI is needed as calling
> + * try_to_wake_up() from ->sched_out() isn't allowed (IRQs are not
> + * enabled until it is safe to call try_to_wake_up() on the task being
> + * scheduled out).
> + */
> + if (pi_test_on(&new))
> + apic->send_IPI_self(POSTED_INTR_WAKEUP_VECTOR);
>
> local_irq_restore(flags);
> - return (vcpu->pre_pcpu == -1);
> }
>
> -void pi_post_block(struct kvm_vcpu *vcpu)
> +void vmx_vcpu_pi_put(struct kvm_vcpu *vcpu)
> {
> - unsigned long flags;
> + struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
>
> - if (vcpu->pre_pcpu == -1)
> + if (!kvm_arch_has_assigned_device(vcpu->kvm) ||
> + !irq_remapping_cap(IRQ_POSTING_CAP) ||
> + !kvm_vcpu_apicv_active(vcpu))
> return;
>
> - local_irq_save(flags);
> - __pi_post_block(vcpu);
> - local_irq_restore(flags);
> + if (kvm_vcpu_is_blocking(vcpu) && !vmx_interrupt_blocked(vcpu))
> + pi_enable_wakeup_handler(vcpu);
> +
> + /*
> + * Set SN when the vCPU is preempted. Note, the vCPU can both be seen
> + * as blocking and preempted, e.g. if it's preempted between setting
> + * its wait state and manually scheduling out. In that case, KVM will
> + * update
> + */
> + if (vcpu->preempted)
> + pi_set_sn(pi_desc);
> }
>
> /*
> @@ -239,7 +215,7 @@ bool pi_has_pending_interrupt(struct kvm_vcpu *vcpu)
> * Bail out of the block loop if the VM has an assigned
> * device, but the blocking vCPU didn't reconfigure the
> * PI.NV to the wakeup vector, i.e. the assigned device
> - * came along after the initial check in pi_pre_block().
> + * came along after the initial check in vmx_vcpu_pi_put().
> */
> void vmx_pi_start_assignment(struct kvm *kvm)
> {
> diff --git a/arch/x86/kvm/vmx/posted_intr.h b/arch/x86/kvm/vmx/posted_intr.h
> index 36ae035f14aa..eb14e76b84ef 100644
> --- a/arch/x86/kvm/vmx/posted_intr.h
> +++ b/arch/x86/kvm/vmx/posted_intr.h
> @@ -40,6 +40,12 @@ static inline bool pi_test_and_clear_on(struct pi_desc *pi_desc)
> (unsigned long *)&pi_desc->control);
> }
>
> +static inline bool pi_test_and_clear_sn(struct pi_desc *pi_desc)
> +{
> + return test_and_clear_bit(POSTED_INTR_SN,
> + (unsigned long *)&pi_desc->control);
> +}
> +
> static inline bool pi_test_and_set_pir(int vector, struct pi_desc *pi_desc)
> {
> return test_and_set_bit(vector, (unsigned long *)pi_desc->pir);
> @@ -88,8 +94,6 @@ static inline bool pi_test_sn(struct pi_desc *pi_desc)
>
> void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu);
> void vmx_vcpu_pi_put(struct kvm_vcpu *vcpu);
> -int pi_pre_block(struct kvm_vcpu *vcpu);
> -void pi_post_block(struct kvm_vcpu *vcpu);
> void pi_wakeup_handler(void);
> void __init pi_init_cpu(int cpu);
> bool pi_has_pending_interrupt(struct kvm_vcpu *vcpu);
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 5517893f12fc..26ed8cd1a1f2 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7462,9 +7462,6 @@ void vmx_update_cpu_dirty_logging(struct kvm_vcpu *vcpu)
>
> static int vmx_pre_block(struct kvm_vcpu *vcpu)
> {
> - if (pi_pre_block(vcpu))
> - return 1;
> -
> if (kvm_lapic_hv_timer_in_use(vcpu))
> kvm_lapic_switch_to_sw_timer(vcpu);
>
> @@ -7475,8 +7472,6 @@ static void vmx_post_block(struct kvm_vcpu *vcpu)
> {
> if (kvm_x86_ops.set_hv_timer)
> kvm_lapic_switch_to_hv_timer(vcpu);
> -
> - pi_post_block(vcpu);
> }
>
> static void vmx_setup_mce(struct kvm_vcpu *vcpu)
Looks OK to me, and IMHO is a very good step in direction to simplify that code,
but the logic is far from beeing simple so I might have missed something.
Especially, this should be tested with nested APICv, which I don't yet know well
enough to know if this can break it or not.
Reviewed-by: Maxim Levitsky <[email protected]>
Best regards,
Maxim Levitsky
On Fri, 2021-10-08 at 19:12 -0700, Sean Christopherson wrote:
> Remove kvm_vcpu.pre_pcpu as it no longer has any users. No functional
> change intended.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> include/linux/kvm_host.h | 1 -
> virt/kvm/kvm_main.c | 1 -
> 2 files changed, 2 deletions(-)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 1fa38dc00b87..87996b22e681 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -304,7 +304,6 @@ struct kvm_vcpu {
> u64 requests;
> unsigned long guest_debug;
>
> - int pre_pcpu;
> struct list_head blocked_vcpu_list;
>
> struct mutex mutex;
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index c870cae7e776..2bbf5c9d410f 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -426,7 +426,6 @@ static void kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
> #endif
> kvm_async_pf_vcpu_init(vcpu);
>
> - vcpu->pre_pcpu = -1;
> INIT_LIST_HEAD(&vcpu->blocked_vcpu_list);
>
> kvm_vcpu_set_in_spin_loop(vcpu, false);
Reviewed-by: Maxim Levitsky <[email protected]>
Best regards,
Maxim Levitsky
On Fri, 2021-10-08 at 19:12 -0700, Sean Christopherson wrote:
> Move the seemingly generic block_vcpu_list from kvm_vcpu to vcpu_vmx, and
> rename the list and all associated variables to clarify that it tracks
> the set of vCPU that need to be poked on a posted interrupt to the wakeup
> vector. The list is not used to track _all_ vCPUs that are blocking, and
> the term "blocked" can be misleading as it may refer to a blocking
> condition in the host or the guest, where as the PI wakeup case is
> specifically for the vCPUs that are actively blocking from within the
> guest.
>
> No functional change intended.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/vmx/posted_intr.c | 39 +++++++++++++++++-----------------
> arch/x86/kvm/vmx/vmx.c | 2 ++
> arch/x86/kvm/vmx/vmx.h | 3 +++
> include/linux/kvm_host.h | 2 --
> virt/kvm/kvm_main.c | 2 --
> 5 files changed, 25 insertions(+), 23 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
> index d2b3d75c57d1..f1bcf8c32b6d 100644
> --- a/arch/x86/kvm/vmx/posted_intr.c
> +++ b/arch/x86/kvm/vmx/posted_intr.c
> @@ -18,7 +18,7 @@
> * wake the target vCPUs. vCPUs are removed from the list and the notification
> * vector is reset when the vCPU is scheduled in.
> */
> -static DEFINE_PER_CPU(struct list_head, blocked_vcpu_on_cpu);
> +static DEFINE_PER_CPU(struct list_head, wakeup_vcpus_on_cpu);
This looks good, you can disregard my comment on this variable from previous patch
where I nitpicked about it.
> /*
> * Protect the per-CPU list with a per-CPU spinlock to handle task migration.
> * When a blocking vCPU is awakened _and_ migrated to a different pCPU, the
> @@ -26,7 +26,7 @@ static DEFINE_PER_CPU(struct list_head, blocked_vcpu_on_cpu);
> * CPU. IRQs must be disabled when taking this lock, otherwise deadlock will
> * occur if a wakeup IRQ arrives and attempts to acquire the lock.
> */
> -static DEFINE_PER_CPU(spinlock_t, blocked_vcpu_on_cpu_lock);
> +static DEFINE_PER_CPU(spinlock_t, wakeup_vcpus_on_cpu_lock);
>
> static inline struct pi_desc *vcpu_to_pi_desc(struct kvm_vcpu *vcpu)
> {
> @@ -36,6 +36,7 @@ static inline struct pi_desc *vcpu_to_pi_desc(struct kvm_vcpu *vcpu)
> void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
> {
> struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
> + struct vcpu_vmx *vmx = to_vmx(vcpu);
> struct pi_desc old, new;
> unsigned long flags;
> unsigned int dest;
> @@ -71,9 +72,9 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
> * current pCPU if the task was migrated.
> */
> if (pi_desc->nv == POSTED_INTR_WAKEUP_VECTOR) {
> - spin_lock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->cpu));
> - list_del(&vcpu->blocked_vcpu_list);
> - spin_unlock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->cpu));
> + spin_lock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));
> + list_del(&vmx->pi_wakeup_list);
> + spin_unlock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));
> }
>
> dest = cpu_physical_id(cpu);
> @@ -121,15 +122,16 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
> static void pi_enable_wakeup_handler(struct kvm_vcpu *vcpu)
> {
> struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
> + struct vcpu_vmx *vmx = to_vmx(vcpu);
> struct pi_desc old, new;
> unsigned long flags;
>
> local_irq_save(flags);
>
> - spin_lock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->cpu));
> - list_add_tail(&vcpu->blocked_vcpu_list,
> - &per_cpu(blocked_vcpu_on_cpu, vcpu->cpu));
> - spin_unlock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->cpu));
> + spin_lock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));
> + list_add_tail(&vmx->pi_wakeup_list,
> + &per_cpu(wakeup_vcpus_on_cpu, vcpu->cpu));
> + spin_unlock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));
>
> WARN(pi_desc->sn, "PI descriptor SN field set before blocking");
>
> @@ -182,24 +184,23 @@ void vmx_vcpu_pi_put(struct kvm_vcpu *vcpu)
> */
> void pi_wakeup_handler(void)
> {
> - struct kvm_vcpu *vcpu;
> int cpu = smp_processor_id();
> + struct vcpu_vmx *vmx;
>
> - spin_lock(&per_cpu(blocked_vcpu_on_cpu_lock, cpu));
> - list_for_each_entry(vcpu, &per_cpu(blocked_vcpu_on_cpu, cpu),
> - blocked_vcpu_list) {
> - struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
> + spin_lock(&per_cpu(wakeup_vcpus_on_cpu_lock, cpu));
> + list_for_each_entry(vmx, &per_cpu(wakeup_vcpus_on_cpu, cpu),
> + pi_wakeup_list) {
>
> - if (pi_test_on(pi_desc))
> - kvm_vcpu_kick(vcpu);
> + if (pi_test_on(&vmx->pi_desc))
> + kvm_vcpu_kick(&vmx->vcpu);
> }
> - spin_unlock(&per_cpu(blocked_vcpu_on_cpu_lock, cpu));
> + spin_unlock(&per_cpu(wakeup_vcpus_on_cpu_lock, cpu));
> }
>
> void __init pi_init_cpu(int cpu)
> {
> - INIT_LIST_HEAD(&per_cpu(blocked_vcpu_on_cpu, cpu));
> - spin_lock_init(&per_cpu(blocked_vcpu_on_cpu_lock, cpu));
> + INIT_LIST_HEAD(&per_cpu(wakeup_vcpus_on_cpu, cpu));
> + spin_lock_init(&per_cpu(wakeup_vcpus_on_cpu_lock, cpu));
> }
>
> bool pi_has_pending_interrupt(struct kvm_vcpu *vcpu)
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 26ed8cd1a1f2..b3bb2031a7ac 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -6848,6 +6848,8 @@ static int vmx_create_vcpu(struct kvm_vcpu *vcpu)
> BUILD_BUG_ON(offsetof(struct vcpu_vmx, vcpu) != 0);
> vmx = to_vmx(vcpu);
>
> + INIT_LIST_HEAD(&vmx->pi_wakeup_list);
> +
> err = -ENOMEM;
>
> vmx->vpid = allocate_vpid();
> diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
> index 592217fd7d92..d1a720be9a64 100644
> --- a/arch/x86/kvm/vmx/vmx.h
> +++ b/arch/x86/kvm/vmx/vmx.h
> @@ -298,6 +298,9 @@ struct vcpu_vmx {
> /* Posted interrupt descriptor */
> struct pi_desc pi_desc;
>
> + /* Used if this vCPU is waiting for PI notification wakeup. */
> + struct list_head pi_wakeup_list;
> +
> /* Support for a guest hypervisor (nested VMX) */
> struct nested_vmx nested;
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 87996b22e681..c5961a361c73 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -304,8 +304,6 @@ struct kvm_vcpu {
> u64 requests;
> unsigned long guest_debug;
>
> - struct list_head blocked_vcpu_list;
> -
> struct mutex mutex;
> struct kvm_run *run;
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 2bbf5c9d410f..c1850b60f38b 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -426,8 +426,6 @@ static void kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
> #endif
> kvm_async_pf_vcpu_init(vcpu);
>
> - INIT_LIST_HEAD(&vcpu->blocked_vcpu_list);
> -
> kvm_vcpu_set_in_spin_loop(vcpu, false);
> kvm_vcpu_set_dy_eligible(vcpu, false);
> vcpu->preempted = false;
Reviewed-by: Maxim Levitsky <[email protected]>
Best regards,
Maxim Levitsky
On Thu, Oct 28, 2021, Maxim Levitsky wrote:
> On Fri, 2021-10-08 at 19:12 -0700, Sean Christopherson wrote:
> > Move the WARN sanity checks out of the PI descriptor update loop so as
> > not to spam the kernel log if the condition is violated and the update
> > takes multiple attempts due to another writer. This also eliminates a
> > few extra uops from the retry path.
> >
> > Technically not checking every attempt could mean KVM will now fail to
> > WARN in a scenario that would have failed before, but any such failure
> > would be inherently racy as some other agent (CPU or device) would have
> > to concurrent modify the PI descriptor.
...
> Don't know for sure if this is desired. I'll would just use WARN_ON_ONCE instead
> if the warning spams the log.
>
> If there is a race I would rather want to catch it even if rare.
Paolo had similar concerns[*]. I copied the most relevant part of the discussion
below, let me know if you object to the outcome.
Thanks for the reviews!
[*] https://lore.kernel.org/all/[email protected]/T/#u
On Wed, Oct 27, 2021 at 8:38 AM Paolo Bonzini <[email protected]> wrote:
> On 27/10/21 17:28, Sean Christopherson wrote:
> > On Wed, Oct 27, 2021, Paolo Bonzini wrote:
> > > On 27/10/21 16:41, Sean Christopherson wrote:
> > > > The other thing I don't like about having the WARN in the loop is that it suggests
> > > > that something other than the vCPU can modify the NDST and SN fields, which is
> > > > wrong and confusing (for me).
> > >
> > > Yeah, I can agree with that. Can you add it in a comment above the cmpxchg
> > > loop, it can be as simple as
> > >
> > > /* The processor can set ON concurrently. */
> > >
> > > when you respin patch 21 and the rest of the series?
> >
> > I can definitely add a comment, but I think that comment is incorrect.
>
> It's completely backwards indeed. I first had "the hardware" and then
> shut down my brain for a second to replace it.
>
> > So something like this?
> >
> > /* ON can be set concurrently by a different vCPU or by hardware. */
>
> Yes, of course.
On Fri, 2021-10-08 at 19:12 -0700, Sean Christopherson wrote:
> Handle the switch to/from the hypervisor/software timer when a vCPU is
> blocking in common x86 instead of in VMX. Even though VMX is the only
> user of a hypervisor timer, the logic and all functions involved are
> generic x86 (unless future CPUs do something completely different and
> implement a hypervisor timer that runs regardless of mode).
>
> Handling the switch in common x86 will allow for the elimination of the
> pre/post_blocks hooks, and also lets KVM switch back to the hypervisor
> timer if and only if it was in use (without additional params). Add a
> comment explaining why the switch cannot be deferred to kvm_sched_out()
> or kvm_vcpu_block().
>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/vmx/vmx.c | 6 +-----
> arch/x86/kvm/x86.c | 21 +++++++++++++++++++++
> 2 files changed, 22 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index b3bb2031a7ac..a24f19874716 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7464,16 +7464,12 @@ void vmx_update_cpu_dirty_logging(struct kvm_vcpu *vcpu)
>
> static int vmx_pre_block(struct kvm_vcpu *vcpu)
> {
> - if (kvm_lapic_hv_timer_in_use(vcpu))
> - kvm_lapic_switch_to_sw_timer(vcpu);
> -
> return 0;
> }
>
> static void vmx_post_block(struct kvm_vcpu *vcpu)
> {
> - if (kvm_x86_ops.set_hv_timer)
> - kvm_lapic_switch_to_hv_timer(vcpu);
> +
> }
>
> static void vmx_setup_mce(struct kvm_vcpu *vcpu)
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index e0219acfd9cf..909e932a7ae7 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -9896,8 +9896,21 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>
> static inline int vcpu_block(struct kvm *kvm, struct kvm_vcpu *vcpu)
> {
> + bool hv_timer;
> +
> if (!kvm_arch_vcpu_runnable(vcpu) &&
> (!kvm_x86_ops.pre_block || static_call(kvm_x86_pre_block)(vcpu) == 0)) {
> + /*
> + * Switch to the software timer before halt-polling/blocking as
> + * the guest's timer may be a break event for the vCPU, and the
> + * hypervisor timer runs only when the CPU is in guest mode.
> + * Switch before halt-polling so that KVM recognizes an expired
> + * timer before blocking.
> + */
I didn't knew about this until now but it all makes sense. The comment is very good.
> + hv_timer = kvm_lapic_hv_timer_in_use(vcpu);
> + if (hv_timer)
> + kvm_lapic_switch_to_sw_timer(vcpu);
> +
> srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
> if (vcpu->arch.mp_state == KVM_MP_STATE_HALTED)
> kvm_vcpu_halt(vcpu);
> @@ -9905,6 +9918,9 @@ static inline int vcpu_block(struct kvm *kvm, struct kvm_vcpu *vcpu)
> kvm_vcpu_block(vcpu);
> vcpu->srcu_idx = srcu_read_lock(&kvm->srcu);
>
> + if (hv_timer)
> + kvm_lapic_switch_to_hv_timer(vcpu);
> +
> if (kvm_x86_ops.post_block)
> static_call(kvm_x86_post_block)(vcpu);
>
> @@ -10136,6 +10152,11 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
> r = -EINTR;
> goto out;
> }
> + /*
> + * It should be impossible for the hypervisor timer to be in
> + * use before KVM has ever run the vCPU.
> + */
> + WARN_ON_ONCE(kvm_lapic_hv_timer_in_use(vcpu));
> kvm_vcpu_block(vcpu);
> if (kvm_apic_accept_events(vcpu) < 0) {
> r = 0;
Reviewed-by: Maxim Levitsky <[email protected]>
Best regards,
Maxim Levitsky
On Fri, 2021-10-08 at 19:12 -0700, Sean Christopherson wrote:
> Unexport switch_to_{hv,sw}_timer() now that common x86 handles the
> transitions.
>
> No functional change intended.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/lapic.c | 2 --
> 1 file changed, 2 deletions(-)
>
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 0cd7ed21b205..cfb64bd4a1c1 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -1948,7 +1948,6 @@ void kvm_lapic_switch_to_hv_timer(struct kvm_vcpu *vcpu)
> {
> restart_apic_timer(vcpu->arch.apic);
> }
> -EXPORT_SYMBOL_GPL(kvm_lapic_switch_to_hv_timer);
>
> void kvm_lapic_switch_to_sw_timer(struct kvm_vcpu *vcpu)
> {
> @@ -1960,7 +1959,6 @@ void kvm_lapic_switch_to_sw_timer(struct kvm_vcpu *vcpu)
> start_sw_timer(apic);
> preempt_enable();
> }
> -EXPORT_SYMBOL_GPL(kvm_lapic_switch_to_sw_timer);
>
> void kvm_lapic_restart_hv_timer(struct kvm_vcpu *vcpu)
> {
Reviewed-by: Maxim Levitsky <[email protected]>
Best regards,
Maxim Levitsky
On Fri, 2021-10-08 at 19:12 -0700, Sean Christopherson wrote:
> Drop kvm_x86_ops' pre/post_block() now that all implementations are nops.
>
> No functional change intended.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/include/asm/kvm-x86-ops.h | 2 --
> arch/x86/include/asm/kvm_host.h | 12 ------------
> arch/x86/kvm/vmx/vmx.c | 13 -------------
> arch/x86/kvm/x86.c | 6 +-----
> 4 files changed, 1 insertion(+), 32 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> index cefe1d81e2e8..c2b007171abd 100644
> --- a/arch/x86/include/asm/kvm-x86-ops.h
> +++ b/arch/x86/include/asm/kvm-x86-ops.h
> @@ -96,8 +96,6 @@ KVM_X86_OP(handle_exit_irqoff)
> KVM_X86_OP_NULL(request_immediate_exit)
> KVM_X86_OP(sched_in)
> KVM_X86_OP_NULL(update_cpu_dirty_logging)
> -KVM_X86_OP_NULL(pre_block)
> -KVM_X86_OP_NULL(post_block)
> KVM_X86_OP_NULL(vcpu_blocking)
> KVM_X86_OP_NULL(vcpu_unblocking)
> KVM_X86_OP_NULL(update_pi_irte)
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 328103a520d3..76a8dddc1a48 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1445,18 +1445,6 @@ struct kvm_x86_ops {
> const struct kvm_pmu_ops *pmu_ops;
> const struct kvm_x86_nested_ops *nested_ops;
>
> - /*
> - * Architecture specific hooks for vCPU blocking due to
> - * HLT instruction.
> - * Returns for .pre_block():
> - * - 0 means continue to block the vCPU.
> - * - 1 means we cannot block the vCPU since some event
> - * happens during this period, such as, 'ON' bit in
> - * posted-interrupts descriptor is set.
> - */
> - int (*pre_block)(struct kvm_vcpu *vcpu);
> - void (*post_block)(struct kvm_vcpu *vcpu);
> -
> void (*vcpu_blocking)(struct kvm_vcpu *vcpu);
> void (*vcpu_unblocking)(struct kvm_vcpu *vcpu);
>
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index a24f19874716..13e732a818f3 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7462,16 +7462,6 @@ void vmx_update_cpu_dirty_logging(struct kvm_vcpu *vcpu)
> secondary_exec_controls_clearbit(vmx, SECONDARY_EXEC_ENABLE_PML);
> }
>
> -static int vmx_pre_block(struct kvm_vcpu *vcpu)
> -{
> - return 0;
> -}
> -
> -static void vmx_post_block(struct kvm_vcpu *vcpu)
> -{
> -
> -}
> -
> static void vmx_setup_mce(struct kvm_vcpu *vcpu)
> {
> if (vcpu->arch.mcg_cap & MCG_LMCE_P)
> @@ -7665,9 +7655,6 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {
> .cpu_dirty_log_size = PML_ENTITY_NUM,
> .update_cpu_dirty_logging = vmx_update_cpu_dirty_logging,
>
> - .pre_block = vmx_pre_block,
> - .post_block = vmx_post_block,
> -
> .pmu_ops = &intel_pmu_ops,
> .nested_ops = &vmx_nested_ops,
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 909e932a7ae7..9643f23c28c7 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -9898,8 +9898,7 @@ static inline int vcpu_block(struct kvm *kvm, struct kvm_vcpu *vcpu)
> {
> bool hv_timer;
>
> - if (!kvm_arch_vcpu_runnable(vcpu) &&
> - (!kvm_x86_ops.pre_block || static_call(kvm_x86_pre_block)(vcpu) == 0)) {
> + if (!kvm_arch_vcpu_runnable(vcpu)) {
> /*
> * Switch to the software timer before halt-polling/blocking as
> * the guest's timer may be a break event for the vCPU, and the
> @@ -9921,9 +9920,6 @@ static inline int vcpu_block(struct kvm *kvm, struct kvm_vcpu *vcpu)
> if (hv_timer)
> kvm_lapic_switch_to_hv_timer(vcpu);
>
> - if (kvm_x86_ops.post_block)
> - static_call(kvm_x86_post_block)(vcpu);
> -
> if (!kvm_check_request(KVM_REQ_UNHALT, vcpu))
> return 1;
> }
Reviewed-by: Maxim Levitsky <[email protected]>
Best regards,
Maxim Levitsky
On Thu, Oct 28, 2021, Maxim Levitsky wrote:
> On Fri, 2021-10-08 at 19:12 -0700, Sean Christopherson wrote:
> > Use READ_ONCE() when loading the posted interrupt descriptor control
> > field to ensure "old" and "new" have the same base value. If the
> > compiler emits separate loads, and loads into "new" before "old", KVM
> > could theoretically drop the ON bit if it were set between the loads.
> >
> > Fixes: 28b835d60fcc ("KVM: Update Posted-Interrupts Descriptor when vCPU is preempted")
> > Signed-off-by: Sean Christopherson <[email protected]>
> > ---
> > arch/x86/kvm/vmx/posted_intr.c | 6 +++---
> > 1 file changed, 3 insertions(+), 3 deletions(-)
> >
> > diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
> > index 414ea6972b5c..fea343dcc011 100644
> > --- a/arch/x86/kvm/vmx/posted_intr.c
> > +++ b/arch/x86/kvm/vmx/posted_intr.c
> > @@ -53,7 +53,7 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
> >
> > /* The full case. */
> > do {
> > - old.control = new.control = pi_desc->control;
> > + old.control = new.control = READ_ONCE(pi_desc->control);
> >
> > dest = cpu_physical_id(cpu);
> >
> > @@ -104,7 +104,7 @@ static void __pi_post_block(struct kvm_vcpu *vcpu)
> > "Wakeup handler not enabled while the vCPU was blocking");
> >
> > do {
> > - old.control = new.control = pi_desc->control;
> > + old.control = new.control = READ_ONCE(pi_desc->control);
> >
> > dest = cpu_physical_id(vcpu->cpu);
> >
> > @@ -160,7 +160,7 @@ int pi_pre_block(struct kvm_vcpu *vcpu)
> > "Posted Interrupt Suppress Notification set before blocking");
> >
> > do {
> > - old.control = new.control = pi_desc->control;
> > + old.control = new.control = READ_ONCE(pi_desc->control);
> >
> > /* set 'NV' to 'wakeup vector' */
> > new.nv = POSTED_INTR_WAKEUP_VECTOR;
>
> I wish there was a way to mark fields in a struct, as requiring 'READ_ONCE' on them
> so that compiler would complain if this isn't done, or automatically use 'READ_ONCE'
> logic.
Hmm, I think you could make an argument that ON and thus the whole "control"
word should be volatile. AFAICT, tagging just "on" as volatile actually works.
There's even in a clause in Documentation/process/volatile-considered-harmful.rst
that calls this out as a (potentially) legitimate use case.
- Pointers to data structures in coherent memory which might be modified
by I/O devices can, sometimes, legitimately be volatile.
That said, I think I actually prefer forcing the use of READ_ONCE. The descriptor
requires more protections than what volatile provides, namely that all writes need
to be atomic. So given that volatile alone isn't sufficient, I'd prefer to have
the code itself be more self-documenting.
E.g. this compiles and does mess up the expected size.
diff --git a/arch/x86/kvm/vmx/posted_intr.h b/arch/x86/kvm/vmx/posted_intr.h
index 7f7b2326caf5..149df3b18789 100644
--- a/arch/x86/kvm/vmx/posted_intr.h
+++ b/arch/x86/kvm/vmx/posted_intr.h
@@ -11,9 +11,9 @@ struct pi_desc {
union {
struct {
/* bit 256 - Outstanding Notification */
- u16 on : 1,
+ volatile u16 on : 1;
/* bit 257 - Suppress Notification */
- sn : 1,
+ u16 sn : 1,
/* bit 271:258 - Reserved */
rsvd_1 : 14;
/* bit 279:272 - Notification Vector */
@@ -23,7 +23,7 @@ struct pi_desc {
/* bit 319:288 - Notification Destination */
u32 ndst;
};
- u64 control;
+ volatile u64 control;
};
u32 rsvd[6];
} __aligned(64);
On Thu, 2021-10-28 at 14:28 +0300, Maxim Levitsky wrote:
> On Fri, 2021-10-08 at 19:12 -0700, Sean Christopherson wrote:
> > Hoist the CPU => APIC ID conversion for the Posted Interrupt descriptor
> > out of the loop to write the descriptor, preemption is disabled so the
> > CPU won't change, and if the APIC ID changes KVM has bigger problems.
> >
> > No functional change intended.
>
> Is preemption always disabled in vmx_vcpu_pi_load? vmx_vcpu_pi_load is called from vmx_vcpu_load,
> which is called indirectly from vcpu_load which is called from many ioctls,
> which userspace does. In these places I don't think that preemption is disabled.
You can disregard this, I missed the fact that we have 'int cpu = get_cpu();'
which disables preemption in 'vcpu_load'
Thus,
Reviewed-by: Maxim Levitsky <[email protected]>
Best regards,
Maxim Levitsky
>
> Best regards,
> Maxim Levitsky
>
> > Signed-off-by: Sean Christopherson <[email protected]>
> > ---
> > arch/x86/kvm/vmx/posted_intr.c | 25 +++++++++++--------------
> > 1 file changed, 11 insertions(+), 14 deletions(-)
> >
> > diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
> > index fea343dcc011..2b2206339174 100644
> > --- a/arch/x86/kvm/vmx/posted_intr.c
> > +++ b/arch/x86/kvm/vmx/posted_intr.c
> > @@ -51,17 +51,15 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
> > goto after_clear_sn;
> > }
> >
> > - /* The full case. */
> > + /* The full case. Set the new destination and clear SN. */
> > + dest = cpu_physical_id(cpu);
> > + if (!x2apic_mode)
> > + dest = (dest << 8) & 0xFF00;
> > +
> > do {
> > old.control = new.control = READ_ONCE(pi_desc->control);
> >
> > - dest = cpu_physical_id(cpu);
> > -
> > - if (x2apic_mode)
> > - new.ndst = dest;
> > - else
> > - new.ndst = (dest << 8) & 0xFF00;
> > -
> > + new.ndst = dest;
> > new.sn = 0;
> > } while (cmpxchg64(&pi_desc->control, old.control,
> > new.control) != old.control);
> > @@ -103,15 +101,14 @@ static void __pi_post_block(struct kvm_vcpu *vcpu)
> > WARN(pi_desc->nv != POSTED_INTR_WAKEUP_VECTOR,
> > "Wakeup handler not enabled while the vCPU was blocking");
> >
> > + dest = cpu_physical_id(vcpu->cpu);
> > + if (!x2apic_mode)
> > + dest = (dest << 8) & 0xFF00;
> > +
> > do {
> > old.control = new.control = READ_ONCE(pi_desc->control);
> >
> > - dest = cpu_physical_id(vcpu->cpu);
> > -
> > - if (x2apic_mode)
> > - new.ndst = dest;
> > - else
> > - new.ndst = (dest << 8) & 0xFF00;
> > + new.ndst = dest;
> >
> > /* set 'NV' to 'notification vector' */
> > new.nv = POSTED_INTR_VECTOR;
On Thu, Oct 28, 2021, Maxim Levitsky wrote:
> On Fri, 2021-10-08 at 19:12 -0700, Sean Christopherson wrote:
> > Hoist the CPU => APIC ID conversion for the Posted Interrupt descriptor
> > out of the loop to write the descriptor, preemption is disabled so the
> > CPU won't change, and if the APIC ID changes KVM has bigger problems.
> >
> > No functional change intended.
>
> Is preemption always disabled in vmx_vcpu_pi_load? vmx_vcpu_pi_load is called
> from vmx_vcpu_load, which is called indirectly from vcpu_load which is called
> from many ioctls, which userspace does. In these places I don't think that
> preemption is disabled.
Preemption is disabled in vcpu_load() by the get_cpu(). The "cpu" param that's
passed around the vcpu_load() stack is also why I think it's ok to _not_ assert
that preemption is disabled in vmx_vcpu_pi_load(); if preemption is enabled,
"cpu" is unstable and thus the entire "load" operation is busted.
#define get_cpu() ({ preempt_disable(); __smp_processor_id(); })
#define put_cpu() preempt_enable()
void vcpu_load(struct kvm_vcpu *vcpu)
{
int cpu = get_cpu();
__this_cpu_write(kvm_running_vcpu, vcpu);
preempt_notifier_register(&vcpu->preempt_notifier);
kvm_arch_vcpu_load(vcpu, cpu);
put_cpu();
}
EXPORT_SYMBOL_GPL(vcpu_load);
On Fri, 2021-10-08 at 19:12 -0700, Sean Christopherson wrote:
> Signal the AVIC doorbell iff the vCPU is running in the guest. If the vCPU
> is not IN_GUEST_MODE, it's guaranteed to pick up any pending IRQs on the
> next VMRUN, which unconditionally processes the vIRR.
>
> Add comments to document the logic.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/svm/avic.c | 14 ++++++++++++--
> 1 file changed, 12 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
> index 208c5c71e827..cbf02e7e20d0 100644
> --- a/arch/x86/kvm/svm/avic.c
> +++ b/arch/x86/kvm/svm/avic.c
> @@ -674,7 +674,12 @@ int svm_deliver_avic_intr(struct kvm_vcpu *vcpu, int vec)
> kvm_lapic_set_irr(vec, vcpu->arch.apic);
> smp_mb__after_atomic();
>
> - if (avic_vcpu_is_running(vcpu)) {
> + /*
> + * Signal the doorbell to tell hardware to inject the IRQ if the vCPU
> + * is in the guest. If the vCPU is not in the guest, hardware will
> + * automatically process AVIC interrupts at VMRUN.
> + */
> + if (vcpu->mode == IN_GUEST_MODE) {
> int cpu = READ_ONCE(vcpu->cpu);
>
> /*
> @@ -687,8 +692,13 @@ int svm_deliver_avic_intr(struct kvm_vcpu *vcpu, int vec)
> if (cpu != get_cpu())
> wrmsrl(SVM_AVIC_DOORBELL, kvm_cpu_get_apicid(cpu));
> put_cpu();
> - } else
> + } else {
> + /*
> + * Wake the vCPU if it was blocking. KVM will then detect the
> + * pending IRQ when checking if the vCPU has a wake event.
> + */
> kvm_vcpu_wake_up(vcpu);
> + }
>
> return 0;
> }
It makes sense indeed to avoid ringing the doorbell when the vCPU is not in the guest mode.
I do wonder if we want to call kvm_vcpu_wake_up always otherwise, as the vCPU might
be just outside of the guest mode and not scheduled out. I don't know how expensive
is kvm_vcpu_wake_up in this case.
Before this patch, the avic_vcpu_is_running would only be false when the vCPU is scheduled out
(e.g when vcpu_put was done on it)
Best regards,
Maxim Levitsky
On Thu, Oct 28, 2021, Maxim Levitsky wrote:
> On Fri, 2021-10-08 at 19:12 -0700, Sean Christopherson wrote:
> > Signal the AVIC doorbell iff the vCPU is running in the guest. If the vCPU
> > is not IN_GUEST_MODE, it's guaranteed to pick up any pending IRQs on the
> > next VMRUN, which unconditionally processes the vIRR.
> >
> > Add comments to document the logic.
> >
> > Signed-off-by: Sean Christopherson <[email protected]>
> > ---
> > arch/x86/kvm/svm/avic.c | 14 ++++++++++++--
> > 1 file changed, 12 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
> > index 208c5c71e827..cbf02e7e20d0 100644
> > --- a/arch/x86/kvm/svm/avic.c
> > +++ b/arch/x86/kvm/svm/avic.c
> > @@ -674,7 +674,12 @@ int svm_deliver_avic_intr(struct kvm_vcpu *vcpu, int vec)
> > kvm_lapic_set_irr(vec, vcpu->arch.apic);
> > smp_mb__after_atomic();
> >
> > - if (avic_vcpu_is_running(vcpu)) {
> > + /*
> > + * Signal the doorbell to tell hardware to inject the IRQ if the vCPU
> > + * is in the guest. If the vCPU is not in the guest, hardware will
> > + * automatically process AVIC interrupts at VMRUN.
> > + */
> > + if (vcpu->mode == IN_GUEST_MODE) {
> > int cpu = READ_ONCE(vcpu->cpu);
> >
> > /*
> > @@ -687,8 +692,13 @@ int svm_deliver_avic_intr(struct kvm_vcpu *vcpu, int vec)
> > if (cpu != get_cpu())
> > wrmsrl(SVM_AVIC_DOORBELL, kvm_cpu_get_apicid(cpu));
> > put_cpu();
> > - } else
> > + } else {
> > + /*
> > + * Wake the vCPU if it was blocking. KVM will then detect the
> > + * pending IRQ when checking if the vCPU has a wake event.
> > + */
> > kvm_vcpu_wake_up(vcpu);
> > + }
> >
> > return 0;
> > }
>
> It makes sense indeed to avoid ringing the doorbell when the vCPU is not in
> the guest mode.
>
> I do wonder if we want to call kvm_vcpu_wake_up always otherwise, as the vCPU
> might be just outside of the guest mode and not scheduled out. I don't know
> how expensive is kvm_vcpu_wake_up in this case.
IIUC, you're asking if we should do something like:
if (vcpu->mode == IN_GUEST_MODE) {
<signal doorbell>
} else if (!is_vcpu_loaded(vcpu)) {
kvm_vcpu_wake_up();
}
The answer is that kvm_vcpu_wake_up(), which is effectively rcuwait_wake_up(),
is very cheap except for specific configurations that may or may not be valid for
production[*]. Practically speaking, is_vcpu_loaded() doesn't exist and should
never exist because it's inherently racy. The closest we have would be
else if (vcpu != kvm_get_running_vcpu()) {
kvm_vcpu_wake_up();
}
but that's extremely unlikely to be a net win because getting the current vCPU
requires atomics to disable/re-enable preemption, especially if rcuwait_wake_up()
is modified to avoid the rcu lock/unlock.
TL;DR: rcuwait_wake_up() is cheap, and if it's too expensive, a better optimization
would be to make it less expensive.
[*] https://lkml.kernel.org/r/[email protected]
> Before this patch, the avic_vcpu_is_running would only be false when the vCPU
> is scheduled out (e.g when vcpu_put was done on it)
>
> Best regards,
> Maxim Levitsky
>
On Thu, Oct 28, 2021, Maxim Levitsky wrote:
> On Fri, 2021-10-08 at 19:12 -0700, Sean Christopherson wrote:
> > Remove the vCPU from the wakeup list before updating the notification
> > vector in the posted interrupt post-block helper. There is no need to
> > wake the current vCPU as it is by definition not blocking. Practically
> > speaking this is a nop as it only shaves a few meager cycles in the
> > unlikely case that the vCPU was migrated and the previous pCPU gets a
> > wakeup IRQ right before PID.NV is updated. The real motivation is to
> > allow for more readable code in the future, when post-block is merged
> > with vmx_vcpu_pi_load(), at which point removal from the list will be
> > conditional on the old notification vector.
> >
> > Opportunistically add comments to document why KVM has a per-CPU spinlock
> > that, at first glance, appears to be taken only on the owning CPU.
> > Explicitly call out that the spinlock must be taken with IRQs disabled, a
> > detail that was "lost" when KVM switched from spin_lock_irqsave() to
> > spin_lock(), with IRQs disabled for the entirety of the relevant path.
> >
> > Signed-off-by: Sean Christopherson <[email protected]>
> > ---
> > arch/x86/kvm/vmx/posted_intr.c | 49 +++++++++++++++++++++++-----------
> > 1 file changed, 33 insertions(+), 16 deletions(-)
> >
> > diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
> > index 2b2206339174..901b7a5f7777 100644
> > --- a/arch/x86/kvm/vmx/posted_intr.c
> > +++ b/arch/x86/kvm/vmx/posted_intr.c
> > @@ -10,10 +10,22 @@
> > #include "vmx.h"
> >
> > /*
> > - * We maintain a per-CPU linked-list of vCPU, so in wakeup_handler() we
> > - * can find which vCPU should be waken up.
> > + * Maintain a per-CPU list of vCPUs that need to be awakened by wakeup_handler()
> Nit: While at it, it would be nice to rename this to pi_wakeup_hanlder() so
> that it can be more easilly found.
Ah, good catch.
> > + * when a WAKEUP_VECTOR interrupted is posted. vCPUs are added to the list when
> > + * the vCPU is scheduled out and is blocking (e.g. in HLT) with IRQs enabled.
> s/interrupted/interrupt ?
>
> Isn't that comment incorrect? As I see, the PI hardware is setup to use the WAKEUP_VECTOR
> when vcpu blocks (in pi_pre_block) and then that vcpu is added to the list.
> The pi_wakeup_hanlder just goes over the list and wakes up all vcpus on the lsit.
Doh, yes. This patch is predicting the future. The comment becomes correct as of
KVM: VMX: Handle PI wakeup shenanigans during vcpu_put/load
but as of this patch the "scheduled out" piece doesn't hold true.
> > + * The vCPUs posted interrupt descriptor is updated at the same time to set its
> > + * notification vector to WAKEUP_VECTOR, so that posted interrupt from devices
> > + * wake the target vCPUs. vCPUs are removed from the list and the notification
> > + * vector is reset when the vCPU is scheduled in.
> > */
> > static DEFINE_PER_CPU(struct list_head, blocked_vcpu_on_cpu);
> Also while at it, why not to rename this to 'blocked_vcpu_list'?
> to explain that this is list of blocked vcpus. Its a per-cpu variable
> so 'on_cpu' suffix isn't needed IMHO.
As you noted, addressed in a future patch.
> > +/*
> > + * Protect the per-CPU list with a per-CPU spinlock to handle task migration.
> > + * When a blocking vCPU is awakened _and_ migrated to a different pCPU, the
> > + * ->sched_in() path will need to take the vCPU off the list of the _previous_
> > + * CPU. IRQs must be disabled when taking this lock, otherwise deadlock will
> > + * occur if a wakeup IRQ arrives and attempts to acquire the lock.
> > + */
> > static DEFINE_PER_CPU(spinlock_t, blocked_vcpu_on_cpu_lock);
> >
> > static inline struct pi_desc *vcpu_to_pi_desc(struct kvm_vcpu *vcpu)
> > @@ -101,23 +113,28 @@ static void __pi_post_block(struct kvm_vcpu *vcpu)
> > WARN(pi_desc->nv != POSTED_INTR_WAKEUP_VECTOR,
> > "Wakeup handler not enabled while the vCPU was blocking");
> >
> > - dest = cpu_physical_id(vcpu->cpu);
> > - if (!x2apic_mode)
> > - dest = (dest << 8) & 0xFF00;
> > -
> > - do {
> > - old.control = new.control = READ_ONCE(pi_desc->control);
> > -
> > - new.ndst = dest;
> > -
> > - /* set 'NV' to 'notification vector' */
> > - new.nv = POSTED_INTR_VECTOR;
> > - } while (cmpxchg64(&pi_desc->control, old.control,
> > - new.control) != old.control);
> > -
> > + /*
> > + * Remove the vCPU from the wakeup list of the _previous_ pCPU, which
> > + * will not be the same as the current pCPU if the task was migrated.
> > + */
> > spin_lock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->pre_pcpu));
> > list_del(&vcpu->blocked_vcpu_list);
> > spin_unlock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->pre_pcpu));
> > +
> > + dest = cpu_physical_id(vcpu->cpu);
> > + if (!x2apic_mode)
> > + dest = (dest << 8) & 0xFF00;
> It would be nice to have a function for this, this appears in this file twice.
> Maybe there is a function already somewhere?
The second instance does go away by the aforementioned:
KVM: VMX: Handle PI wakeup shenanigans during vcpu_put/load
I'm inclined to say we don't want a helper because there should only ever be one
path that changes PI.ndst. But a comment would definitely help to explain the
difference between xAPIC and x2APIC IDs.
On Fri, 2021-10-08 at 19:12 -0700, Sean Christopherson wrote:
> Drop the avic_vcpu_is_running() check when waking vCPUs in response to a
> VM-Exit due to incomplete IPI delivery. The check isn't wrong per se, but
> it's not 100% accurate in the sense that it doesn't guarantee that the vCPU
> was one of the vCPUs that didn't receive the IPI.
>
> The check isn't required for correctness as blocking == !running in this
> context.
>
> From a performance perspective, waking a live task is not expensive as the
> only moderately costly operation is a locked operation to temporarily
> disable preemption. And if that is indeed a performance issue,
> kvm_vcpu_is_blocking() would be a better check than poking into the AVIC.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/svm/avic.c | 15 +++++++++------
> arch/x86/kvm/svm/svm.h | 11 -----------
> 2 files changed, 9 insertions(+), 17 deletions(-)
>
> diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
> index cbf02e7e20d0..b43b05610ade 100644
> --- a/arch/x86/kvm/svm/avic.c
> +++ b/arch/x86/kvm/svm/avic.c
> @@ -295,13 +295,16 @@ static void avic_kick_target_vcpus(struct kvm *kvm, struct kvm_lapic *source,
> struct kvm_vcpu *vcpu;
> int i;
>
> + /*
> + * Wake any target vCPUs that are blocking, i.e. waiting for a wake
> + * event. There's no need to signal doorbells, as hardware has handled
> + * vCPUs that were in guest at the time of the IPI, and vCPUs that have
> + * since entered the guest will have processed pending IRQs at VMRUN.
> + */
> kvm_for_each_vcpu(i, vcpu, kvm) {
> - bool m = kvm_apic_match_dest(vcpu, source,
> - icrl & APIC_SHORT_MASK,
> - GET_APIC_DEST_FIELD(icrh),
> - icrl & APIC_DEST_MASK);
> -
> - if (m && !avic_vcpu_is_running(vcpu))
> + if (kvm_apic_match_dest(vcpu, source, icrl & APIC_SHORT_MASK,
> + GET_APIC_DEST_FIELD(icrh),
> + icrl & APIC_DEST_MASK))
> kvm_vcpu_wake_up(vcpu);
> }
> }
> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index 0d7bbe548ac3..7f5b01bbee29 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -509,17 +509,6 @@ extern struct kvm_x86_nested_ops svm_nested_ops;
>
> #define VMCB_AVIC_APIC_BAR_MASK 0xFFFFFFFFFF000ULL
>
> -static inline bool (struct kvm_vcpu *vcpu)
> -{
> - struct vcpu_svm *svm = to_svm(vcpu);
> - u64 *entry = svm->avic_physical_id_cache;
> -
> - if (!entry)
> - return false;
> -
> - return (READ_ONCE(*entry) & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK);
> -}
> -
> int avic_ga_log_notifier(u32 ga_tag);
> void avic_vm_destroy(struct kvm *kvm);
> int avic_vm_init(struct kvm *kvm);
I guess this makes sense to do, to get rid of the avic_vcpu_is_running.
As you explained in previous patch, waking up a live task isn't that expensive,
so let it be.
Reviewed-by: Maxim Levitsky <[email protected]>
Best regards,
Maxim Levitsky
On Fri, 2021-10-08 at 19:12 -0700, Sean Christopherson wrote:
> Always mark the AVIC as "running" on vCPU load when the AVIC is enabled and
> drop the vcpu_blocking/unblocking hooks that toggle "running". There is no
> harm in keeping the flag set for a wee bit longer when a vCPU is blocking,
> i.e. between the start of blocking and being scheduled out. At worst, an
> agent in the host will unnecessarily signal the doorbell, but that's
> already the status quo in KVM as the "running" flag is set the entire time
> a vCPU is loaded, not just when it's actively running the guest.
>
> In addition to simplifying the code, keeping the "running" flag set longer
> can reduce the number of VM-Exits due to incomplete IPI delivery.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/svm/avic.c | 53 +++++++++++++----------------------------
> arch/x86/kvm/svm/svm.c | 8 -------
> arch/x86/kvm/svm/svm.h | 3 ---
> 3 files changed, 17 insertions(+), 47 deletions(-)
>
> diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
> index b43b05610ade..213f5223f63e 100644
> --- a/arch/x86/kvm/svm/avic.c
> +++ b/arch/x86/kvm/svm/avic.c
> @@ -967,6 +967,15 @@ void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> int h_physical_id = kvm_cpu_get_apicid(cpu);
> struct vcpu_svm *svm = to_svm(vcpu);
>
> + /* TODO: Document why the unblocking path checks for updates. */
> + if (kvm_vcpu_is_blocking(vcpu) &&
> + kvm_check_request(KVM_REQ_APICV_UPDATE, vcpu)) {
> + kvm_vcpu_update_apicv(vcpu);
> +
> + if (!kvm_vcpu_apicv_active(vcpu))
> + return;
> + }
> +
> /*
> * Since the host physical APIC id is 8 bits,
> * we can support host APIC ID upto 255.
> @@ -974,19 +983,21 @@ void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> if (WARN_ON(h_physical_id > AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK))
> return;
>
> + /*
> + * Unconditionally mark the AVIC as "running", even if the vCPU is in
> + * kvm_vcpu_block(). kvm_vcpu_check_block() will detect pending IRQs
> + * and bail out of the block loop, and if not, avic_vcpu_put() will
> + * set the AVIC back to "not running" when the vCPU is scheduled out.
> + */
> entry = READ_ONCE(*(svm->avic_physical_id_cache));
> WARN_ON(entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK);
>
> entry &= ~AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK;
> entry |= (h_physical_id & AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK);
> -
> - entry &= ~AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK;
> - if (svm->avic_is_running)
> - entry |= AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK;
> + entry |= AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK;
>
> WRITE_ONCE(*(svm->avic_physical_id_cache), entry);
> - avic_update_iommu_vcpu_affinity(vcpu, h_physical_id,
> - svm->avic_is_running);
> + avic_update_iommu_vcpu_affinity(vcpu, h_physical_id, true);
> }
>
> void avic_vcpu_put(struct kvm_vcpu *vcpu)
> @@ -1001,33 +1012,3 @@ void avic_vcpu_put(struct kvm_vcpu *vcpu)
> entry &= ~AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK;
> WRITE_ONCE(*(svm->avic_physical_id_cache), entry);
> }
> -
> -/*
> - * This function is called during VCPU halt/unhalt.
> - */
> -static void avic_set_running(struct kvm_vcpu *vcpu, bool is_run)
> -{
> - struct vcpu_svm *svm = to_svm(vcpu);
> -
> - svm->avic_is_running = is_run;
> -
> - if (!kvm_vcpu_apicv_active(vcpu))
> - return;
> -
> - if (is_run)
> - avic_vcpu_load(vcpu, vcpu->cpu);
> - else
> - avic_vcpu_put(vcpu);
> -}
> -
> -void svm_vcpu_blocking(struct kvm_vcpu *vcpu)
> -{
> - avic_set_running(vcpu, false);
> -}
> -
> -void svm_vcpu_unblocking(struct kvm_vcpu *vcpu)
> -{
> - if (kvm_check_request(KVM_REQ_APICV_UPDATE, vcpu))
> - kvm_vcpu_update_apicv(vcpu);
> - avic_set_running(vcpu, true);
> -}
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 89077160d463..a1ca5707f2c8 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -1433,12 +1433,6 @@ static int svm_create_vcpu(struct kvm_vcpu *vcpu)
> if (err)
> goto error_free_vmsa_page;
>
> - /* We initialize this flag to true to make sure that the is_running
> - * bit would be set the first time the vcpu is loaded.
> - */
> - if (irqchip_in_kernel(vcpu->kvm) && kvm_apicv_activated(vcpu->kvm))
> - svm->avic_is_running = true;
> -
> svm->msrpm = svm_vcpu_alloc_msrpm();
> if (!svm->msrpm) {
> err = -ENOMEM;
> @@ -4597,8 +4591,6 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
> .prepare_guest_switch = svm_prepare_guest_switch,
> .vcpu_load = svm_vcpu_load,
> .vcpu_put = svm_vcpu_put,
> - .vcpu_blocking = svm_vcpu_blocking,
> - .vcpu_unblocking = svm_vcpu_unblocking,
>
> .update_exception_bitmap = svm_update_exception_bitmap,
> .get_msr_feature = svm_get_msr_feature,
> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index 7f5b01bbee29..652d71acfb6c 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -169,7 +169,6 @@ struct vcpu_svm {
> u32 dfr_reg;
> struct page *avic_backing_page;
> u64 *avic_physical_id_cache;
> - bool avic_is_running;
>
> /*
> * Per-vcpu list of struct amd_svm_iommu_ir:
> @@ -529,8 +528,6 @@ int svm_deliver_avic_intr(struct kvm_vcpu *vcpu, int vec);
> bool svm_dy_apicv_has_pending_interrupt(struct kvm_vcpu *vcpu);
> int svm_update_pi_irte(struct kvm *kvm, unsigned int host_irq,
> uint32_t guest_irq, bool set);
> -void svm_vcpu_blocking(struct kvm_vcpu *vcpu);
> -void svm_vcpu_unblocking(struct kvm_vcpu *vcpu);
>
> /* sev.c */
>
Looks good. It is nice to get rid of all of this logic that was just making things more complicated.
Something else nice to do here which I didn't finish back then when I worked on avic, would be
to maybe rename avic_vcpu_load/avic_vcpu_put because those are also now run on avic inhibit/uninhibit.
Basically the 'svm_refresh_apicv_exec_ctrl' is the full avic activate/deactivate, while
avic_vcpu_load/avic_vcpu_put are the lighter weight partial avic activation/deactivation functions.
So minus the comment from Paolo about the updating avic on unblock which I missed back when I wrote
my patches:
Reviewed-by: Maxim Levitsky <[email protected]>
Best regards,
Maxim Levitsky
On Fri, 2021-10-08 at 19:12 -0700, Sean Christopherson wrote:
> Remove kvm_arch_vcpu_(un)blocking() now that all implementations are nops.
>
> No functional change intended.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/arm64/kvm/arm.c | 10 ----------
> arch/mips/include/asm/kvm_host.h | 2 --
> arch/powerpc/include/asm/kvm_host.h | 2 --
> arch/s390/include/asm/kvm_host.h | 2 --
> arch/x86/include/asm/kvm-x86-ops.h | 2 --
> arch/x86/include/asm/kvm_host.h | 13 -------------
> include/linux/kvm_host.h | 2 --
> virt/kvm/kvm_main.c | 4 ----
> 8 files changed, 37 deletions(-)
>
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index 9ff0e85a9f16..444d6f5a980a 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -357,16 +357,6 @@ int kvm_cpu_has_pending_timer(struct kvm_vcpu *vcpu)
> return kvm_timer_is_pending(vcpu);
> }
>
> -void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu)
> -{
> -
> -}
> -
> -void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu)
> -{
> -
> -}
> -
> void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> {
> struct kvm_s2_mmu *mmu;
> diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h
> index 72b90d45a46e..28110f71089b 100644
> --- a/arch/mips/include/asm/kvm_host.h
> +++ b/arch/mips/include/asm/kvm_host.h
> @@ -895,8 +895,6 @@ static inline void kvm_arch_free_memslot(struct kvm *kvm,
> struct kvm_memory_slot *slot) {}
> static inline void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen) {}
> static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
> -static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) {}
> -static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {}
>
> #define __KVM_HAVE_ARCH_FLUSH_REMOTE_TLB
> int kvm_arch_flush_remote_tlb(struct kvm *kvm);
> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> index 4a195c161592..0dfee6866541 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -863,7 +863,5 @@ static inline void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen) {}
> static inline void kvm_arch_flush_shadow_all(struct kvm *kvm) {}
> static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
> static inline void kvm_arch_exit(void) {}
> -static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) {}
> -static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {}
>
> #endif /* __POWERPC_KVM_HOST_H__ */
> diff --git a/arch/s390/include/asm/kvm_host.h b/arch/s390/include/asm/kvm_host.h
> index a22c9266ea05..25ed4ec66f4a 100644
> --- a/arch/s390/include/asm/kvm_host.h
> +++ b/arch/s390/include/asm/kvm_host.h
> @@ -1007,7 +1007,5 @@ static inline void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen) {}
> static inline void kvm_arch_flush_shadow_all(struct kvm *kvm) {}
> static inline void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
> struct kvm_memory_slot *slot) {}
> -static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) {}
> -static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {}
>
> #endif
> diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> index c2b007171abd..f2c38acdcad6 100644
> --- a/arch/x86/include/asm/kvm-x86-ops.h
> +++ b/arch/x86/include/asm/kvm-x86-ops.h
> @@ -96,8 +96,6 @@ KVM_X86_OP(handle_exit_irqoff)
> KVM_X86_OP_NULL(request_immediate_exit)
> KVM_X86_OP(sched_in)
> KVM_X86_OP_NULL(update_cpu_dirty_logging)
> -KVM_X86_OP_NULL(vcpu_blocking)
> -KVM_X86_OP_NULL(vcpu_unblocking)
> KVM_X86_OP_NULL(update_pi_irte)
> KVM_X86_OP_NULL(start_assignment)
> KVM_X86_OP_NULL(apicv_post_state_restore)
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 76a8dddc1a48..bebd42926321 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1445,9 +1445,6 @@ struct kvm_x86_ops {
> const struct kvm_pmu_ops *pmu_ops;
> const struct kvm_x86_nested_ops *nested_ops;
>
> - void (*vcpu_blocking)(struct kvm_vcpu *vcpu);
> - void (*vcpu_unblocking)(struct kvm_vcpu *vcpu);
> -
> int (*update_pi_irte)(struct kvm *kvm, unsigned int host_irq,
> uint32_t guest_irq, bool set);
> void (*start_assignment)(struct kvm *kvm);
> @@ -1904,16 +1901,6 @@ static inline bool kvm_irq_is_postable(struct kvm_lapic_irq *irq)
> irq->delivery_mode == APIC_DM_LOWEST);
> }
>
> -static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu)
> -{
> - static_call_cond(kvm_x86_vcpu_blocking)(vcpu);
> -}
> -
> -static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu)
> -{
> - static_call_cond(kvm_x86_vcpu_unblocking)(vcpu);
> -}
> -
> static inline int kvm_cpu_get_apicid(int mps_cpu)
> {
> #ifdef CONFIG_X86_LOCAL_APIC
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index c5961a361c73..6a84b020daa6 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -966,8 +966,6 @@ void kvm_sigset_deactivate(struct kvm_vcpu *vcpu);
>
> void kvm_vcpu_halt(struct kvm_vcpu *vcpu);
> bool kvm_vcpu_block(struct kvm_vcpu *vcpu);
> -void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu);
> -void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu);
> bool kvm_vcpu_wake_up(struct kvm_vcpu *vcpu);
> void kvm_vcpu_kick(struct kvm_vcpu *vcpu);
> int kvm_vcpu_yield_to(struct kvm_vcpu *target);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index c1850b60f38b..96de905e26e4 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -3210,8 +3210,6 @@ bool kvm_vcpu_block(struct kvm_vcpu *vcpu)
>
> vcpu->stat.generic.blocking = 1;
>
> - kvm_arch_vcpu_blocking(vcpu);
> -
> prepare_to_rcuwait(wait);
> for (;;) {
> set_current_state(TASK_INTERRUPTIBLE);
> @@ -3224,8 +3222,6 @@ bool kvm_vcpu_block(struct kvm_vcpu *vcpu)
> }
> finish_rcuwait(wait);
>
> - kvm_arch_vcpu_unblocking(vcpu);
> -
> vcpu->stat.generic.blocking = 0;
>
> return waited;
Reviewed-by: Maxim Levitsky <[email protected]>
On Thu, 2021-10-28 at 00:09 +0200, Paolo Bonzini wrote:
> On 27/10/21 18:04, Sean Christopherson wrote:
> > > > + /*
> > > > + * The smp_wmb() in kvm_make_request() pairs with the smp_mb_*()
> > > > + * after setting vcpu->mode in vcpu_enter_guest(), thus the vCPU
> > > > + * is guaranteed to see the event request if triggering a posted
> > > > + * interrupt "fails" because vcpu->mode != IN_GUEST_MODE.
> > >
> > > What this smp_wmb() pair with, is the smp_mb__after_atomic in
> > > kvm_check_request(KVM_REQ_EVENT, vcpu).
> >
> > I don't think that's correct. There is no kvm_check_request() in the relevant path.
> > kvm_vcpu_exit_request() uses kvm_request_pending(), which is just a READ_ONCE()
> > without a barrier.
>
> Ok, we are talking about two different set of barriers. This is mine:
>
> - smp_wmb() in kvm_make_request() pairs with the smp_mb__after_atomic() in
> kvm_check_request(); it ensures that everything before the request
> (in this case, pi_pending = true) is seen by inject_pending_event.
>
> - pi_test_and_set_on() orders the write to ON after the write to PIR,
> pairing with vmx_sync_pir_to_irr and ensuring that the bit in the PIR is
> seen.
>
> And this is yours:
>
> - pi_test_and_set_on() _also_ orders the write to ON before the read of
> vcpu->mode, pairing with vcpu_enter_guest()
>
> - kvm_make_request() however does _not_ order the write to
> vcpu->requests before the read of vcpu->mode, even though it's needed.
> Usually that's handled by kvm_vcpu_exiting_guest_mode(), but in this case
> vcpu->mode is read in kvm_vcpu_trigger_posted_interrupt.
Yes indeed, kvm_make_request() writes the vcpu->requests after the memory barrier,
and then there is no barrier until reading of vcpu->mode in kvm_vcpu_trigger_posted_interrupt.
>
> So vmx_deliver_nested_posted_interrupt() is missing a smp_mb__after_atomic().
> It's documentation only for x86, but still easily done in v3.
>
> Paolo
>
I used this patch as a justification to read Paolo's excellent LWN series of articles on memory barriers,
to refresh my knowledge of the memory barriers and understand the above analysis better.
https://lwn.net/Articles/844224/
I agree with the above, but this is something that is so easy to make a mistake
that I can't be 100% sure.
Best regards,
Maxim Levitsky
On Wed, 2021-10-27 at 15:30 +0000, Sean Christopherson wrote:
> On Mon, Oct 25, 2021, Paolo Bonzini wrote:
> > On 09/10/21 04:12, Sean Christopherson wrote:
> > > Lastly, this aligns the non-nested and nested usage of triggering posted
> > > interrupts, and will allow for additional cleanups.
> >
> > It also aligns with SVM a little bit more (especially given patch 35),
> > doesn't it?
>
> Yes, aligning VMX and SVM APICv behavior as much as possible is definitely a goal
> of this series, though I suspect I failed to state that anywhere.
>
Looks reasonable to me.
Reviewed-by: Maxim Levitsky <[email protected]>
Best regards,
Maxim Levitky
On Fri, 2021-10-08 at 19:12 -0700, Sean Christopherson wrote:
> Refactor the posted interrupt helper to take the desired notification
> vector instead of a bool so that the callers are self-documenting.
>
> No functional change intended.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/vmx/vmx.c | 8 +++-----
> 1 file changed, 3 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 78c8bc7f1b3b..f505fee3cf5c 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -3928,11 +3928,9 @@ static void vmx_msr_filter_changed(struct kvm_vcpu *vcpu)
> }
>
> static inline bool kvm_vcpu_trigger_posted_interrupt(struct kvm_vcpu *vcpu,
> - bool nested)
> + int pi_vec)
> {
> #ifdef CONFIG_SMP
> - int pi_vec = nested ? POSTED_INTR_NESTED_VECTOR : POSTED_INTR_VECTOR;
> -
> if (vcpu->mode == IN_GUEST_MODE) {
> /*
> * The vector of interrupt to be delivered to vcpu had
> @@ -3986,7 +3984,7 @@ static int vmx_deliver_nested_posted_interrupt(struct kvm_vcpu *vcpu,
> */
> kvm_make_request(KVM_REQ_EVENT, vcpu);
> /* the PIR and ON have been set by L1. */
> - if (!kvm_vcpu_trigger_posted_interrupt(vcpu, true))
> + if (!kvm_vcpu_trigger_posted_interrupt(vcpu, POSTED_INTR_NESTED_VECTOR))
> kvm_vcpu_wake_up(vcpu);
> return 0;
> }
> @@ -4024,7 +4022,7 @@ static int vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector)
> * guaranteed to see PID.ON=1 and sync the PIR to IRR if triggering a
> * posted interrupt "fails" because vcpu->mode != IN_GUEST_MODE.
> */
> - if (!kvm_vcpu_trigger_posted_interrupt(vcpu, false))
> + if (!kvm_vcpu_trigger_posted_interrupt(vcpu, POSTED_INTR_VECTOR))
> kvm_vcpu_wake_up(vcpu);
>
> return 0;
I both like and don't like this patch.
It is indeed a bit more more self documented, but then it allows caller to
pass anything other than POSTED_INTR_NESTED_VECTOR/POSTED_INTR_VECTOR which
would fail.
Maybe addd an assert?
I won't do bikesheddding about this though, so
Reviewed-by: Maxim Levitsky <[email protected]>
Best regards,
Maxim Levitsky
On Fri, 2021-10-08 at 19:12 -0700, Sean Christopherson wrote:
> Move the fallback "wake_up" path into the helper to trigger posted
> interrupt helper now that the nested and non-nested paths are identical.
Nit: I think you refer to patch 41 here, but I think that nested and non nested paths were identical
before as well, so this patch could be done without patch 41 as well.
Reviewed-by: Maxim Levitsky <[email protected]>
Best regards,
Maxim Levitsky
>
> No functional change intended.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/vmx/vmx.c | 18 ++++++++++--------
> 1 file changed, 10 insertions(+), 8 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index f505fee3cf5c..b0d97cf18c34 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -3927,7 +3927,7 @@ static void vmx_msr_filter_changed(struct kvm_vcpu *vcpu)
> pt_update_intercept_for_msr(vcpu);
> }
>
> -static inline bool kvm_vcpu_trigger_posted_interrupt(struct kvm_vcpu *vcpu,
> +static inline void kvm_vcpu_trigger_posted_interrupt(struct kvm_vcpu *vcpu,
> int pi_vec)
> {
> #ifdef CONFIG_SMP
> @@ -3958,10 +3958,15 @@ static inline bool kvm_vcpu_trigger_posted_interrupt(struct kvm_vcpu *vcpu,
> */
>
> apic->send_IPI_mask(get_cpu_mask(vcpu->cpu), pi_vec);
> - return true;
> + return;
> }
> #endif
> - return false;
> + /*
> + * The vCPU isn't in the guest; wake the vCPU in case it is blocking,
> + * otherwise do nothing as KVM will grab the highest priority pending
> + * IRQ via ->sync_pir_to_irr() in vcpu_enter_guest().
> + */
> + kvm_vcpu_wake_up(vcpu);
> }
>
> static int vmx_deliver_nested_posted_interrupt(struct kvm_vcpu *vcpu,
> @@ -3984,8 +3989,7 @@ static int vmx_deliver_nested_posted_interrupt(struct kvm_vcpu *vcpu,
> */
> kvm_make_request(KVM_REQ_EVENT, vcpu);
> /* the PIR and ON have been set by L1. */
> - if (!kvm_vcpu_trigger_posted_interrupt(vcpu, POSTED_INTR_NESTED_VECTOR))
> - kvm_vcpu_wake_up(vcpu);
> + kvm_vcpu_trigger_posted_interrupt(vcpu, POSTED_INTR_NESTED_VECTOR);
> return 0;
> }
> return -1;
> @@ -4022,9 +4026,7 @@ static int vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector)
> * guaranteed to see PID.ON=1 and sync the PIR to IRR if triggering a
> * posted interrupt "fails" because vcpu->mode != IN_GUEST_MODE.
> */
> - if (!kvm_vcpu_trigger_posted_interrupt(vcpu, POSTED_INTR_VECTOR))
> - kvm_vcpu_wake_up(vcpu);
> -
> + kvm_vcpu_trigger_posted_interrupt(vcpu, POSTED_INTR_VECTOR);
> return 0;
> }
>
On Mon, 2021-10-25 at 16:16 +0200, Paolo Bonzini wrote:
> On 09/10/21 04:12, Sean Christopherson wrote:
> > When waking vCPUs in the posted interrupt wakeup handling, do exactly
> > that and no more. There is no need to kick the vCPU as the wakeup
> > handler just need to get the vCPU task running, and if it's in the guest
> > then it's definitely running.
>
> And more important, the transition from blocking to running will have
> gone through sync_pir_to_irr, thus checking ON and manually moving the
> vector from PIR to RVI.
>
> Paolo
>
I also think so, and maybe this can be added to the commit message.
Anyway, last one for the series :)
Reviewed-by: Maxim Levitsky <[email protected]>
Best regards,
Maxim Levitsky
On Thu, 2021-10-28 at 15:55 +0000, Sean Christopherson wrote:
> On Thu, Oct 28, 2021, Maxim Levitsky wrote:
> > On Fri, 2021-10-08 at 19:12 -0700, Sean Christopherson wrote:
> > > Use READ_ONCE() when loading the posted interrupt descriptor control
> > > field to ensure "old" and "new" have the same base value. If the
> > > compiler emits separate loads, and loads into "new" before "old", KVM
> > > could theoretically drop the ON bit if it were set between the loads.
> > >
> > > Fixes: 28b835d60fcc ("KVM: Update Posted-Interrupts Descriptor when vCPU is preempted")
> > > Signed-off-by: Sean Christopherson <[email protected]>
> > > ---
> > > arch/x86/kvm/vmx/posted_intr.c | 6 +++---
> > > 1 file changed, 3 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
> > > index 414ea6972b5c..fea343dcc011 100644
> > > --- a/arch/x86/kvm/vmx/posted_intr.c
> > > +++ b/arch/x86/kvm/vmx/posted_intr.c
> > > @@ -53,7 +53,7 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
> > >
> > > /* The full case. */
> > > do {
> > > - old.control = new.control = pi_desc->control;
> > > + old.control = new.control = READ_ONCE(pi_desc->control);
> > >
> > > dest = cpu_physical_id(cpu);
> > >
> > > @@ -104,7 +104,7 @@ static void __pi_post_block(struct kvm_vcpu *vcpu)
> > > "Wakeup handler not enabled while the vCPU was blocking");
> > >
> > > do {
> > > - old.control = new.control = pi_desc->control;
> > > + old.control = new.control = READ_ONCE(pi_desc->control);
> > >
> > > dest = cpu_physical_id(vcpu->cpu);
> > >
> > > @@ -160,7 +160,7 @@ int pi_pre_block(struct kvm_vcpu *vcpu)
> > > "Posted Interrupt Suppress Notification set before blocking");
> > >
> > > do {
> > > - old.control = new.control = pi_desc->control;
> > > + old.control = new.control = READ_ONCE(pi_desc->control);
> > >
> > > /* set 'NV' to 'wakeup vector' */
> > > new.nv = POSTED_INTR_WAKEUP_VECTOR;
> >
> > I wish there was a way to mark fields in a struct, as requiring 'READ_ONCE' on them
> > so that compiler would complain if this isn't done, or automatically use 'READ_ONCE'
> > logic.
>
> Hmm, I think you could make an argument that ON and thus the whole "control"
> word should be volatile. AFAICT, tagging just "on" as volatile actually works.
> There's even in a clause in Documentation/process/volatile-considered-harmful.rst
> that calls this out as a (potentially) legitimate use case.
>
> - Pointers to data structures in coherent memory which might be modified
> by I/O devices can, sometimes, legitimately be volatile.
>
> That said, I think I actually prefer forcing the use of READ_ONCE. The descriptor
> requires more protections than what volatile provides, namely that all writes need
> to be atomic. So given that volatile alone isn't sufficient, I'd prefer to have
> the code itself be more self-documenting.
I took a look at how READ_ONCE/WRITE_ONCE is implemented and indeed they use volatile
(the comment above __READ_ONCE is worth gold...), so there is a bit of contradiction:
volatile-considered-harmful.rst states not to mark struct members volatile since
you usually need more that than (very true often) and yet, I also heard that
READ_ONCE/WRITE_ONCE is very encouraged to be used to fields that are used in lockless
algorithms, even when not strictly needed,
so why not to just mark the field and then use it normally? I guess that
explicit READ_ONCE/WRITE_ONCE is much more readable/visible that a volatile in some header file.
Anyway this isn't something I am going to argue about or push to be changed,
just something I thought about.
Best regards,
Maxim Levitsky
>
> E.g. this compiles and does mess up the expected size.
>
> diff --git a/arch/x86/kvm/vmx/posted_intr.h b/arch/x86/kvm/vmx/posted_intr.h
> index 7f7b2326caf5..149df3b18789 100644
> --- a/arch/x86/kvm/vmx/posted_intr.h
> +++ b/arch/x86/kvm/vmx/posted_intr.h
> @@ -11,9 +11,9 @@ struct pi_desc {
> union {
> struct {
> /* bit 256 - Outstanding Notification */
> - u16 on : 1,
> + volatile u16 on : 1;
> /* bit 257 - Suppress Notification */
> - sn : 1,
> + u16 sn : 1,
> /* bit 271:258 - Reserved */
> rsvd_1 : 14;
> /* bit 279:272 - Notification Vector */
> @@ -23,7 +23,7 @@ struct pi_desc {
> /* bit 319:288 - Notification Destination */
> u32 ndst;
> };
> - u64 control;
> + volatile u64 control;
> };
> u32 rsvd[6];
> } __aligned(64);
>
On Thu, 2021-10-28 at 16:12 +0000, Sean Christopherson wrote:
> On Thu, Oct 28, 2021, Maxim Levitsky wrote:
> > On Fri, 2021-10-08 at 19:12 -0700, Sean Christopherson wrote:
> > > Hoist the CPU => APIC ID conversion for the Posted Interrupt descriptor
> > > out of the loop to write the descriptor, preemption is disabled so the
> > > CPU won't change, and if the APIC ID changes KVM has bigger problems.
> > >
> > > No functional change intended.
> >
> > Is preemption always disabled in vmx_vcpu_pi_load? vmx_vcpu_pi_load is called
> > from vmx_vcpu_load, which is called indirectly from vcpu_load which is called
> > from many ioctls, which userspace does. In these places I don't think that
> > preemption is disabled.
>
> Preemption is disabled in vcpu_load() by the get_cpu(). The "cpu" param that's
> passed around the vcpu_load() stack is also why I think it's ok to _not_ assert
> that preemption is disabled in vmx_vcpu_pi_load(); if preemption is enabled,
> "cpu" is unstable and thus the entire "load" operation is busted.
Yes, I even knew about the get_cpu() behavier which indeed has to disable preemption.
But I didn't notice call to it, when I wrote this mail! Later I did notice it but it was
too late. Sometimes sending all the review mails at once at the end does make sense after all,
I guess.
Best regards,
Maxim Levitsky
>
>
> #define get_cpu() ({ preempt_disable(); __smp_processor_id(); })
> #define put_cpu() preempt_enable()
>
>
> void vcpu_load(struct kvm_vcpu *vcpu)
> {
> int cpu = get_cpu();
>
> __this_cpu_write(kvm_running_vcpu, vcpu);
> preempt_notifier_register(&vcpu->preempt_notifier);
> kvm_arch_vcpu_load(vcpu, cpu);
> put_cpu();
> }
> EXPORT_SYMBOL_GPL(vcpu_load);
>
On Thu, 2021-10-28 at 17:19 +0000, Sean Christopherson wrote:
> On Thu, Oct 28, 2021, Maxim Levitsky wrote:
> > On Fri, 2021-10-08 at 19:12 -0700, Sean Christopherson wrote:
> > > Remove the vCPU from the wakeup list before updating the notification
> > > vector in the posted interrupt post-block helper. There is no need to
> > > wake the current vCPU as it is by definition not blocking. Practically
> > > speaking this is a nop as it only shaves a few meager cycles in the
> > > unlikely case that the vCPU was migrated and the previous pCPU gets a
> > > wakeup IRQ right before PID.NV is updated. The real motivation is to
> > > allow for more readable code in the future, when post-block is merged
> > > with vmx_vcpu_pi_load(), at which point removal from the list will be
> > > conditional on the old notification vector.
> > >
> > > Opportunistically add comments to document why KVM has a per-CPU spinlock
> > > that, at first glance, appears to be taken only on the owning CPU.
> > > Explicitly call out that the spinlock must be taken with IRQs disabled, a
> > > detail that was "lost" when KVM switched from spin_lock_irqsave() to
> > > spin_lock(), with IRQs disabled for the entirety of the relevant path.
> > >
> > > Signed-off-by: Sean Christopherson <[email protected]>
> > > ---
> > > arch/x86/kvm/vmx/posted_intr.c | 49 +++++++++++++++++++++++-----------
> > > 1 file changed, 33 insertions(+), 16 deletions(-)
> > >
> > > diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
> > > index 2b2206339174..901b7a5f7777 100644
> > > --- a/arch/x86/kvm/vmx/posted_intr.c
> > > +++ b/arch/x86/kvm/vmx/posted_intr.c
> > > @@ -10,10 +10,22 @@
> > > #include "vmx.h"
> > >
> > > /*
> > > - * We maintain a per-CPU linked-list of vCPU, so in wakeup_handler() we
> > > - * can find which vCPU should be waken up.
> > > + * Maintain a per-CPU list of vCPUs that need to be awakened by wakeup_handler()
> > Nit: While at it, it would be nice to rename this to pi_wakeup_hanlder() so
> > that it can be more easilly found.
>
> Ah, good catch.
>
> > > + * when a WAKEUP_VECTOR interrupted is posted. vCPUs are added to the list when
> > > + * the vCPU is scheduled out and is blocking (e.g. in HLT) with IRQs enabled.
> > s/interrupted/interrupt ?
> >
> > Isn't that comment incorrect? As I see, the PI hardware is setup to use the WAKEUP_VECTOR
> > when vcpu blocks (in pi_pre_block) and then that vcpu is added to the list.
> > The pi_wakeup_hanlder just goes over the list and wakes up all vcpus on the lsit.
>
> Doh, yes. This patch is predicting the future. The comment becomes correct as of
>
> KVM: VMX: Handle PI wakeup shenanigans during vcpu_put/load
>
> but as of this patch the "scheduled out" piece doesn't hold true.
>
> > > + * The vCPUs posted interrupt descriptor is updated at the same time to set its
> > > + * notification vector to WAKEUP_VECTOR, so that posted interrupt from devices
> > > + * wake the target vCPUs. vCPUs are removed from the list and the notification
> > > + * vector is reset when the vCPU is scheduled in.
> > > */
> > > static DEFINE_PER_CPU(struct list_head, blocked_vcpu_on_cpu);
> > Also while at it, why not to rename this to 'blocked_vcpu_list'?
> > to explain that this is list of blocked vcpus. Its a per-cpu variable
> > so 'on_cpu' suffix isn't needed IMHO.
>
> As you noted, addressed in a future patch.
>
> > > +/*
> > > + * Protect the per-CPU list with a per-CPU spinlock to handle task migration.
> > > + * When a blocking vCPU is awakened _and_ migrated to a different pCPU, the
> > > + * ->sched_in() path will need to take the vCPU off the list of the _previous_
> > > + * CPU. IRQs must be disabled when taking this lock, otherwise deadlock will
> > > + * occur if a wakeup IRQ arrives and attempts to acquire the lock.
> > > + */
> > > static DEFINE_PER_CPU(spinlock_t, blocked_vcpu_on_cpu_lock);
> > >
> > > static inline struct pi_desc *vcpu_to_pi_desc(struct kvm_vcpu *vcpu)
> > > @@ -101,23 +113,28 @@ static void __pi_post_block(struct kvm_vcpu *vcpu)
> > > WARN(pi_desc->nv != POSTED_INTR_WAKEUP_VECTOR,
> > > "Wakeup handler not enabled while the vCPU was blocking");
> > >
> > > - dest = cpu_physical_id(vcpu->cpu);
> > > - if (!x2apic_mode)
> > > - dest = (dest << 8) & 0xFF00;
> > > -
> > > - do {
> > > - old.control = new.control = READ_ONCE(pi_desc->control);
> > > -
> > > - new.ndst = dest;
> > > -
> > > - /* set 'NV' to 'notification vector' */
> > > - new.nv = POSTED_INTR_VECTOR;
> > > - } while (cmpxchg64(&pi_desc->control, old.control,
> > > - new.control) != old.control);
> > > -
> > > + /*
> > > + * Remove the vCPU from the wakeup list of the _previous_ pCPU, which
> > > + * will not be the same as the current pCPU if the task was migrated.
> > > + */
> > > spin_lock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->pre_pcpu));
> > > list_del(&vcpu->blocked_vcpu_list);
> > > spin_unlock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->pre_pcpu));
> > > +
> > > + dest = cpu_physical_id(vcpu->cpu);
> > > + if (!x2apic_mode)
> > > + dest = (dest << 8) & 0xFF00;
> > It would be nice to have a function for this, this appears in this file twice.
> > Maybe there is a function already somewhere?
>
> The second instance does go away by the aforementioned:
Then no need for a helper.
>
> KVM: VMX: Handle PI wakeup shenanigans during vcpu_put/load
>
> I'm inclined to say we don't want a helper because there should only ever be one
> path that changes PI.ndst. But a comment would definitely help to explain the
> difference between xAPIC and x2APIC IDs.
>
Makes sense!
Best regards,
Maxim Levitsky
On Mon, Nov 01, 2021, Maxim Levitsky wrote:
> On Thu, 2021-10-28 at 15:55 +0000, Sean Christopherson wrote:
> > On Thu, Oct 28, 2021, Maxim Levitsky wrote:
> > > On Fri, 2021-10-08 at 19:12 -0700, Sean Christopherson wrote:
> > > I wish there was a way to mark fields in a struct, as requiring 'READ_ONCE' on them
> > > so that compiler would complain if this isn't done, or automatically use 'READ_ONCE'
> > > logic.
> >
> > Hmm, I think you could make an argument that ON and thus the whole "control"
> > word should be volatile. AFAICT, tagging just "on" as volatile actually works.
> > There's even in a clause in Documentation/process/volatile-considered-harmful.rst
> > that calls this out as a (potentially) legitimate use case.
> >
> > - Pointers to data structures in coherent memory which might be modified
> > by I/O devices can, sometimes, legitimately be volatile.
> >
> > That said, I think I actually prefer forcing the use of READ_ONCE. The descriptor
> > requires more protections than what volatile provides, namely that all writes need
> > to be atomic. So given that volatile alone isn't sufficient, I'd prefer to have
> > the code itself be more self-documenting.
>
> I took a look at how READ_ONCE/WRITE_ONCE is implemented and indeed they use volatile
> (the comment above __READ_ONCE is worth gold...), so there is a bit of contradiction:
>
> volatile-considered-harmful.rst states not to mark struct members volatile since
> you usually need more that than (very true often) and yet, I also heard that
> READ_ONCE/WRITE_ONCE is very encouraged to be used to fields that are used in lockless
> algorithms, even when not strictly needed,
> so why not to just mark the field and then use it normally? I guess that
> explicit READ_ONCE/WRITE_ONCE is much more readable/visible that a volatile
> in some header file.
Are you asking about this PI field in particular, or for any field in general?
In this particular case, visibility and documentation is really the only difference,
functionally the result is the same. But that's also very much related to why this
case gets the exception listed above. The "use it normally" part is also why I
don't want to tag the field volatile since writing the field absolutely cannot be
done "normally", it must be done atomically, and volatile doesn't capture that
detail.
If you're asking about fields in general, the "volatile is harmful" guideline is
to deter usage of volatile for cases where the field/variable in question is not
intrinsically volatile. As the docs call out, using volatile in those cases often
leads to worse code generation because the compiler is disallowed from optimizing
accesses that are protected through other mechanisms.
A good example in x86 KVM is the READ_ONCE(sp->unsync) in mmu_try_to_unsync_pages() to
force the compiler to emit a load of sp->unsync after acquiring mmu_unsync_pages_lock.
Tagging "unsync" as volatile is unnecessary since the vast majority of its usage is
protected by holding a spinlock for write, and would prevent optimizing references in
kvm_mmu_get_page() and other flows that are protected by mmu_lock in the legacy MMU.
On Wed, 2021-10-27 at 16:40 +0300, Maxim Levitsky wrote:
> On Fri, 2021-10-08 at 19:12 -0700, Sean Christopherson wrote:
> > Invoke the arch hooks for block+unblock if and only if KVM actually
> > attempts to block the vCPU. The only non-nop implementation is on x86,
> > specifically SVM's AVIC, and there is no need to put the AVIC prior to
> > halt-polling as KVM x86's kvm_vcpu_has_events() will scour the full vIRR
> > to find pending IRQs regardless of whether the AVIC is loaded/"running".
> >
> > The primary motivation is to allow future cleanup to split out "block"
> > from "halt", but this is also likely a small performance boost on x86 SVM
> > when halt-polling is successful.
> >
> > Adjust the post-block path to update "cur" after unblocking, i.e. include
> > AVIC load time in halt_wait_ns and halt_wait_hist, so that the behavior
> > is consistent. Moving just the pre-block arch hook would result in only
> > the AVIC put latency being included in the halt_wait stats. There is no
> > obvious evidence that one way or the other is correct, so just ensure KVM
> > is consistent.
> >
> > Note, x86 has two separate paths for handling APICv with respect to vCPU
> > blocking. VMX uses hooks in x86's vcpu_block(), while SVM uses the arch
> > hooks in kvm_vcpu_block(). Prior to this path, the two paths were more
> > or less functionally identical. That is very much not the case after
> > this patch, as the hooks used by VMX _must_ fire before halt-polling.
> > x86's entire mess will be cleaned up in future patches.
> >
> > Signed-off-by: Sean Christopherson <[email protected]>
> > ---
> > virt/kvm/kvm_main.c | 7 ++++---
> > 1 file changed, 4 insertions(+), 3 deletions(-)
> >
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index f90b3ed05628..227f6bbe0716 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -3235,8 +3235,6 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
> > bool waited = false;
> > u64 block_ns;
> >
> > - kvm_arch_vcpu_blocking(vcpu);
> > -
> > start = cur = poll_end = ktime_get();
> > if (do_halt_poll) {
> > ktime_t stop = ktime_add_ns(ktime_get(), vcpu->halt_poll_ns);
> > @@ -3253,6 +3251,7 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
> > } while (kvm_vcpu_can_poll(cur, stop));
> > }
> >
> > + kvm_arch_vcpu_blocking(vcpu);
> >
> > prepare_to_rcuwait(wait);
> > for (;;) {
> > @@ -3265,6 +3264,9 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
> > schedule();
> > }
> > finish_rcuwait(wait);
> > +
> > + kvm_arch_vcpu_unblocking(vcpu);
> > +
> > cur = ktime_get();
> > if (waited) {
> > vcpu->stat.generic.halt_wait_ns +=
> > @@ -3273,7 +3275,6 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
> > ktime_to_ns(cur) - ktime_to_ns(poll_end));
> > }
> > out:
> > - kvm_arch_vcpu_unblocking(vcpu);
> > block_ns = ktime_to_ns(cur) - ktime_to_ns(start);
> >
> > /*
>
> Makes sense.
>
> Reviewed-by: Maxim Levitsky <[email protected]>
>
> Best regards,
> Maxim Levitsky
So...
Last week I decided to study a bit how AVIC behaves when vCPUs are not 100% running
(aka no cpu_pm=on), to mostly understand their so-called 'GA log' thing.
(This thing is that when you tell the IOMMU that a vCPU is not running,
the IOMMU starts logging all incoming passed-through interrupts to a ring buffer,
and raises its own interrupt, which’s handler is supposed to wake up the VM's vCPU.)
That led to me discovering that AMD's IOMMU is totally busted after a suspend/resume cycle,
fixing which took me few days (and most of the time I worried that it's some sort of a BIOS bug which nobody would fix,
as the IOMMU interrupt delivery was totally busted after resume, sometimes even power cycle didn't help
to revive it - phew...).
Luckily I did fix it, and patches are waiting for the review upstream.
(https://www.spinics.net/lists/kernel/msg4153488.html)
Another thing I discovered that this patch series totally breaks my VMs, without cpu_pm=on
The whole series (I didn't yet bisect it) makes even my fedora32 VM be very laggy, almost unusable,
and it only has one passed-through device, a nic).
If I apply though only the patch series up to this patch, my fedora VM seems to work fine, but
my windows VM still locks up hard when I run 'LatencyTop' in it, which doesn't happen without this patch.
So far the symptoms I see is that on VCPU 0, ISR has quite high interrupt (0xe1 last time I seen it),
TPR and PPR are 0xe0 (although I have seen TPR to have different values), and IRR has plenty of interrupts
with lower priority. The VM seems to be stuck in this case. As if its EOI got lost or something is preventing
the IRQ handler from issuing EOI.
LatencyTop does install some form of a kernel driver which likely does meddle with interrupts (maybe it sends lots of self IPIs?).
100% reproducible as soon as I start monitoring with LatencyTop.
Without this patch it works (or if disabling halt polling),
but I still did manage to lockup the VM few times still, after lot of random clicking/starting up various apps while LatencyTop was running,
etc, but in this case when I dump local apic via qemu's hmp interface the VM instantly revives, which might be either same bug
which got amplified by this patch or something else.
That was tested on the pure 5.15.0 kernel without any patches.
It is possible that this is a bug in LatencyTop that just got exposed by different timing.
The windows VM does have GPU and few USB controllers passed to it, and without them, in pure VM mode, as I call it,
the LatencyTop seems to work.
Tomorrow I'll give it a more formal investigation.
Best regards,
Maxim Levitsky
On Mon, Nov 29, 2021, Maxim Levitsky wrote:
> (This thing is that when you tell the IOMMU that a vCPU is not running,
> Another thing I discovered that this patch series totally breaks my VMs,
> without cpu_pm=on The whole series (I didn't yet bisect it) makes even my
> fedora32 VM be very laggy, almost unusable, and it only has one
> passed-through device, a nic).
Grrrr, the complete lack of comments in the KVM code and the separate paths for
VMX vs SVM when handling HLT with APICv make this all way for difficult to
understand than it should be.
The hangs are likely due to:
KVM: SVM: Unconditionally mark AVIC as running on vCPU load (with APICv)
If a posted interrupt arrives after KVM has done its final search through the vIRR,
but before avic_update_iommu_vcpu_affinity() is called, the posted interrupt will
be set in the vIRR without triggering a host IRQ to wake the vCPU via the GA log.
I.e. KVM is missing an equivalent to VMX's posted interrupt check for an outstanding
notification after switching to the wakeup vector.
For now, the least awful approach is sadly to keep the vcpu_(un)blocking() hooks.
Unlike VMX's PI support, there's no fast check for an interrupt being posted (KVM
would have to rewalk the vIRR), no easy to signal the current CPU to do wakeup (I
don't think KVM even has access to the IRQ used by the owning IOMMU), and there's
no simplification of load/put code.
If the scheduler were changed to support waking in the sched_out path, then I'd be
more inclined to handle this in avic_vcpu_put() by rewalking the vIRR one final
time, but for now it's not worth it.
> If I apply though only the patch series up to this patch, my fedora VM seems
> to work fine, but my windows VM still locks up hard when I run 'LatencyTop'
> in it, which doesn't happen without this patch.
Buy "run 'LatencyTop' in it", do you mean running something in the Windows guest?
The only search results I can find for LatencyTop are Linux specific.
> So far the symptoms I see is that on VCPU 0, ISR has quite high interrupt
> (0xe1 last time I seen it), TPR and PPR are 0xe0 (although I have seen TPR to
> have different values), and IRR has plenty of interrupts with lower priority.
> The VM seems to be stuck in this case. As if its EOI got lost or something is
> preventing the IRQ handler from issuing EOI.
>
> LatencyTop does install some form of a kernel driver which likely does meddle
> with interrupts (maybe it sends lots of self IPIs?).
>
> 100% reproducible as soon as I start monitoring with LatencyTop.
>
> Without this patch it works (or if disabling halt polling),
Huh. I assume everything works if you disable halt polling _without_ this patch
applied?
If so, that implies that successful halt polling without mucking with vCPU IOMMU
affinity is somehow problematic. I can't think of any relevant side effects other
than timing.
On 11/29/21 18:25, Sean Christopherson wrote:
>> If I apply though only the patch series up to this patch, my fedora VM seems
>> to work fine, but my windows VM still locks up hard when I run 'LatencyTop'
>> in it, which doesn't happen without this patch.
>
> Buy "run 'LatencyTop' in it", do you mean running something in the Windows guest?
> The only search results I can find for LatencyTop are Linux specific.
I think it's LatencyMon, https://www.resplendence.com/latencymon.
Paolo
On 11/29/21 19:55, Sean Christopherson wrote:
>> Still it does seem to be a race that happens when IS_RUNNING=true but
>> vcpu->mode == OUTSIDE_GUEST_MODE. This patch makes the race easier to
>> trigger because it moves IS_RUNNING=false later.
>
> Oh! Any chance the bug only repros with preemption enabled? That would explain
> why I don't see problems, I'm pretty sure I've only run AVIC with a PREEMPT=n.
Me too.
> svm_vcpu_{un}blocking() are called with preemption enabled, and avic_set_running()
> passes in vcpu->cpu. If the vCPU is preempted and scheduled in on a different CPU,
> avic_vcpu_load() will overwrite the vCPU's entry with the wrong CPU info.
That would make a lot of sense. avic_vcpu_load() can handle
svm->avic_is_running = false, but avic_set_running still needs its body
wrapped by preempt_disable/preempt_enable.
Fedora's kernel is CONFIG_PREEMPT_VOLUNTARY, but I know Maxim uses his
own build so it would not surprise me if he used CONFIG_PREEMPT=y.
Paolo
On Mon, Nov 29, 2021, Paolo Bonzini wrote:
> On 11/29/21 18:25, Sean Christopherson wrote:
> > If a posted interrupt arrives after KVM has done its final search through the vIRR,
> > but before avic_update_iommu_vcpu_affinity() is called, the posted interrupt will
> > be set in the vIRR without triggering a host IRQ to wake the vCPU via the GA log.
> >
> > I.e. KVM is missing an equivalent to VMX's posted interrupt check for an outstanding
> > notification after switching to the wakeup vector.
>
> BTW Maxim reported that it can break even without assigned devices.
>
> > For now, the least awful approach is sadly to keep the vcpu_(un)blocking() hooks.
>
> I agree that the hooks cannot be dropped but the bug is reproducible with
> this patch, where the hooks are still there.
...
> Still it does seem to be a race that happens when IS_RUNNING=true but
> vcpu->mode == OUTSIDE_GUEST_MODE. This patch makes the race easier to
> trigger because it moves IS_RUNNING=false later.
Oh! Any chance the bug only repros with preemption enabled? That would explain
why I don't see problems, I'm pretty sure I've only run AVIC with a PREEMPT=n.
svm_vcpu_{un}blocking() are called with preemption enabled, and avic_set_running()
passes in vcpu->cpu. If the vCPU is preempted and scheduled in on a different CPU,
avic_vcpu_load() will overwrite the vCPU's entry with the wrong CPU info.
On Mon, 2021-11-29 at 20:18 +0100, Paolo Bonzini wrote:
> On 11/29/21 19:55, Sean Christopherson wrote:
> > > Still it does seem to be a race that happens when IS_RUNNING=true but
> > > vcpu->mode == OUTSIDE_GUEST_MODE. This patch makes the race easier to
> > > trigger because it moves IS_RUNNING=false later.
> >
> > Oh! Any chance the bug only repros with preemption enabled? That would explain
> > why I don't see problems, I'm pretty sure I've only run AVIC with a PREEMPT=n.
>
> Me too.
>
> > svm_vcpu_{un}blocking() are called with preemption enabled, and avic_set_running()
> > passes in vcpu->cpu. If the vCPU is preempted and scheduled in on a different CPU,
> > avic_vcpu_load() will overwrite the vCPU's entry with the wrong CPU info.
>
> That would make a lot of sense. avic_vcpu_load() can handle
> svm->avic_is_running = false, but avic_set_running still needs its body
> wrapped by preempt_disable/preempt_enable.
>
> Fedora's kernel is CONFIG_PREEMPT_VOLUNTARY, but I know Maxim uses his
> own build so it would not surprise me if he used CONFIG_PREEMPT=y.
>
> Paolo
>
I will write ll the details tomorrow but I strongly suspect the CPU errata
https://developer.amd.com/wp-content/resources/56323-PUB_0.78.pdf
#1235
Basically what I see that
1. vCPU2 disables is_running in avic physical id cache
2. vCPU2 checks that IRR is empty and it is
3. vCPU2 does schedule();
and it keeps on sleeping forever. If I kick it via signal
(like just doing 'info registers' qemu hmp command
or just stop/cont on the same hmp interface, the
vCPU wakes up and notices that IRR suddenly is not empty,
and the VM comes back to life (and then hangs after a while again
with the same problem....).
As far as I see in the traces, the bit in IRR came from
another VCPU who didn't respect the ir_running bit and didn't get
AVIC_INCOMPLETE_IPI VMexit.
I can't 100% prove it yet, but everything in the trace shows this.
About the rest of the environment, currently I reproduce this in
a VM which has no pci passed through devices at all, just AVIC.
(I wasn't able to reproduce it before just because I forgot to
enable AVIC in this configuration).
So I also agree that Sean's patch is not to blame here,
it just made the window between setting is_running and getting to sleep
shorter and made it less likely that other vCPUs will pick up the is_running change.
(I suspect that they pick it up on next vmrun, and otherwise the value is somehow
cached wrongfully in them).
A very performance killing workaround of kicking all vCPUs when one of them enters vcpu_block
does seem to work for me but it skews all the timing off so I can't prove it.
That is all, I will write more detailed info, including some traces I have.
I do use windows 10 with so called LatencyMon in it, which shows overall how
much latency hardware interrupts have, which used to be useful for me to
ensure that my VMs are suitable for RT like latency (even before I joined RedHat,
I tuned my VMs as much as I could to make my Rift CV1 VR headset work well which
needs RT like latencies.
These days VR works fine in my VMs anyway, but I still kept this tool to keep an eye on it).
I really need to write a kvm unit test to stress test IPIs, especially this case,
I will do this very soon.
Wei Huang, any info on this would be very helpful.
Maybe putting the avic physical table in UC memory would help?
Maybe ringing doorbells of all other vcpus will help them notice the change?
Best regards,
Maxim Levitsky
On Mon, 2021-11-29 at 18:55 +0100, Paolo Bonzini wrote:
> On 11/29/21 18:25, Sean Christopherson wrote:
> > > If I apply though only the patch series up to this patch, my fedora VM seems
> > > to work fine, but my windows VM still locks up hard when I run 'LatencyTop'
> > > in it, which doesn't happen without this patch.
> >
> > Buy "run 'LatencyTop' in it", do you mean running something in the Windows guest?
> > The only search results I can find for LatencyTop are Linux specific.
>
> I think it's LatencyMon, https://www.resplendence.com/latencymon.
>
> Paolo
>
Yes.
Best regards,
Maxim Levitsky
On 11/29/21 18:25, Sean Christopherson wrote:
> If a posted interrupt arrives after KVM has done its final search through the vIRR,
> but before avic_update_iommu_vcpu_affinity() is called, the posted interrupt will
> be set in the vIRR without triggering a host IRQ to wake the vCPU via the GA log.
>
> I.e. KVM is missing an equivalent to VMX's posted interrupt check for an outstanding
> notification after switching to the wakeup vector.
BTW Maxim reported that it can break even without assigned devices.
> For now, the least awful approach is sadly to keep the vcpu_(un)blocking() hooks.
I agree that the hooks cannot be dropped but the bug is reproducible
with this patch, where the hooks are still there.
With the hooks in place, you have:
kvm_vcpu_blocking(vcpu)
avic_set_running(vcpu, false)
avic_vcpu_put(vcpu)
avic_update_iommu_vcpu_affinity()
WRITE_ONCE(...) // clear IS_RUNNING bit
set_current_state()
smp_mb()
kvm_vcpu_check_block()
return kvm_arch_vcpu_runnable() || ...
return kvm_vcpu_has_events() || ...
return kvm_cpu_has_interrupt() || ...
return kvm_apic_has_interrupt() || ...
return apic_has_interrupt_for_ppr()
apic_find_highest_irr()
scan vIRR
This covers the barrier between the write of is_running and the read of
vIRR, and the other side should be correct as well. in particular,
reads of is_running always come after an atomic write to vIRR, and hence
after an implicit full memory barrier. svm_deliver_avic_intr() has an
smp_mb__after_atomic() after writing IRR; avic_kick_target_vcpus() even
has an explicit barrier in srcu_read_lock(), between the microcode's
write to vIRR and its own call to avic_vcpu_is_running().
Still it does seem to be a race that happens when IS_RUNNING=true but
vcpu->mode == OUTSIDE_GUEST_MODE. This patch makes the race easier to
trigger because it moves IS_RUNNING=false later.
Paolo
On 10/26/21 18:12, Paolo Bonzini wrote:
> On 26/10/21 17:41, Marc Zyngier wrote:
>>> This needs a word on why kvm_psci_vcpu_suspend does not need the
>>> hooks. Or it needs to be changed to also use kvm_vcpu_wfi in the PSCI
>>> code, I don't know.
>>>
>>> Marc, can you review and/or advise?
>> I was looking at that over the weekend, and that's a pre-existing
>> bug. I would have addressed it independently, but it looks like you
>> already have queued the patch.
>
> I have "queued" it, but that's just my queue - it's not on kernel.org
> and it's not going to be in 5.16, at least not in the first batch.
>
> There's plenty of time for me to rebase on top of a fix, if you want to
> send the fix through your kvm-arm pull request. Just Cc me so that I
> understand what's going on.
Since a month has passed and I didn't see anything related in the
KVM-ARM pull requests, I am going to queue this patch. Any conflicts
can be resolved through a kvmarm->kvm merge of either a topic branch or
a tag that is destined to 5.16.
Paolo
On 2021-11-30 11:39, Paolo Bonzini wrote:
> On 10/26/21 18:12, Paolo Bonzini wrote:
>> On 26/10/21 17:41, Marc Zyngier wrote:
>>>> This needs a word on why kvm_psci_vcpu_suspend does not need the
>>>> hooks. Or it needs to be changed to also use kvm_vcpu_wfi in the
>>>> PSCI
>>>> code, I don't know.
>>>>
>>>> Marc, can you review and/or advise?
>>> I was looking at that over the weekend, and that's a pre-existing
>>> bug. I would have addressed it independently, but it looks like you
>>> already have queued the patch.
>>
>> I have "queued" it, but that's just my queue - it's not on kernel.org
>> and it's not going to be in 5.16, at least not in the first batch.
>>
>> There's plenty of time for me to rebase on top of a fix, if you want
>> to send the fix through your kvm-arm pull request. Just Cc me so that
>> I understand what's going on.
>
> Since a month has passed and I didn't see anything related in the
> KVM-ARM pull requests, I am going to queue this patch. Any conflicts
> can be resolved through a kvmarm->kvm merge of either a topic branch
> or a tag that is destined to 5.16.
Can you at least spell out *when* this will land?
There is, in general, a certain lack of clarity about what you are
queuing,
where you are queuing it, and what release it targets.
Thanks,
M.
--
Jazz is not dead. It just smells funny...
On 11/30/21 13:04, Marc Zyngier wrote:
>>>
>>> I have "queued" it, but that's just my queue - it's not on kernel.org
>>> and it's not going to be in 5.16, at least not in the first batch.
>>>
>>> There's plenty of time for me to rebase on top of a fix, if you want
>>> to send the fix through your kvm-arm pull request. Just Cc me so
>>> that I understand what's going on.
>>
>> Since a month has passed and I didn't see anything related in the
>> KVM-ARM pull requests, I am going to queue this patch. Any conflicts
>> can be resolved through a kvmarm->kvm merge of either a topic branch
>> or a tag that is destined to 5.16.
>
> Can you at least spell out *when* this will land?
It will be in kvm/next as soon as I finish running tests on it, which
may take a couple more days because I'm updating my machines to newer
operating systems.
> There is, in general, a certain lack of clarity about what you are queuing,
> where you are queuing it, and what release it targets.
Ok, thanks for the suggestion. Generally speaking:
- kvm/master is stuff that is merged and will be in the next -rc, right
now 5.16-rc4. It shouldn't ever rewind (though it may happen, it is rare)
- kvm/next is stuff that is merged and will be in the next merge window,
right now 5.17. It also shouldn't rewind.
- kvm/queue is stuff that the submitter shouldn't care about, and that
other people should only care about to check for conflicts. When I say
I "queued" a patch it goes in kvm/queue, and there's time to remove it
if something breaks.
Regarding this series:
- I am queuing it up to this patch
- I am queuing it to kvm/next, meaning it targets 5.17
- it looks like the next one (11/43) triggers a known AMD errata, so I'm
holding on the rest until we understand if it actually does, and if so
if AMD AVIC is doomed. For the time being, it will stay in kvm/queue.
Paolo
On Tue, 2021-11-30 at 00:53 +0200, Maxim Levitsky wrote:
> On Mon, 2021-11-29 at 20:18 +0100, Paolo Bonzini wrote:
> > On 11/29/21 19:55, Sean Christopherson wrote:
> > > > Still it does seem to be a race that happens when IS_RUNNING=true but
> > > > vcpu->mode == OUTSIDE_GUEST_MODE. This patch makes the race easier to
> > > > trigger because it moves IS_RUNNING=false later.
> > >
> > > Oh! Any chance the bug only repros with preemption enabled? That would explain
> > > why I don't see problems, I'm pretty sure I've only run AVIC with a PREEMPT=n.
> >
> > Me too.
> >
> > > svm_vcpu_{un}blocking() are called with preemption enabled, and avic_set_running()
> > > passes in vcpu->cpu. If the vCPU is preempted and scheduled in on a different CPU,
> > > avic_vcpu_load() will overwrite the vCPU's entry with the wrong CPU info.
> >
> > That would make a lot of sense. avic_vcpu_load() can handle
> > svm->avic_is_running = false, but avic_set_running still needs its body
> > wrapped by preempt_disable/preempt_enable.
> >
> > Fedora's kernel is CONFIG_PREEMPT_VOLUNTARY, but I know Maxim uses his
> > own build so it would not surprise me if he used CONFIG_PREEMPT=y.
> >
> > Paolo
> >
>
> I will write ll the details tomorrow but I strongly suspect the CPU errata
> https://developer.amd.com/wp-content/resources/56323-PUB_0.78.pdf
> #1235
>
> Basically what I see that
>
> 1. vCPU2 disables is_running in avic physical id cache
> 2. vCPU2 checks that IRR is empty and it is
> 3. vCPU2 does schedule();
>
> and it keeps on sleeping forever. If I kick it via signal
> (like just doing 'info registers' qemu hmp command
> or just stop/cont on the same hmp interface, the
> vCPU wakes up and notices that IRR suddenly is not empty,
> and the VM comes back to life (and then hangs after a while again
> with the same problem....).
>
> As far as I see in the traces, the bit in IRR came from
> another VCPU who didn't respect the ir_running bit and didn't get
> AVIC_INCOMPLETE_IPI VMexit.
> I can't 100% prove it yet, but everything in the trace shows this.
>
> About the rest of the environment, currently I reproduce this in
> a VM which has no pci passed through devices at all, just AVIC.
> (I wasn't able to reproduce it before just because I forgot to
> enable AVIC in this configuration).
>
> So I also agree that Sean's patch is not to blame here,
> it just made the window between setting is_running and getting to sleep
> shorter and made it less likely that other vCPUs will pick up the is_running change.
> (I suspect that they pick it up on next vmrun, and otherwise the value is somehow
> cached wrongfully in them).
>
> A very performance killing workaround of kicking all vCPUs when one of them enters vcpu_block
> does seem to work for me but it skews all the timing off so I can't prove it.
>
> That is all, I will write more detailed info, including some traces I have.
>
> I do use windows 10 with so called LatencyMon in it, which shows overall how
> much latency hardware interrupts have, which used to be useful for me to
> ensure that my VMs are suitable for RT like latency (even before I joined RedHat,
> I tuned my VMs as much as I could to make my Rift CV1 VR headset work well which
> needs RT like latencies.
>
> These days VR works fine in my VMs anyway, but I still kept this tool to keep an eye on it).
>
> I really need to write a kvm unit test to stress test IPIs, especially this case,
> I will do this very soon.
>
>
> Wei Huang, any info on this would be very helpful.
>
> Maybe putting the avic physical table in UC memory would help?
> Maybe ringing doorbells of all other vcpus will help them notice the change?
>
> Best regards,
> Maxim Levitsky
Hi!
I am now almost sure that this is errata #1235.
I had attached a kvm-unit-test I wrote (patch against master of https://gitlab.com/kvm-unit-tests/kvm-unit-tests.git/)
which is able to reproduce the issue on stock 5.15.0 kernel (*no patches applied at all*) after just few seconds.
If kvm is loaded without halt-polling (that is halt_poll_ns=0 is used).
Halt polling and/or Sean's patch are not to blame, it just changes timeing.
With Sean's patch I don't need to disable half polling.
I did find few avic inhibition bugs that this test also finds and to make it work before I fix them,
I added a workaround to not hit them in this test.
I'll send patches to fix those very soon.
Note that in windows VM there were no avic inhibitions so those bugs are not relevant.
Wei Huang, do you know if this issue is fixed on Zen3, and if it is fixed on some Zen2 machines?
Any workarounds other than 'don't use avic'?
Best regards,
Maxim Levitsky
On Thu, Dec 02, 2021, Maxim Levitsky wrote:
> On Tue, 2021-11-30 at 00:53 +0200, Maxim Levitsky wrote:
> > On Mon, 2021-11-29 at 20:18 +0100, Paolo Bonzini wrote:
> > Basically what I see that
> >
> > 1. vCPU2 disables is_running in avic physical id cache
> > 2. vCPU2 checks that IRR is empty and it is
> > 3. vCPU2 does schedule();
> >
> > and it keeps on sleeping forever. If I kick it via signal
> > (like just doing 'info registers' qemu hmp command
> > or just stop/cont on the same hmp interface, the
> > vCPU wakes up and notices that IRR suddenly is not empty,
> > and the VM comes back to life (and then hangs after a while again
> > with the same problem....).
> >
> > As far as I see in the traces, the bit in IRR came from
> > another VCPU who didn't respect the ir_running bit and didn't get
> > AVIC_INCOMPLETE_IPI VMexit.
> > I can't 100% prove it yet, but everything in the trace shows this.
...
> I am now almost sure that this is errata #1235.
>
> I had attached a kvm-unit-test I wrote (patch against master of
> https://gitlab.com/kvm-unit-tests/kvm-unit-tests.git/) which is able to
> reproduce the issue on stock 5.15.0 kernel (*no patches applied at all*)
> after just few seconds. If kvm is loaded without halt-polling (that is
> halt_poll_ns=0 is used).
>
> Halt polling and/or Sean's patch are not to blame, it just changes timeing.
> With Sean's patch I don't need to disable half polling.
Hmm, that suggests the bug/erratum is due to the CPU consuming stale data from #4
for the IsRunning check in #5, or retiring uops for the IsRunning check before
retiring the vIRR update. It would be helpful if the erratum actually provided
info on the "highly specific and detailed set of internal timing conditions". :-/
4. Lookup the vAPIC backing page address in the Physical APIC table using the
guest physical APIC ID as an index into the table.
5. For every valid destination:
- Atomically set the appropriate IRR bit in each of the destinations’ vAPIC
backing page.
- Check the IsRunning status of each destination.
On Mon, 2021-11-29 at 17:25 +0000, Sean Christopherson wrote:
> On Mon, Nov 29, 2021, Maxim Levitsky wrote:
> > (This thing is that when you tell the IOMMU that a vCPU is not running,
> > Another thing I discovered that this patch series totally breaks my VMs,
> > without cpu_pm=on The whole series (I didn't yet bisect it) makes even my
> > fedora32 VM be very laggy, almost unusable, and it only has one
> > passed-through device, a nic).
>
> Grrrr, the complete lack of comments in the KVM code and the separate paths for
> VMX vs SVM when handling HLT with APICv make this all way for difficult to
> understand than it should be.
>
> The hangs are likely due to:
>
> KVM: SVM: Unconditionally mark AVIC as running on vCPU load (with APICv)
>
> If a posted interrupt arrives after KVM has done its final search through the vIRR,
> but before avic_update_iommu_vcpu_affinity() is called, the posted interrupt will
> be set in the vIRR without triggering a host IRQ to wake the vCPU via the GA log.
>
> I.e. KVM is missing an equivalent to VMX's posted interrupt check for an outstanding
> notification after switching to the wakeup vector.
>
> For now, the least awful approach is sadly to keep the vcpu_(un)blocking() hooks.
> Unlike VMX's PI support, there's no fast check for an interrupt being posted (KVM
> would have to rewalk the vIRR), no easy to signal the current CPU to do wakeup (I
> don't think KVM even has access to the IRQ used by the owning IOMMU), and there's
> no simplification of load/put code.
I have an idea.
Why do we even use/need the GA log?
Why not, just disable the 'guest mode' in the iommu and let it sent good old normal interrupt
when a vCPU is not running, just like we do when we inhibit the AVIC?
GA log makes all devices that share an iommu (there are 4 iommus per package these days,
some without useful devices) go through a single (!) msi like interrupt,
which is even for some reason implemented by a threaded IRQ in the linux kernel.
Best regards,
Maxim Levitsky
>
> If the scheduler were changed to support waking in the sched_out path, then I'd be
> more inclined to handle this in avic_vcpu_put() by rewalking the vIRR one final
> time, but for now it's not worth it.
>
> > If I apply though only the patch series up to this patch, my fedora VM seems
> > to work fine, but my windows VM still locks up hard when I run 'LatencyTop'
> > in it, which doesn't happen without this patch.
>
> Buy "run 'LatencyTop' in it", do you mean running something in the Windows guest?
> The only search results I can find for LatencyTop are Linux specific.
>
> > So far the symptoms I see is that on VCPU 0, ISR has quite high interrupt
> > (0xe1 last time I seen it), TPR and PPR are 0xe0 (although I have seen TPR to
> > have different values), and IRR has plenty of interrupts with lower priority.
> > The VM seems to be stuck in this case. As if its EOI got lost or something is
> > preventing the IRQ handler from issuing EOI.
> >
> > LatencyTop does install some form of a kernel driver which likely does meddle
> > with interrupts (maybe it sends lots of self IPIs?).
> >
> > 100% reproducible as soon as I start monitoring with LatencyTop.
> >
> > Without this patch it works (or if disabling halt polling),
>
> Huh. I assume everything works if you disable halt polling _without_ this patch
> applied?
>
> If so, that implies that successful halt polling without mucking with vCPU IOMMU
> affinity is somehow problematic. I can't think of any relevant side effects other
> than timing.
>
On 12/2/21 03:00, Sean Christopherson wrote:
> Hmm, that suggests the bug/erratum is due to the CPU consuming stale data from #4
> for the IsRunning check in #5, or retiring uops for the IsRunning check before
> retiring the vIRR update.
Yes, this seems to be an error in the implementation of step 5. In
assembly, atomic operations have implicit memory barriers, but who knows
what's going on in microcode. So either it's the former, or something
is going on that's specific to the microcode sequencer, or it's a more
mundane implementation bug.
In any case, AVIC is disabled for now and will need a list of model
where it works, so I'll go on and queue the first part of this series.
Paolo
> It would be helpful if the erratum actually provided
> info on the "highly specific and detailed set of internal timing conditions". :-/
>
> 4. Lookup the vAPIC backing page address in the Physical APIC table using the
> guest physical APIC ID as an index into the table.
> 5. For every valid destination:
> - Atomically set the appropriate IRR bit in each of the destinations’ vAPIC
> backing page.
> - Check the IsRunning status of each destination.
On Thu, 2021-12-02 at 12:20 +0200, Maxim Levitsky wrote:
> On Mon, 2021-11-29 at 17:25 +0000, Sean Christopherson wrote:
> > On Mon, Nov 29, 2021, Maxim Levitsky wrote:
> > > (This thing is that when you tell the IOMMU that a vCPU is not running,
> > > Another thing I discovered that this patch series totally breaks my VMs,
> > > without cpu_pm=on The whole series (I didn't yet bisect it) makes even my
> > > fedora32 VM be very laggy, almost unusable, and it only has one
> > > passed-through device, a nic).
> >
> > Grrrr, the complete lack of comments in the KVM code and the separate paths for
> > VMX vs SVM when handling HLT with APICv make this all way for difficult to
> > understand than it should be.
> >
> > The hangs are likely due to:
> >
> > KVM: SVM: Unconditionally mark AVIC as running on vCPU load (with APICv)
> >
> > If a posted interrupt arrives after KVM has done its final search through the vIRR,
> > but before avic_update_iommu_vcpu_affinity() is called, the posted interrupt will
> > be set in the vIRR without triggering a host IRQ to wake the vCPU via the GA log.
> >
> > I.e. KVM is missing an equivalent to VMX's posted interrupt check for an outstanding
> > notification after switching to the wakeup vector.
> >
> > For now, the least awful approach is sadly to keep the vcpu_(un)blocking() hooks.
> > Unlike VMX's PI support, there's no fast check for an interrupt being posted (KVM
> > would have to rewalk the vIRR), no easy to signal the current CPU to do wakeup (I
> > don't think KVM even has access to the IRQ used by the owning IOMMU), and there's
> > no simplification of load/put code.
>
> I have an idea.
>
> Why do we even use/need the GA log?
> Why not, just disable the 'guest mode' in the iommu and let it sent good old normal interrupt
> when a vCPU is not running, just like we do when we inhibit the AVIC?
>
> GA log makes all devices that share an iommu (there are 4 iommus per package these days,
> some without useful devices) go through a single (!) msi like interrupt,
> which is even for some reason implemented by a threaded IRQ in the linux kernel.
Yep, this gross hack works!
diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 958966276d00b8..6136b94f6b5f5e 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -987,8 +987,9 @@ void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
entry |= AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK;
WRITE_ONCE(*(svm->avic_physical_id_cache), entry);
- avic_update_iommu_vcpu_affinity(vcpu, h_physical_id,
- svm->avic_is_running);
+
+ svm_set_pi_irte_mode(vcpu, svm->avic_is_running);
+ avic_update_iommu_vcpu_affinity(vcpu, h_physical_id, true);
}
void avic_vcpu_put(struct kvm_vcpu *vcpu)
@@ -997,8 +998,9 @@ void avic_vcpu_put(struct kvm_vcpu *vcpu)
struct vcpu_svm *svm = to_svm(vcpu);
entry = READ_ONCE(*(svm->avic_physical_id_cache));
- if (entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK)
- avic_update_iommu_vcpu_affinity(vcpu, -1, 0);
+ if (entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK) {
+ svm_set_pi_irte_mode(vcpu, false);
+ }
entry &= ~AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK;
WRITE_ONCE(*(svm->avic_physical_id_cache), entry);
>
GA log interrupts almost gone (there are still few because svm_set_pi_irte_mode sets is_running false)
devices works as expected sending normal interrupts unless guest is loaded, then normal interrupts disappear,
as expected.
Best regards,
Maxim Levitsky
>
> Best regards,
> Maxim Levitsky
>
> > If the scheduler were changed to support waking in the sched_out path, then I'd be
> > more inclined to handle this in avic_vcpu_put() by rewalking the vIRR one final
> > time, but for now it's not worth it.
> >
> > > If I apply though only the patch series up to this patch, my fedora VM seems
> > > to work fine, but my windows VM still locks up hard when I run 'LatencyTop'
> > > in it, which doesn't happen without this patch.
> >
> > Buy "run 'LatencyTop' in it", do you mean running something in the Windows guest?
> > The only search results I can find for LatencyTop are Linux specific.
> >
> > > So far the symptoms I see is that on VCPU 0, ISR has quite high interrupt
> > > (0xe1 last time I seen it), TPR and PPR are 0xe0 (although I have seen TPR to
> > > have different values), and IRR has plenty of interrupts with lower priority.
> > > The VM seems to be stuck in this case. As if its EOI got lost or something is
> > > preventing the IRQ handler from issuing EOI.
> > >
> > > LatencyTop does install some form of a kernel driver which likely does meddle
> > > with interrupts (maybe it sends lots of self IPIs?).
> > >
> > > 100% reproducible as soon as I start monitoring with LatencyTop.
> > >
> > > Without this patch it works (or if disabling halt polling),
> >
> > Huh. I assume everything works if you disable halt polling _without_ this patch
> > applied?
> >
> > If so, that implies that successful halt polling without mucking with vCPU IOMMU
> > affinity is somehow problematic. I can't think of any relevant side effects other
> > than timing.
> >
On Mon, 2021-11-29 at 17:25 +0000, Sean Christopherson wrote:
> On Mon, Nov 29, 2021, Maxim Levitsky wrote:
> > (This thing is that when you tell the IOMMU that a vCPU is not running,
> > Another thing I discovered that this patch series totally breaks my VMs,
> > without cpu_pm=on The whole series (I didn't yet bisect it) makes even my
> > fedora32 VM be very laggy, almost unusable, and it only has one
> > passed-through device, a nic).
>
> Grrrr, the complete lack of comments in the KVM code and the separate paths for
> VMX vs SVM when handling HLT with APICv make this all way for difficult to
> understand than it should be.
>
> The hangs are likely due to:
>
> KVM: SVM: Unconditionally mark AVIC as running on vCPU load (with APICv)
Yes, the other hang I told about which makes all my VMs very laggy, almost impossible
to use is because of the above patch, but since I reproduced it now again without
any passed-through device, I also blame the cpu errata on this.
Best regards,
Maxim Levitsky
>
> If a posted interrupt arrives after KVM has done its final search through the vIRR,
> but before avic_update_iommu_vcpu_affinity() is called, the posted interrupt will
> be set in the vIRR without triggering a host IRQ to wake the vCPU via the GA log.
>
> I.e. KVM is missing an equivalent to VMX's posted interrupt check for an outstanding
> notification after switching to the wakeup vector.
>
> For now, the least awful approach is sadly to keep the vcpu_(un)blocking() hooks.
> Unlike VMX's PI support, there's no fast check for an interrupt being posted (KVM
> would have to rewalk the vIRR), no easy to signal the current CPU to do wakeup (I
> don't think KVM even has access to the IRQ used by the owning IOMMU), and there's
> no simplification of load/put code.
>
> If the scheduler were changed to support waking in the sched_out path, then I'd be
> more inclined to handle this in avic_vcpu_put() by rewalking the vIRR one final
> time, but for now it's not worth it.
>
> > If I apply though only the patch series up to this patch, my fedora VM seems
> > to work fine, but my windows VM still locks up hard when I run 'LatencyTop'
> > in it, which doesn't happen without this patch.
>
> Buy "run 'LatencyTop' in it", do you mean running something in the Windows guest?
> The only search results I can find for LatencyTop are Linux specific.
>
> > So far the symptoms I see is that on VCPU 0, ISR has quite high interrupt
> > (0xe1 last time I seen it), TPR and PPR are 0xe0 (although I have seen TPR to
> > have different values), and IRR has plenty of interrupts with lower priority.
> > The VM seems to be stuck in this case. As if its EOI got lost or something is
> > preventing the IRQ handler from issuing EOI.
> >
> > LatencyTop does install some form of a kernel driver which likely does meddle
> > with interrupts (maybe it sends lots of self IPIs?).
> >
> > 100% reproducible as soon as I start monitoring with LatencyTop.
> >
> > Without this patch it works (or if disabling halt polling),
>
> Huh. I assume everything works if you disable halt polling _without_ this patch
> applied?
>
> If so, that implies that successful halt polling without mucking with vCPU IOMMU
> affinity is somehow problematic. I can't think of any relevant side effects other
> than timing.
>