When vCPU enters block phase, pi_pre_block() inserts vCPU to a per pCPU
linked list of all vCPUs that are blocked on this pCPU. Afterwards, it
changes PID.NV to POSTED_INTR_WAKEUP_VECTOR which its handler
(wakeup_handler()) is responsible to kick (unblock) any vCPU on that
linked list that now has pending posted interrupts.
While vCPU is blocked (in kvm_vcpu_block()), it may be preempted which
will cause vmx_vcpu_pi_put() to set PID.SN. If later the vCPU will be
scheduled to run on a different pCPU, vmx_vcpu_pi_load() will clear
PID.SN but will also *overwrite PID.NDST to this different pCPU*.
Instead of keeping it with original pCPU which vCPU had entered block
phase on.
This results in an issue because when a posted interrupt is delivered, as
the wakeup_handler() will be executed and fail to find blocked vCPU on
its per pCPU linked list of all vCPUs that are blocked on this pCPU.
Which is due to the vCPU being placed on a *different* per pCPU
linked list i.e. the original pCPU in which it entered block phase.
The regression is introduced by commit c112b5f50232 ("KVM: x86:
Recompute PID.ON when clearing PID.SN"). Therefore, partially revert
it and reintroduce the condition in vmx_vcpu_pi_load() responsible for
avoiding changing PID.NDST when loading a blocked vCPU.
Fixes: c112b5f50232 ("KVM: x86: Recompute PID.ON when clearing PID.SN")
Tested-by: Nathan Ni <[email protected]>
Co-developed-by: Liran Alon <[email protected]>
Signed-off-by: Liran Alon <[email protected]>
Signed-off-by: Joao Martins <[email protected]>
---
v2:
* Fixed wrong Sob chain;
* Add missing Tb;
---
arch/x86/kvm/vmx/vmx.c | 14 ++++++++++++++
arch/x86/kvm/vmx/vmx.h | 6 ++++++
2 files changed, 20 insertions(+)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 4c7d2935f7ec..ccd06fdfbb76 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1274,6 +1274,18 @@ static void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
if (!pi_test_sn(pi_desc) && vcpu->cpu == cpu)
return;
+ /*
+ * If the 'nv' field is POSTED_INTR_WAKEUP_VECTOR, do not change
+ * PI.NDST: pi_post_block is the one expected to change PID.NDST and the
+ * wakeup handler expects the vCPU to be on the blocked_vcpu_list that
+ * matches PI.NDST. Otherwise, a vcpu may not be able to be woken up
+ * correctly.
+ */
+ if (pi_desc->nv == POSTED_INTR_WAKEUP_VECTOR || vcpu->cpu == cpu) {
+ pi_clear_sn(pi_desc);
+ goto after_clear_sn;
+ }
+
/* The full case. */
do {
old.control = new.control = pi_desc->control;
@@ -1289,6 +1301,8 @@ static void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
} while (cmpxchg64(&pi_desc->control, old.control,
new.control) != old.control);
+after_clear_sn:
+
/*
* Clear SN before reading the bitmap. The VT-d firmware
* writes the bitmap and reads SN atomically (5.2.3 in the
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index bee16687dc0b..1e32ab54fc2d 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -373,6 +373,12 @@ static inline void pi_clear_on(struct pi_desc *pi_desc)
(unsigned long *)&pi_desc->control);
}
+static inline void pi_clear_sn(struct pi_desc *pi_desc)
+{
+ clear_bit(POSTED_INTR_SN,
+ (unsigned long *)&pi_desc->control);
+}
+
static inline int pi_test_on(struct pi_desc *pi_desc)
{
return test_bit(POSTED_INTR_ON,
--
2.11.0
On Tue, 12 Nov 2019 at 01:23, Joao Martins <[email protected]> wrote:
>
> While vCPU is blocked (in kvm_vcpu_block()), it may be preempted which
> will cause vmx_vcpu_pi_put() to set PID.SN. If later the vCPU will be
How can this happen? See the prepare_to_swait_exlusive() in
kvm_vcpu_block(), the task will be set in TASK_INTERRUPTIBLE state,
kvm_sched_out will set vcpu->preempted to true iff current->state ==
TASK_RUNNING.
Wanpeng
On 11/19/19 11:36 AM, Wanpeng Li wrote:
> On Tue, 12 Nov 2019 at 01:23, Joao Martins <[email protected]> wrote:
>>
>> While vCPU is blocked (in kvm_vcpu_block()), it may be preempted which
>> will cause vmx_vcpu_pi_put() to set PID.SN. If later the vCPU will be
>
> How can this happen? See the prepare_to_swait_exlusive() in
> kvm_vcpu_block(), the task will be set in TASK_INTERRUPTIBLE state,
> kvm_sched_out will set vcpu->preempted to true iff current->state ==
> TASK_RUNNING.
You're right.
But preemption (thus setting PID.SN) can still happen before vcpu_block(), or
otherwise scheduled out if we are already in vcpu_block() (with task in
TASK_INTERRUPTIBLE). The right term we should have written in that sentence
above would have been 'scheduled out' and drop the vcpu_block mention and it
would encompass both cases. A better sentence would perhaps be:
"While vCPU is blocked (or about to block) it may be scheduled out which will
cause vmx_vcpu_pi_put() to be called."
But setting or not preempted/PID.SN doesn't change the rest and was mentioned
for completeness.
Joao