LinuxLists.cc - [PATCH] KVM: x86/vmx: Suppress posted interrupt notification when CPU is in host

2022-06-17 17:11:32

Subject: [PATCH] KVM: x86/vmx: Suppress posted interrupt notification when CPU is in host

PIN (Posted interrupt notification) is useless to host as KVM always syncs
pending guest interrupts in PID to guest's vAPIC before each VM entry. In
fact, Sending PINs to a CPU running in host will lead to additional
overhead due to interrupt handling.

Currently, software path, vmx_deliver_posted_interrupt(), is optimized to
issue PINs only if target vCPU is in IN_GUEST_MODE. But hardware paths
(VT-d and Intel IPI virtualization) aren't optimized.

Set PID.SN right after VM exits and clear it before VM entry to minimize
the chance of hardware issuing PINs to a CPU when it's in host.

Also honour PID.SN bit in vmx_deliver_posted_interrupt().

When IPI virtualization is enabled, this patch increases "perf bench" [*]
by 4% from 8.12 us/ops to 7.80 us/ops.

[*] test cmd: perf bench sched pipe -T. Note that we change the source
code to pin two threads to two different vCPUs so that it can reproduce
stable results.

Signed-off-by: Chao Gao <[email protected]>
---
arch/x86/kvm/vmx/posted_intr.c | 28 ++--------------------------
arch/x86/kvm/vmx/vmx.c | 24 +++++++++++++++++++++++-
2 files changed, 25 insertions(+), 27 deletions(-)

diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
index 237a1f40f939..a0458f72df99 100644
--- a/arch/x86/kvm/vmx/posted_intr.c
+++ b/arch/x86/kvm/vmx/posted_intr.c
@@ -70,12 +70,6 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
* needs to be changed.
*/
if (pi_desc->nv != POSTED_INTR_WAKEUP_VECTOR && vcpu->cpu == cpu) {
- /*
- * Clear SN if it was set due to being preempted. Again, do
- * this even if there is no assigned device for simplicity.
- */
- if (pi_test_and_clear_sn(pi_desc))
- goto after_clear_sn;
return;
}

@@ -99,12 +93,8 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
do {
old.control = new.control = READ_ONCE(pi_desc->control);

- /*
- * Clear SN (as above) and refresh the destination APIC ID to
- * handle task migration (@cpu != vcpu->cpu).
- */
new.ndst = dest;
- new.sn = 0;
+ new.sn = 1;

/*
* Restore the notification vector; in the blocking case, the
@@ -114,19 +104,6 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
} while (pi_try_set_control(pi_desc, old.control, new.control));

local_irq_restore(flags);
-
-after_clear_sn:
-
- /*
- * Clear SN before reading the bitmap. The VT-d firmware
- * writes the bitmap and reads SN atomically (5.2.3 in the
- * spec), so it doesn't really have a memory barrier that
- * pairs with this, but we cannot do that and we need one.
- */
- smp_mb__after_atomic();
-
- if (!pi_is_pir_empty(pi_desc))
- pi_set_on(pi_desc);
}

static bool vmx_can_use_vtd_pi(struct kvm *kvm)
@@ -154,13 +131,12 @@ static void pi_enable_wakeup_handler(struct kvm_vcpu *vcpu)
&per_cpu(wakeup_vcpus_on_cpu, vcpu->cpu));
raw_spin_unlock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));

- WARN(pi_desc->sn, "PI descriptor SN field set before blocking");
-
do {
old.control = new.control = READ_ONCE(pi_desc->control);

/* set 'NV' to 'wakeup vector' */
new.nv = POSTED_INTR_WAKEUP_VECTOR;
+ new.sn = 0;
} while (pi_try_set_control(pi_desc, old.control, new.control));

/*
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index a3c5504601a8..fa915b1680eb 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4036,6 +4036,9 @@ static int vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector)
if (pi_test_and_set_pir(vector, &vmx->pi_desc))
return 0;

+ if (pi_test_sn(&vmx->pi_desc))
+ return 0;
+
/* If a previous notification has sent the IPI, nothing to do. */
if (pi_test_and_set_on(&vmx->pi_desc))
return 0;
@@ -6520,8 +6523,17 @@ static int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu)
if (KVM_BUG_ON(!enable_apicv, vcpu->kvm))
return -EIO;

- if (pi_test_on(&vmx->pi_desc)) {
+ if (pi_test_on(&vmx->pi_desc) || pi_test_sn(&vmx->pi_desc)) {
pi_clear_on(&vmx->pi_desc);
+
+ /*
+ * IN_GUEST_MODE means we are about to enter vCPU. Allow
+ * PIN (posted interrupt notification) to deliver is key
+ * to interrupt posting. Clear PID.SN.
+ */
+ if (vcpu->mode == IN_GUEST_MODE)
+ pi_clear_sn(&vmx->pi_desc);
+
/*
* IOMMU can write to PID.ON, so the barrier matters even on UP.
* But on x86 this is just a compiler barrier anyway.
@@ -6976,6 +6988,16 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
/* The actual VMENTER/EXIT is in the .noinstr.text section. */
vmx_vcpu_enter_exit(vcpu, vmx);

+ /*
+ * Suppress notification right after VM exits to minimize the
+ * window where VT-d or remote CPU may send a useless notification
+ * when posting interrupts to a VM. Note that the notification is
+ * useless because KVM syncs pending interrupts in PID.IRR to vAPIC
+ * IRR before VM entry.
+ */
+ if (kvm_vcpu_apicv_active(vcpu))
+ pi_set_sn(&vmx->pi_desc);
+
/*
* We do not use IBRS in the kernel. If this vCPU has used the
* SPEC_CTRL MSR it may have left it on; save the value and
--
2.25.1

2022-06-21 07:23:20

by Tian, Kevin

[permalink] [raw]

Subject: RE: [PATCH] KVM: x86/vmx: Suppress posted interrupt notification when CPU is in host

> From: Chao Gao <[email protected]>
> Sent: Friday, June 17, 2022 7:47 PM
>
> PIN (Posted interrupt notification) is useless to host as KVM always syncs
> pending guest interrupts in PID to guest's vAPIC before each VM entry. In
> fact, Sending PINs to a CPU running in host will lead to additional
> overhead due to interrupt handling.
>
> Currently, software path, vmx_deliver_posted_interrupt(), is optimized to
> issue PINs only if target vCPU is in IN_GUEST_MODE. But hardware paths
> (VT-d and Intel IPI virtualization) aren't optimized.
>
> Set PID.SN right after VM exits and clear it before VM entry to minimize
> the chance of hardware issuing PINs to a CPU when it's in host.
>
> Also honour PID.SN bit in vmx_deliver_posted_interrupt().
>
> When IPI virtualization is enabled, this patch increases "perf bench" [*]
> by 4% from 8.12 us/ops to 7.80 us/ops.
>
> [*] test cmd: perf bench sched pipe -T. Note that we change the source
> code to pin two threads to two different vCPUs so that it can reproduce
> stable results.
>
> Signed-off-by: Chao Gao <[email protected]>
> ---
> arch/x86/kvm/vmx/posted_intr.c | 28 ++--------------------------
> arch/x86/kvm/vmx/vmx.c | 24 +++++++++++++++++++++++-
> 2 files changed, 25 insertions(+), 27 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/posted_intr.c
> b/arch/x86/kvm/vmx/posted_intr.c
> index 237a1f40f939..a0458f72df99 100644
> --- a/arch/x86/kvm/vmx/posted_intr.c
> +++ b/arch/x86/kvm/vmx/posted_intr.c
> @@ -70,12 +70,6 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int
> cpu)
> * needs to be changed.
> */
> if (pi_desc->nv != POSTED_INTR_WAKEUP_VECTOR && vcpu->cpu ==
> cpu) {
> - /*
> - * Clear SN if it was set due to being preempted. Again, do
> - * this even if there is no assigned device for simplicity.
> - */
> - if (pi_test_and_clear_sn(pi_desc))
> - goto after_clear_sn;
> return;
> }
>
> @@ -99,12 +93,8 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int
> cpu)
> do {
> old.control = new.control = READ_ONCE(pi_desc->control);
>
> - /*
> - * Clear SN (as above) and refresh the destination APIC ID to
> - * handle task migration (@cpu != vcpu->cpu).
> - */
> new.ndst = dest;
> - new.sn = 0;
> + new.sn = 1;

A comment is appreciated.

>
> /*
> * Restore the notification vector; in the blocking case, the
> @@ -114,19 +104,6 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int
> cpu)
> } while (pi_try_set_control(pi_desc, old.control, new.control));
>
> local_irq_restore(flags);
> -
> -after_clear_sn:
> -
> - /*
> - * Clear SN before reading the bitmap. The VT-d firmware
> - * writes the bitmap and reads SN atomically (5.2.3 in the
> - * spec), so it doesn't really have a memory barrier that
> - * pairs with this, but we cannot do that and we need one.
> - */
> - smp_mb__after_atomic();
> -
> - if (!pi_is_pir_empty(pi_desc))
> - pi_set_on(pi_desc);
> }
>
> static bool vmx_can_use_vtd_pi(struct kvm *kvm)
> @@ -154,13 +131,12 @@ static void pi_enable_wakeup_handler(struct
> kvm_vcpu *vcpu)
> &per_cpu(wakeup_vcpus_on_cpu, vcpu->cpu));
> raw_spin_unlock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu-
> >cpu));
>
> - WARN(pi_desc->sn, "PI descriptor SN field set before blocking");
> -
> do {
> old.control = new.control = READ_ONCE(pi_desc->control);
>
> /* set 'NV' to 'wakeup vector' */
> new.nv = POSTED_INTR_WAKEUP_VECTOR;
> + new.sn = 0;
> } while (pi_try_set_control(pi_desc, old.control, new.control));
>

there is a problem a few lines downwards:

/*
* Send a wakeup IPI to this CPU if an interrupt may have been posted
* before the notification vector was updated, in which case the IRQ
* will arrive on the non-wakeup vector. An IPI is needed as calling
* try_to_wake_up() from ->sched_out() isn't allowed (IRQs are not
* enabled until it is safe to call try_to_wake_up() on the task being
* scheduled out).
*/
if (pi_test_on(&new))
apic->send_IPI_self(POSTED_INTR_WAKEUP_VECTOR);

'on' is not set when SN is set. This is different from original assumption
which has SN cleared in above window. In this case pi_test_on()
should be replaced with pi_is_pir_empty().

There is another simplification possible in vmx_vcpu_pi_put():

if (kvm_vcpu_is_blocking(vcpu) && !vmx_interrupt_blocked(vcpu))
pi_enable_wakeup_handler(vcpu);

/*
* Set SN when the vCPU is preempted. Note, the vCPU can both be seen
* as blocking and preempted, e.g. if it's preempted between setting
* its wait state and manually scheduling out.
*/
if (vcpu->preempted)
pi_set_sn(pi_desc);

With this patch 'sn' is already set when a runnable vcpu is preempted
hence above is required only for a blocking vcpu. And in the
blocking case if the notification is anyway suppressed it's pointless to
further change the notification vector. Then it could be simplified as:

if (!vcpu->preempted && kvm_vcpu_is_blocking(vcpu) &&
!vmx_interrupt_blocked(vcpu))
pi_enable_wakeup_handler(vcpu);

> /*
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index a3c5504601a8..fa915b1680eb 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -4036,6 +4036,9 @@ static int vmx_deliver_posted_interrupt(struct
> kvm_vcpu *vcpu, int vector)
> if (pi_test_and_set_pir(vector, &vmx->pi_desc))
> return 0;
>
> + if (pi_test_sn(&vmx->pi_desc))
> + return 0;
> +
> /* If a previous notification has sent the IPI, nothing to do. */
> if (pi_test_and_set_on(&vmx->pi_desc))
> return 0;
> @@ -6520,8 +6523,17 @@ static int vmx_sync_pir_to_irr(struct kvm_vcpu
> *vcpu)
> if (KVM_BUG_ON(!enable_apicv, vcpu->kvm))
> return -EIO;
>
> - if (pi_test_on(&vmx->pi_desc)) {
> + if (pi_test_on(&vmx->pi_desc) || pi_test_sn(&vmx->pi_desc)) {

this has potential side-effect in vmexit/vmentry path where pi_desc is
always scanned now. While reducing interrupts help the IPC test case,
do you observe any impact on other scenarios where interrupts are not
the main cause of vmexits?

> pi_clear_on(&vmx->pi_desc);
> +
> + /*
> + * IN_GUEST_MODE means we are about to enter vCPU.
> Allow
> + * PIN (posted interrupt notification) to deliver is key
> + * to interrupt posting. Clear PID.SN.
> + */
> + if (vcpu->mode == IN_GUEST_MODE)
> + pi_clear_sn(&vmx->pi_desc);

I wonder whether it makes more sense to have 'sn' closely sync-ed
with vcpu->mode, e.g. having a kvm_x86_set_vcpu_mode() ops
to translate vcpu->mode into vmx/svm specific hardware bits like
'sn' here. Then call it in common place when vcpu->mode is changed.

> +
> /*
> * IOMMU can write to PID.ON, so the barrier matters even
> on UP.
> * But on x86 this is just a compiler barrier anyway.
> @@ -6976,6 +6988,16 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu
> *vcpu)
> /* The actual VMENTER/EXIT is in the .noinstr.text section. */
> vmx_vcpu_enter_exit(vcpu, vmx);
>
> + /*
> + * Suppress notification right after VM exits to minimize the
> + * window where VT-d or remote CPU may send a useless notification
> + * when posting interrupts to a VM. Note that the notification is
> + * useless because KVM syncs pending interrupts in PID.IRR to vAPIC
> + * IRR before VM entry.
> + */
> + if (kvm_vcpu_apicv_active(vcpu))
> + pi_set_sn(&vmx->pi_desc);
> +
> /*
> * We do not use IBRS in the kernel. If this vCPU has used the
> * SPEC_CTRL MSR it may have left it on; save the value and
> --
> 2.25.1

2022-06-21 10:22:36

by Chao Gao

[permalink] [raw]

Subject: Re: [PATCH] KVM: x86/vmx: Suppress posted interrupt notification when CPU is in host

On Tue, Jun 21, 2022 at 02:59:10PM +0800, Tian, Kevin wrote:
>> From: Chao Gao <[email protected]>
>> Sent: Friday, June 17, 2022 7:47 PM
>>
>> PIN (Posted interrupt notification) is useless to host as KVM always syncs
>> pending guest interrupts in PID to guest's vAPIC before each VM entry. In
>> fact, Sending PINs to a CPU running in host will lead to additional
>> overhead due to interrupt handling.
>>
>> Currently, software path, vmx_deliver_posted_interrupt(), is optimized to
>> issue PINs only if target vCPU is in IN_GUEST_MODE. But hardware paths
>> (VT-d and Intel IPI virtualization) aren't optimized.
>>
>> Set PID.SN right after VM exits and clear it before VM entry to minimize
>> the chance of hardware issuing PINs to a CPU when it's in host.
>>
>> Also honour PID.SN bit in vmx_deliver_posted_interrupt().
>>
>> When IPI virtualization is enabled, this patch increases "perf bench" [*]
>> by 4% from 8.12 us/ops to 7.80 us/ops.
>>
>> [*] test cmd: perf bench sched pipe -T. Note that we change the source
>> code to pin two threads to two different vCPUs so that it can reproduce
>> stable results.
>>
>> Signed-off-by: Chao Gao <[email protected]>
>> ---
>> arch/x86/kvm/vmx/posted_intr.c | 28 ++--------------------------
>> arch/x86/kvm/vmx/vmx.c | 24 +++++++++++++++++++++++-
>> 2 files changed, 25 insertions(+), 27 deletions(-)
>>
>> diff --git a/arch/x86/kvm/vmx/posted_intr.c
>> b/arch/x86/kvm/vmx/posted_intr.c
>> index 237a1f40f939..a0458f72df99 100644
>> --- a/arch/x86/kvm/vmx/posted_intr.c
>> +++ b/arch/x86/kvm/vmx/posted_intr.c
>> @@ -70,12 +70,6 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int
>> cpu)
>> * needs to be changed.
>> */
>> if (pi_desc->nv != POSTED_INTR_WAKEUP_VECTOR && vcpu->cpu ==
>> cpu) {
>> - /*
>> - * Clear SN if it was set due to being preempted. Again, do
>> - * this even if there is no assigned device for simplicity.
>> - */
>> - if (pi_test_and_clear_sn(pi_desc))
>> - goto after_clear_sn;
>> return;
>> }
>>
>> @@ -99,12 +93,8 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int
>> cpu)
>> do {
>> old.control = new.control = READ_ONCE(pi_desc->control);
>>
>> - /*
>> - * Clear SN (as above) and refresh the destination APIC ID to
>> - * handle task migration (@cpu != vcpu->cpu).
>> - */
>> new.ndst = dest;
>> - new.sn = 0;
>> + new.sn = 1;
>
>A comment is appreciated.

Will add a comment like:
/*
* Set SN and refresh the destination APIC ID to handle task migration
* (@cpu != vcpu->cpu).
*
* SN was cleared when a vCPU goes to blocked state so that KVM can wake
* up the blocked vCPU on receving a notification. For a running/runnable
* vCPU, such notifications are useless. Set SN bit to suppress them.
*/

>there is a problem a few lines downwards:
>
> /*
> * Send a wakeup IPI to this CPU if an interrupt may have been posted
> * before the notification vector was updated, in which case the IRQ
> * will arrive on the non-wakeup vector. An IPI is needed as calling
> * try_to_wake_up() from ->sched_out() isn't allowed (IRQs are not
> * enabled until it is safe to call try_to_wake_up() on the task being
> * scheduled out).
> */
> if (pi_test_on(&new))
> apic->send_IPI_self(POSTED_INTR_WAKEUP_VECTOR);
>
>'on' is not set when SN is set. This is different from original assumption
>which has SN cleared in above window. In this case pi_test_on()
>should be replaced with pi_is_pir_empty().

Nice catch. Will fix it.

>
>There is another simplification possible in vmx_vcpu_pi_put():
>
> if (kvm_vcpu_is_blocking(vcpu) && !vmx_interrupt_blocked(vcpu))
> pi_enable_wakeup_handler(vcpu);
>
> /*
> * Set SN when the vCPU is preempted. Note, the vCPU can both be seen
> * as blocking and preempted, e.g. if it's preempted between setting
> * its wait state and manually scheduling out.
> */
> if (vcpu->preempted)
> pi_set_sn(pi_desc);
>
>With this patch 'sn' is already set when a runnable vcpu is preempted
>hence above is required only for a blocking vcpu. And in the

There are two corner cases to be fixed before we can assume this.

Currently, this patch clears SN bit in vmx_deliver_posted_interrupt().
If kvm decides to cancel vCPU entry (i.e., bail out) after one of two call sites
of it in vcpu_enter_guest(), SN is cleared for a running vCPU.
I was thinking a running vCPU with SN cleared won't have any functional
issue. So I didn't add a new static_call for SN bit and manipulate it
when KVM decides to cancel vCPU entry.

>blocking case if the notification is anyway suppressed it's pointless to
>further change the notification vector. Then it could be simplified as:
>
> if (!vcpu->preempted && kvm_vcpu_is_blocking(vcpu) &&
> !vmx_interrupt_blocked(vcpu))
> pi_enable_wakeup_handler(vcpu);
>
>> /*
>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>> index a3c5504601a8..fa915b1680eb 100644
>> --- a/arch/x86/kvm/vmx/vmx.c
>> +++ b/arch/x86/kvm/vmx/vmx.c
>> @@ -4036,6 +4036,9 @@ static int vmx_deliver_posted_interrupt(struct
>> kvm_vcpu *vcpu, int vector)
>> if (pi_test_and_set_pir(vector, &vmx->pi_desc))
>> return 0;
>>
>> + if (pi_test_sn(&vmx->pi_desc))
>> + return 0;
>> +
>> /* If a previous notification has sent the IPI, nothing to do. */
>> if (pi_test_and_set_on(&vmx->pi_desc))
>> return 0;
>> @@ -6520,8 +6523,17 @@ static int vmx_sync_pir_to_irr(struct kvm_vcpu
>> *vcpu)
>> if (KVM_BUG_ON(!enable_apicv, vcpu->kvm))
>> return -EIO;
>>
>> - if (pi_test_on(&vmx->pi_desc)) {
>> + if (pi_test_on(&vmx->pi_desc) || pi_test_sn(&vmx->pi_desc)) {
>
>this has potential side-effect in vmexit/vmentry path where pi_desc is
>always scanned now. While reducing interrupts help the IPC test case,
>do you observe any impact on other scenarios where interrupts are not
>the main cause of vmexits?

Good question. I will run cpuid loop tests to measure the impact.
SN/ON/IRR are in the same cacheline. So, supposing the impact would be
negligible.

>
>> pi_clear_on(&vmx->pi_desc);
>> +
>> + /*
>> + * IN_GUEST_MODE means we are about to enter vCPU.
>> Allow
>> + * PIN (posted interrupt notification) to deliver is key
>> + * to interrupt posting. Clear PID.SN.
>> + */
>> + if (vcpu->mode == IN_GUEST_MODE)
>> + pi_clear_sn(&vmx->pi_desc);
>
>I wonder whether it makes more sense to have 'sn' closely sync-ed
>with vcpu->mode, e.g. having a kvm_x86_set_vcpu_mode() ops
>to translate vcpu->mode into vmx/svm specific hardware bits like
>'sn' here. Then call it in common place when vcpu->mode is changed.

It makes sense to me because it can fix the two corner cases described above.