Fix a bug where KVM doesn't set the pCPU affinity for running vCPUs when
updating IRTE routing. Not setting the pCPU means the IOMMU will signal
the wrong pCPU's doorbell until the vCPU goes through a put+load cycle.
I waffled for far too long between making this one patch or two. Moving
the lock doesn't make all that much sense as a standalone patch, but in the
end, I decided that isolating the locking change would be useful in the
unlikely event that it breaks something. If anyone feels strongly about
making this a single patch, I have no objection to squashing these together.
Sean Christopherson (2):
KVM: SVM: Take and hold ir_list_lock when updating vCPU's Physical ID
entry
KVM: SVM: Set target pCPU during IRTE update if target vCPU is running
arch/x86/kvm/svm/avic.c | 59 +++++++++++++++++++++++++++++++++++------
1 file changed, 51 insertions(+), 8 deletions(-)
base-commit: 240f736891887939571854bd6d734b6c9291f22e
--
2.41.0.640.ga95def55d0-goog
Hoist the acquisition of ir_list_lock from avic_update_iommu_vcpu_affinity()
to its two callers, avic_vcpu_load() and avic_vcpu_put(), specifically to
encapsulate the write to the vCPU's entry in the AVIC Physical ID table.
This will allow a future fix to pull information from the Physical ID entry
when updating the IRTE, without potentially consuming stale information,
i.e. without racing with the vCPU being (un)loaded.
Add a comment to call out that ir_list_lock does NOT protect against
multiple writers, specifically that reading the Physical ID entry in
avic_vcpu_put() outside of the lock is safe.
To preserve some semblance of independence from ir_list_lock, keep the
READ_ONCE() in avic_vcpu_load() even though acuiring the spinlock
effectively ensures the load(s) will be generated after acquiring the
lock.
Cc: [email protected]
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/svm/avic.c | 31 +++++++++++++++++++++++--------
1 file changed, 23 insertions(+), 8 deletions(-)
diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index cfc8ab773025..8e041b215ddb 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -986,10 +986,11 @@ static inline int
avic_update_iommu_vcpu_affinity(struct kvm_vcpu *vcpu, int cpu, bool r)
{
int ret = 0;
- unsigned long flags;
struct amd_svm_iommu_ir *ir;
struct vcpu_svm *svm = to_svm(vcpu);
+ lockdep_assert_held(&svm->ir_list_lock);
+
if (!kvm_arch_has_assigned_device(vcpu->kvm))
return 0;
@@ -997,19 +998,15 @@ avic_update_iommu_vcpu_affinity(struct kvm_vcpu *vcpu, int cpu, bool r)
* Here, we go through the per-vcpu ir_list to update all existing
* interrupt remapping table entry targeting this vcpu.
*/
- spin_lock_irqsave(&svm->ir_list_lock, flags);
-
if (list_empty(&svm->ir_list))
- goto out;
+ return 0;
list_for_each_entry(ir, &svm->ir_list, node) {
ret = amd_iommu_update_ga(cpu, r, ir->data);
if (ret)
- break;
+ return ret;
}
-out:
- spin_unlock_irqrestore(&svm->ir_list_lock, flags);
- return ret;
+ return 0;
}
void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
@@ -1017,6 +1014,7 @@ void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
u64 entry;
int h_physical_id = kvm_cpu_get_apicid(cpu);
struct vcpu_svm *svm = to_svm(vcpu);
+ unsigned long flags;
lockdep_assert_preemption_disabled();
@@ -1033,6 +1031,8 @@ void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
if (kvm_vcpu_is_blocking(vcpu))
return;
+ spin_lock_irqsave(&svm->ir_list_lock, flags);
+
entry = READ_ONCE(*(svm->avic_physical_id_cache));
WARN_ON_ONCE(entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK);
@@ -1042,25 +1042,40 @@ void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
WRITE_ONCE(*(svm->avic_physical_id_cache), entry);
avic_update_iommu_vcpu_affinity(vcpu, h_physical_id, true);
+
+ spin_unlock_irqrestore(&svm->ir_list_lock, flags);
}
void avic_vcpu_put(struct kvm_vcpu *vcpu)
{
u64 entry;
struct vcpu_svm *svm = to_svm(vcpu);
+ unsigned long flags;
lockdep_assert_preemption_disabled();
+ /*
+ * Note, reading the Physical ID entry outside of ir_list_lock is safe
+ * as only the pCPU that has loaded (or is loading) the vCPU is allowed
+ * to modify the entry, and preemption is disabled. I.e. the vCPU
+ * can't be scheduled out and thus avic_vcpu_{put,load}() can't run
+ * recursively.
+ */
entry = READ_ONCE(*(svm->avic_physical_id_cache));
/* Nothing to do if IsRunning == '0' due to vCPU blocking. */
if (!(entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK))
return;
+ spin_lock_irqsave(&svm->ir_list_lock, flags);
+
avic_update_iommu_vcpu_affinity(vcpu, -1, 0);
entry &= ~AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK;
WRITE_ONCE(*(svm->avic_physical_id_cache), entry);
+
+ spin_unlock_irqrestore(&svm->ir_list_lock, flags);
+
}
void avic_refresh_virtual_apic_mode(struct kvm_vcpu *vcpu)
--
2.41.0.640.ga95def55d0-goog
On 09/08/2023 00:31, Sean Christopherson wrote:
> Fix a bug where KVM doesn't set the pCPU affinity for running vCPUs when
> updating IRTE routing. Not setting the pCPU means the IOMMU will signal
> the wrong pCPU's doorbell until the vCPU goes through a put+load cycle.
>
Or also framed as an inefficiency that we depend on the GALog (for a running
vCPU) for interrupt delivery until the put+load cycle happens. I don't think I
ever reproduced the missed interrupt case in our stress testing.
> I waffled for far too long between making this one patch or two. Moving
> the lock doesn't make all that much sense as a standalone patch, but in the
> end, I decided that isolating the locking change would be useful in the
> unlikely event that it breaks something. If anyone feels strongly about
> making this a single patch, I have no objection to squashing these together.
>
IMHO, as two patches looks better;
For what is worth:
Reviewed-by: Joao Martins <[email protected]>
I think Alejandro had reported his testing as successful here:
https://lore.kernel.org/kvm/[email protected]/
OTOH, he didn't give the Tested-by explicitly
On Wed, Aug 09, 2023, Joao Martins wrote:
> On 09/08/2023 00:31, Sean Christopherson wrote:
> > Fix a bug where KVM doesn't set the pCPU affinity for running vCPUs when
> > updating IRTE routing. Not setting the pCPU means the IOMMU will signal
> > the wrong pCPU's doorbell until the vCPU goes through a put+load cycle.
> >
>
> Or also framed as an inefficiency that we depend on the GALog (for a running
> vCPU) for interrupt delivery until the put+load cycle happens. I don't think I
> ever reproduced the missed interrupt case in our stress testing.
Ah, I'll reword the changelog in patch 2 if this only delays the interrupt instead
of dropping it entirely.
> > I waffled for far too long between making this one patch or two. Moving
> > the lock doesn't make all that much sense as a standalone patch, but in the
> > end, I decided that isolating the locking change would be useful in the
> > unlikely event that it breaks something. If anyone feels strongly about
> > making this a single patch, I have no objection to squashing these together.
> >
> IMHO, as two patches looks better;
>
> For what is worth:
>
> Reviewed-by: Joao Martins <[email protected]>
>
> I think Alejandro had reported his testing as successful here:
>
> https://lore.kernel.org/kvm/[email protected]/
>
> OTOH, he didn't give the Tested-by explicitly
Yeah, I almost asked for a Tested-by, but figured it would be just as easy to
post the patches.
On 8/9/23 10:23, Sean Christopherson wrote:
> On Wed, Aug 09, 2023, Joao Martins wrote:
>> On 09/08/2023 00:31, Sean Christopherson wrote:
>>> Fix a bug where KVM doesn't set the pCPU affinity for running vCPUs when
>>> updating IRTE routing. Not setting the pCPU means the IOMMU will signal
>>> the wrong pCPU's doorbell until the vCPU goes through a put+load cycle.
>>>
>>
>> Or also framed as an inefficiency that we depend on the GALog (for a running
>> vCPU) for interrupt delivery until the put+load cycle happens. I don't think I
>> ever reproduced the missed interrupt case in our stress testing.
Right, I was never able to see any dropped interrupts when testing the baseline host kernel with "idle=poll" on my guest.
Though I didn't reproduce Dengqiao's setup exactly e.g. they imply using isolcpus in the host kernel params.
>
> Ah, I'll reword the changelog in patch 2 if this only delays the interrupt instead
> of dropping it entirely.
>
>>> I waffled for far too long between making this one patch or two. Moving
>>> the lock doesn't make all that much sense as a standalone patch, but in the
>>> end, I decided that isolating the locking change would be useful in the
>>> unlikely event that it breaks something. If anyone feels strongly about
>>> making this a single patch, I have no objection to squashing these together.
>>>
>> IMHO, as two patches looks better;
>>
>> For what is worth:
>>
>> Reviewed-by: Joao Martins <[email protected]>
>>
>> I think Alejandro had reported his testing as successful here:
>>
>> https://lore.kernel.org/kvm/[email protected]/
>>
>> OTOH, he didn't give the Tested-by explicitly
>
> Yeah, I almost asked for a Tested-by, but figured it would be just as easy to
> post the patches.
I was hoping to find more time to test with other configs (i.e. more closely matching the original environment).
That being said, besides the positive results from the validation script mentioned earlier, I have been using the
patched kernel to launch guests in my setup for quite some time now without encountering any issues. From my side:
Tested-by: Alejandro Jimenez <[email protected]>