LinuxLists.cc - [PATCH] KVM: SVM: Update destination when updating pi irte

2023-05-16 12:36:46

Subject: [PATCH] KVM: SVM: Update destination when updating pi irte

Destination of irte will be cleard by IOMMU driver when updating irte.
It will only be set correctly in vcpu_load. IOMMU will deliver the
doorbell message to the wrong physical cpu before vcpu_load is executed.
That means vcpu can not recognize interrupt delivery during the time of
non-root mode.

Signed-off-by: dengqiao.joey <[email protected]>
---
arch/x86/kvm/svm/avic.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index cfc8ab773025..16fe41429123 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -933,8 +933,11 @@ int avic_pi_update_irte(struct kvm *kvm, unsigned int host_irq,
* we can reference to them directly when we update vcpu
* scheduling information in IOMMU irte.
*/
- if (!ret && pi.is_guest_mode)
+ if (!ret && pi.is_guest_mode) {
+ amd_iommu_update_ga(kvm_cpu_get_apicid(svm->vcpu.cpu),
+ true, pi.ir_data);
svm_ir_list_add(svm, &pi);
+ }
} else {
/* Use legacy mode in IRTE */
struct amd_iommu_pi_data pi;
--
2.11.0

2023-05-16 16:08:00

by Sean Christopherson

[permalink] [raw]

Subject: Re: [PATCH] KVM: SVM: Update destination when updating pi irte

On Tue, May 16, 2023, dengqiao.joey wrote:
> Destination of irte will be cleard by IOMMU driver when updating irte.
> It will only be set correctly in vcpu_load. IOMMU will deliver the
> doorbell message to the wrong physical cpu before vcpu_load is executed.
> That means vcpu can not recognize interrupt delivery during the time of
> non-root mode.

I suspect this is actually a bug in amd_iommu_activate_guest_mode(). Does the
below fix the issue? If so, can you give a Tested-by and/or Reviewed-by on the
posted patch[*]?

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 5a505ba5467e..fbe77ee2d26c 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -3484,8 +3484,7 @@ int amd_iommu_activate_guest_mode(void *data)
struct irte_ga *entry = (struct irte_ga *) ir_data->entry;
u64 valid;

- if (!AMD_IOMMU_GUEST_IR_VAPIC(amd_iommu_guest_ir) ||
- !entry || entry->lo.fields_vapic.guest_mode)
+ if (!AMD_IOMMU_GUEST_IR_VAPIC(amd_iommu_guest_ir) || !entry)
return 0;

valid = entry->lo.fields_vapic.valid;

[*] https://lore.kernel.org/all/[email protected]

2023-05-17 12:12:08

by dengqiao.joey

[permalink] [raw]

Subject: Re: [PATCH] KVM: SVM: Update destination when updating pi irte

This seems to be a different issue. Joao's patch tries to fix the issue
that IRTE is not changed. The problem I encountered is that IRTE does
get changed, thus the destination field is also cleared by
amd_iommu_activate_guest_mode().

2023-05-17 15:36:53

by Joao Martins

[permalink] [raw]

Subject: Re: [PATCH] KVM: SVM: Update destination when updating pi irte

On 17/05/2023 13:04, dengqiao.joey wrote:
> This seems to be a different issue. Joao's patch tries to fix the issue
> that IRTE is not changed. The problem I encountered is that IRTE does
> get changed, thus the destination field is also cleared by
> amd_iommu_activate_guest_mode().

Whether the amd_iommu_activate_guest_mode() clears the destination field or not
doesn't change the situation I think. So I don't think that is the problem
you're having. amd_iommu_activate_guest_mode() will always make IRTEs with
isRun=0. Which means your VF interrupts get injected via GALog events and the
DestID is meaningless there :/ Even if you leave the old affinity there with
isRun=1 bit it's still wrong as by the time you run the vcpu, the wrong affinity
will be used by the VF. More fundamentally I am not sure that on IRTE routing
updates in the VM that you can look on vcpu->cpu the way you are using it could
be potentially invalid when vcpus are blocked/preempted. You're changing IRTEs
in the whole VM in pi_update_irte()

My changes essentially handle the fact where IRTEs will be always updated with
the right GATag set in the IRTE ... and then when the vCPU migrates, wakes or
block+unblocks it will change IRTE[DestID] with the new pcpu (which the vcpu is
running on) via amd_iommu_update_ga(). So there's a good chance my changes fixes
your affinity issue.

Joao

2023-05-18 04:08:35

by dengqiao.joey

[permalink] [raw]

Subject: Re: [PATCH] KVM: SVM: Update destination when updating pi irte

On Wed, May 17, 2023 at 11:32 PM Joao Martins <[email protected]> wrote:
>
> On 17/05/2023 13:04, dengqiao.joey wrote:
> > This seems to be a different issue. Joao's patch tries to fix the issue
> > that IRTE is not changed. The problem I encountered is that IRTE does
> > get changed, thus the destination field is also cleared by
> > amd_iommu_activate_guest_mode().
>
> Whether the amd_iommu_activate_guest_mode() clears the destination field or not
> doesn't change the situation I think. So I don't think that is the problem
> you're having. amd_iommu_activate_guest_mode() will always make IRTEs with
> isRun=0. Which means your VF interrupts get injected via GALog events and the
> DestID is meaningless there :/ Even if you leave the old affinity there with
> isRun=1 bit it's still wrong as by the time you run the vcpu, the wrong affinity
> will be used by the VF. More fundamentally I am not sure that on IRTE routing
> updates in the VM that you can look on vcpu->cpu the way you are using it could
> be potentially invalid when vcpus are blocked/preempted. You're changing IRTEs
> in the whole VM in pi_update_irte()
>
> My changes essentially handle the fact where IRTEs will be always updated with
> the right GATag set in the IRTE ... and then when the vCPU migrates, wakes or
> block+unblocks it will change IRTE[DestID] with the new pcpu (which the vcpu is
> running on) via amd_iommu_update_ga(). So there's a good chance my changes fixes
> your affinity issue.
>
> Joao

Sorry I forgot to mention before that in my application scenario, the vcpu runs on
a dedicated isolated cpu and uses "mwait" instead of "halt". So it will not be migrated,
blocked/wake and rarely get preempted by other threads. I think your patch can not fix
my issue because the vcpu rarely gets the opportunity to execute amd_iommu_update_ga()
from vcpu_load().

So each time the interrupt affinity is updated I can observe the loss of VF interrupt.
After applying my patch, the problem is solved.

2023-05-18 08:30:21

by Maxim Levitsky

[permalink] [raw]

Subject: Re: [PATCH] KVM: SVM: Update destination when updating pi irte

У чт, 2023-05-18 у 11:58 +0800, dengqiao.joey пише:
> On Wed, May 17, 2023 at 11:32 PM Joao Martins <[email protected]> wrote:
> > On 17/05/2023 13:04, dengqiao.joey wrote:
> > > This seems to be a different issue. Joao's patch tries to fix the issue
> > > that IRTE is not changed. The problem I encountered is that IRTE does
> > > get changed, thus the destination field is also cleared by
> > > amd_iommu_activate_guest_mode().
> >
> > Whether the amd_iommu_activate_guest_mode() clears the destination field or not
> > doesn't change the situation I think. So I don't think that is the problem
> > you're having. amd_iommu_activate_guest_mode() will always make IRTEs with
> > isRun=0. Which means your VF interrupts get injected via GALog events and the
> > DestID is meaningless there :/ Even if you leave the old affinity there with
> > isRun=1 bit it's still wrong as by the time you run the vcpu, the wrong affinity
> > will be used by the VF. More fundamentally I am not sure that on IRTE routing
> > updates in the VM that you can look on vcpu->cpu the way you are using it could
> > be potentially invalid when vcpus are blocked/preempted. You're changing IRTEs
> > in the whole VM in pi_update_irte()
> >
> > My changes essentially handle the fact where IRTEs will be always updated with
> > the right GATag set in the IRTE ... and then when the vCPU migrates, wakes or
> > block+unblocks it will change IRTE[DestID] with the new pcpu (which the vcpu is
> > running on) via amd_iommu_update_ga(). So there's a good chance my changes fixes
> > your affinity issue.
> >
> > Joao
>
> Sorry I forgot to mention before that in my application scenario, the vcpu runs on
> a dedicated isolated cpu and uses "mwait" instead of "halt". So it will not be migrated,
> blocked/wake and rarely get preempted by other threads. I think your patch can not fix
> my issue because the vcpu rarely gets the opportunity to execute amd_iommu_update_ga()
> from vcpu_load().
>
> So each time the interrupt affinity is updated I can observe the loss of VF interrupt.
> After applying my patch, the problem is solved.
>

Just to see if I understand the issue correctly:

vcpu 'affinity' of hardware interrupt to a guest vCPU
(which is in AMD case pretty much address of apic backing page + vector)

changes in these cases:

1. when new VFIO device is attached to the VM and the guest enables MSI,
which triggers attachment of irqfd which is matched with eventfd from the
hardware device and irq bypass is created.

2. when the guest changes the MSI/MSI-X settings - this is similar to 1,
and can happen any time

3. When the vCPU is migrated.

As I understand the case 1 and 2, the avic_pi_update_irte is called and indeed,
if the vCPU is already running and in MWAIT/HLT state especially, then
it will no longer receive doorbell kicks, and the GA log handler won't do anything
to this vCPU either since it is running (it just does kvm_vcpu_wake_up)

In case 3, the vCPU is loaded again eventually and that calls the
avic_update_iommu_vcpu_affinity which calls into iommu 'amd_iommu_update_ga'
to update target vCPU in the iommu remap entry.

Now it is possible that usually the MSI/MSI-x config space updates don't happen often
(e.g if guest doesn't migrate interrupts often, or device has per cpu interrupts (e.g nvme)),
and also vcpu loads/puts mask this, especially since MSI config space update itself is done
via userspace vmexit (e.g qemu machine model) which also triggers vcpu load/puts.

I think that we do need to a flag indicating if the vCPU is currently running and if yes,
then use svm->vcpu.cpu (or put -1 to it when it not running or something)
(currently the vcpu->cpu remains set when vCPU is put)

In other words if a vCPU is running, then avic_pi_update_irte should put correct pCPU number,
and if it raced with vCPU put/load, then later should win and put the correct value.
This can be done either with a lock or barriers.

In the current form, this patch is wrong since if the target is sleeping, then it will tell iommu
to send doorbells to last pCPU the target was running on.

FYI, I had the same issue with my nested avic implementation - when we have a nested vCPU run,
I need to translate the shadow physid table with the pCPU numbers, and in the same time,
vCPU migration also triggers update to the same table. I used a lock in the first implementation
but it can be improved.

As for varibale I used a private svm variable, 'nested_avic_active' which were true only when
the vCPU is loaded and its nested avic activated.

Best regards,
Maxim Levitsky

2023-05-18 09:29:15

by Joao Martins

[permalink] [raw]

Subject: Re: [PATCH] KVM: SVM: Update destination when updating pi irte

On 18/05/2023 09:19, Maxim Levitsky wrote:
> У чт, 2023-05-18 у 11:58 +0800, dengqiao.joey пише:
>> On Wed, May 17, 2023 at 11:32 PM Joao Martins <[email protected]> wrote:
>>> On 17/05/2023 13:04, dengqiao.joey wrote:
>>>> This seems to be a different issue. Joao's patch tries to fix the issue
>>>> that IRTE is not changed. The problem I encountered is that IRTE does
>>>> get changed, thus the destination field is also cleared by
>>>> amd_iommu_activate_guest_mode().
>>>
>>> Whether the amd_iommu_activate_guest_mode() clears the destination field or not
>>> doesn't change the situation I think. So I don't think that is the problem
>>> you're having. amd_iommu_activate_guest_mode() will always make IRTEs with
>>> isRun=0. Which means your VF interrupts get injected via GALog events and the
>>> DestID is meaningless there :/ Even if you leave the old affinity there with
>>> isRun=1 bit it's still wrong as by the time you run the vcpu, the wrong affinity
>>> will be used by the VF. More fundamentally I am not sure that on IRTE routing
>>> updates in the VM that you can look on vcpu->cpu the way you are using it could
>>> be potentially invalid when vcpus are blocked/preempted. You're changing IRTEs
>>> in the whole VM in pi_update_irte()
>>>
>>> My changes essentially handle the fact where IRTEs will be always updated with
>>> the right GATag set in the IRTE ... and then when the vCPU migrates, wakes or
>>> block+unblocks it will change IRTE[DestID] with the new pcpu (which the vcpu is
>>> running on) via amd_iommu_update_ga(). So there's a good chance my changes fixes
>>> your affinity issue.
>>>
>>> Joao
>>
>> Sorry I forgot to mention before that in my application scenario, the vcpu runs on
>> a dedicated isolated cpu and uses "mwait" instead of "halt". So it will not be migrated,
>> blocked/wake and rarely get preempted by other threads. I think your patch can not fix
>> my issue because the vcpu rarely gets the opportunity to execute amd_iommu_update_ga()
>> from vcpu_load().
>>
>> So each time the interrupt affinity is updated I can observe the loss of VF interrupt.
>> After applying my patch, the problem is solved.
>>
>
> Just to see if I understand the issue correctly:
> > vcpu 'affinity' of hardware interrupt to a guest vCPU
> (which is in AMD case pretty much address of apic backing page + vector)
>
> changes in these cases:
>
> 1. when new VFIO device is attached to the VM and the guest enables MSI,
> which triggers attachment of irqfd which is matched with eventfd from the
> hardware device and irq bypass is created.
>
> 2. when the guest changes the MSI/MSI-X settings - this is similar to 1,
> and can happen any time
>
> 3. When the vCPU is migrated.
>
> As I understand the case 1 and 2, the avic_pi_update_irte is called and indeed,
> if the vCPU is already running and in MWAIT/HLT state especially, then
> it will no longer receive doorbell kicks, and the GA log handler won't do anything
> to this vCPU either since it is running (it just does kvm_vcpu_wake_up)
>
That's by design I think, as the IOMMU will still update the APIC table in the
specified vector of the table. The problem is that it is more inneficient, but
it will still inject the interrupt correctly as activate_guest_mode will also
update the vector and the new descriptor data representing the new vCPU APIC.
The only thing that gets invalidated by vcpu affinity is the doorbell isn't
right as VCPU affinity has changed, and thus we need to update it next vmentry.

>
> In case 3, the vCPU is loaded again eventually and that calls the
> avic_update_iommu_vcpu_affinity which calls into iommu 'amd_iommu_update_ga'
> to update target vCPU in the iommu remap entry.
>
Correct.

I think the issue here is that affinity updates invalidate the current affinity,
and that makes it depend on vcpus wakeups, but if the vCPU never exits the IRTE
is never marked as Running until the next vcpu_load().

>
> Now it is possible that usually the MSI/MSI-x config space updates don't happen often
> (e.g if guest doesn't migrate interrupts often, or device has per cpu interrupts (e.g nvme)),
> and also vcpu loads/puts mask this, especially since MSI config space update itself is done
> via userspace vmexit (e.g qemu machine model) which also triggers vcpu load/puts.
>
Yeap.

> I think that we do need to a flag indicating if the vCPU is currently running and if yes,
> then use svm->vcpu.cpu (or put -1 to it when it not running or something)
> (currently the vcpu->cpu remains set when vCPU is put)
>
> In other words if a vCPU is running, then avic_pi_update_irte should put correct pCPU number,
> and if it raced with vCPU put/load, then later should win and put the correct value.
> This can be done either with a lock or barriers.
>
If this could be done, it could remove cost from other places and avoid this
little dance of the galog (and avoid its usage as it's not the greatest design
aspect of the IOMMU). We anyways already need to do IRT flushes in all these
things with regards to updating any piece of the IRTE, but we need some care
there two to avoid invalidating too much (which is just as expensive and per-VCPU).

> In the current form, this patch is wrong since if the target is sleeping, then it will tell iommu
> to send doorbells to last pCPU the target was running on.
>
Exactly my point.

> FYI, I had the same issue with my nested avic implementation - when we have a nested vCPU run,
> I need to translate the shadow physid table with the pCPU numbers, and in the same time,
> vCPU migration also triggers update to the same table. I used a lock in the first implementation
> but it can be improved.
>

Interesting.

> As for varibale I used a private svm variable, 'nested_avic_active' which were true only when
> the vCPU is loaded and its nested avic activated.
>
I had a different take on it largelly based on how we do other things (e.g.
pfncache updates). By the time we update vCPUs affinity we will inevitably
enable GALog with today's semantics, until IRTEs/VCPUs are marked as isRunning
again So what I was doing was asking the vCPUs to go out of guest mode (but not
those that are blocked as that is covered) and request that the individual vCPUs
to update their affinity. This would be another way.

But still quite expensive (as many IPIs as vCPUs updated), but it works as
intended and guest will immediately see the right vcpu affinity. But I honestly
prefer going towards your suggestion (via vcpu.pcpu) if we can have some
insurance that vcpu.cpu is safe to use in pi_update_irte if protected against
preemption/blocking of the VCPU. I would suggest we would get some smarts in
changing affinity, such that the update of the GATag/GAVector/GARootPtr happens
we also has a pcpu in there. Perhaps we could even remove activate_guest_mode()
and just batch all these updates. If you have some changes on the vcpu::cpu I am
happy to try it out on the AMD IOMMU side on top of it.

The snippet below isn't exactly a formal RFC patch (it's better splitted in my
local branch), but I am attaching it anyhow the diffs for discussions purposes.

------------------------------->8----------------------------------
Subject: KVM: SVM: Avoid GALog when updating IRTE affinity

Signed-off-by: Joao Martins <[email protected]>
---
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index fb9d1f2d6136..2dd41dbb51cb 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -113,6 +113,8 @@
KVM_ARCH_REQ_FLAGS(31, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
#define KVM_REQ_HV_TLB_FLUSH \
KVM_ARCH_REQ_FLAGS(32, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
+#define KVM_REQ_APICV_AFFINITY_UPDATE \
+ KVM_ARCH_REQ_FLAGS(33, KVM_REQUEST_WAIT)

#define CR0_RESERVED_BITS \
(~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \
@@ -1731,6 +1733,8 @@ struct kvm_x86_ops {
* Returns vCPU specific APICv inhibit reasons
*/
unsigned long (*vcpu_get_apicv_inhibit_reasons)(struct kvm_vcpu *vcpu);
+
+ void (*update_apicv_affinity)(struct kvm_vcpu *vcpu);
};
diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index cfc8ab773025..535619a94770 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -39,6 +39,7 @@
* as far as hardware is concerned.
*/
#define AVIC_VCPU_ID_MASK AVIC_PHYSICAL_MAX_INDEX_MASK
+#define AVIC_VCPU_ID_NR (AVIC_PHYSICAL_MAX_INDEX_MASK + 1)

#define AVIC_VM_ID_SHIFT HWEIGHT32(AVIC_PHYSICAL_MAX_INDEX_MASK)
#define AVIC_VM_ID_MASK (GENMASK(31, AVIC_VM_ID_SHIFT)
>> AVIC_VM_ID_SHIFT)
@@ -878,8 +879,10 @@ get_pi_vcpu_info(struct kvm *kvm, struct
kvm_kernel_irq_routing_entry *e,
int avic_pi_update_irte(struct kvm *kvm, unsigned int host_irq,
uint32_t guest_irq, bool set)
{
+ DECLARE_BITMAP(vcpu_bitmap, AVIC_VCPU_ID_NR);
struct kvm_kernel_irq_routing_entry *e;
struct kvm_irq_routing_table *irq_rt;
+ bool update_affinity = false;
int idx, ret = 0;

if (!kvm_arch_has_assigned_device(kvm) ||
@@ -933,8 +936,16 @@ int avic_pi_update_irte(struct kvm *kvm, unsigned int host_irq,
* we can reference to them directly when we update vcpu
* scheduling information in IOMMU irte.
*/
- if (!ret && pi.is_guest_mode)
+ if (!ret && pi.is_guest_mode) {
svm_ir_list_add(svm, &pi);
+ if (!update_affinity) {
+ bitmap_zero(vcpu_bitmap,
+ AVIC_VCPU_ID_NR);
+ update_affinity = true;
+ }
+ __set_bit(svm->vcpu.vcpu_id,
+ vcpu_bitmap);
+ }
} else {
/* Use legacy mode in IRTE */
struct amd_iommu_pi_data pi;
@@ -979,6 +990,14 @@ int avic_pi_update_irte(struct kvm *kvm, unsigned int host_irq,
ret = 0;
out:
srcu_read_unlock(&kvm->irq_srcu, idx);
+ if (update_affinity) {
+ preempt_disable();
+ local_irq_enable();
+ kvm_make_vcpus_request_mask(kvm, KVM_REQ_APICV_AFFINITY_UPDATE,
+ vcpu_bitmap);
+ local_irq_disable();
+ preempt_enable();
+ }
return ret;
}

@@ -1012,6 +1031,21 @@ avic_update_iommu_vcpu_affinity(struct kvm_vcpu *vcpu,
int cpu, bool r)
return ret;
}

+void avic_update_affinity(struct kvm_vcpu *vcpu)
+{
+ int h_physical_id = kvm_cpu_get_apicid(vcpu->cpu);
+ struct vcpu_svm *svm = to_svm(vcpu);
+ u64 entry;
+
+ entry = READ_ONCE(*(svm->avic_physical_id_cache));
+
+ /* Nothing to do if IsRunning == '0' due to vCPU blocking. */
+ if (!(entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK))
+ return;
+
+ avic_update_iommu_vcpu_affinity(vcpu, h_physical_id, true);
+}
+
void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
{
u64 entry;
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index ca32389f3c36..2f3eb13ac248 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4935,6 +4935,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {

.vcpu_deliver_sipi_vector = svm_vcpu_deliver_sipi_vector,
.vcpu_get_apicv_inhibit_reasons = avic_vcpu_get_apicv_inhibit_reasons,
+ .update_apicv_affinity = avic_update_affinity,
};

/*
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index f44751dd8d5d..c6c47c5c7ad8 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -706,6 +706,7 @@ void avic_vcpu_unblocking(struct kvm_vcpu *vcpu);
void avic_ring_doorbell(struct kvm_vcpu *vcpu);
unsigned long avic_vcpu_get_apicv_inhibit_reasons(struct kvm_vcpu *vcpu);
void avic_refresh_virtual_apic_mode(struct kvm_vcpu *vcpu);
+void avic_update_affinity(struct kvm_vcpu *vcpu);

/* sev.c */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ceb7c5e9cf9e..62fe938ea1c2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10458,6 +10458,18 @@ static void kvm_vcpu_reload_apic_access_page(struct
kvm_vcpu *vcpu)
static_call_cond(kvm_x86_set_apic_access_page_addr)(vcpu);
}

+void kvm_vcpu_update_apicv_affinity(struct kvm_vcpu *vcpu)
+{
+ if (!lapic_in_kernel(vcpu))
+ return;
+
+ if (!kvm_x86_ops.update_apicv_affinity)
+ return;
+
+ kvm_x86_ops.update_apicv_affinity(vcpu);
+}
+EXPORT_SYMBOL_GPL(kvm_vcpu_update_apicv_affinity);
+
void __kvm_request_immediate_exit(struct kvm_vcpu *vcpu)
{
smp_send_reschedule(vcpu->cpu);
@@ -10586,6 +10598,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
vcpu_load_eoi_exitmap(vcpu);
if (kvm_check_request(KVM_REQ_APIC_PAGE_RELOAD, vcpu))
kvm_vcpu_reload_apic_access_page(vcpu);
+ if (kvm_check_request(KVM_REQ_APICV_AFFINITY_UPDATE, vcpu))
+ kvm_vcpu_update_apicv_affinity(vcpu);
if (kvm_check_request(KVM_REQ_HV_CRASH, vcpu)) {
vcpu->run->exit_reason = KVM_EXIT_SYSTEM_EVENT;
vcpu->run->system_event.type = KVM_SYSTEM_EVENT_CRASH;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index cb5c13eee193..cd28aabd28f3 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -317,6 +317,7 @@ bool kvm_make_vcpus_request_mask(struct kvm *kvm, unsigned
int req,

return called;
}
+EXPORT_SYMBOL_GPL(kvm_make_vcpus_request_mask);

bool kvm_make_all_cpus_request_except(struct kvm *kvm, unsigned int req,
struct kvm_vcpu *except)

2023-06-29 22:38:34

by Sean Christopherson

[permalink] [raw]

Subject: Re: [PATCH] KVM: SVM: Update destination when updating pi irte

On Thu, May 18, 2023, Joao Martins wrote:
>
> On 18/05/2023 09:19, Maxim Levitsky wrote:
> > I think that we do need to a flag indicating if the vCPU is currently
> > running and if yes, then use svm->vcpu.cpu (or put -1 to it when it not
> > running or something) (currently the vcpu->cpu remains set when vCPU is
> > put)
> >
> > In other words if a vCPU is running, then avic_pi_update_irte should put
> > correct pCPU number, and if it raced with vCPU put/load, then later should
> > win and put the correct value. This can be done either with a lock or
> > barriers.
> >
> If this could be done, it could remove cost from other places and avoid this
> little dance of the galog (and avoid its usage as it's not the greatest design
> aspect of the IOMMU). We anyways already need to do IRT flushes in all these
> things with regards to updating any piece of the IRTE, but we need some care
> there two to avoid invalidating too much (which is just as expensive and per-VCPU).

...

> But still quite expensive (as many IPIs as vCPUs updated), but it works as
> intended and guest will immediately see the right vcpu affinity. But I honestly
> prefer going towards your suggestion (via vcpu.pcpu) if we can have some
> insurance that vcpu.cpu is safe to use in pi_update_irte if protected against
> preemption/blocking of the VCPU.

I think we have all the necessary info, and even a handy dandy spinlock to ensure
ordering. Disclaimers: compile tested only, I know almost nothing about the IOMMU
side of things, and I don't know if I understood the needs for the !IsRunning cases.

Disclaimers aside, this should point the IOMMU at the right pCPU when the target
vCPU changes and the new vCPU is actively running.

---
arch/x86/kvm/svm/avic.c | 44 +++++++++++++++++++++++++++++++++--------
1 file changed, 36 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index cfc8ab773025..703ad9af73eb 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -791,6 +791,7 @@ static int svm_ir_list_add(struct vcpu_svm *svm, struct amd_iommu_pi_data *pi)
int ret = 0;
unsigned long flags;
struct amd_svm_iommu_ir *ir;
+ u64 entry;

/**
* In some cases, the existing irte is updated and re-set,
@@ -824,6 +825,18 @@ static int svm_ir_list_add(struct vcpu_svm *svm, struct amd_iommu_pi_data *pi)
ir->data = pi->ir_data;

spin_lock_irqsave(&svm->ir_list_lock, flags);
+
+ /*
+ * Update the target pCPU for IOMMU doorbells if the vCPU is running.
+ * If the vCPU is NOT running, i.e. is blocking or scheduled out, KVM
+ * will update the pCPU info when the vCPU awkened and/or scheduled in.
+ * See also avic_vcpu_load().
+ */
+ entry = READ_ONCE(*(svm->avic_physical_id_cache));
+ if (entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK)
+ amd_iommu_update_ga(entry & AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK,
+ true, pi->ir_data);
+
list_add(&ir->node, &svm->ir_list);
spin_unlock_irqrestore(&svm->ir_list_lock, flags);
out:
@@ -986,10 +999,11 @@ static inline int
avic_update_iommu_vcpu_affinity(struct kvm_vcpu *vcpu, int cpu, bool r)
{
int ret = 0;
- unsigned long flags;
struct amd_svm_iommu_ir *ir;
struct vcpu_svm *svm = to_svm(vcpu);

+ lockdep_assert_held(&svm->ir_list_lock);
+
if (!kvm_arch_has_assigned_device(vcpu->kvm))
return 0;

@@ -997,19 +1011,15 @@ avic_update_iommu_vcpu_affinity(struct kvm_vcpu *vcpu, int cpu, bool r)
* Here, we go through the per-vcpu ir_list to update all existing
* interrupt remapping table entry targeting this vcpu.
*/
- spin_lock_irqsave(&svm->ir_list_lock, flags);
-
if (list_empty(&svm->ir_list))
- goto out;
+ return 0;

list_for_each_entry(ir, &svm->ir_list, node) {
ret = amd_iommu_update_ga(cpu, r, ir->data);
if (ret)
- break;
+ return ret;
}
-out:
- spin_unlock_irqrestore(&svm->ir_list_lock, flags);
- return ret;
+ return 0;
}

void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
@@ -1017,6 +1027,7 @@ void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
u64 entry;
int h_physical_id = kvm_cpu_get_apicid(cpu);
struct vcpu_svm *svm = to_svm(vcpu);
+ unsigned long flags;

lockdep_assert_preemption_disabled();

@@ -1033,6 +1044,15 @@ void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
if (kvm_vcpu_is_blocking(vcpu))
return;

+ /*
+ * Grab the per-vCPU interrupt remapping lock even if the VM doesn't
+ * _currently_ have assigned devices, as that can change. Holding
+ * ir_list_lock ensures that either svm_ir_list_add() will consume
+ * up-to-date entry information, or that this task will wait until
+ * svm_ir_list_add() completes to set the new target pCPU.
+ */
+ spin_lock_irqsave(&svm->ir_list_lock, flags);
+
entry = READ_ONCE(*(svm->avic_physical_id_cache));
WARN_ON_ONCE(entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK);

@@ -1042,12 +1062,15 @@ void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)

WRITE_ONCE(*(svm->avic_physical_id_cache), entry);
avic_update_iommu_vcpu_affinity(vcpu, h_physical_id, true);
+
+ spin_unlock_irqrestore(&svm->ir_list_lock, flags);
}

void avic_vcpu_put(struct kvm_vcpu *vcpu)
{
u64 entry;
struct vcpu_svm *svm = to_svm(vcpu);
+ unsigned long flags;

lockdep_assert_preemption_disabled();

@@ -1057,10 +1080,15 @@ void avic_vcpu_put(struct kvm_vcpu *vcpu)
if (!(entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK))
return;

+ spin_lock_irqsave(&svm->ir_list_lock, flags);
+
avic_update_iommu_vcpu_affinity(vcpu, -1, 0);

entry &= ~AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK;
WRITE_ONCE(*(svm->avic_physical_id_cache), entry);
+
+ spin_unlock_irqrestore(&svm->ir_list_lock, flags);
+
}

void avic_refresh_virtual_apic_mode(struct kvm_vcpu *vcpu)

base-commit: 88bb466c9dec4f70d682cf38c685324e7b1b3d60
--

2023-07-14 13:27:28

by Joao Martins

[permalink] [raw]

Subject: Re: [PATCH] KVM: SVM: Update destination when updating pi irte

+Suravee, +Alejandro

On 29/06/2023 23:35, Sean Christopherson wrote:
> On Thu, May 18, 2023, Joao Martins wrote:
>> On 18/05/2023 09:19, Maxim Levitsky wrote:
>>> I think that we do need to a flag indicating if the vCPU is currently
>>> running and if yes, then use svm->vcpu.cpu (or put -1 to it when it not
>>> running or something) (currently the vcpu->cpu remains set when vCPU is
>>> put)
>>>
>>> In other words if a vCPU is running, then avic_pi_update_irte should put
>>> correct pCPU number, and if it raced with vCPU put/load, then later should
>>> win and put the correct value. This can be done either with a lock or
>>> barriers.
>>>
>> If this could be done, it could remove cost from other places and avoid this
>> little dance of the galog (and avoid its usage as it's not the greatest design
>> aspect of the IOMMU). We anyways already need to do IRT flushes in all these
>> things with regards to updating any piece of the IRTE, but we need some care
>> there two to avoid invalidating too much (which is just as expensive and per-VCPU).
>
> ...
>
>> But still quite expensive (as many IPIs as vCPUs updated), but it works as
>> intended and guest will immediately see the right vcpu affinity. But I honestly
>> prefer going towards your suggestion (via vcpu.pcpu) if we can have some
>> insurance that vcpu.cpu is safe to use in pi_update_irte if protected against
>> preemption/blocking of the VCPU.
>
> I think we have all the necessary info, and even a handy dandy spinlock to ensure
> ordering. Disclaimers: compile tested only, I know almost nothing about the IOMMU
> side of things, and I don't know if I understood the needs for the !IsRunning cases.
>
I was avoiding grabbing that lock, but now that I think about it it shouldn't do
much harm.

My only concern has mostly been whether we mark the IRQ isRunning=1 on a vcpu
that is about to block as then the doorbell rang by the IOMMU won't do anything
to the guest. But IIUC the physical ID cache read-once should cover that

> Disclaimers aside, this should point the IOMMU at the right pCPU when the target
> vCPU changes and the new vCPU is actively running.
>
Yeap, it should. I am gonna see if we can test this next week (even if not me,
but someone from my team)

> ---
> arch/x86/kvm/svm/avic.c | 44 +++++++++++++++++++++++++++++++++--------
> 1 file changed, 36 insertions(+), 8 deletions(-)
>
> diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
> index cfc8ab773025..703ad9af73eb 100644
> --- a/arch/x86/kvm/svm/avic.c
> +++ b/arch/x86/kvm/svm/avic.c
> @@ -791,6 +791,7 @@ static int svm_ir_list_add(struct vcpu_svm *svm, struct amd_iommu_pi_data *pi)
> int ret = 0;
> unsigned long flags;
> struct amd_svm_iommu_ir *ir;
> + u64 entry;
>
> /**
> * In some cases, the existing irte is updated and re-set,
> @@ -824,6 +825,18 @@ static int svm_ir_list_add(struct vcpu_svm *svm, struct amd_iommu_pi_data *pi)
> ir->data = pi->ir_data;
>
> spin_lock_irqsave(&svm->ir_list_lock, flags);
> +
> + /*
> + * Update the target pCPU for IOMMU doorbells if the vCPU is running.
> + * If the vCPU is NOT running, i.e. is blocking or scheduled out, KVM
> + * will update the pCPU info when the vCPU awkened and/or scheduled in.
> + * See also avic_vcpu_load().
> + */
> + entry = READ_ONCE(*(svm->avic_physical_id_cache));
> + if (entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK)
> + amd_iommu_update_ga(entry & AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK,
> + true, pi->ir_data);
> +

Ah! Totally forgot about the ID cache from AVIC. And it's already paired with
barriers Maxim was alluding to. Much better than trying to get a safe read of
vcpu::cpu.

> list_add(&ir->node, &svm->ir_list);
> spin_unlock_irqrestore(&svm->ir_list_lock, flags);
> out:
> @@ -986,10 +999,11 @@ static inline int
> avic_update_iommu_vcpu_affinity(struct kvm_vcpu *vcpu, int cpu, bool r)
> {
> int ret = 0;
> - unsigned long flags;
> struct amd_svm_iommu_ir *ir;
> struct vcpu_svm *svm = to_svm(vcpu);
>
> + lockdep_assert_held(&svm->ir_list_lock);
> +
> if (!kvm_arch_has_assigned_device(vcpu->kvm))
> return 0;
>
> @@ -997,19 +1011,15 @@ avic_update_iommu_vcpu_affinity(struct kvm_vcpu *vcpu, int cpu, bool r)
> * Here, we go through the per-vcpu ir_list to update all existing
> * interrupt remapping table entry targeting this vcpu.
> */
> - spin_lock_irqsave(&svm->ir_list_lock, flags);
> -
> if (list_empty(&svm->ir_list))
> - goto out;
> + return 0;
>
> list_for_each_entry(ir, &svm->ir_list, node) {
> ret = amd_iommu_update_ga(cpu, r, ir->data);
> if (ret)
> - break;
> + return ret;
> }
> -out:
> - spin_unlock_irqrestore(&svm->ir_list_lock, flags);
> - return ret;
> + return 0;
> }
>
> void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> @@ -1017,6 +1027,7 @@ void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> u64 entry;
> int h_physical_id = kvm_cpu_get_apicid(cpu);
> struct vcpu_svm *svm = to_svm(vcpu);
> + unsigned long flags;
>
> lockdep_assert_preemption_disabled();
>
> @@ -1033,6 +1044,15 @@ void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> if (kvm_vcpu_is_blocking(vcpu))
> return;
>
> + /*
> + * Grab the per-vCPU interrupt remapping lock even if the VM doesn't
> + * _currently_ have assigned devices, as that can change. Holding
> + * ir_list_lock ensures that either svm_ir_list_add() will consume
> + * up-to-date entry information, or that this task will wait until
> + * svm_ir_list_add() completes to set the new target pCPU.
> + */
> + spin_lock_irqsave(&svm->ir_list_lock, flags);
> +
> entry = READ_ONCE(*(svm->avic_physical_id_cache));
> WARN_ON_ONCE(entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK);
>
> @@ -1042,12 +1062,15 @@ void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>
> WRITE_ONCE(*(svm->avic_physical_id_cache), entry);
> avic_update_iommu_vcpu_affinity(vcpu, h_physical_id, true);
> +
> + spin_unlock_irqrestore(&svm->ir_list_lock, flags);
> }
>
> void avic_vcpu_put(struct kvm_vcpu *vcpu)
> {
> u64 entry;
> struct vcpu_svm *svm = to_svm(vcpu);
> + unsigned long flags;
>
> lockdep_assert_preemption_disabled();
>
> @@ -1057,10 +1080,15 @@ void avic_vcpu_put(struct kvm_vcpu *vcpu)
> if (!(entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK))
> return;
>
> + spin_lock_irqsave(&svm->ir_list_lock, flags);
> +
> avic_update_iommu_vcpu_affinity(vcpu, -1, 0);
>
> entry &= ~AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK;
> WRITE_ONCE(*(svm->avic_physical_id_cache), entry);
> +
> + spin_unlock_irqrestore(&svm->ir_list_lock, flags);
> +
> }

Other than that, looks to be good. To some extent we are doing an update_ga()
just like Dengqiao was doing, but with the right protection both from IR
perspective and safe read of the VCPU cpu.

2023-07-19 16:25:15

by Sean Christopherson

[permalink] [raw]

Subject: Re: [PATCH] KVM: SVM: Update destination when updating pi irte

On Fri, Jul 14, 2023, Joao Martins wrote:
> +Suravee, +Alejandro
>
> On 29/06/2023 23:35, Sean Christopherson wrote:
> > On Thu, May 18, 2023, Joao Martins wrote:
> >> On 18/05/2023 09:19, Maxim Levitsky wrote:
> >>> I think that we do need to a flag indicating if the vCPU is currently
> >>> running and if yes, then use svm->vcpu.cpu (or put -1 to it when it not
> >>> running or something) (currently the vcpu->cpu remains set when vCPU is
> >>> put)
> >>>
> >>> In other words if a vCPU is running, then avic_pi_update_irte should put
> >>> correct pCPU number, and if it raced with vCPU put/load, then later should
> >>> win and put the correct value. This can be done either with a lock or
> >>> barriers.
> >>>
> >> If this could be done, it could remove cost from other places and avoid this
> >> little dance of the galog (and avoid its usage as it's not the greatest design
> >> aspect of the IOMMU). We anyways already need to do IRT flushes in all these
> >> things with regards to updating any piece of the IRTE, but we need some care
> >> there two to avoid invalidating too much (which is just as expensive and per-VCPU).
> >
> > ...
> >
> >> But still quite expensive (as many IPIs as vCPUs updated), but it works as
> >> intended and guest will immediately see the right vcpu affinity. But I honestly
> >> prefer going towards your suggestion (via vcpu.pcpu) if we can have some
> >> insurance that vcpu.cpu is safe to use in pi_update_irte if protected against
> >> preemption/blocking of the VCPU.
> >
> > I think we have all the necessary info, and even a handy dandy spinlock to ensure
> > ordering. Disclaimers: compile tested only, I know almost nothing about the IOMMU
> > side of things, and I don't know if I understood the needs for the !IsRunning cases.
> >
> I was avoiding grabbing that lock, but now that I think about it it shouldn't do
> much harm.
>
> My only concern has mostly been whether we mark the IRQ isRunning=1 on a vcpu
> that is about to block as then the doorbell rang by the IOMMU won't do anything
> to the guest. But IIUC the physical ID cache read-once should cover that

Acquiring ir_list_lock in avic_vcpu_{load,put}() when modifying
AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK is the key to avoiding ordering issues.
E.g. without the spinlock, READ_ONCE() wouldn't prevent svm_ir_list_add() from
racing with avic_vcpu_{load,put}() and ultimately shoving stale data into the IRTE.

It *should* actually be safe to drop the READ_ONCE() since acquiring/releasing
the spinlock will prevent multiple loads from observing different values. I kept
them mostly to keep the diff small, and to be conservative.

The WRITE_ONCE() needs to stay to ensure that hardware doesn't see inconsitent
information due to store tearing.

If this patch works, I think it makes sense to follow-up with a cleanup patch to
drop the READ_ONCE() and add comments explaining why KVM uses WRITE_ONCE() but
not READ_ONCE().

2023-07-21 18:54:58

by Alejandro Jimenez

[permalink] [raw]

Subject: Re: [PATCH] KVM: SVM: Update destination when updating pi irte

On 7/19/23 11:57, Sean Christopherson wrote:
> On Fri, Jul 14, 2023, Joao Martins wrote:
>> +Suravee, +Alejandro
>>
>> On 29/06/2023 23:35, Sean Christopherson wrote:
>>> On Thu, May 18, 2023, Joao Martins wrote:
>>>> On 18/05/2023 09:19, Maxim Levitsky wrote:
>>>>> I think that we do need to a flag indicating if the vCPU is currently
>>>>> running and if yes, then use svm->vcpu.cpu (or put -1 to it when it not
>>>>> running or something) (currently the vcpu->cpu remains set when vCPU is
>>>>> put)
>>>>>
>>>>> In other words if a vCPU is running, then avic_pi_update_irte should put
>>>>> correct pCPU number, and if it raced with vCPU put/load, then later should
>>>>> win and put the correct value. This can be done either with a lock or
>>>>> barriers.
>>>>>
>>>> If this could be done, it could remove cost from other places and avoid this
>>>> little dance of the galog (and avoid its usage as it's not the greatest design
>>>> aspect of the IOMMU). We anyways already need to do IRT flushes in all these
>>>> things with regards to updating any piece of the IRTE, but we need some care
>>>> there two to avoid invalidating too much (which is just as expensive and per-VCPU).
>>> ...
>>>
>>>> But still quite expensive (as many IPIs as vCPUs updated), but it works as
>>>> intended and guest will immediately see the right vcpu affinity. But I honestly
>>>> prefer going towards your suggestion (via vcpu.pcpu) if we can have some
>>>> insurance that vcpu.cpu is safe to use in pi_update_irte if protected against
>>>> preemption/blocking of the VCPU.
>>> I think we have all the necessary info, and even a handy dandy spinlock to ensure
>>> ordering. Disclaimers: compile tested only, I know almost nothing about the IOMMU
>>> side of things, and I don't know if I understood the needs for the !IsRunning cases.
>>>
>> I was avoiding grabbing that lock, but now that I think about it it shouldn't do
>> much harm.
>>
>> My only concern has mostly been whether we mark the IRQ isRunning=1 on a vcpu
>> that is about to block as then the doorbell rang by the IOMMU won't do anything
>> to the guest. But IIUC the physical ID cache read-once should cover that
> Acquiring ir_list_lock in avic_vcpu_{load,put}() when modifying
> AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK is the key to avoiding ordering issues.
> E.g. without the spinlock, READ_ONCE() wouldn't prevent svm_ir_list_add() from
> racing with avic_vcpu_{load,put}() and ultimately shoving stale data into the IRTE.
>
> It *should* actually be safe to drop the READ_ONCE() since acquiring/releasing
> the spinlock will prevent multiple loads from observing different values. I kept
> them mostly to keep the diff small, and to be conservative.
>
> The WRITE_ONCE() needs to stay to ensure that hardware doesn't see inconsitent
> information due to store tearing.
>
> If this patch works, I think it makes sense to follow-up with a cleanup patch to
> drop the READ_ONCE() and add comments explaining why KVM uses WRITE_ONCE() but
> not READ_ONCE().
I tested the patch on top of v6.5-rc1, to also use the host kernel param
"amd_iommu=irtcachedis" from:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=66419036f68a

and found no issues running a 380 vCPU guest (using idle=poll), with a
mlx VF, on a Genoa host.

I didn't run an interrupt intensive workload, but stressed the affinity
changes in a tight loop on the guest running:

--
rcpu=$(($RANDOM % $(nproc)))

# the mlx5_comp* IRQs are in the 35-49 range
rirq=$((35 + $RANDOM % (50-35)))

echo $rcpu > /proc/irq/$rirq/smp_affinity_list

--

As suggested by Joao, I checked to see if there were any 'spurious' GA
Log events that are received while the target vCPU is running. The
window for this to happen is quite tight with the new changes, so after
100k affinity changes there are only 2 reported GA Log events on the guest:

--
@galog_events: 2
@galog_on_running[303]: 1
@galog_on_running[222]: 1

@vcpuHist:
[128, 256)             1
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[256, 512)             1
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

--

where when running with an unpatched host kernel there will be a
significant number detected:

--
@galog_events: 2476

[...]

@vcpuHist:
[0]                    2
|                                                    |
[1]                    1
|                                                    |
[2, 4)                11
|                                                    |
[4, 8)                13
|                                                    |
[8, 16)               51
|@@@                                                 |
[16, 32)              99
|@@@@@                                               |
[32, 64)             213
|@@@@@@@@@@@@                                        |
[64, 128)            381
|@@@@@@@@@@@@@@@@@@@@@@                              |
[128, 256)           869
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[256, 512)           834
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@   |

--

The script I used makes assumptions about strict ordering in which the
probes will be registered, but given that the observed behavior is as
predicted the assumptions seem sound. It is pasted below, in case there
are concerns about the logic.

Thank you,

Alejandro

---

#!/usr/bin/bpftrace

BEGIN
{
   printf("Tracing GALog events between affinity updates... Hit Ctrl-C
to end.\n");
   zero(@irteModified);
   zero(@galog_events);
   zero(@galog_on_running);
}

/*
* amd_iommu_activate_guest_mode() is mostly called from
* amd_ir_set_vcpu_affinity() to process vCPU affinity updates, but it also
* gets called by avic_set_pi_irte_mode() for re-enabling AVIC after
inhibition
* so this data is unreliable during boot time where most of the inhibition
* events take place.
*/
kprobe:amd_iommu_activate_guest_mode
{
   /*
   * $vcpu_gatag = (struct amd_ir_data *)arg0->gatag;
   * pahole -C amd_ir_data --hex drivers/iommu/amd/iommu.o
   * shows offset of u32 ga_tag field is 0x40
   * AVIC GATAG encodes vCPU ID in LSB 9 bits
   */
   $vcpu_gatag = (*(uint32 *)(arg0 + 0x40)) & 0x1FF;

   @irteModified[$vcpu_gatag] = 1;
}

tracepoint:kvm:kvm_avic_ga_log
{
   $vcpuid = args->vcpuid;

   @galog_events = count();

   /*
   * GALog reports an entry, and it's expected that the
IRTE.isRunning bit
   * is 0. The question becomes if it was cleared by an affinity update
   * and has not been restored by a subsequent call to
amd_iommu_update_ga
   */
   if (@irteModified[args->vcpuid] != 0) {
        @galog_on_running[args->vcpuid] = count();

        @vcpuHist = hist(args->vcpuid);
        //@vcpuHist = lhist($args->vcpuid, 0, 380, 1);
   }
}

kprobe:amd_iommu_update_ga
{
   /*
   * $vcpu_gatag = (struct amd_ir_data *)arg0->gatag;
   * pahole -C amd_ir_data --hex drivers/iommu/amd/iommu.o
   * shows offset of u32 ga_tag field is 0x40
   * AVIC GATAG encodes vCPU ID in LSB 9 bits
   */
   $vcpu_gatag = (*(uint32 *)(arg0 + 0x40)) & 0x1FF;

   /*
   * With the guest running with idle=poll, avic_vcpu_put() should not
   * be called, and any GA Log events detected must be spurious i.e.
   * targetting a vCPU that is currently running. Only clear the flag
   * when setting IsRun to 1 (as in via avic_vcpu_load() or
   * svm_ir_list_add()), to capture the spurious GA Log events.
   * arg1 ==> isRun
   */
   if (arg1 != 0) {
        @irteModified[$vcpu_gatag] = 0;
   }
}

END
{
   clear(@irteModified);
   print(@galog_events);
   print(@galog_on_running);
   print(@vcpuHist);
}