2023-09-28 21:34:11

by Maxim Levitsky

[permalink] [raw]
Subject: [PATCH 0/5] AVIC bugfixes and workarounds

Hi!



This patch series includes several fixes to AVIC I found while working

on a new version of nested AVIC code.



Also while developing it I realized that a very simple workaround for

AVIC's errata #1235 exists and included it in this patch series as well.



Best regards,

Maxim Levitsky



Maxim Levitsky (5):

x86: KVM: SVM: fix for x2avic CVE-2023-5090

x86: KVM: SVM: add support for Invalid IPI Vector interception

x86: KVM: SVM: refresh AVIC inhibition in svm_leave_nested()

iommu/amd: skip updating the IRTE entry when is_run is already false

x86: KVM: SVM: workaround for AVIC's errata #1235



arch/x86/include/asm/svm.h | 1 +

arch/x86/kvm/svm/avic.c | 55 +++++++++++++++++++++++++++-----------

arch/x86/kvm/svm/nested.c | 3 +++

arch/x86/kvm/svm/svm.c | 3 +--

drivers/iommu/amd/iommu.c | 9 +++++++

5 files changed, 54 insertions(+), 17 deletions(-)



--

2.26.3





2023-09-28 23:36:57

by Maxim Levitsky

[permalink] [raw]
Subject: [PATCH 5/5] x86: KVM: SVM: workaround for AVIC's errata #1235

On Zen2 (and likely on Zen1 as well), AVIC doesn't reliably detect a change
in the 'is_running' bit during ICR write emulation and might skip a
VM exit, if that bit was recently cleared.

The absence of the VM exit, leads to the KVM not waking up / triggering
nested vm exit on the target(s) of the IPI which can, in some cases,
lead to an unbounded delays in the guest execution.

As I recently discovered, a reasonable workaround exists: make the KVM
never set the is_running bit.

This workaround ensures that (*) all ICR writes always cause a VM exit
and therefore correctly emulated, in expense of never enjoying VM exit-less
ICR emulation.

This workaround does carry a performance penalty but according to my
benchmarks is still much better than not using AVIC at all,
because AVIC is still used for the receiving end of the IPIs, and for the
posted interrupts.

If the user is aware of the errata and it doesn't affect his workload,
the user can disable the workaround with 'avic_zen2_errata_workaround=0'

(*) More correctly all ICR writes except when 'Self' shorthand is used:

In this case AVIC skips reading physid table and just sets bits in IRR
of local APIC. Thankfully in this case, the errata is not possible,
therefore an extra workaround for this case is not needed.

Signed-off-by: Maxim Levitsky <[email protected]>
---
arch/x86/kvm/svm/avic.c | 50 +++++++++++++++++++++++++++++------------
1 file changed, 36 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index c44b65af494e3ff..df9efa428f86aa9 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -62,6 +62,9 @@ static_assert(__AVIC_GATAG(AVIC_VM_ID_MASK, AVIC_VCPU_ID_MASK) == -1u);
static bool force_avic;
module_param_unsafe(force_avic, bool, 0444);

+static int avic_zen2_errata_workaround = -1;
+module_param(avic_zen2_errata_workaround, int, 0444);
+
/* Note:
* This hash table is used to map VM_ID to a struct kvm_svm,
* when handling AMD IOMMU GALOG notification to schedule in
@@ -1027,7 +1030,6 @@ avic_update_iommu_vcpu_affinity(struct kvm_vcpu *vcpu, int cpu, bool r)

void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
{
- u64 entry;
int h_physical_id = kvm_cpu_get_apicid(cpu);
struct vcpu_svm *svm = to_svm(vcpu);
unsigned long flags;
@@ -1056,14 +1058,18 @@ void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
*/
spin_lock_irqsave(&svm->ir_list_lock, flags);

- entry = READ_ONCE(*(svm->avic_physical_id_cache));
- WARN_ON_ONCE(entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK);
+ if (!avic_zen2_errata_workaround) {
+ u64 entry = READ_ONCE(*(svm->avic_physical_id_cache));

- entry &= ~AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK;
- entry |= (h_physical_id & AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK);
- entry |= AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK;
+ WARN_ON_ONCE(entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK);
+
+ entry &= ~AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK;
+ entry |= (h_physical_id & AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK);
+ entry |= AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK;
+
+ WRITE_ONCE(*(svm->avic_physical_id_cache), entry);
+ }

- WRITE_ONCE(*(svm->avic_physical_id_cache), entry);
avic_update_iommu_vcpu_affinity(vcpu, h_physical_id, true);

spin_unlock_irqrestore(&svm->ir_list_lock, flags);
@@ -1071,7 +1077,7 @@ void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)

void avic_vcpu_put(struct kvm_vcpu *vcpu)
{
- u64 entry;
+ u64 entry = 0;
struct vcpu_svm *svm = to_svm(vcpu);
unsigned long flags;

@@ -1084,11 +1090,13 @@ void avic_vcpu_put(struct kvm_vcpu *vcpu)
* can't be scheduled out and thus avic_vcpu_{put,load}() can't run
* recursively.
*/
- entry = READ_ONCE(*(svm->avic_physical_id_cache));

- /* Nothing to do if IsRunning == '0' due to vCPU blocking. */
- if (!(entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK))
- return;
+ if (!avic_zen2_errata_workaround) {
+ /* Nothing to do if IsRunning == '0' due to vCPU blocking. */
+ entry = READ_ONCE(*(svm->avic_physical_id_cache));
+ if (!(entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK))
+ return;
+ }

/*
* Take and hold the per-vCPU interrupt remapping lock while updating
@@ -1102,8 +1110,10 @@ void avic_vcpu_put(struct kvm_vcpu *vcpu)

avic_update_iommu_vcpu_affinity(vcpu, -1, 0);

- entry &= ~AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK;
- WRITE_ONCE(*(svm->avic_physical_id_cache), entry);
+ if (!avic_zen2_errata_workaround) {
+ entry &= ~AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK;
+ WRITE_ONCE(*(svm->avic_physical_id_cache), entry);
+ }

spin_unlock_irqrestore(&svm->ir_list_lock, flags);

@@ -1217,5 +1227,17 @@ bool avic_hardware_setup(void)

amd_iommu_register_ga_log_notifier(&avic_ga_log_notifier);

+ if (avic_zen2_errata_workaround == -1) {
+
+ /* Assume that Zen1 and Zen2 have errata #1235 */
+ if (boot_cpu_data.x86 == 0x17)
+ avic_zen2_errata_workaround = 1;
+ else
+ avic_zen2_errata_workaround = 0;
+ }
+
+ if (avic_zen2_errata_workaround)
+ pr_info("Workaround for AVIC errata #1235 is enabled\n");
+
return true;
}
--
2.26.3

2023-09-29 00:21:14

by Maxim Levitsky

[permalink] [raw]
Subject: [PATCH 4/5] iommu/amd: skip updating the IRTE entry when is_run is already false

When vCPU affinity of an IRTE which already has
is_run == false, is updated and the update also sets is_run to false,
there is nothing to do.

The goal of this patch is to make a call to 'amd_iommu_update_ga()'
to be relatively cheap if there is nothing to do.

Signed-off-by: Maxim Levitsky <[email protected]>
---
drivers/iommu/amd/iommu.c | 9 +++++++++
1 file changed, 9 insertions(+)

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 95bd7c25ba6f366..10bcd436e984672 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -3774,6 +3774,15 @@ int amd_iommu_update_ga(int cpu, bool is_run, void *data)
entry->hi.fields.destination =
APICID_TO_IRTE_DEST_HI(cpu);
}
+
+ if (!is_run && !entry->lo.fields_vapic.is_run) {
+ /*
+ * No need to notify the IOMMU about an entry which
+ * already has is_run == False
+ */
+ return 0;
+ }
+
entry->lo.fields_vapic.is_run = is_run;

return modify_irte_ga(ir_data->iommu, ir_data->irq_2_irte.devid,
--
2.26.3

2023-09-29 01:54:30

by Joao Martins

[permalink] [raw]
Subject: Re: [PATCH 4/5] iommu/amd: skip updating the IRTE entry when is_run is already false

On 28/09/2023 16:04, Maxim Levitsky wrote:
> When vCPU affinity of an IRTE which already has
> is_run == false, is updated and the update also sets is_run to false,
> there is nothing to do.
>
> The goal of this patch is to make a call to 'amd_iommu_update_ga()'
> to be relatively cheap if there is nothing to do.
>
> Signed-off-by: Maxim Levitsky <[email protected]>

Reviewed-by: Joao Martins <[email protected]>

> ---
> drivers/iommu/amd/iommu.c | 9 +++++++++
> 1 file changed, 9 insertions(+)
>
> diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
> index 95bd7c25ba6f366..10bcd436e984672 100644
> --- a/drivers/iommu/amd/iommu.c
> +++ b/drivers/iommu/amd/iommu.c
> @@ -3774,6 +3774,15 @@ int amd_iommu_update_ga(int cpu, bool is_run, void *data)
> entry->hi.fields.destination =
> APICID_TO_IRTE_DEST_HI(cpu);
> }
> +
> + if (!is_run && !entry->lo.fields_vapic.is_run) {
> + /*
> + * No need to notify the IOMMU about an entry which
> + * already has is_run == False
> + */
> + return 0;
> + }
> +
> entry->lo.fields_vapic.is_run = is_run;
>
> return modify_irte_ga(ir_data->iommu, ir_data->irq_2_irte.devid,

2024-03-26 03:15:35

by Jim Mattson

[permalink] [raw]
Subject: Re: [PATCH 0/5] AVIC bugfixes and workarounds

On Thu, Sep 28, 2023 at 8:05 AM Maxim Levitsky <[email protected]> wrote:
>
> Hi!
>
> This patch series includes several fixes to AVIC I found while working
> on a new version of nested AVIC code.
>
> Also while developing it I realized that a very simple workaround for
> AVIC's errata #1235 exists and included it in this patch series as well.
>
> Best regards,
> Maxim Levitsky

Can someone explain why we're still unwilling to enable AVIC by
default? Have the performance issues that plagued the Rome
implementation been fixed? What is AMD's guidance?

2024-03-26 16:00:26

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [PATCH 0/5] AVIC bugfixes and workarounds

On Mon, 2024-03-25 at 20:15 -0700, Jim Mattson wrote:
> > On Thu, Sep 28, 2023 at 8:05 AM Maxim Levitsky <mlevitsk@redhatcom> wrote:
> > > >
> > > > Hi!
> > > >
> > > > This patch series includes several fixes to AVIC I found while working
> > > > on a new version of nested AVIC code.
> > > >
> > > > Also while developing it I realized that a very simple workaround for
> > > > AVIC's errata #1235 exists and included it in this patch series as well.
> > > >
> > > > Best regards,
> > > >         Maxim Levitsky
> >
> > Can someone explain why we're still unwilling to enable AVIC by
> > default? Have the performance issues that plagued the Rome
> > implementation been fixed? What is AMD's guidance?
> >
Hi

This is what I know:

Zen1:
I never tested it, so I don't know how well AVIC works there and if it has any erratas.

Zen2:
Has CPU errata in regard to IPI virtualization that makes it unusable in production,
but if AVIC's IPI virtualization (borrowing the Intel term here) is disabled,
then it works just fine and 1:1 equivalent to APICv without IPI.

I posted patches for this several times, latest version is here, it still applies I think:
https://lkml.iu.edu/hypermail/linux/kernel/2310.0/00790.html

Zen3:
For some reason AVIC got disabled by AMD in CPUID. It is still there though and force_avic=1 kvm_amd option
can make KVM use it and AFAIK it works just fine.

It is possible that it got disabled due to Zen2 errata that is fixed on Zen3,
but maybe AMD wasn't sure back then that it will be fixed or it might be due to performance issues with broadcast
IPIs which I think ended up being a software issue and was fixed a long time ago.

Zen4+
I haven't tested it much, but AFAIK it should work out of the box. It also got x2avic mode which allows
to use AVIC with VMs that have more that 254 vCPUs.

IMHO if we merge the workaround I have for IPI virtualization and make IPI virtualization off for Zen2
(and maybe Zen1 as well), then I don't see why we can't make AVIC be the default on.

Best regards,
Maxim Levitsky



2024-03-26 17:19:16

by Joao Martins

[permalink] [raw]
Subject: Re: [PATCH 0/5] AVIC bugfixes and workarounds

On 26/03/2024 15:59, [email protected] wrote:
> On Mon, 2024-03-25 at 20:15 -0700, Jim Mattson wrote:
>>> On Thu, Sep 28, 2023 at 8:05 AM Maxim Levitsky <[email protected]> wrote:
>>>>>
>>>>> Hi!
>>>>>
>>>>> This patch series includes several fixes to AVIC I found while working
>>>>> on a new version of nested AVIC code.
>>>>>
>>>>> Also while developing it I realized that a very simple workaround for
>>>>> AVIC's errata #1235 exists and included it in this patch series as well.
>>>>>
>>>>> Best regards,
>>>>>         Maxim Levitsky
>>>
>>> Can someone explain why we're still unwilling to enable AVIC by
>>> default? Have the performance issues that plagued the Rome
>>> implementation been fixed? What is AMD's guidance?
>>>
> Hi
>
> This is what I know:
>
> Zen1:
> I never tested it, so I don't know how well AVIC works there and if it has any erratas.
>
> Zen2:
> Has CPU errata in regard to IPI virtualization that makes it unusable in production,
> but if AVIC's IPI virtualization (borrowing the Intel term here) is disabled,
> then it works just fine and 1:1 equivalent to APICv without IPI.
>
> I posted patches for this several times, latest version is here, it still applies I think:
> https://lkml.iu.edu/hypermail/linux/kernel/2310.0/00790.html
>
> Zen3:
> For some reason AVIC got disabled by AMD in CPUID. It is still there though and force_avic=1 kvm_amd option
> can make KVM use it and AFAIK it works just fine.
>
> It is possible that it got disabled due to Zen2 errata that is fixed on Zen3,
> but maybe AMD wasn't sure back then that it will be fixed or it might be due to performance issues with broadcast
> IPIs which I think ended up being a software issue and was fixed a long time ago.
>
> Zen4+
> I haven't tested it much, but AFAIK it should work out of the box. It also got x2avic mode which allows
> to use AVIC with VMs that have more that 254 vCPUs.
>
> IMHO if we merge the workaround I have for IPI virtualization and make IPI virtualization off for Zen2
> (and maybe Zen1 as well), then I don't see why we can't make AVIC be the default on.

Additionally, I think right now with avic=1 it fails vcpu creation when creating
more a vcpu with an id bigger than what's supported i.e. the MAX_VCPUS we
currently advertised in UAPI isn't quite honored. So that's the only wrinkle at
least I am aware. Sean had send this one series to inhibit AVIC when such config:

https://lore.kernel.org/linux-iommu/[email protected]/#r

But I am not sure if it was respinned.

Joao