From: Maxim Levitsky <[email protected]>
No normal guest has any reason to change physical APIC IDs, and
allowing this introduces bugs into APIC acceleration code.
And Intel recent hardware just ignores writes to APIC_ID in
xAPIC mode. More background can be found at:
https://lore.kernel.org/lkml/[email protected]/
Looks there is no much value to support writable xAPIC ID in
guest except supporting some old and crazy use cases which
probably would fail on real hardware. So, make xAPIC ID
read-only for KVM guests.
Signed-off-by: Maxim Levitsky <[email protected]>
Signed-off-by: Zeng Guang <[email protected]>
---
arch/x86/kvm/lapic.c | 25 ++++++++++++++++++-------
1 file changed, 18 insertions(+), 7 deletions(-)
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index e4bcdab1fac0..b38288c8a94f 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2044,10 +2044,17 @@ static int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val)
switch (reg) {
case APIC_ID: /* Local APIC ID */
- if (!apic_x2apic_mode(apic))
- kvm_apic_set_xapic_id(apic, val >> 24);
- else
+ if (apic_x2apic_mode(apic)) {
ret = 1;
+ break;
+ }
+ /* Don't allow changing APIC ID to avoid unexpected issues */
+ if ((val >> 24) != apic->vcpu->vcpu_id) {
+ kvm_vm_bugged(apic->vcpu->kvm);
+ break;
+ }
+
+ kvm_apic_set_xapic_id(apic, val >> 24);
break;
case APIC_TASKPRI:
@@ -2631,11 +2638,15 @@ int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu)
static int kvm_apic_state_fixup(struct kvm_vcpu *vcpu,
struct kvm_lapic_state *s, bool set)
{
- if (apic_x2apic_mode(vcpu->arch.apic)) {
- u32 *id = (u32 *)(s->regs + APIC_ID);
- u32 *ldr = (u32 *)(s->regs + APIC_LDR);
- u64 icr;
+ u32 *id = (u32 *)(s->regs + APIC_ID);
+ u32 *ldr = (u32 *)(s->regs + APIC_LDR);
+ u64 icr;
+ if (!apic_x2apic_mode(vcpu->arch.apic)) {
+ /* Don't allow changing APIC ID to avoid unexpected issues */
+ if ((*id >> 24) != vcpu->vcpu_id)
+ return -EINVAL;
+ } else {
if (vcpu->kvm->arch.x2apic_format) {
if (*id != vcpu->vcpu_id)
return -EINVAL;
--
2.27.0
On Fri, 2022-02-25 at 16:46 +0200, Maxim Levitsky wrote:
> On Fri, 2022-02-25 at 16:22 +0800, Zeng Guang wrote:
> > From: Maxim Levitsky <
> > [email protected]
> > >
> >
> > No normal guest has any reason to change physical APIC IDs, and
> > allowing this introduces bugs into APIC acceleration code.
> >
> > And Intel recent hardware just ignores writes to APIC_ID in
> > xAPIC mode. More background can be found at:
> > https://lore.kernel.org/lkml/[email protected]/
> >
> >
> > Looks there is no much value to support writable xAPIC ID in
> > guest except supporting some old and crazy use cases which
> > probably would fail on real hardware. So, make xAPIC ID
> > read-only for KVM guests.
> >
> > Signed-off-by: Maxim Levitsky <[email protected]>
> > Signed-off-by: Zeng Guang <[email protected]>
>
> Assuming that this is approved and accepted upstream,
> that is even better that my proposal of doing this
> when APICv is enabled.
>
> Since now apic id is always read only, now we should not
> forget to clean up some parts of kvm like kvm_recalculate_apic_map,
> which are not needed anymore
Can we also now optimise kvm_get_vcpu_by_id() so it doesn't have to do
a linear search over all the vCPUs when there isn't a 1:1
correspondence with the vCPU index?
On Fri, 2022-02-25 at 16:22 +0800, Zeng Guang wrote:
> From: Maxim Levitsky <[email protected]>
>
> No normal guest has any reason to change physical APIC IDs, and
> allowing this introduces bugs into APIC acceleration code.
>
> And Intel recent hardware just ignores writes to APIC_ID in
> xAPIC mode. More background can be found at:
> https://lore.kernel.org/lkml/[email protected]/
>
> Looks there is no much value to support writable xAPIC ID in
> guest except supporting some old and crazy use cases which
> probably would fail on real hardware. So, make xAPIC ID
> read-only for KVM guests.
>
> Signed-off-by: Maxim Levitsky <[email protected]>
> Signed-off-by: Zeng Guang <[email protected]>
Assuming that this is approved and accepted upstream,
that is even better that my proposal of doing this
when APICv is enabled.
Since now apic id is always read only, now we should not
forget to clean up some parts of kvm like kvm_recalculate_apic_map,
which are not needed anymore.
Best regards,
Maxim Levitsky
> ---
> arch/x86/kvm/lapic.c | 25 ++++++++++++++++++-------
> 1 file changed, 18 insertions(+), 7 deletions(-)
>
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index e4bcdab1fac0..b38288c8a94f 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -2044,10 +2044,17 @@ static int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val)
>
> switch (reg) {
> case APIC_ID: /* Local APIC ID */
> - if (!apic_x2apic_mode(apic))
> - kvm_apic_set_xapic_id(apic, val >> 24);
> - else
> + if (apic_x2apic_mode(apic)) {
> ret = 1;
> + break;
> + }
> + /* Don't allow changing APIC ID to avoid unexpected issues */
> + if ((val >> 24) != apic->vcpu->vcpu_id) {
> + kvm_vm_bugged(apic->vcpu->kvm);
> + break;
> + }
> +
> + kvm_apic_set_xapic_id(apic, val >> 24);
> break;
>
> case APIC_TASKPRI:
> @@ -2631,11 +2638,15 @@ int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu)
> static int kvm_apic_state_fixup(struct kvm_vcpu *vcpu,
> struct kvm_lapic_state *s, bool set)
> {
> - if (apic_x2apic_mode(vcpu->arch.apic)) {
> - u32 *id = (u32 *)(s->regs + APIC_ID);
> - u32 *ldr = (u32 *)(s->regs + APIC_LDR);
> - u64 icr;
> + u32 *id = (u32 *)(s->regs + APIC_ID);
> + u32 *ldr = (u32 *)(s->regs + APIC_LDR);
> + u64 icr;
>
> + if (!apic_x2apic_mode(vcpu->arch.apic)) {
> + /* Don't allow changing APIC ID to avoid unexpected issues */
> + if ((*id >> 24) != vcpu->vcpu_id)
> + return -EINVAL;
> + } else {
> if (vcpu->kvm->arch.x2apic_format) {
> if (*id != vcpu->vcpu_id)
> return -EINVAL;
On Fri, 2022-02-25 at 14:56 +0000, David Woodhouse wrote:
> On Fri, 2022-02-25 at 16:46 +0200, Maxim Levitsky wrote:
> > On Fri, 2022-02-25 at 16:22 +0800, Zeng Guang wrote:
> > > From: Maxim Levitsky <
> > > [email protected]
> > >
> > > No normal guest has any reason to change physical APIC IDs, and
> > > allowing this introduces bugs into APIC acceleration code.
> > >
> > > And Intel recent hardware just ignores writes to APIC_ID in
> > > xAPIC mode. More background can be found at:
> > > https://lore.kernel.org/lkml/[email protected]/
> > >
> > >
> > > Looks there is no much value to support writable xAPIC ID in
> > > guest except supporting some old and crazy use cases which
> > > probably would fail on real hardware. So, make xAPIC ID
> > > read-only for KVM guests.
> > >
> > > Signed-off-by: Maxim Levitsky <[email protected]>
> > > Signed-off-by: Zeng Guang <[email protected]>
> >
> > Assuming that this is approved and accepted upstream,
> > that is even better that my proposal of doing this
> > when APICv is enabled.
> >
> > Since now apic id is always read only, now we should not
> > forget to clean up some parts of kvm like kvm_recalculate_apic_map,
> > which are not needed anymore
>
> Can we also now optimise kvm_get_vcpu_by_id() so it doesn't have to do
> a linear search over all the vCPUs when there isn't a 1:1
> correspondence with the vCPU index?
I don't think so since vcpu id can still be set by userspace to anything,
and this is even used to encode topology in it.
However a hash table can still be used there to speed it up regardless of
read-only apic id IMHO.
Or, even better than a hash table, I see that KVM already
limits vcpu_id to KVM_MAX_VCPUS * 4 with a comment that only two extra
bits of topology are used:
"In the worst case, we'll need less than one extra bit for the
* Core ID, and less than one extra bit for the Package (Die) ID,
* so ratio of 4 should be enough"
Thus, we could in theory standardize location of those bits in apic_id
(even with a new KVM extension and do linear search for legacy userspace),
and then just mask/shift the topology bits.
The kvm extension would be defining how many low (or high?) bits of vcpu_id
are topology bits.
Best regards,
Maxim Levitsky
On Fri, 2022-02-25 at 15:42 +0000, David Woodhouse wrote:
> On Fri, 2022-02-25 at 17:11 +0200, Maxim Levitsky wrote:
> > On Fri, 2022-02-25 at 14:56 +0000, David Woodhouse wrote:
> > > On Fri, 2022-02-25 at 16:46 +0200, Maxim Levitsky wrote:
> > > > Assuming that this is approved and accepted upstream,
> > > > that is even better that my proposal of doing this
> > > > when APICv is enabled.
> > > >
> > > > Since now apic id is always read only, now we should not
> > > > forget to clean up some parts of kvm like kvm_recalculate_apic_map,
> > > > which are not needed anymore
> > >
> > > Can we also now optimise kvm_get_vcpu_by_id() so it doesn't have to do
> > > a linear search over all the vCPUs when there isn't a 1:1
> > > correspondence with the vCPU index?
> >
> > I don't think so since vcpu id can still be set by userspace to anything,
> > and this is even used to encode topology in it.
>
> Yes, but it can only be set at vCPU creation time and it has to be
> unique.
>
> > However a hash table can still be used there to speed it up regardless of
> > read-only apic id IMHO.
> >
> > Or, even better than a hash table, I see that KVM already
> > limits vcpu_id to KVM_MAX_VCPUS * 4 with a comment that only two extra
> > bits of topology are used:
>
> We already have the kvm_apic_map which provides a fast lookup. The key
> point here is that the APIC ID can't *change* from vcpu->vcpu_id any
> more, so we can actually use the kvm_apic_map for kvm_get_vcpu_by_id()
> now, can't we?
>
Right! I wrote my response partially when I still assumed that vcpu_id
can be any 32 bit number (thus hash table),
and later checked that it is capped by KVM_MAX_VCPUS * 4 which isn't a big number,
plus as I now see in the kvm_recalculate_apic_map
the map is dynamically allocated up to the max apic id.
(technically speaking an array is a hash table).
Now the map would only be needed to be rebuit few times when new vCPUs are added,
and can be used to locate vcpu by its apic id.
I so hope that this patch is accepted so all of this could be done.
Best regards,
Maxim Levitsky
On Fri, 2022-02-25 at 17:11 +0200, Maxim Levitsky wrote:
> On Fri, 2022-02-25 at 14:56 +0000, David Woodhouse wrote:
> > On Fri, 2022-02-25 at 16:46 +0200, Maxim Levitsky wrote:
> > > Assuming that this is approved and accepted upstream,
> > > that is even better that my proposal of doing this
> > > when APICv is enabled.
> > >
> > > Since now apic id is always read only, now we should not
> > > forget to clean up some parts of kvm like kvm_recalculate_apic_map,
> > > which are not needed anymore
> >
> > Can we also now optimise kvm_get_vcpu_by_id() so it doesn't have to do
> > a linear search over all the vCPUs when there isn't a 1:1
> > correspondence with the vCPU index?
>
> I don't think so since vcpu id can still be set by userspace to anything,
> and this is even used to encode topology in it.
Yes, but it can only be set at vCPU creation time and it has to be
unique.
> However a hash table can still be used there to speed it up regardless of
> read-only apic id IMHO.
>
> Or, even better than a hash table, I see that KVM already
> limits vcpu_id to KVM_MAX_VCPUS * 4 with a comment that only two extra
> bits of topology are used:
We already have the kvm_apic_map which provides a fast lookup. The key
point here is that the APIC ID can't *change* from vcpu->vcpu_id any
more, so we can actually use the kvm_apic_map for kvm_get_vcpu_by_id()
now, can't we?
On Fri, Feb 25, 2022 at 04:46:31PM +0200, Maxim Levitsky wrote:
>On Fri, 2022-02-25 at 16:22 +0800, Zeng Guang wrote:
>> From: Maxim Levitsky <[email protected]>
>>
>> No normal guest has any reason to change physical APIC IDs, and
>> allowing this introduces bugs into APIC acceleration code.
>>
>> And Intel recent hardware just ignores writes to APIC_ID in
>> xAPIC mode. More background can be found at:
>> https://lore.kernel.org/lkml/[email protected]/
>>
>> Looks there is no much value to support writable xAPIC ID in
>> guest except supporting some old and crazy use cases which
>> probably would fail on real hardware. So, make xAPIC ID
>> read-only for KVM guests.
>>
>> Signed-off-by: Maxim Levitsky <[email protected]>
>> Signed-off-by: Zeng Guang <[email protected]>
>
>Assuming that this is approved and accepted upstream,
>that is even better that my proposal of doing this
>when APICv is enabled.
Sean & Paolo
what's your opinion? Shall we make xAPIC ID read-only unconditionally
or just when enable_apicv is enabled or use a parameter to control
it as Sean suggested at
https://lore.kernel.org/lkml/[email protected]/
Intel SDM Vol3 10.4.6 Local APID ID says:
In MP systems, the local APIC ID is also used as a processor ID by the
BIOS and the operating system. Some processors permit software to
modify the APIC ID. However, the ability of software to modify the APIC
ID is processor model specific. Because of this, operating system
software should avoid writing to the local APIC ID register.
So, we think it is fine to make xAPIC ID always read-only. Instead of
supporting writable xAPIC ID in KVM guest, it may be better to fix software
that modify local APIC ID.
Intel IPI virtualization and AVIC are two cases that requires special
handling if xAPIC ID is writable. But it doesn't worth the effort and
is error-prone (e.g., AVIC is broken if guest changed its APIC ID).
>
>Since now apic id is always read only, now we should not
>forget to clean up some parts of kvm like kvm_recalculate_apic_map,
>which are not needed anymore.
Do you mean marking apic_map as DIRTY isn't needed in some cases?
Or some cleanup should be done inside kvm_recalculate_apic_map()?
For the former, I think we can ...
>
>Best regards,
> Maxim Levitsky
>
>> ---
>> arch/x86/kvm/lapic.c | 25 ++++++++++++++++++-------
>> 1 file changed, 18 insertions(+), 7 deletions(-)
>>
>> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
>> index e4bcdab1fac0..b38288c8a94f 100644
>> --- a/arch/x86/kvm/lapic.c
>> +++ b/arch/x86/kvm/lapic.c
>> @@ -2044,10 +2044,17 @@ static int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val)
>>
>> switch (reg) {
>> case APIC_ID: /* Local APIC ID */
>> - if (!apic_x2apic_mode(apic))
>> - kvm_apic_set_xapic_id(apic, val >> 24);
>> - else
>> + if (apic_x2apic_mode(apic)) {
>> ret = 1;
>> + break;
>> + }
>> + /* Don't allow changing APIC ID to avoid unexpected issues */
>> + if ((val >> 24) != apic->vcpu->vcpu_id) {
>> + kvm_vm_bugged(apic->vcpu->kvm);
>> + break;
>> + }
>> +
>> + kvm_apic_set_xapic_id(apic, val >> 24);
... drop the kvm_apic_set_xapic_id().
On Fri, Feb 25, 2022, Zeng Guang wrote:
> From: Maxim Levitsky <[email protected]>
>
> No normal guest has any reason to change physical APIC IDs,
I don't think we can reasonably assume this, my analysis in the link (that I just
realized I deleted from context here) shows it's at least plausible that an existing
guest could rely on the APIC ID being writable. And that's just one kernel, who
know what else is out there, especially given that people use KVM to emulate really
old stuff, often on really old hardware.
Practically speaking, anyone that wants to deploy IPIv is going to have to make
the switch at some point, but that doesn't help people running legacy crud that
don't care about IPIv.
I was thinking a module param would be trivial, and it is (see below) if the
param is off by default. A module param will also provide a convenient opportunity
to resolve the loophole reported by Maxim[1][2], though it's a bit funky.
Anyways, with an off-by-default module param, we can just do:
if (!enable_apicv || !cpu_has_vmx_ipiv() || !xapic_id_readonly)
enable_ipiv = false;
Forcing userspace to take advantage of IPIv is rather annoying, but it's not the
end of world.
Having the param on by default is a mess. Either we break userspace (above), or
we only kinda break userspace by having it on iff IPIv is on, but then we end up
with cyclical dependency hell. E.g. userspace makes xAPIC ID writable and forces
on IPIv, which one "wins"? And if it's on by default, we can't fix the loophole
in KVM_SET_LAPIC.
If we really wanted to have it on by default, we could have a Kconfig and make
_that_ off by default, e.g.
static bool __read_mostly xapic_id_readonly = IS_ENABLED(CONFING_KVM_XAPIC_ID_RO);
but that seems like overkill. If a kernel owner knows they want the param on,
it should be easy enough to force it without a Kconfig.
So I think my vote would be for something like this? Compile tested only...
---
arch/x86/kvm/lapic.c | 14 +++++++++-----
arch/x86/kvm/svm/avic.c | 5 +++++
arch/x86/kvm/x86.c | 4 ++++
arch/x86/kvm/x86.h | 1 +
4 files changed, 19 insertions(+), 5 deletions(-)
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index c4c3155d98db..2c01cd45fb18 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2043,7 +2043,7 @@ static int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val)
switch (reg) {
case APIC_ID: /* Local APIC ID */
- if (!apic_x2apic_mode(apic))
+ if (!apic_x2apic_mode(apic) && !xapic_id_readonly)
kvm_apic_set_xapic_id(apic, val >> 24);
else
ret = 1;
@@ -2634,10 +2634,7 @@ static int kvm_apic_state_fixup(struct kvm_vcpu *vcpu,
u32 *ldr = (u32 *)(s->regs + APIC_LDR);
u64 icr;
- if (vcpu->kvm->arch.x2apic_format) {
- if (*id != vcpu->vcpu_id)
- return -EINVAL;
- } else {
+ if (!vcpu->kvm->arch.x2apic_format) {
if (set)
*id >>= 24;
else
@@ -2650,6 +2647,10 @@ static int kvm_apic_state_fixup(struct kvm_vcpu *vcpu,
* split to ICR+ICR2 in userspace for backwards compatibility.
*/
if (set) {
+ if ((vcpu->kvm->arch.x2apic_format || xapic_id_readonly) &&
+ (*id != vcpu->vcpu_id))
+ return -EINVAL;
+
*ldr = kvm_apic_calc_x2apic_ldr(*id);
icr = __kvm_lapic_get_reg(s->regs, APIC_ICR) |
@@ -2659,6 +2660,9 @@ static int kvm_apic_state_fixup(struct kvm_vcpu *vcpu,
icr = __kvm_lapic_get_reg64(s->regs, APIC_ICR);
__kvm_lapic_set_reg(s->regs, APIC_ICR2, icr >> 32);
}
+ } else if (set && xapic_id_readonly &&
+ (__kvm_lapic_get_reg(s->regs, APIC_ID) >> 24) != vcpu->vcpu_id) {
+ return -EINVAL;
}
return 0;
diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index b37b353ec086..4a031d9686c2 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -442,6 +442,11 @@ static int avic_handle_apic_id_update(struct kvm_vcpu *vcpu)
struct vcpu_svm *svm = to_svm(vcpu);
u32 id = kvm_xapic_id(vcpu->arch.apic);
+ if (xapic_id_readonly && id != vcpu->vcpu_id) {
+ kvm_prepare_emulation_failure_exit(vcpu);
+ return 0;
+ }
+
if (vcpu->vcpu_id == id)
return 0;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4fa4d8269e5b..67706d468ed3 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -177,6 +177,10 @@ module_param(tsc_tolerance_ppm, uint, S_IRUGO | S_IWUSR);
static int __read_mostly lapic_timer_advance_ns = -1;
module_param(lapic_timer_advance_ns, int, S_IRUGO | S_IWUSR);
+bool __read_mostly xapic_id_readonly;
+module_param(xapic_id_readonly, bool, 0444);
+EXPORT_SYMBOL_GPL(xapic_id_readonly);
+
static bool __read_mostly vector_hashing = true;
module_param(vector_hashing, bool, S_IRUGO);
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index aa86abad914d..89f40c921c08 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -302,6 +302,7 @@ static inline bool kvm_mpx_supported(void)
extern unsigned int min_timer_period_us;
extern bool enable_vmware_backdoor;
+extern bool xapic_id_readonly;
extern int pi_inject_timer;
base-commit: 1e147f6f90668f2c2b57406d451f0cfcd2ba19d0
--
On Tue, Mar 08, 2022 at 11:04:01PM +0000, Sean Christopherson wrote:
>On Fri, Feb 25, 2022, Zeng Guang wrote:
>> From: Maxim Levitsky <[email protected]>
>>
>> No normal guest has any reason to change physical APIC IDs,
>
>I don't think we can reasonably assume this, my analysis in the link (that I just
>realized I deleted from context here) shows it's at least plausible that an existing
>guest could rely on the APIC ID being writable. And that's just one kernel, who
>know what else is out there, especially given that people use KVM to emulate really
>old stuff, often on really old hardware.
Making xAPIC ID readonly is not only based on your analysis, but also Intel SDM
clearly saying writable xAPIC ID is processor model specific and ***software should
avoid writing to xAPIC ID***.
If writable xAPIC ID support should be retained and is tied to a module param,
live migration would depend on KVM's module params: e.g., migrate a VM with
modified xAPIC ID (apic_id_readonly off on this system) to one with
xapic_id_readonly on would fail, right? Is this failure desired? if not, we need to
have a VM-scope control. e.g., add an inhibitor of APICv (XAPIC_ID_MODIFIED) and
disable APICv forever for this VM if its vCPUs or QEMU modifies xAPIC ID.
>
>Practically speaking, anyone that wants to deploy IPIv is going to have to make
>the switch at some point, but that doesn't help people running legacy crud that
>don't care about IPIv.
>
>I was thinking a module param would be trivial, and it is (see below) if the
>param is off by default. A module param will also provide a convenient opportunity
>to resolve the loophole reported by Maxim[1][2], though it's a bit funky.
Could you share the links?
>
>Anyways, with an off-by-default module param, we can just do:
>
> if (!enable_apicv || !cpu_has_vmx_ipiv() || !xapic_id_readonly)
> enable_ipiv = false;
>
>Forcing userspace to take advantage of IPIv is rather annoying, but it's not the
>end of world.
>
>Having the param on by default is a mess. Either we break userspace (above), or
>we only kinda break userspace by having it on iff IPIv is on, but then we end up
>with cyclical dependency hell. E.g. userspace makes xAPIC ID writable and forces
>on IPIv, which one "wins"? And if it's on by default, we can't fix the loophole
>in KVM_SET_LAPIC.
We are fine with having this param off by default.
TL;DR: Maxim, any objection to yet another inhibit? Any potential issues you can think of?
On Wed, Mar 09, 2022, Chao Gao wrote:
> On Tue, Mar 08, 2022 at 11:04:01PM +0000, Sean Christopherson wrote:
> >On Fri, Feb 25, 2022, Zeng Guang wrote:
> >> From: Maxim Levitsky <[email protected]>
> >>
> >> No normal guest has any reason to change physical APIC IDs,
> >
> >I don't think we can reasonably assume this, my analysis in the link (that I just
> >realized I deleted from context here) shows it's at least plausible that an existing
> >guest could rely on the APIC ID being writable. And that's just one kernel, who
> >know what else is out there, especially given that people use KVM to emulate really
> >old stuff, often on really old hardware.
>
> Making xAPIC ID readonly is not only based on your analysis, but also Intel SDM
> clearly saying writable xAPIC ID is processor model specific and ***software should
> avoid writing to xAPIC ID***.
Intel isn't the only vendor KVM supports, and xAPIC ID is fully writable according
to AMD's docs and AMD's hardware. x2APIC is even (indirectly) writable, but luckily
KVM has never modeled that...
Don't get me wrong, I would love to make xAPIC ID read-only, and I fully agree
that the probability of breaking someone's setup is very low, I just don't think
the benefits of forcing it are worth the risk of breaking userspace.
> If writable xAPIC ID support should be retained and is tied to a module param,
> live migration would depend on KVM's module params: e.g., migrate a VM with
> modified xAPIC ID (apic_id_readonly off on this system) to one with
> xapic_id_readonly on would fail, right? Is this failure desired?
Hrm, I was originally thinking it's not a terrible outcome, but I was assuming
that userspace would gracefully handle migration failure. That's a bad assumption.
> if not, we need to have a VM-scope control. e.g., add an inhibitor of APICv
> (XAPIC_ID_MODIFIED) and disable APICv forever for this VM if its vCPUs or
> QEMU modifies xAPIC ID.
Inhibiting APICv if IPIv is enabled (implied for AMD's AVIC) is probably a better
option than a module param. I was worried about ending up with silently degraded
VM performance, but that's easily solved by adding a stat to track APICv inhibitions,
which would be useful for other cases too (getting AMD's AVIC enabled is comically
difficult).
That would also let us drop the code buggy avic_handle_apic_id_update().
And it wouldn't necessarily have to be forever, though I agree that's a perfectly
fine approach until we have data that shows anything fancier is necessary.
> >Practically speaking, anyone that wants to deploy IPIv is going to have to make
> >the switch at some point, but that doesn't help people running legacy crud that
> >don't care about IPIv.
> >
> >I was thinking a module param would be trivial, and it is (see below) if the
> >param is off by default. A module param will also provide a convenient opportunity
> >to resolve the loophole reported by Maxim[1][2], though it's a bit funky.
>
> Could you share the links?
Doh, sorry (they're both in this one).
https://lore.kernel.org/all/[email protected]
On Wed, 2022-03-09 at 06:01 +0000, Sean Christopherson wrote:
> TL;DR: Maxim, any objection to yet another inhibit? Any potential issues you can think of?
>
> On Wed, Mar 09, 2022, Chao Gao wrote:
> > On Tue, Mar 08, 2022 at 11:04:01PM +0000, Sean Christopherson wrote:
> > > On Fri, Feb 25, 2022, Zeng Guang wrote:
> > > > From: Maxim Levitsky <[email protected]>
> > > >
> > > > No normal guest has any reason to change physical APIC IDs,
> > >
> > > I don't think we can reasonably assume this, my analysis in the link (that I just
> > > realized I deleted from context here) shows it's at least plausible that an existing
> > > guest could rely on the APIC ID being writable. And that's just one kernel, who
> > > know what else is out there, especially given that people use KVM to emulate really
> > > old stuff, often on really old hardware.
> >
> > Making xAPIC ID readonly is not only based on your analysis, but also Intel SDM
> > clearly saying writable xAPIC ID is processor model specific and ***software should
> > avoid writing to xAPIC ID***.
>
> Intel isn't the only vendor KVM supports, and xAPIC ID is fully writable according
> to AMD's docs and AMD's hardware. x2APIC is even (indirectly) writable, but luckily
> KVM has never modeled that...
>
> Don't get me wrong, I would love to make xAPIC ID read-only, and I fully agree
> that the probability of breaking someone's setup is very low, I just don't think
> the benefits of forcing it are worth the risk of breaking userspace.
>
> > If writable xAPIC ID support should be retained and is tied to a module param,
> > live migration would depend on KVM's module params: e.g., migrate a VM with
> > modified xAPIC ID (apic_id_readonly off on this system) to one with
> > xapic_id_readonly on would fail, right? Is this failure desired?
>
> Hrm, I was originally thinking it's not a terrible outcome, but I was assuming
> that userspace would gracefully handle migration failure. That's a bad assumption.
>
> > if not, we need to have a VM-scope control. e.g., add an inhibitor of APICv
> > (XAPIC_ID_MODIFIED) and disable APICv forever for this VM if its vCPUs or
> > QEMU modifies xAPIC ID.
>
> Inhibiting APICv if IPIv is enabled (implied for AMD's AVIC) is probably a better
> option than a module param. I was worried about ending up with silently degraded
> VM performance, but that's easily solved by adding a stat to track APICv inhibitions,
> which would be useful for other cases too (getting AMD's AVIC enabled is comically
> difficult).
>
> That would also let us drop the code buggy avic_handle_apic_id_update().
>
> And it wouldn't necessarily have to be forever, though I agree that's a perfectly
> fine approach until we have data that shows anything fancier is necessary.
>
> > > Practically speaking, anyone that wants to deploy IPIv is going to have to make
> > > the switch at some point, but that doesn't help people running legacy crud that
> > > don't care about IPIv.
> > >
> > > I was thinking a module param would be trivial, and it is (see below) if the
> > > param is off by default. A module param will also provide a convenient opportunity
> > > to resolve the loophole reported by Maxim[1][2], though it's a bit funky.
> >
> > Could you share the links?
>
> Doh, sorry (they're both in this one).
>
> https://lore.kernel.org/all/[email protected]
>
My opinion on this subject is very simple: we need to draw the line somewhere.
There is balance between supporting (poorly) unused hardware features and
not supporting them at all.
Writable APIC id is not just some legacy feature like task switch but a
feature that is frowned upon in both Intel and AMD manual:
Yes, look at AMD's SDM at:
"16.3.3 Local APIC ID
Unique local APIC IDs are assigned to each CPU core in the system. The value is determined by
hardware, based on the number of CPU cores on the processor and the node ID of the processor.
The APIC ID is located in the APIC ID register at APIC offset 20h. See Figure 16-3. It is model
dependent, whether software can modify the APIC ID Register. The initial value of the APIC ID (after
a reset) is the value returned in CPUID function 0000_0001h_EBX[31:24]."
Also in section '16.12 x2APIC_ID' of SDM it is mentioned:
RDMSR. An RDMSR of MSR 0802h returns the x2APIC_ID in EAX[31:0]. The x2APIC_ID is a
read-only register.
Attempting to write MSR 802h or attempting to read this MSR when not in x2APIC
mode causes a #GP(0) exception. See 16.11 “Accessing x2APIC Registers”.
CPUID. The x2APIC ID is reported by the following CPUID functions Fn0000_000B (Extended
Topology Enumeration) and CPUID Fn8000_001E (Extended APIC ID) as follows:
From this you can also infer that x2apic id is assigned once on boot, and same value is
reported in CPUID and in MSR 0x0802.
The fact that one can outsmart the microcode, change apic id and then switch to
x2apic mode, is more a like a microcode bug that a feature IMHO, that nobody
bothered to fix since nobody uses it.
Sean, please don't get me wrong, I used to think differently on this -
I implemented PDPTRS migration in kvm, although in retrospect I probably shouldn't.
I also understand your concerns - and I am not going to fight over this, a module
param for read only apic id, will work for me.
All I wanted to do is to make KVM better by simplifying it - KVM is already as complex
as it can get, anything to make it simpler is welcome IMHO.
Best regards,
Maxim Levitsky
On Wed, Mar 09, 2022, Maxim Levitsky wrote:
> On Wed, 2022-03-09 at 06:01 +0000, Sean Christopherson wrote:
> > > Could you share the links?
> >
> > Doh, sorry (they're both in this one).
> >
> > https://lore.kernel.org/all/[email protected]
> >
>
> My opinion on this subject is very simple: we need to draw the line somewhere.
...
> I also understand your concerns - and I am not going to fight over this, a module
> param for read only apic id, will work for me.
Sadly, I don't think a module param would actually help. I was thinking it would
avoid breakage by allowing for graceful fallback on migration failure, but that
was wishful thinking. An inhibit seems like the least awful idea if we don't end
up making it unconditionally readonly.
> All I wanted to do is to make KVM better by simplifying it - KVM is already
> as complex as it can get, anything to make it simpler is welcome IMHO.
I agree that simplifying KVM is a goal, and that we need to decide when enough is
enough. But we also can't break userspace or existing deployments, that's a very
clearly drawn line in Linux.
My biggest worry is that, unlike the KVM_SET_CPUID2 breakage, which was obvious
and came relatively quick, this could cause breakage at the worst possible time
(migration) months or years down the road.
Since the goal is to simplify KVM, can we try the inhibit route and see what the
code looks like before making a decision? I think it might actually yield a less
awful KVM than the readonly approach, especially if the inhibit is "sticky", i.e.
we don't try to remove the inhibit on subsequent changes.
Killing the VM, as proposed, is very user unfriendly as the user will have no idea
why the VM was killed. WARN is out of the question because this is user triggerable.
Returning an emulation error would be ideal, but getting that result up through
apic_mmio_write() could be annoying and end up being more complex.
The touchpoints will all be the same, unless I'm missing something the difference
should only be a call to set an inhibit instead killing the VM.
On Fri, 2022-03-11 at 21:28 +0800, Zeng Guang wrote:
>
> On 3/11/2022 12:26 PM, Sean Christopherson wrote:
> > On Wed, Mar 09, 2022, Maxim Levitsky wrote:
> > > On Wed, 2022-03-09 at 06:01 +0000, Sean Christopherson wrote:
> > > > > Could you share the links?
> > > >
> > > > Doh, sorry (they're both in this one).
> > > >
> > > > https://lore.kernel.org/all/[email protected]
> > > >
> > > >
> > >
> > > My opinion on this subject is very simple: we need to draw the line somewhere.
> >
> > ...
> >
> >
> > Since the goal is to simplify KVM, can we try the inhibit route and see what the
> > code looks like before making a decision? I think it might actually yield a less
> > awful KVM than the readonly approach, especially if the inhibit is "sticky", i.e.
> > we don't try to remove the inhibit on subsequent changes.
> >
> > Killing the VM, as proposed, is very user unfriendly as the user will have no idea
> > why the VM was killed. WARN is out of the question because this is user triggerable.
> > Returning an emulation error would be ideal, but getting that result up through
> > apic_mmio_write() could be annoying and end up being more complex.
> >
> > The touchpoints will all be the same, unless I'm missing something the difference
> > should only be a call to set an inhibit instead killing the VM.
>
> Introduce an inhibition - APICV_INHIBIT_REASON_APICID_CHG to deactivate
> APICv once KVM guest would try to change APIC ID in xapic mode, and same
> sanity check in KVM_{SET,GET}_LAPIC for live migration. KVM will keep
> alive but obviously lose benefit from hardware acceleration in this way.
>
> So how do you think the proposal like this ?
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 6dcccb304775..30d825c069be 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1046,6 +1046,7 @@ struct kvm_x86_msr_filter {
> #define APICV_INHIBIT_REASON_X2APIC 5
> #define APICV_INHIBIT_REASON_BLOCKIRQ 6
> #define APICV_INHIBIT_REASON_ABSENT 7
> +#define APICV_INHIBIT_REASON_APICID_CHG 8
>
> struct kvm_arch {
> unsigned long n_used_mmu_pages;
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 22929b5b3f9b..66cd54fa4515 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -2044,10 +2044,19 @@ static int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val)
>
> switch (reg) {
> case APIC_ID: /* Local APIC ID */
> - if (!apic_x2apic_mode(apic))
> - kvm_apic_set_xapic_id(apic, val >> 24);
> - else
> + if (apic_x2apic_mode(apic)) {
> ret = 1;
> + break;
> + }
> + /*
> + * If changing APIC ID with any APIC acceleration enabled,
> + * deactivate APICv to avoid unexpected issues.
> + */
> + if (enable_apicv && (val >> 24) != apic->vcpu->vcpu_id)
> + kvm_request_apicv_update(apic->vcpu->kvm,
> + false, APICV_INHIBIT_REASON_APICID_CHG);
> +
> + kvm_apic_set_xapic_id(apic, val >> 24);
> break;
>
> case APIC_TASKPRI:
> @@ -2628,11 +2637,19 @@ int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu)
> static int kvm_apic_state_fixup(struct kvm_vcpu *vcpu,
> struct kvm_lapic_state *s, bool set)
> {
> - if (apic_x2apic_mode(vcpu->arch.apic)) {
> - u32 *id = (u32 *)(s->regs + APIC_ID);
> - u32 *ldr = (u32 *)(s->regs + APIC_LDR);
> - u64 icr;
> + u32 *id = (u32 *)(s->regs + APIC_ID);
> + u32 *ldr = (u32 *)(s->regs + APIC_LDR);
> + u64 icr;
> + if (!apic_x2apic_mode(vcpu->arch.apic)) {
> + /*
> + * If APIC ID changed with any APIC acceleration enabled,
> + * deactivate APICv to avoid unexpected issues.
> + */
> + if (enable_apicv && (*id >> 24) != vcpu->vcpu_id)
> + kvm_request_apicv_update(vcpu->kvm,
> + false, APICV_INHIBIT_REASON_APICID_CHG);
> + } else {
> if (vcpu->kvm->arch.x2apic_format) {
> if (*id != vcpu->vcpu_id)
> return -EINVAL;
> diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
> index 82d56f8055de..f78754bdc1d0 100644
> --- a/arch/x86/kvm/svm/avic.c
> +++ b/arch/x86/kvm/svm/avic.c
> @@ -931,7 +931,8 @@ bool svm_check_apicv_inhibit_reasons(ulong bit)
> BIT(APICV_INHIBIT_REASON_IRQWIN) |
> BIT(APICV_INHIBIT_REASON_PIT_REINJ) |
> BIT(APICV_INHIBIT_REASON_X2APIC) |
> - BIT(APICV_INHIBIT_REASON_BLOCKIRQ);
> + BIT(APICV_INHIBIT_REASON_BLOCKIRQ) |
> + BIT(APICV_INHIBIT_REASON_APICID_CHG);
>
> return supported & BIT(bit);
> }
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 7beba7a9f247..91265f0784bd 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7751,7 +7751,8 @@ static bool vmx_check_apicv_inhibit_reasons(ulong bit)
> ulong supported = BIT(APICV_INHIBIT_REASON_DISABLE) |
> BIT(APICV_INHIBIT_REASON_ABSENT) |
> BIT(APICV_INHIBIT_REASON_HYPERV) |
> - BIT(APICV_INHIBIT_REASON_BLOCKIRQ);
> + BIT(APICV_INHIBIT_REASON_BLOCKIRQ) |
> + BIT(APICV_INHIBIT_REASON_APICID_CHG);
>
> return supported & BIT(bit);
> }
>
>
>
This won't work with nested AVIC - we can't just inhibit a nested guest using its own AVIC,
because migration happens.
Best regards,
Maxim Levitsky
On Sun, Mar 13, 2022 at 12:59:36PM +0200, Maxim Levitsky wrote:
>On Sun, 2022-03-13 at 11:19 +0200, Maxim Levitsky wrote:
>> On Fri, 2022-03-11 at 21:28 +0800, Zeng Guang wrote:
>> > On 3/11/2022 12:26 PM, Sean Christopherson wrote:
>> > > On Wed, Mar 09, 2022, Maxim Levitsky wrote:
>> > > > On Wed, 2022-03-09 at 06:01 +0000, Sean Christopherson wrote:
>> > > > > > Could you share the links?
>> > > > >
>> > > > > Doh, sorry (they're both in this one).
>> > > > >
>> > > > > https://lore.kernel.org/all/[email protected]
>> > > > >
>> > > > >
>> > > >
>> > > > My opinion on this subject is very simple: we need to draw the line somewhere.
>> > >
>> > > ...
>> > >
>> > >
>> > > Since the goal is to simplify KVM, can we try the inhibit route and see what the
>> > > code looks like before making a decision? I think it might actually yield a less
>> > > awful KVM than the readonly approach, especially if the inhibit is "sticky", i.e.
>> > > we don't try to remove the inhibit on subsequent changes.
>> > >
>> > > Killing the VM, as proposed, is very user unfriendly as the user will have no idea
>> > > why the VM was killed. WARN is out of the question because this is user triggerable.
>> > > Returning an emulation error would be ideal, but getting that result up through
>> > > apic_mmio_write() could be annoying and end up being more complex.
>> > >
>> > > The touchpoints will all be the same, unless I'm missing something the difference
>> > > should only be a call to set an inhibit instead killing the VM.
>> >
>> > Introduce an inhibition - APICV_INHIBIT_REASON_APICID_CHG to deactivate
>> > APICv once KVM guest would try to change APIC ID in xapic mode, and same
>> > sanity check in KVM_{SET,GET}_LAPIC for live migration. KVM will keep
>> > alive but obviously lose benefit from hardware acceleration in this way.
>> >
>> > So how do you think the proposal like this ?
>> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>> > index 6dcccb304775..30d825c069be 100644
>> > --- a/arch/x86/include/asm/kvm_host.h
>> > +++ b/arch/x86/include/asm/kvm_host.h
>> > @@ -1046,6 +1046,7 @@ struct kvm_x86_msr_filter {
>> > #define APICV_INHIBIT_REASON_X2APIC 5
>> > #define APICV_INHIBIT_REASON_BLOCKIRQ 6
>> > #define APICV_INHIBIT_REASON_ABSENT 7
>> > +#define APICV_INHIBIT_REASON_APICID_CHG 8
>> >
>> > struct kvm_arch {
>> > unsigned long n_used_mmu_pages;
>> > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
>> > index 22929b5b3f9b..66cd54fa4515 100644
>> > --- a/arch/x86/kvm/lapic.c
>> > +++ b/arch/x86/kvm/lapic.c
>> > @@ -2044,10 +2044,19 @@ static int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val)
>> >
>> > switch (reg) {
>> > case APIC_ID: /* Local APIC ID */
>> > - if (!apic_x2apic_mode(apic))
>> > - kvm_apic_set_xapic_id(apic, val >> 24);
>> > - else
>> > + if (apic_x2apic_mode(apic)) {
>> > ret = 1;
>> > + break;
>> > + }
>> > + /*
>> > + * If changing APIC ID with any APIC acceleration enabled,
>> > + * deactivate APICv to avoid unexpected issues.
>> > + */
>> > + if (enable_apicv && (val >> 24) != apic->vcpu->vcpu_id)
>> > + kvm_request_apicv_update(apic->vcpu->kvm,
>> > + false, APICV_INHIBIT_REASON_APICID_CHG);
>> > +
>> > + kvm_apic_set_xapic_id(apic, val >> 24);
>> > break;
>> >
>> > case APIC_TASKPRI:
>> > @@ -2628,11 +2637,19 @@ int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu)
>> > static int kvm_apic_state_fixup(struct kvm_vcpu *vcpu,
>> > struct kvm_lapic_state *s, bool set)
>> > {
>> > - if (apic_x2apic_mode(vcpu->arch.apic)) {
>> > - u32 *id = (u32 *)(s->regs + APIC_ID);
>> > - u32 *ldr = (u32 *)(s->regs + APIC_LDR);
>> > - u64 icr;
>> > + u32 *id = (u32 *)(s->regs + APIC_ID);
>> > + u32 *ldr = (u32 *)(s->regs + APIC_LDR);
>> > + u64 icr;
>> > + if (!apic_x2apic_mode(vcpu->arch.apic)) {
>> > + /*
>> > + * If APIC ID changed with any APIC acceleration enabled,
>> > + * deactivate APICv to avoid unexpected issues.
>> > + */
>> > + if (enable_apicv && (*id >> 24) != vcpu->vcpu_id)
>> > + kvm_request_apicv_update(vcpu->kvm,
>> > + false, APICV_INHIBIT_REASON_APICID_CHG);
>> > + } else {
>> > if (vcpu->kvm->arch.x2apic_format) {
>> > if (*id != vcpu->vcpu_id)
>> > return -EINVAL;
>> > diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
>> > index 82d56f8055de..f78754bdc1d0 100644
>> > --- a/arch/x86/kvm/svm/avic.c
>> > +++ b/arch/x86/kvm/svm/avic.c
>> > @@ -931,7 +931,8 @@ bool svm_check_apicv_inhibit_reasons(ulong bit)
>> > BIT(APICV_INHIBIT_REASON_IRQWIN) |
>> > BIT(APICV_INHIBIT_REASON_PIT_REINJ) |
>> > BIT(APICV_INHIBIT_REASON_X2APIC) |
>> > - BIT(APICV_INHIBIT_REASON_BLOCKIRQ);
>> > + BIT(APICV_INHIBIT_REASON_BLOCKIRQ) |
>> > + BIT(APICV_INHIBIT_REASON_APICID_CHG);
>> >
>> > return supported & BIT(bit);
>> > }
>> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>> > index 7beba7a9f247..91265f0784bd 100644
>> > --- a/arch/x86/kvm/vmx/vmx.c
>> > +++ b/arch/x86/kvm/vmx/vmx.c
>> > @@ -7751,7 +7751,8 @@ static bool vmx_check_apicv_inhibit_reasons(ulong bit)
>> > ulong supported = BIT(APICV_INHIBIT_REASON_DISABLE) |
>> > BIT(APICV_INHIBIT_REASON_ABSENT) |
>> > BIT(APICV_INHIBIT_REASON_HYPERV) |
>> > - BIT(APICV_INHIBIT_REASON_BLOCKIRQ);
>> > + BIT(APICV_INHIBIT_REASON_BLOCKIRQ) |
>> > + BIT(APICV_INHIBIT_REASON_APICID_CHG);
>> >
>> > return supported & BIT(bit);
>> > }
>> >
>> >
>> >
>>
>> This won't work with nested AVIC - we can't just inhibit a nested guest using its own AVIC,
>> because migration happens.
>
>I mean because host decided to change its apic id, which it can in theory do any time,
>even after the nested guest has started. Seriously, the only reason guest has to change apic id,
>is to try to exploit some security hole.
Hi
Thanks for the information.
IIUC, you mean KVM applies APICv inhibition only to L1 VM, leaving APICv
enabled for L2 VM. Shouldn't KVM disable APICv for L2 VM in this case?
It looks like a generic issue in dynamically toggling APICv scheme,
e.g., qemu can set KVM_GUESTDBG_BLOCKIRQ after nested guest has started.
On Sun, 2022-03-13 at 11:19 +0200, Maxim Levitsky wrote:
> On Fri, 2022-03-11 at 21:28 +0800, Zeng Guang wrote:
> > On 3/11/2022 12:26 PM, Sean Christopherson wrote:
> > > On Wed, Mar 09, 2022, Maxim Levitsky wrote:
> > > > On Wed, 2022-03-09 at 06:01 +0000, Sean Christopherson wrote:
> > > > > > Could you share the links?
> > > > >
> > > > > Doh, sorry (they're both in this one).
> > > > >
> > > > > https://lore.kernel.org/all/[email protected]
> > > > >
> > > > >
> > > >
> > > > My opinion on this subject is very simple: we need to draw the line somewhere.
> > >
> > > ...
> > >
> > >
> > > Since the goal is to simplify KVM, can we try the inhibit route and see what the
> > > code looks like before making a decision? I think it might actually yield a less
> > > awful KVM than the readonly approach, especially if the inhibit is "sticky", i.e.
> > > we don't try to remove the inhibit on subsequent changes.
> > >
> > > Killing the VM, as proposed, is very user unfriendly as the user will have no idea
> > > why the VM was killed. WARN is out of the question because this is user triggerable.
> > > Returning an emulation error would be ideal, but getting that result up through
> > > apic_mmio_write() could be annoying and end up being more complex.
> > >
> > > The touchpoints will all be the same, unless I'm missing something the difference
> > > should only be a call to set an inhibit instead killing the VM.
> >
> > Introduce an inhibition - APICV_INHIBIT_REASON_APICID_CHG to deactivate
> > APICv once KVM guest would try to change APIC ID in xapic mode, and same
> > sanity check in KVM_{SET,GET}_LAPIC for live migration. KVM will keep
> > alive but obviously lose benefit from hardware acceleration in this way.
> >
> > So how do you think the proposal like this ?
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 6dcccb304775..30d825c069be 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -1046,6 +1046,7 @@ struct kvm_x86_msr_filter {
> > #define APICV_INHIBIT_REASON_X2APIC 5
> > #define APICV_INHIBIT_REASON_BLOCKIRQ 6
> > #define APICV_INHIBIT_REASON_ABSENT 7
> > +#define APICV_INHIBIT_REASON_APICID_CHG 8
> >
> > struct kvm_arch {
> > unsigned long n_used_mmu_pages;
> > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> > index 22929b5b3f9b..66cd54fa4515 100644
> > --- a/arch/x86/kvm/lapic.c
> > +++ b/arch/x86/kvm/lapic.c
> > @@ -2044,10 +2044,19 @@ static int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val)
> >
> > switch (reg) {
> > case APIC_ID: /* Local APIC ID */
> > - if (!apic_x2apic_mode(apic))
> > - kvm_apic_set_xapic_id(apic, val >> 24);
> > - else
> > + if (apic_x2apic_mode(apic)) {
> > ret = 1;
> > + break;
> > + }
> > + /*
> > + * If changing APIC ID with any APIC acceleration enabled,
> > + * deactivate APICv to avoid unexpected issues.
> > + */
> > + if (enable_apicv && (val >> 24) != apic->vcpu->vcpu_id)
> > + kvm_request_apicv_update(apic->vcpu->kvm,
> > + false, APICV_INHIBIT_REASON_APICID_CHG);
> > +
> > + kvm_apic_set_xapic_id(apic, val >> 24);
> > break;
> >
> > case APIC_TASKPRI:
> > @@ -2628,11 +2637,19 @@ int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu)
> > static int kvm_apic_state_fixup(struct kvm_vcpu *vcpu,
> > struct kvm_lapic_state *s, bool set)
> > {
> > - if (apic_x2apic_mode(vcpu->arch.apic)) {
> > - u32 *id = (u32 *)(s->regs + APIC_ID);
> > - u32 *ldr = (u32 *)(s->regs + APIC_LDR);
> > - u64 icr;
> > + u32 *id = (u32 *)(s->regs + APIC_ID);
> > + u32 *ldr = (u32 *)(s->regs + APIC_LDR);
> > + u64 icr;
> > + if (!apic_x2apic_mode(vcpu->arch.apic)) {
> > + /*
> > + * If APIC ID changed with any APIC acceleration enabled,
> > + * deactivate APICv to avoid unexpected issues.
> > + */
> > + if (enable_apicv && (*id >> 24) != vcpu->vcpu_id)
> > + kvm_request_apicv_update(vcpu->kvm,
> > + false, APICV_INHIBIT_REASON_APICID_CHG);
> > + } else {
> > if (vcpu->kvm->arch.x2apic_format) {
> > if (*id != vcpu->vcpu_id)
> > return -EINVAL;
> > diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
> > index 82d56f8055de..f78754bdc1d0 100644
> > --- a/arch/x86/kvm/svm/avic.c
> > +++ b/arch/x86/kvm/svm/avic.c
> > @@ -931,7 +931,8 @@ bool svm_check_apicv_inhibit_reasons(ulong bit)
> > BIT(APICV_INHIBIT_REASON_IRQWIN) |
> > BIT(APICV_INHIBIT_REASON_PIT_REINJ) |
> > BIT(APICV_INHIBIT_REASON_X2APIC) |
> > - BIT(APICV_INHIBIT_REASON_BLOCKIRQ);
> > + BIT(APICV_INHIBIT_REASON_BLOCKIRQ) |
> > + BIT(APICV_INHIBIT_REASON_APICID_CHG);
> >
> > return supported & BIT(bit);
> > }
> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > index 7beba7a9f247..91265f0784bd 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -7751,7 +7751,8 @@ static bool vmx_check_apicv_inhibit_reasons(ulong bit)
> > ulong supported = BIT(APICV_INHIBIT_REASON_DISABLE) |
> > BIT(APICV_INHIBIT_REASON_ABSENT) |
> > BIT(APICV_INHIBIT_REASON_HYPERV) |
> > - BIT(APICV_INHIBIT_REASON_BLOCKIRQ);
> > + BIT(APICV_INHIBIT_REASON_BLOCKIRQ) |
> > + BIT(APICV_INHIBIT_REASON_APICID_CHG);
> >
> > return supported & BIT(bit);
> > }
> >
> >
> >
>
> This won't work with nested AVIC - we can't just inhibit a nested guest using its own AVIC,
> because migration happens.
I mean because host decided to change its apic id, which it can in theory do any time,
even after the nested guest has started. Seriously, the only reason guest has to change apic id,
is to try to exploit some security hole.
Best regards,
Maxim Levitsky
>
> Best regards,
> Maxim Levitsky
>> > > This won't work with nested AVIC - we can't just inhibit a nested guest using its own AVIC,
>> > > because migration happens.
>> >
>> > I mean because host decided to change its apic id, which it can in theory do any time,
>> > even after the nested guest has started. Seriously, the only reason guest has to change apic id,
>> > is to try to exploit some security hole.
>>
>> Hi
>>
>> Thanks for the information.
>>
>> IIUC, you mean KVM applies APICv inhibition only to L1 VM, leaving APICv
>> enabled for L2 VM. Shouldn't KVM disable APICv for L2 VM in this case?
>> It looks like a generic issue in dynamically toggling APICv scheme,
>> e.g., qemu can set KVM_GUESTDBG_BLOCKIRQ after nested guest has started.
>>
>
>That is the problem - you can't disable it for L2, unless you are willing to emulate it in software.
>Or in other words, when nested guest uses a hardware feature, you can't at some point say to it:
>sorry buddy - hardware feature disappeared.
Agreed. I missed this.
>
>It is *currently* not a problem for APICv because it doesn't do IPI virtualization,
>and even with these patches, it doesn't do this for nesting.
>It does become when you allow nested guest to use this which I did in the nested AVIC code.
>
>
>and writable apic ids do pose a large problem, since nested AVIC, will target L1's apic ids,
>and when they can change under you without any notice, and even worse be duplicate,
>it is just nightmare.
OK. So the problem of disabling APICv is if we choose to disable APICv instead
of making APIC ID read-only, although it can work perfectly for VMX IPIv, it
effectively makes future cleanup to AVIC difficult/impossible because nested
AVIC is practically to implement without assuming APIC IDs of L1 is immutable.
Sean & Maxim
How about go back to use a module parameter to opt in to read-only APIC ID.
Although migration in some cases may fail but it shouldn't be a big issue as
migration VMs from a KVM with nested=on to a KVM with nested=off may also fail.
On Sun, 2022-03-13 at 21:53 +0800, Chao Gao wrote:
> On Sun, Mar 13, 2022 at 12:59:36PM +0200, Maxim Levitsky wrote:
> > On Sun, 2022-03-13 at 11:19 +0200, Maxim Levitsky wrote:
> > > On Fri, 2022-03-11 at 21:28 +0800, Zeng Guang wrote:
> > > > On 3/11/2022 12:26 PM, Sean Christopherson wrote:
> > > > > On Wed, Mar 09, 2022, Maxim Levitsky wrote:
> > > > > > On Wed, 2022-03-09 at 06:01 +0000, Sean Christopherson wrote:
> > > > > > > > Could you share the links?
> > > > > > >
> > > > > > > Doh, sorry (they're both in this one).
> > > > > > >
> > > > > > > https://lore.kernel.org/all/[email protected]
> > > > > > >
> > > > > > >
> > > > > >
> > > > > > My opinion on this subject is very simple: we need to draw the line somewhere.
> > > > >
> > > > > ...
> > > > >
> > > > >
> > > > > Since the goal is to simplify KVM, can we try the inhibit route and see what the
> > > > > code looks like before making a decision? I think it might actually yield a less
> > > > > awful KVM than the readonly approach, especially if the inhibit is "sticky", i.e.
> > > > > we don't try to remove the inhibit on subsequent changes.
> > > > >
> > > > > Killing the VM, as proposed, is very user unfriendly as the user will have no idea
> > > > > why the VM was killed. WARN is out of the question because this is user triggerable.
> > > > > Returning an emulation error would be ideal, but getting that result up through
> > > > > apic_mmio_write() could be annoying and end up being more complex.
> > > > >
> > > > > The touchpoints will all be the same, unless I'm missing something the difference
> > > > > should only be a call to set an inhibit instead killing the VM.
> > > >
> > > > Introduce an inhibition - APICV_INHIBIT_REASON_APICID_CHG to deactivate
> > > > APICv once KVM guest would try to change APIC ID in xapic mode, and same
> > > > sanity check in KVM_{SET,GET}_LAPIC for live migration. KVM will keep
> > > > alive but obviously lose benefit from hardware acceleration in this way.
> > > >
> > > > So how do you think the proposal like this ?
> > > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > > > index 6dcccb304775..30d825c069be 100644
> > > > --- a/arch/x86/include/asm/kvm_host.h
> > > > +++ b/arch/x86/include/asm/kvm_host.h
> > > > @@ -1046,6 +1046,7 @@ struct kvm_x86_msr_filter {
> > > > #define APICV_INHIBIT_REASON_X2APIC 5
> > > > #define APICV_INHIBIT_REASON_BLOCKIRQ 6
> > > > #define APICV_INHIBIT_REASON_ABSENT 7
> > > > +#define APICV_INHIBIT_REASON_APICID_CHG 8
> > > >
> > > > struct kvm_arch {
> > > > unsigned long n_used_mmu_pages;
> > > > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> > > > index 22929b5b3f9b..66cd54fa4515 100644
> > > > --- a/arch/x86/kvm/lapic.c
> > > > +++ b/arch/x86/kvm/lapic.c
> > > > @@ -2044,10 +2044,19 @@ static int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val)
> > > >
> > > > switch (reg) {
> > > > case APIC_ID: /* Local APIC ID */
> > > > - if (!apic_x2apic_mode(apic))
> > > > - kvm_apic_set_xapic_id(apic, val >> 24);
> > > > - else
> > > > + if (apic_x2apic_mode(apic)) {
> > > > ret = 1;
> > > > + break;
> > > > + }
> > > > + /*
> > > > + * If changing APIC ID with any APIC acceleration enabled,
> > > > + * deactivate APICv to avoid unexpected issues.
> > > > + */
> > > > + if (enable_apicv && (val >> 24) != apic->vcpu->vcpu_id)
> > > > + kvm_request_apicv_update(apic->vcpu->kvm,
> > > > + false, APICV_INHIBIT_REASON_APICID_CHG);
> > > > +
> > > > + kvm_apic_set_xapic_id(apic, val >> 24);
> > > > break;
> > > >
> > > > case APIC_TASKPRI:
> > > > @@ -2628,11 +2637,19 @@ int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu)
> > > > static int kvm_apic_state_fixup(struct kvm_vcpu *vcpu,
> > > > struct kvm_lapic_state *s, bool set)
> > > > {
> > > > - if (apic_x2apic_mode(vcpu->arch.apic)) {
> > > > - u32 *id = (u32 *)(s->regs + APIC_ID);
> > > > - u32 *ldr = (u32 *)(s->regs + APIC_LDR);
> > > > - u64 icr;
> > > > + u32 *id = (u32 *)(s->regs + APIC_ID);
> > > > + u32 *ldr = (u32 *)(s->regs + APIC_LDR);
> > > > + u64 icr;
> > > > + if (!apic_x2apic_mode(vcpu->arch.apic)) {
> > > > + /*
> > > > + * If APIC ID changed with any APIC acceleration enabled,
> > > > + * deactivate APICv to avoid unexpected issues.
> > > > + */
> > > > + if (enable_apicv && (*id >> 24) != vcpu->vcpu_id)
> > > > + kvm_request_apicv_update(vcpu->kvm,
> > > > + false, APICV_INHIBIT_REASON_APICID_CHG);
> > > > + } else {
> > > > if (vcpu->kvm->arch.x2apic_format) {
> > > > if (*id != vcpu->vcpu_id)
> > > > return -EINVAL;
> > > > diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
> > > > index 82d56f8055de..f78754bdc1d0 100644
> > > > --- a/arch/x86/kvm/svm/avic.c
> > > > +++ b/arch/x86/kvm/svm/avic.c
> > > > @@ -931,7 +931,8 @@ bool svm_check_apicv_inhibit_reasons(ulong bit)
> > > > BIT(APICV_INHIBIT_REASON_IRQWIN) |
> > > > BIT(APICV_INHIBIT_REASON_PIT_REINJ) |
> > > > BIT(APICV_INHIBIT_REASON_X2APIC) |
> > > > - BIT(APICV_INHIBIT_REASON_BLOCKIRQ);
> > > > + BIT(APICV_INHIBIT_REASON_BLOCKIRQ) |
> > > > + BIT(APICV_INHIBIT_REASON_APICID_CHG);
> > > >
> > > > return supported & BIT(bit);
> > > > }
> > > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > > > index 7beba7a9f247..91265f0784bd 100644
> > > > --- a/arch/x86/kvm/vmx/vmx.c
> > > > +++ b/arch/x86/kvm/vmx/vmx.c
> > > > @@ -7751,7 +7751,8 @@ static bool vmx_check_apicv_inhibit_reasons(ulong bit)
> > > > ulong supported = BIT(APICV_INHIBIT_REASON_DISABLE) |
> > > > BIT(APICV_INHIBIT_REASON_ABSENT) |
> > > > BIT(APICV_INHIBIT_REASON_HYPERV) |
> > > > - BIT(APICV_INHIBIT_REASON_BLOCKIRQ);
> > > > + BIT(APICV_INHIBIT_REASON_BLOCKIRQ) |
> > > > + BIT(APICV_INHIBIT_REASON_APICID_CHG);
> > > >
> > > > return supported & BIT(bit);
> > > > }
> > > >
> > > >
> > > >
> > >
> > > This won't work with nested AVIC - we can't just inhibit a nested guest using its own AVIC,
> > > because migration happens.
> >
> > I mean because host decided to change its apic id, which it can in theory do any time,
> > even after the nested guest has started. Seriously, the only reason guest has to change apic id,
> > is to try to exploit some security hole.
>
> Hi
>
> Thanks for the information.
>
> IIUC, you mean KVM applies APICv inhibition only to L1 VM, leaving APICv
> enabled for L2 VM. Shouldn't KVM disable APICv for L2 VM in this case?
> It looks like a generic issue in dynamically toggling APICv scheme,
> e.g., qemu can set KVM_GUESTDBG_BLOCKIRQ after nested guest has started.
>
That is the problem - you can't disable it for L2, unless you are willing to emulate it in software.
Or in other words, when nested guest uses a hardware feature, you can't at some point say to it:
sorry buddy - hardware feature disappeared.
It is *currently* not a problem for APICv because it doesn't do IPI virtualization,
and even with these patches, it doesn't do this for nesting.
It does become when you allow nested guest to use this which I did in the nested AVIC code.
and writable apic ids do pose a large problem, since nested AVIC, will target L1's apic ids,
and when they can change under you without any notice, and even worse be duplicate,
it is just nightmare.
About KVM_GUESTDBG_BLOCKIRQ - yes, but that is a best effort hack anyway,
which not supposed to 100% work - running gdbstub with nested is broken in many ways anyway.
Best regards,
Maxim Levitsky
On Sun, Mar 13, 2022 at 05:09:08PM +0200, Maxim Levitsky wrote:
>> > > This won't work with nested AVIC - we can't just inhibit a nested guest using its own AVIC,
>> > > because migration happens.
>> >
>> > I mean because host decided to change its apic id, which it can in theory do any time,
>> > even after the nested guest has started. Seriously, the only reason guest has to change apic id,
>> > is to try to exploit some security hole.
>>
>> Hi
>>
>> Thanks for the information.
>>
>> IIUC, you mean KVM applies APICv inhibition only to L1 VM, leaving APICv
>> enabled for L2 VM. Shouldn't KVM disable APICv for L2 VM in this case?
>> It looks like a generic issue in dynamically toggling APICv scheme,
>> e.g., qemu can set KVM_GUESTDBG_BLOCKIRQ after nested guest has started.
>>
>
>That is the problem - you can't disable it for L2, unless you are willing to emulate it in software.
>Or in other words, when nested guest uses a hardware feature, you can't at some point say to it:
>sorry buddy - hardware feature disappeared.
Hi Maxim,
I may miss something. When reading Sean's APICv inhibition cleanups, I
find AVIC is disabled for L1 when nested is enabled (SVM is advertised
to L1). Then, I think the new inhibition introduced for changed xAPIC ID
shouldn't be a problem for L2 VM. Or, you plan to remove
APICV_INHIBIT_REASON_NESTED and expose AVIC to L1?
svm_vcpu_after_set_cpuid:
/*
* Currently, AVIC does not work with nested virtualization.
* So, we disable AVIC when cpuid for SVM is set in the L1 guest.
*/
if (nested && guest_cpuid_has(vcpu, X86_FEATURE_SVM))
kvm_request_apicv_update(vcpu->kvm, false,
APICV_INHIBIT_REASON_NESTED);
On Tue, Mar 15, 2022 at 05:30:32PM +0200, Maxim Levitsky wrote:
>Yep, I have a patch for this ( which I hope to be accepted really soon
>(KVM: x86: SVM: allow AVIC to co-exist with a nested guest running)
>
>I also implemented working support for nested AVIC, which includes support for IPI without vm exits
>between L2's vCPUs. I had sent an RFC for that.
>
>With all patches applied both L1 and L2 switch hands on AVIC, L1's avic is inhibited
>(only locally) on the vCPU which runs nested, and while it runs nested, L2 uses AVIC
>to target other vCPUs which also run nested.
>
>I and Paolo talked about this, and we reached a very promising conclusion.
>
>I will add new KVM cap, say KVM_CAP_READ_ONLY_APIC, which userspace will set
>prior to creating a vCPU, and which will make APIC ID fully readonly when set.
Will KVM report violations to QEMU? then, QEMU can know the VM attempted
to change APIC ID and report an error to admin. Then admin can relaunch
the VM without setting this new cap.
>
>As a bonus, if you don't object, I will also make this cap, make APIC base read-only,
>since this feature is also broken in kvm, optional in x86 spec, and not really
>used by guests just like writable apic id.
>
>I hope to have patches in day or two for this.
Great. And no objection to making APIC base read-only.
On Tue, 2022-03-15 at 23:10 +0800, Chao Gao wrote:
> On Sun, Mar 13, 2022 at 05:09:08PM +0200, Maxim Levitsky wrote:
> > > > > This won't work with nested AVIC - we can't just inhibit a nested guest using its own AVIC,
> > > > > because migration happens.
> > > >
> > > > I mean because host decided to change its apic id, which it can in theory do any time,
> > > > even after the nested guest has started. Seriously, the only reason guest has to change apic id,
> > > > is to try to exploit some security hole.
> > >
> > > Hi
> > >
> > > Thanks for the information.
> > >
> > > IIUC, you mean KVM applies APICv inhibition only to L1 VM, leaving APICv
> > > enabled for L2 VM. Shouldn't KVM disable APICv for L2 VM in this case?
> > > It looks like a generic issue in dynamically toggling APICv scheme,
> > > e.g., qemu can set KVM_GUESTDBG_BLOCKIRQ after nested guest has started.
> > >
> >
> > That is the problem - you can't disable it for L2, unless you are willing to emulate it in software.
> > Or in other words, when nested guest uses a hardware feature, you can't at some point say to it:
> > sorry buddy - hardware feature disappeared.
>
> Hi Maxim,
>
> I may miss something. When reading Sean's APICv inhibition cleanups, I
> find AVIC is disabled for L1 when nested is enabled (SVM is advertised
> to L1). Then, I think the new inhibition introduced for changed xAPIC ID
> shouldn't be a problem for L2 VM. Or, you plan to remove
> APICV_INHIBIT_REASON_NESTED and expose AVIC to L1?
Yep, I have a patch for this ( which I hope to be accepted really soon
(KVM: x86: SVM: allow AVIC to co-exist with a nested guest running)
I also implemented working support for nested AVIC, which includes support for IPI without vm exits
between L2's vCPUs. I had sent an RFC for that.
With all patches applied both L1 and L2 switch hands on AVIC, L1's avic is inhibited
(only locally) on the vCPU which runs nested, and while it runs nested, L2 uses AVIC
to target other vCPUs which also run nested.
I and Paolo talked about this, and we reached a very promising conclusion.
I will add new KVM cap, say KVM_CAP_READ_ONLY_APIC, which userspace will set
prior to creating a vCPU, and which will make APIC ID fully readonly when set.
As a bonus, if you don't object, I will also make this cap, make APIC base read-only,
since this feature is also broken in kvm, optional in x86 spec, and not really
used by guests just like writable apic id.
I hope to have patches in day or two for this.
When this cap is not set, it is fair to disable both IPIv, my nested AVIC,
or even better inhibit AVIC completely, including any nested support.
Best regards,
Maxim Levitsky
>
> svm_vcpu_after_set_cpuid:
> /*
> * Currently, AVIC does not work with nested virtualization.
> * So, we disable AVIC when cpuid for SVM is set in the L1 guest.
> */
> if (nested && guest_cpuid_has(vcpu, X86_FEATURE_SVM))
> kvm_request_apicv_update(vcpu->kvm, false,
> APICV_INHIBIT_REASON_NESTED);
>