2024-02-15 16:10:40

by Alejandro Jimenez

[permalink] [raw]
Subject: [RFC 0/3] Export APICv-related state via binary stats interface

The goal of this RFC is to agree on a mechanism for querying the state (and
related stats) of APICv/AVIC. I clearly have an AVIC bias when approaching this
topic since that is the side that I have mostly looked at, and has the greater
number of possible inhibits, but I believe the argument applies for both
vendor's technologies.

Currently, a user or monitoring app trying to determine if APICv is actually
being used needs implementation-specific knowlegde in order to look for specific
types of #VMEXIT (i.e. AVIC_INCOMPLETE_IPI/AVIC_NOACCEL), checking GALog events
by watching /proc/interrupts for AMD-Vi*-GA, etc. There are existing tracepoints
(e.g. kvm_apicv_accept_irq, kvm_avic_ga_log) that make this task easier, but
tracefs is not viable in some scenarios. Adding kvm debugfs entries has similar
downsides. Suravee has previously proposed a new IOCTL interface[0] to expose
this information, but there has not been any development in that direction.
Sean has mentioned a preference for using BPF to extract info from the current
tracepoints, which would require reworking existing structs to access some
desired data, but as far as I know there isn't any work done on that approach
yet.

Recently Joao mentioned another alternative: the binary stats framework that is
already supported by kernel[1] and QEMU[2]. This RFC has minimal code changes to
expose the relevant info based on the existing data types the framework already
supports. If there is consensus on using this approach, I can expand the fd
stats subsystem to include other data types (e.g. a bitmap type for exposing the
inhibit reasons), as well as adding documentation on KVM explaining which stats
are relevant for APICv and how to query them.

A basic example of retrieving the stats via qmp-shell, showing both a VM and
per-vCPU case:

# /usr/local/bin/qmp-shell --pretty ./qmp-sock

(QEMU) query-stats target=vm providers=[{'provider':'kvm','names':['apicv_inhibited']}]
{
"return": [
{
"provider": "kvm",
"stats": [
{
"name": "apicv_inhibited",
"value": false
}
]
}
]
}

(QEMU) query-stats target=vcpu vcpus=['/machine/unattached/device[0]'] providers=[{'provider':'kvm','names':['apicv_accept_irq','ga_log_event']}]
{
"return": [
{
"provider": "kvm",
"qom-path": "/machine/unattached/device[0]",
"stats": [
{
"name": "ga_log_event",
"value": 98
},
{
"name": "apicv_accept_irq",
"value": 166920
}
]
}
]
}

If other alternatives are preferred, please let's use this thread to discuss and
I can take a shot at implementing the desired solution.

Regards,
Alejandro

[0] https://lore.kernel.org/qemu-devel/[email protected]/
[1] https://lore.kernel.org/all/[email protected]/
[2] https://lore.kernel.org/qemu-devel/[email protected]/

Alejandro Jimenez (3):
x86: KVM: stats: Add a stat to report status of APICv inhibition
x86: KVM: stats: Add stat counter for IRQs injected via APICv
x86: KVM: stats: Add a stat counter for GALog events

arch/x86/include/asm/kvm_host.h | 3 +++
arch/x86/kvm/svm/avic.c | 4 +++-
arch/x86/kvm/svm/svm.c | 3 +++
arch/x86/kvm/vmx/vmx.c | 2 ++
arch/x86/kvm/x86.c | 12 +++++++++++-
5 files changed, 22 insertions(+), 2 deletions(-)


base-commit: 7455665a3521aa7b56245c0a2810f748adc5fdd4
--
2.39.3



2024-02-15 16:27:57

by Alejandro Jimenez

[permalink] [raw]
Subject: [RFC 2/3] x86: KVM: stats: Add stat counter for IRQs injected via APICv

Export binary stat counting how many interrupts have been delivered via
APICv/AVIC acceleration from the host. This is one of the most reliable
methods to detect when hardware accelerated interrupt delivery is active,
since APIC timer interrupts are regularly injected and exercise these
code paths.

Signed-off-by: Alejandro Jimenez <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/svm/svm.c | 3 +++
arch/x86/kvm/vmx/vmx.c | 2 ++
arch/x86/kvm/x86.c | 1 +
4 files changed, 7 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 9b960a523715..b6f18084d504 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1564,6 +1564,7 @@ struct kvm_vcpu_stat {
u64 preemption_other;
u64 guest_mode;
u64 notify_window_exits;
+ u64 apicv_accept_irq;
};

struct x86_instruction_info;
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index e90b429c84f1..2243af08ed39 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3648,6 +3648,9 @@ void svm_complete_interrupt_delivery(struct kvm_vcpu *vcpu, int delivery_mode,
}

trace_kvm_apicv_accept_irq(vcpu->vcpu_id, delivery_mode, trig_mode, vector);
+
+ ++vcpu->stat.apicv_accept_irq;
+
if (in_guest_mode) {
/*
* Signal the doorbell to tell hardware to inject the IRQ. If
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index d4e6625e0a9a..f7db75ae2c55 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4275,6 +4275,8 @@ static void vmx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
} else {
trace_kvm_apicv_accept_irq(vcpu->vcpu_id, delivery_mode,
trig_mode, vector);
+
+ ++vcpu->stat.apicv_accept_irq;
}
}

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index f7f598f066e7..2ad70cf6e52c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -304,6 +304,7 @@ const struct _kvm_stats_desc kvm_vcpu_stats_desc[] = {
STATS_DESC_COUNTER(VCPU, preemption_other),
STATS_DESC_IBOOLEAN(VCPU, guest_mode),
STATS_DESC_COUNTER(VCPU, notify_window_exits),
+ STATS_DESC_COUNTER(VCPU, apicv_accept_irq),
};

const struct kvm_stats_header kvm_vcpu_stats_header = {
--
2.39.3


2024-04-09 05:10:17

by Vasant Hegde

[permalink] [raw]
Subject: Re: [RFC 0/3] Export APICv-related state via binary stats interface

Hi Alejadnro,

On 2/15/2024 9:31 PM, Alejandro Jimenez wrote:
> The goal of this RFC is to agree on a mechanism for querying the state (and
> related stats) of APICv/AVIC. I clearly have an AVIC bias when approaching this
> topic since that is the side that I have mostly looked at, and has the greater
> number of possible inhibits, but I believe the argument applies for both
> vendor's technologies.
>
> Currently, a user or monitoring app trying to determine if APICv is actually
> being used needs implementation-specific knowlegde in order to look for specific
> types of #VMEXIT (i.e. AVIC_INCOMPLETE_IPI/AVIC_NOACCEL), checking GALog events
> by watching /proc/interrupts for AMD-Vi*-GA, etc. There are existing tracepoints
> (e.g. kvm_apicv_accept_irq, kvm_avic_ga_log) that make this task easier, but
> tracefs is not viable in some scenarios. Adding kvm debugfs entries has similar
> downsides. Suravee has previously proposed a new IOCTL interface[0] to expose
> this information, but there has not been any development in that direction.
> Sean has mentioned a preference for using BPF to extract info from the current
> tracepoints, which would require reworking existing structs to access some
> desired data, but as far as I know there isn't any work done on that approach
> yet.
>
> Recently Joao mentioned another alternative: the binary stats framework that is
> already supported by kernel[1] and QEMU[2]. This RFC has minimal code changes to
> expose the relevant info based on the existing data types the framework already
> supports. If there is consensus on using this approach, I can expand the fd
> stats subsystem to include other data types (e.g. a bitmap type for exposing the
> inhibit reasons), as well as adding documentation on KVM explaining which stats
> are relevant for APICv and how to query them.

Thanks for the series. IMO this approach makes sense. May be we should
consider adding one more stat to say whether AVIC is active or not. That
way,
Check whether AVIC is active or not.
If AVIC is active, then inhibited or not
If not inhibited, then use other statistics info.


I have reviewed/tested this series on AMD Genoa platform. It looks good
to me.

Reviewed-by: Vasant Hegde <[email protected]>

-Vasant

>
> A basic example of retrieving the stats via qmp-shell, showing both a VM and
> per-vCPU case:
>
> # /usr/local/bin/qmp-shell --pretty ./qmp-sock
>
> (QEMU) query-stats target=vm providers=[{'provider':'kvm','names':['apicv_inhibited']}]
> {
> "return": [
> {
> "provider": "kvm",
> "stats": [
> {
> "name": "apicv_inhibited",
> "value": false
> }
> ]
> }
> ]
> }
>
> (QEMU) query-stats target=vcpu vcpus=['/machine/unattached/device[0]'] providers=[{'provider':'kvm','names':['apicv_accept_irq','ga_log_event']}]
> {
> "return": [
> {
> "provider": "kvm",
> "qom-path": "/machine/unattached/device[0]",
> "stats": [
> {
> "name": "ga_log_event",
> "value": 98
> },
> {
> "name": "apicv_accept_irq",
> "value": 166920
> }
> ]
> }
> ]
> }
>
> If other alternatives are preferred, please let's use this thread to discuss and
> I can take a shot at implementing the desired solution.
>
> Regards,
> Alejandro
>
> [0] https://lore.kernel.org/qemu-devel/[email protected]/
> [1] https://lore.kernel.org/all/[email protected]/
> [2] https://lore.kernel.org/qemu-devel/[email protected]/
>
> Alejandro Jimenez (3):
> x86: KVM: stats: Add a stat to report status of APICv inhibition
> x86: KVM: stats: Add stat counter for IRQs injected via APICv
> x86: KVM: stats: Add a stat counter for GALog events
>
> arch/x86/include/asm/kvm_host.h | 3 +++
> arch/x86/kvm/svm/avic.c | 4 +++-
> arch/x86/kvm/svm/svm.c | 3 +++
> arch/x86/kvm/vmx/vmx.c | 2 ++
> arch/x86/kvm/x86.c | 12 +++++++++++-
> 5 files changed, 22 insertions(+), 2 deletions(-)
>
>
> base-commit: 7455665a3521aa7b56245c0a2810f748adc5fdd4


2024-04-10 01:36:54

by Alejandro Jimenez

[permalink] [raw]
Subject: Re: [RFC 0/3] Export APICv-related state via binary stats interface



On 4/9/24 01:09, Vasant Hegde wrote:
> Hi Alejadnro,
>
> On 2/15/2024 9:31 PM, Alejandro Jimenez wrote:
>> The goal of this RFC is to agree on a mechanism for querying the state (and
>> related stats) of APICv/AVIC. I clearly have an AVIC bias when approaching this
>> topic since that is the side that I have mostly looked at, and has the greater
>> number of possible inhibits, but I believe the argument applies for both
>> vendor's technologies.
>>
>> Currently, a user or monitoring app trying to determine if APICv is actually
>> being used needs implementation-specific knowlegde in order to look for specific
>> types of #VMEXIT (i.e. AVIC_INCOMPLETE_IPI/AVIC_NOACCEL), checking GALog events
>> by watching /proc/interrupts for AMD-Vi*-GA, etc. There are existing tracepoints
>> (e.g. kvm_apicv_accept_irq, kvm_avic_ga_log) that make this task easier, but
>> tracefs is not viable in some scenarios. Adding kvm debugfs entries has similar
>> downsides. Suravee has previously proposed a new IOCTL interface[0] to expose
>> this information, but there has not been any development in that direction.
>> Sean has mentioned a preference for using BPF to extract info from the current
>> tracepoints, which would require reworking existing structs to access some
>> desired data, but as far as I know there isn't any work done on that approach
>> yet.
>>
>> Recently Joao mentioned another alternative: the binary stats framework that is
>> already supported by kernel[1] and QEMU[2]. This RFC has minimal code changes to
>> expose the relevant info based on the existing data types the framework already
>> supports. If there is consensus on using this approach, I can expand the fd
>> stats subsystem to include other data types (e.g. a bitmap type for exposing the
>> inhibit reasons), as well as adding documentation on KVM explaining which stats
>> are relevant for APICv and how to query them.
>
> Thanks for the series. IMO this approach makes sense. May be we should consider adding one more stat to say whether AVIC is active or not. That way,
>  Check whether AVIC is active or not.
>  If AVIC is active, then inhibited or not
>  If not inhibited, then use other statistics info.

Hi Vasant,

Thank you for reviewing/testing. I'll implement your suggestion and send it on the next revision.

Alejandro

>
>
> I have reviewed/tested this series on AMD Genoa platform. It looks good to me.
>
> Reviewed-by: Vasant Hegde <[email protected]>
>
> -Vasant
>
>>
>> A basic example of retrieving the stats via qmp-shell, showing both a VM and
>> per-vCPU case:
>>
>> # /usr/local/bin/qmp-shell --pretty ./qmp-sock
>>
>> (QEMU) query-stats target=vm providers=[{'provider':'kvm','names':['apicv_inhibited']}]
>> {
>>      "return": [
>>          {
>>              "provider": "kvm",
>>              "stats": [
>>                  {
>>                      "name": "apicv_inhibited",
>>                      "value": false
>>                  }
>>              ]
>>          }
>>      ]
>> }
>>
>> (QEMU) query-stats target=vcpu vcpus=['/machine/unattached/device[0]'] providers=[{'provider':'kvm','names':['apicv_accept_irq','ga_log_event']}]
>> {
>>      "return": [
>>          {
>>              "provider": "kvm",
>>              "qom-path": "/machine/unattached/device[0]",
>>              "stats": [
>>                  {
>>                      "name": "ga_log_event",
>>                      "value": 98
>>                  },
>>                  {
>>                      "name": "apicv_accept_irq",
>>                      "value": 166920
>>                  }
>>              ]
>>          }
>>      ]
>> }
>>
>> If other alternatives are preferred, please let's use this thread to discuss and
>> I can take a shot at implementing the desired solution.
>>
>> Regards,
>> Alejandro
>>
>> [0] https://lore.kernel.org/qemu-devel/[email protected]/
>> [1] https://lore.kernel.org/all/[email protected]/
>> [2] https://lore.kernel.org/qemu-devel/[email protected]/
>>
>> Alejandro Jimenez (3):
>>    x86: KVM: stats: Add a stat to report status of APICv inhibition
>>    x86: KVM: stats: Add stat counter for IRQs injected via APICv
>>    x86: KVM: stats: Add a stat counter for GALog events
>>
>>   arch/x86/include/asm/kvm_host.h |  3 +++
>>   arch/x86/kvm/svm/avic.c         |  4 +++-
>>   arch/x86/kvm/svm/svm.c          |  3 +++
>>   arch/x86/kvm/vmx/vmx.c          |  2 ++
>>   arch/x86/kvm/x86.c              | 12 +++++++++++-
>>   5 files changed, 22 insertions(+), 2 deletions(-)
>>
>>
>> base-commit: 7455665a3521aa7b56245c0a2810f748adc5fdd4
>
>

2024-04-16 18:08:24

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC 0/3] Export APICv-related state via binary stats interface

On Thu, Feb 15, 2024, Alejandro Jimenez wrote:
> The goal of this RFC is to agree on a mechanism for querying the state (and
> related stats) of APICv/AVIC. I clearly have an AVIC bias when approaching this
> topic since that is the side that I have mostly looked at, and has the greater
> number of possible inhibits, but I believe the argument applies for both
> vendor's technologies.
>
> Currently, a user or monitoring app trying to determine if APICv is actually
> being used needs implementation-specific knowlegde in order to look for specific
> types of #VMEXIT (i.e. AVIC_INCOMPLETE_IPI/AVIC_NOACCEL), checking GALog events
> by watching /proc/interrupts for AMD-Vi*-GA, etc. There are existing tracepoints
> (e.g. kvm_apicv_accept_irq, kvm_avic_ga_log) that make this task easier, but
> tracefs is not viable in some scenarios. Adding kvm debugfs entries has similar
> downsides. Suravee has previously proposed a new IOCTL interface[0] to expose
> this information, but there has not been any development in that direction.
> Sean has mentioned a preference for using BPF to extract info from the current
> tracepoints, which would require reworking existing structs to access some
> desired data, but as far as I know there isn't any work done on that approach
> yet.
>
> Recently Joao mentioned another alternative: the binary stats framework that is
> already supported by kernel[1] and QEMU[2].

The hiccup with stats are that they are ABI, e.g. we can't (easily) ditch stats
once they're added, and KVM needs to maintain the exact behavior.

Tracepoints are explicitly not ABI, and so we can be much more permissive when it
comes to adding/expanding tracepoints, specifically because there are no guarantees
provided to userspace.

2024-04-16 19:51:36

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC 0/3] Export APICv-related state via binary stats interface

On Tue, Apr 16, 2024 at 8:08 PM Sean Christopherson <[email protected]> wrote:
>
> On Thu, Feb 15, 2024, Alejandro Jimenez wrote:
> > The goal of this RFC is to agree on a mechanism for querying the state (and
> > related stats) of APICv/AVIC. I clearly have an AVIC bias when approaching this
> > topic since that is the side that I have mostly looked at, and has the greater
> > number of possible inhibits, but I believe the argument applies for both
> > vendor's technologies.
> >
> > Currently, a user or monitoring app trying to determine if APICv is actually
> > being used needs implementation-specific knowlegde in order to look for specific
> > types of #VMEXIT (i.e. AVIC_INCOMPLETE_IPI/AVIC_NOACCEL), checking GALog events
> > by watching /proc/interrupts for AMD-Vi*-GA, etc. There are existing tracepoints
> > (e.g. kvm_apicv_accept_irq, kvm_avic_ga_log) that make this task easier, but
> > tracefs is not viable in some scenarios. Adding kvm debugfs entries has similar
> > downsides. Suravee has previously proposed a new IOCTL interface[0] to expose
> > this information, but there has not been any development in that direction.
> > Sean has mentioned a preference for using BPF to extract info from the current
> > tracepoints, which would require reworking existing structs to access some
> > desired data, but as far as I know there isn't any work done on that approach
> > yet.
> >
> > Recently Joao mentioned another alternative: the binary stats framework that is
> > already supported by kernel[1] and QEMU[2].
>
> The hiccup with stats are that they are ABI, e.g. we can't (easily) ditch stats
> once they're added, and KVM needs to maintain the exact behavior.

Stats are not ABI---why would they be? They have an established
meaning and it's not a good idea to change it, but it's not an
absolute no-no(*); and removing them is not a problem at all.

For example, if stats were ABI, there would be no need for the
introspection mechanism, you could just use a struct like ethtool
stats (which *are* ABO).

Not everything makes a good stat but, if in doubt and it's cheap
enough to collect it, go ahead and add it. Cheap collection is the
important point, because tracepoints in a hot path can be so expensive
as to slow down the guest substantially, at least in microbenchmarks.

In this case I'm not sure _all_ inhibits makes sense and I certainly
wouldn't want a bitmask, but a generic APICv-enabled stat certainly
makes sense, and perhaps another for a weirdly-configured local APIC.

Paolo

(*) you have to draw a line somewhere. New processor models may
virtualize parts of the CPU in such a way that some stats become
meaningless or just stay at zero. Should KVM not support those
features because it is not possible anymore to introspect the guest
through stat?

> Tracepoints are explicitly not ABI, and so we can be much more permissive when it
> comes to adding/expanding tracepoints, specifically because there are no guarantees
> provided to userspace.
>


2024-04-16 21:38:19

by Alejandro Jimenez

[permalink] [raw]
Subject: Re: [RFC 0/3] Export APICv-related state via binary stats interface



On 4/16/24 15:51, Paolo Bonzini wrote:
> On Tue, Apr 16, 2024 at 8:08 PM Sean Christopherson <[email protected]> wrote:
>>
>> On Thu, Feb 15, 2024, Alejandro Jimenez wrote:
>>> The goal of this RFC is to agree on a mechanism for querying the state (and
>>> related stats) of APICv/AVIC. I clearly have an AVIC bias when approaching this
>>> topic since that is the side that I have mostly looked at, and has the greater
>>> number of possible inhibits, but I believe the argument applies for both
>>> vendor's technologies.
>>>
>>> Currently, a user or monitoring app trying to determine if APICv is actually
>>> being used needs implementation-specific knowlegde in order to look for specific
>>> types of #VMEXIT (i.e. AVIC_INCOMPLETE_IPI/AVIC_NOACCEL), checking GALog events
>>> by watching /proc/interrupts for AMD-Vi*-GA, etc. There are existing tracepoints
>>> (e.g. kvm_apicv_accept_irq, kvm_avic_ga_log) that make this task easier, but
>>> tracefs is not viable in some scenarios. Adding kvm debugfs entries has similar
>>> downsides. Suravee has previously proposed a new IOCTL interface[0] to expose
>>> this information, but there has not been any development in that direction.
>>> Sean has mentioned a preference for using BPF to extract info from the current
>>> tracepoints, which would require reworking existing structs to access some
>>> desired data, but as far as I know there isn't any work done on that approach
>>> yet.
>>>
>>> Recently Joao mentioned another alternative: the binary stats framework that is
>>> already supported by kernel[1] and QEMU[2].
>>
>> The hiccup with stats are that they are ABI, e.g. we can't (easily) ditch stats
>> once they're added, and KVM needs to maintain the exact behavior.
>
> Stats are not ABI---why would they be? They have an established
> meaning and it's not a good idea to change it, but it's not an
> absolute no-no(*); and removing them is not a problem at all.
>
> For example, if stats were ABI, there would be no need for the
> introspection mechanism, you could just use a struct like ethtool
> stats (which *are* ABO).
>
> Not everything makes a good stat but, if in doubt and it's cheap
> enough to collect it, go ahead and add it. Cheap collection is the
> important point, because tracepoints in a hot path can be so expensive
> as to slow down the guest substantially, at least in microbenchmarks.
>
> In this case I'm not sure _all_ inhibits makes sense and I certainly
> wouldn't want a bitmask,

I wanted to be able to query enough info via stats (i.e. is APICv active, and if
not, why is it inhibited?) that is exposed via the other interfaces which are not
always available. That unfortunately requires a bit of "overloading" of
the stat as I mentioned earlier, but it remains cheap to collect and expose.

What would be your preferred interface to provide the (complete) APICv state?

but a generic APICv-enabled stat certainly
> makes sense, and perhaps another for a weirdly-configured local APIC.

Can you expand on what you mean by "weirdly-configured"? Do you mean something
like virtual wire mode?

Alejandro

>
> Paolo
>
> (*) you have to draw a line somewhere. New processor models may
> virtualize parts of the CPU in such a way that some stats become
> meaningless or just stay at zero. Should KVM not support those
> features because it is not possible anymore to introspect the guest
> through stat?
>
>> Tracepoints are explicitly not ABI, and so we can be much more permissive when it
>> comes to adding/expanding tracepoints, specifically because there are no guarantees
>> provided to userspace.
>>
>
>

2024-04-16 21:39:35

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC 0/3] Export APICv-related state via binary stats interface

On Tue, Apr 16, 2024 at 11:29 PM Alejandro Jimenez
<[email protected]> wrote:
>
>
>
> On 4/16/24 15:51, Paolo Bonzini wrote:
> > On Tue, Apr 16, 2024 at 8:08 PM Sean Christopherson <[email protected]> wrote:
> >>
> >> On Thu, Feb 15, 2024, Alejandro Jimenez wrote:
> >>> The goal of this RFC is to agree on a mechanism for querying the state (and
> >>> related stats) of APICv/AVIC. I clearly have an AVIC bias when approaching this
> >>> topic since that is the side that I have mostly looked at, and has the greater
> >>> number of possible inhibits, but I believe the argument applies for both
> >>> vendor's technologies.
> >>>
> >>> Currently, a user or monitoring app trying to determine if APICv is actually
> >>> being used needs implementation-specific knowlegde in order to look for specific
> >>> types of #VMEXIT (i.e. AVIC_INCOMPLETE_IPI/AVIC_NOACCEL), checking GALog events
> >>> by watching /proc/interrupts for AMD-Vi*-GA, etc. There are existing tracepoints
> >>> (e.g. kvm_apicv_accept_irq, kvm_avic_ga_log) that make this task easier, but
> >>> tracefs is not viable in some scenarios. Adding kvm debugfs entries has similar
> >>> downsides. Suravee has previously proposed a new IOCTL interface[0] to expose
> >>> this information, but there has not been any development in that direction.
> >>> Sean has mentioned a preference for using BPF to extract info from the current
> >>> tracepoints, which would require reworking existing structs to access some
> >>> desired data, but as far as I know there isn't any work done on that approach
> >>> yet.
> >>>
> >>> Recently Joao mentioned another alternative: the binary stats framework that is
> >>> already supported by kernel[1] and QEMU[2].
> >>
> >> The hiccup with stats are that they are ABI, e.g. we can't (easily) ditch stats
> >> once they're added, and KVM needs to maintain the exact behavior.
> >
> > Stats are not ABI---why would they be? They have an established
> > meaning and it's not a good idea to change it, but it's not an
> > absolute no-no(*); and removing them is not a problem at all.
> >
> > For example, if stats were ABI, there would be no need for the
> > introspection mechanism, you could just use a struct like ethtool
> > stats (which *are* ABO).
> >
> > Not everything makes a good stat but, if in doubt and it's cheap
> > enough to collect it, go ahead and add it. Cheap collection is the
> > important point, because tracepoints in a hot path can be so expensive
> > as to slow down the guest substantially, at least in microbenchmarks.
> >
> > In this case I'm not sure _all_ inhibits makes sense and I certainly
> > wouldn't want a bitmask,
>
> I wanted to be able to query enough info via stats (i.e. is APICv active, and if
> not, why is it inhibited?) that is exposed via the other interfaces which are not
> always available. That unfortunately requires a bit of "overloading" of
> the stat as I mentioned earlier, but it remains cheap to collect and expose.
>
> What would be your preferred interface to provide the (complete) APICv state?
>
> but a generic APICv-enabled stat certainly
> > makes sense, and perhaps another for a weirdly-configured local APIC.
>
> Can you expand on what you mean by "weirdly-configured"? Do you mean something
> like virtual wire mode?

I mean any of

APICV_INHIBIT_REASON_HYPERV,
APICV_INHIBIT_REASON_PHYSICAL_ID_ALIASED,
APICV_INHIBIT_REASON_APIC_ID_MODIFIED,
APICV_INHIBIT_REASON_APIC_BASE_MODIFIED,
APICV_INHIBIT_REASON_LOGICAL_ID_ALIASED,

which in practice are always going to be APICV_INHIBIT_REASON_HYPERV
on 99.99% of the gues

ExtINT is visible via irq_window_exits; if you are not running nested,
do not have the problematic configurations above. don't have debugging
(BLOCKIRQ) or in-kernel PIT with reinjection, that's basically the
only one that's left.

Paolo

> Alejandro
>
> >
> > Paolo
> >
> > (*) you have to draw a line somewhere. New processor models may
> > virtualize parts of the CPU in such a way that some stats become
> > meaningless or just stay at zero. Should KVM not support those
> > features because it is not possible anymore to introspect the guest
> > through stat?
> >
> >> Tracepoints are explicitly not ABI, and so we can be much more permissive when it
> >> comes to adding/expanding tracepoints, specifically because there are no guarantees
> >> provided to userspace.
> >>
> >
> >
>


2024-04-16 22:35:18

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC 0/3] Export APICv-related state via binary stats interface

On Tue, Apr 16, 2024, Paolo Bonzini wrote:
> On Tue, Apr 16, 2024 at 8:08 PM Sean Christopherson <[email protected]> wrote:
> >
> > On Thu, Feb 15, 2024, Alejandro Jimenez wrote:
> > > The goal of this RFC is to agree on a mechanism for querying the state (and
> > > related stats) of APICv/AVIC. I clearly have an AVIC bias when approaching this
> > > topic since that is the side that I have mostly looked at, and has the greater
> > > number of possible inhibits, but I believe the argument applies for both
> > > vendor's technologies.
> > >
> > > Currently, a user or monitoring app trying to determine if APICv is actually
> > > being used needs implementation-specific knowlegde in order to look for specific
> > > types of #VMEXIT (i.e. AVIC_INCOMPLETE_IPI/AVIC_NOACCEL), checking GALog events
> > > by watching /proc/interrupts for AMD-Vi*-GA, etc. There are existing tracepoints
> > > (e.g. kvm_apicv_accept_irq, kvm_avic_ga_log) that make this task easier, but
> > > tracefs is not viable in some scenarios. Adding kvm debugfs entries has similar
> > > downsides. Suravee has previously proposed a new IOCTL interface[0] to expose
> > > this information, but there has not been any development in that direction.
> > > Sean has mentioned a preference for using BPF to extract info from the current
> > > tracepoints, which would require reworking existing structs to access some
> > > desired data, but as far as I know there isn't any work done on that approach
> > > yet.
> > >
> > > Recently Joao mentioned another alternative: the binary stats framework that is
> > > already supported by kernel[1] and QEMU[2].
> >
> > The hiccup with stats are that they are ABI, e.g. we can't (easily) ditch stats
> > once they're added, and KVM needs to maintain the exact behavior.
>
> Stats are not ABI---why would they be?

Because they exposed through an ioctl(), and userspace can and will use stats for
functional purposes? Maybe I just had the wrong takeaway from an old thread about
adding a big pile of stats[1], where unfortunately (for me) you weighed in on
whether or not tracepoints are ABI, but not stats.

And because I've have been operating under the assumption that stats are ABI, I've
been guiding people to using stats to make decisions in userspace. E.g. KVM doesn't
currently expose is_guest_mode() in kvm_run, but it is a stat, so it's not hard
to imagine userspace using the stat to make decisions without needing to call back
into KVM[2].

And based on the old discussion[1] I doubt I'm though only one that has this view.

That said, I'm definitely not opposed to stats _not_ being ABI, because that would
give us a ton of flexibility. E.g. we have a non-trivial number of internal stats
that are super useful _for us_, but are rather heavy and might not be desirable
for most environments. If stats aren't considered ABI, then I'm pretty sure we
could land some of the more generally useful ones upstream, but off-by-default
and guarded by a Kconfig. E.g. we have a pile of stats related to mmu_lock that
are very helpful in identifying performance issues, but they aren't things I would
want enabled by default.

But if we do decide stats aren't ABI, then we need to document and disseminate
that information.

[1] https://lore.kernel.org/all/CALzav=cuzT=u6G0TCVZUfEgAKOCKTSCDE8x2v5qc-Gd_NL-pzg@mail.gmail.com
[2] https://lore.kernel.org/all/[email protected]

> They have an established meaning and it's not a good idea to change it, but
> it's not an absolute no-no(*); and removing them is not a problem at all.
>
> For example, if stats were ABI, there would be no need for the
> introspection mechanism, you could just use a struct like ethtool
> stats (which *are* ABO).
>
> Not everything makes a good stat but, if in doubt and it's cheap
> enough to collect it, go ahead and add it.

Marc raised the (IMO) valid concern that "if it's cheap, add it" will lead to
death by a thousand cuts. E.g. add a few hundred vCPU stats and suddenly vCPUs
consumes an extra KiB or three of memory.

A few APIC stats obviously aren't going to move the needle much, I'm just pointing
out that not everyone agrees that KVM should be hyper permissive when it comes to
adding new stats.

> Cheap collection is the important point, because tracepoints in a hot path
> can be so expensive as to slow down the guest substantially, at least in
> microbenchmarks.
>
> In this case I'm not sure _all_ inhibits makes sense and I certainly
> wouldn't want a bitmask, but a generic APICv-enabled stat certainly
> makes sense, and perhaps another for a weirdly-configured local APIC.
>
> Paolo
>
> (*) you have to draw a line somewhere. New processor models may
> virtualize parts of the CPU in such a way that some stats become
> meaningless or just stay at zero. Should KVM not support those
> features because it is not possible anymore to introspect the guest
> through stat?

2024-04-17 09:48:30

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC 0/3] Export APICv-related state via binary stats interface

On Wed, Apr 17, 2024 at 12:35 AM Sean Christopherson <seanjc@googlecom> wrote:
> > > The hiccup with stats are that they are ABI, e.g. we can't (easily) ditch stats
> > > once they're added, and KVM needs to maintain the exact behavior.
> >
> > Stats are not ABI---why would they be?
>
> Because they exposed through an ioctl(), and userspace can and will use stats for
> functional purposes? Maybe I just had the wrong takeaway from an old thread about
> adding a big pile of stats[1], where unfortunately (for me) you weighed in on
> whether or not tracepoints are ABI, but not stats.

I think we can agree that:
- you don't want hundreds of stats (Marc's point)
- a large part of the stats are very stable, but just as many (while
useful) depend on details which are very much implementation dependent
- a subset of stats is pretty close to being ABI (e.g. guest_mode),
but others can change meaning depending on processor model, guest
configuration and/or guest type (e.g. all of them affect
interrupt-related stats due to APICv inhibits).

While there are exceptions, the main consumer of stats (but indeed not
the only one) is intended to be the user, not a program. This is the
same as tracepoints, and it's why the introspection bits exist.
(User-friendliness also means that bitmask stats are "ouch"; I guess
we could add support for bit-sized boolean stats is things get out of
control).

For many stats, using them for functional purposes would be
wrong/dumb. You have to be ready for the exact behavior to change even
if the stats remain the same. If userspace doesn't, it's being dumb.
KVM can't be blocked from supporting new features just because they
"break" stats, and shouldn't be blocked from adding useful debugging
stats just because userspace could be dumb.

For example, the point of pf_fast is mostly to compare it with other
pf_* stats and see if there's something smelly going on.pf_fast used
to affect pretty much only dirty bits; nowadays it also affects
accessed bits on !ept_ad machines and it does not affect dirty bits if
you have PML. So in the past it was possible to use pf_fast as a proxy
for the number of dirty pages, for example during migration. That
usage doesn't work anymore. Tough luck.

Perhaps you could use the halt-polling stats to toggle DISABLE_EXITS
for VMs that consume a lot of time polling. That would be a more
plausible use of stats to drive heuristics, but again, the power to
look into low-level details of KVM and guest behavior comes with the
accompanying responsibility. It is _not_ guaranteed to keep working in
the same way as processors come out and optimizations are added to
KVM.

So you would have to look at intentional stats breakages one by one,
and this is again a lot like tracepoints. And many potential breakages
would go unnoticed anyway, because there's an infinite supply of bad
ideas when it comes to stats and heuristics.

> That said, I'm definitely not opposed to stats _not_ being ABI, because that would
> give us a ton of flexibility. E.g. we have a non-trivial number of internal stats
> that are super useful _for us_, but are rather heavy and might not be desirable
> for most environments. If stats aren't considered ABI, then I'm pretty sure we
> could land some of the more generally useful ones upstream, but off-by-default
> and guarded by a Kconfig. E.g. we have a pile of stats related to mmu_lock that
> are very helpful in identifying performance issues, but they aren't things I would
> want enabled by default.

That would be great indeed.

> > Not everything makes a good stat but, if in doubt and it's cheap
> > enough to collect it, go ahead and add it.
>
> Marc raised the (IMO) valid concern that "if it's cheap, add it" will lead to
> death by a thousand cuts. E.g. add a few hundred vCPU stats and suddenly vCPUs
> consumes an extra KiB or three of memory.
>
> A few APIC stats obviously aren't going to move the needle much, I'm just pointing
> out that not everyone agrees that KVM should be hyper permissive when it comes to
> adding new stats.

Yeah, that's why I made it conditional to "if in doubt". "Stats are
not ABI" is not a free pass to add anything, also because the truth is
closer than "Stats are generally not ABI but keeping them stable is a
good idea". Many more stats are obviously bad to have upstream, than
there are good ones; and when adding stats it makes sense to consider
their stability but without making it an absolute criterion for
inclusion.

So for this patch, I would weigh advantages to be substantial:
+ APICv inhibits at this point are relatively stable
+ the performance impact is large enough that APICv/AVIC stats _can_
be useful, both boolean and cumulative ones; so for example I'd add an
interrupt_injections stat for unaccelerated injections causing a
vmexit or otherwise hitting lapic.c.

But absolutely would not go with a raw bitmask because:
- the exact set of inhibits is subject to change
- super high detail into niche APICv inhibits is unlikely to be useful
- many if not most inhibits are trivially derived from VM configuration

Paolo


2024-04-23 20:44:22

by Alejandro Jimenez

[permalink] [raw]
Subject: Re: [RFC 0/3] Export APICv-related state via binary stats interface



On 4/17/24 05:48, Paolo Bonzini wrote:
> On Wed, Apr 17, 2024 at 12:35 AM Sean Christopherson <[email protected]> wrote:
>>>> The hiccup with stats are that they are ABI, e.g. we can't (easily) ditch stats
>>>> once they're added, and KVM needs to maintain the exact behavior.
>>>
>>> Stats are not ABI---why would they be?
>>
>> Because they exposed through an ioctl(), and userspace can and will use stats for
>> functional purposes? Maybe I just had the wrong takeaway from an old thread about
>> adding a big pile of stats[1], where unfortunately (for me) you weighed in on
>> whether or not tracepoints are ABI, but not stats.
>
> I think we can agree that:
> - you don't want hundreds of stats (Marc's point)
> - a large part of the stats are very stable, but just as many (while
> useful) depend on details which are very much implementation dependent
> - a subset of stats is pretty close to being ABI (e.g. guest_mode),
> but others can change meaning depending on processor model, guest
> configuration and/or guest type (e.g. all of them affect
> interrupt-related stats due to APICv inhibits).
>
> While there are exceptions, the main consumer of stats (but indeed not
> the only one) is intended to be the user, not a program. This is the
> same as tracepoints, and it's why the introspection bits exist.
> (User-friendliness also means that bitmask stats are "ouch"; I guess
> we could add support for bit-sized boolean stats is things get out of
> control).
>
> For many stats, using them for functional purposes would be
> wrong/dumb. You have to be ready for the exact behavior to change even
> if the stats remain the same. If userspace doesn't, it's being dumb.
> KVM can't be blocked from supporting new features just because they
> "break" stats, and shouldn't be blocked from adding useful debugging
> stats just because userspace could be dumb.

[ trim ]

>
>> That said, I'm definitely not opposed to stats _not_ being ABI, because that would
>> give us a ton of flexibility. E.g. we have a non-trivial number of internal stats
>> that are super useful _for us_, but are rather heavy and might not be desirable
>> for most environments. If stats aren't considered ABI, then I'm pretty sure we
>> could land some of the more generally useful ones upstream, but off-by-default
>> and guarded by a Kconfig. E.g. we have a pile of stats related to mmu_lock that
>> are very helpful in identifying performance issues, but they aren't things I would
>> want enabled by default.
>
> That would be great indeed.
>
>>> Not everything makes a good stat but, if in doubt and it's cheap
>>> enough to collect it, go ahead and add it.
>>
>> Marc raised the (IMO) valid concern that "if it's cheap, add it" will lead to
>> death by a thousand cuts. E.g. add a few hundred vCPU stats and suddenly vCPUs
>> consumes an extra KiB or three of memory.
>>
>> A few APIC stats obviously aren't going to move the needle much, I'm just pointing
>> out that not everyone agrees that KVM should be hyper permissive when it comes to
>> adding new stats.
>
> Yeah, that's why I made it conditional to "if in doubt". "Stats are
> not ABI" is not a free pass to add anything, also because the truth is
> closer than "Stats are generally not ABI but keeping them stable is a
> good idea". Many more stats are obviously bad to have upstream, than
> there are good ones; and when adding stats it makes sense to consider
> their stability but without making it an absolute criterion for
> inclusion.
>
> So for this patch, I would weigh advantages to be substantial:
> + APICv inhibits at this point are relatively stable
> + the performance impact is large enough that APICv/AVIC stats _can_
> be useful, both boolean and cumulative ones; so for example I'd add an
> interrupt_injections stat for unaccelerated injections causing a
> vmexit or otherwise hitting lapic.c.

So far it seems there is support for using the stats interface to expose:

- APICv status: (apicv_enabled, boolean, per-vCPU)
- Windows guest using SynIC's AutoEOI: (synic_auto_eoi_used, boolean, per-VM)
- KVM PIT in reinject mode inhibits AVIC: (pit_reinject_mode, boolean, per-VM)
- APICv unaccelerated injections causing a vmexit (i.e. AVIC_INCOMPLETE_IPI,
AVIC_UNACCELERATED_ACCESS, APIC_WRITE): (apicv_unaccelerated_inj, counter,
per-vCPU)

You'll noticed that I framed the AutoEOI usage and PIT reinject mode as their
own stats, as opposed to linking them to APICv inhibits, so it requires a
certain level of knowledge about the APICv limitations to make the connection.
Is this the preferred approach or would we want to associate them more
explicitly to APICv inhibitions?

Alternatively...

>
> But absolutely would not go with a raw bitmask because:
> - the exact set of inhibits is subject to change
> - super high detail into niche APICv inhibits is unlikely to be useful
> - many if not most inhibits are trivially derived from VM configuration


I wasn't able to fully make my case during the last PUCK meeting, so I'll try
here again. I'd like to insist on the opportunity to provide visibility into
APICv state with minimal change by exposing the current inhibit reasons. I am
aware that Sean and Paolo have expressed opposition to it, so I'll only insist
one more time:

Putting aside valid concerns about setting a precedent for exposing a bitmask
via stats, I'd argue in this case the apicv_inhibit_reasons is essentially a
tri-state variable. i.e. we can have a per-vCPU stat with possible values:

0 --> APICv is enabled and active
1 --> APICv is disabled (either by module parameter or lack of HW support)
>1 --> APICv is disabled due to other reason(s)

The >1 values must be decoded using the kvm_host.h header, same as we are forced
to do for tracepoints until (shameless plug):
https://lore.kernel.org/all/[email protected]/
is hopefully merged.

So as long as the method for decoding is documented in KVM docs (which would I
add as a patch in the series), a single stat can fully expose the APICv state
and be useful for debugging too. It could replace three of the stats proposed:
apicv_enabled, synic_auto_eoi_used, pit_reinject_mode.

Cons:
- The exact set of inhibits is subject ot change / detail into niche APICv
inhibits is unlikely to be useful.

The state described by 0 and 1 values of the stat are unlikely to change, and it
costs nothing to provide additional meaning in the remaining bits. The fact that
inhibits are likely to change/grow makes this approach better, since we avoid
creating stats to describe scenarios that become obsolete e.g. Synic AutoEOI is
fully deprecated, or split irqchip becomes the default..

- many if not most inhibits are trivially derived from VM configuration

This assumes non-trivial and implementation specific knowledge about VMM/guest
defaults, and limitations of the AVIC/APICv host hardware, which have also been
known to change e.g. Milan (unofficially) supports AVIC, but not x2AVIC.

Some inhibits will be set by default currently e.g. QEMU uses in-kernel irqchip
by default since pc-q35-4.0.1, and KVM defaults to reinject mode when creating a
PIT.

ExtINT can be inferred via IRQ window exits which are rare, but on the reverse,
noticing that AVIC is inhibited due to an IRQ window request can draw attention
to problems in that unlikely code path, as I recently found while debugging an
issue on OVMF + Secure Boot. (anecdotal)

VMMs other than QEMU and/or guests other than Linux can be inhibited due to
"less common" reasons than the ones we see today. (hypothetical)

--

I understand concerns about overloading the stats interface, but seems a lost
opportunity and a handicap to usability if we provide a boolean for APICv,
but no information about the reasons why it was not be enabled, specially
since in some real scenarios (e.g. hosts enabling kernel lockdown in
confidentiality mode, normally paired with Secure Boot) that information will
not be accessible via other mechanisms like tracepoints or BPF.

If the idea of exposing the inhibit reasons is still a NACK, then please let me
know if you agree with the 4 proposed stats mentioned at the beginning and I'll
send a v2 that implements them.

Thank you,
Alejandro


>
> Paolo
>