There is one bug in KVM that can hit vm-entry failure 100% on platform
supporting PT_MODE_HOST_GUEST mode following below steps:
1. #modprobe -r kvm_intel
2. #modprobe kvm_intel pt_mode=1
3. start a VM with QEMU
4. on host: #perf record -e intel_pt//
The vm-entry failure happens because it violates the requirement stated
in Intel SDM 26.2.1.1 VM-Execution Control Fields
If the logical processor is operating with Intel PT enabled (if
IA32_RTIT_CTL.TraceEn = 1) at the time of VM entry, the "load
IA32_RTIT_CTL" VM-entry control must be 0.
On PT_MODE_HOST_GUEST node, PT_MODE_HOST_GUEST is always set. Thus KVM
needs to ensure IA32_RTIT_CTL.TraceEn is 0 before VM-entry. Currently KVM
manually WRMSR(IA32_RTIT_CTL) to clear TraceEn bit. However, it doesn't
work everytime since there is a posibility that IA32_RTIT_CTL.TraceEn is
re-enabled in PT PMI handler before vm-entry. This series tries to fix
the issue by exposing two interfaces from Intel PT driver for the purose
to stop and resume Intel PT on host. It prevents PT PMI handler from
re-enabling PT. By the way, it also fixes another issue that PT PMI
touches PT MSRs whihc leads to what KVM stores for host bemomes stale.
Xiaoyao Li (2):
perf/x86/intel/pt: Introduce intel_pt_{stop,resume}()
KVM: VMX: Stop/resume host PT before/after VM entry when
PT_MODE_HOST_GUEST
arch/x86/events/intel/pt.c | 11 ++++++++++-
arch/x86/include/asm/intel_pt.h | 6 ++++--
arch/x86/kernel/crash.c | 4 ++--
arch/x86/kvm/vmx/vmx.c | 11 ++++++++++-
4 files changed, 26 insertions(+), 6 deletions(-)
--
2.27.0
KVM supports PT_MODE_HOST_GUEST mode for Intel PT that host and guest
have separate Intel PT configurations and work independently. In that
mdoe, KVM needs to context switch all the Intel PT configurations
between host and guest on VM-entry and VM-exit.
Before VM-entry, if Intel PT is enabled on host, KVM needs to disable it
first so as to context switch the PT configurations. After VM exit, KVM
needs to re-enable Intel PT for host. Currently, KVM achieves it by
manually toggle MSR_IA32_RTIT_CTL.TRACEEN bit to en/dis-able Intel PT.
However, PT PMI can be delivered after MSR_IA32_RTIT_CTL.TRACEEN bit is
cleared. PT PMI handler changes PT MSRs and re-enable PT, that leads to
1) VM-entry failure of guest 2) KVM stores stale value of PT MSRs.
To solve the problems, expose two interfaces for KVM to stop and
resume the PT tracing.
Signed-off-by: Xiaoyao Li <[email protected]>
---
arch/x86/events/intel/pt.c | 11 ++++++++++-
arch/x86/include/asm/intel_pt.h | 6 ++++--
arch/x86/kernel/crash.c | 4 ++--
3 files changed, 16 insertions(+), 5 deletions(-)
diff --git a/arch/x86/events/intel/pt.c b/arch/x86/events/intel/pt.c
index 82ef87e9a897..55fc02036ff1 100644
--- a/arch/x86/events/intel/pt.c
+++ b/arch/x86/events/intel/pt.c
@@ -1730,13 +1730,22 @@ static int pt_event_init(struct perf_event *event)
return 0;
}
-void cpu_emergency_stop_pt(void)
+void intel_pt_stop(void)
{
struct pt *pt = this_cpu_ptr(&pt_ctx);
if (pt->handle.event)
pt_event_stop(pt->handle.event, PERF_EF_UPDATE);
}
+EXPORT_SYMBOL_GPL(intel_pt_stop);
+
+void intel_pt_resume(void) {
+ struct pt *pt = this_cpu_ptr(&pt_ctx);
+
+ if (pt->handle.event)
+ pt_event_start(pt->handle.event, 0);
+}
+EXPORT_SYMBOL_GPL(intel_pt_resume);
int is_intel_pt_event(struct perf_event *event)
{
diff --git a/arch/x86/include/asm/intel_pt.h b/arch/x86/include/asm/intel_pt.h
index c796e9bc98b6..fdfa4d31740c 100644
--- a/arch/x86/include/asm/intel_pt.h
+++ b/arch/x86/include/asm/intel_pt.h
@@ -27,12 +27,14 @@ enum pt_capabilities {
};
#if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_CPU_SUP_INTEL)
-void cpu_emergency_stop_pt(void);
+void intel_pt_stop(void);
+void intel_pt_resume(void);
extern u32 intel_pt_validate_hw_cap(enum pt_capabilities cap);
extern u32 intel_pt_validate_cap(u32 *caps, enum pt_capabilities cap);
extern int is_intel_pt_event(struct perf_event *event);
#else
-static inline void cpu_emergency_stop_pt(void) {}
+static inline void intel_pt_stop(void) {}
+static inline void intel_pt_resume(void) {}
static inline u32 intel_pt_validate_hw_cap(enum pt_capabilities cap) { return 0; }
static inline u32 intel_pt_validate_cap(u32 *caps, enum pt_capabilities capability) { return 0; }
static inline int is_intel_pt_event(struct perf_event *event) { return 0; }
diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index 9730c88530fc..2f2f72a209c0 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -93,7 +93,7 @@ static void kdump_nmi_callback(int cpu, struct pt_regs *regs)
/*
* Disable Intel PT to stop its logging
*/
- cpu_emergency_stop_pt();
+ intel_pt_stop();
disable_local_APIC();
}
@@ -158,7 +158,7 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
/*
* Disable Intel PT to stop its logging
*/
- cpu_emergency_stop_pt();
+ intel_pt_stop();
#ifdef CONFIG_X86_IO_APIC
/* Prevent crash_kexec() from deadlocking on ioapic_lock. */
--
2.27.0
Current implementation in pt_guest_enter() has two issues when pt mode
is PT_MODE_HOST_GUEST.
1. It relies on VM_ENTRY_LOAD_IA32_RTIT_CTL to disable host's Intel PT
for the case that host's RTIT_CTL_TRACEEN is 1 while guest's is 0.
However, it causes VM entry failure due to violating the requirement
stated in SDM "VM-Execution Control Fields"
If the logical processor is operating with Intel PT enabled (if
IA32_RTIT_CTL.TraceEn = 1) at the time of VM entry, the "load
IA32_RTIT_CTL" VM-entry control must be 0.
2. In the case both host and guest enable Intel PT, it disables host's
Intel PT by manually clearing MSR_IA32_RTIT_CTL for the purpose to
context switch host and guest's PT configurations.
However, PT PMI can be delivered later and before VM entry. In the PT
PMI handler, it will a) update the host PT MSRs which leads to what KVM
stores in vmx->pt_desc.host becomes stale, and b) re-enable Intel PT
which leads to VM entry failure as #1.
To fix the above two issues, call intel_pt_stop() exposed by Intel PT
driver to disable Intel PT of host unconditionally, it can ensure
MSR_IA32_RTIT_CTL.TraceEn is 0 and following PT PMI does nothing.
As paired, call intel_pt_resume() after VM exit.
Signed-off-by: Xiaoyao Li <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index d7f8331d6f7e..3e9ce8f600d2 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -38,6 +38,7 @@
#include <asm/fpu/api.h>
#include <asm/fpu/xstate.h>
#include <asm/idtentry.h>
+#include <asm/intel_pt.h>
#include <asm/io.h>
#include <asm/irq_remapping.h>
#include <asm/kexec.h>
@@ -1128,13 +1129,19 @@ static void pt_guest_enter(struct vcpu_vmx *vmx)
if (vmx_pt_mode_is_system())
return;
+ /*
+ * Stop Intel PT on host to avoid vm-entry failure since
+ * VM_ENTRY_LOAD_IA32_RTIT_CTL is set
+ */
+ intel_pt_stop();
+
/*
* GUEST_IA32_RTIT_CTL is already set in the VMCS.
* Save host state before VM entry.
*/
rdmsrl(MSR_IA32_RTIT_CTL, vmx->pt_desc.host.ctl);
if (vmx->pt_desc.guest.ctl & RTIT_CTL_TRACEEN) {
- wrmsrl(MSR_IA32_RTIT_CTL, 0);
+ /* intel_pt_stop() ensures RTIT_CTL.TraceEn is zero */
pt_save_msr(&vmx->pt_desc.host, vmx->pt_desc.num_address_ranges);
pt_load_msr(&vmx->pt_desc.guest, vmx->pt_desc.num_address_ranges);
}
@@ -1156,6 +1163,8 @@ static void pt_guest_exit(struct vcpu_vmx *vmx)
*/
if (vmx->pt_desc.host.ctl)
wrmsrl(MSR_IA32_RTIT_CTL, vmx->pt_desc.host.ctl);
+
+ intel_pt_resume();
}
void vmx_set_host_fs_gs(struct vmcs_host_state *host, u16 fs_sel, u16 gs_sel,
--
2.27.0
On Thu, Aug 25, 2022, Xiaoyao Li wrote:
> KVM supports PT_MODE_HOST_GUEST mode for Intel PT that host and guest
> have separate Intel PT configurations and work independently. In that
> mdoe, KVM needs to context switch all the Intel PT configurations
> between host and guest on VM-entry and VM-exit.
>
> Before VM-entry, if Intel PT is enabled on host, KVM needs to disable it
> first so as to context switch the PT configurations. After VM exit, KVM
> needs to re-enable Intel PT for host. Currently, KVM achieves it by
> manually toggle MSR_IA32_RTIT_CTL.TRACEEN bit to en/dis-able Intel PT.
>
> However, PT PMI can be delivered after MSR_IA32_RTIT_CTL.TRACEEN bit is
> cleared. PT PMI handler changes PT MSRs and re-enable PT, that leads to
> 1) VM-entry failure of guest 2) KVM stores stale value of PT MSRs.
>
> To solve the problems, expose two interfaces for KVM to stop and
> resume the PT tracing.
>
> Signed-off-by: Xiaoyao Li <[email protected]>
> ---
> arch/x86/events/intel/pt.c | 11 ++++++++++-
> arch/x86/include/asm/intel_pt.h | 6 ++++--
> arch/x86/kernel/crash.c | 4 ++--
> 3 files changed, 16 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/events/intel/pt.c b/arch/x86/events/intel/pt.c
> index 82ef87e9a897..55fc02036ff1 100644
> --- a/arch/x86/events/intel/pt.c
> +++ b/arch/x86/events/intel/pt.c
> @@ -1730,13 +1730,22 @@ static int pt_event_init(struct perf_event *event)
> return 0;
> }
>
> -void cpu_emergency_stop_pt(void)
> +void intel_pt_stop(void)
> {
> struct pt *pt = this_cpu_ptr(&pt_ctx);
>
> if (pt->handle.event)
> pt_event_stop(pt->handle.event, PERF_EF_UPDATE);
> }
> +EXPORT_SYMBOL_GPL(intel_pt_stop);
> +
> +void intel_pt_resume(void) {
Curly brace goes on its own line.
> + struct pt *pt = this_cpu_ptr(&pt_ctx);
> +
> + if (pt->handle.event)
> + pt_event_start(pt->handle.event, 0);
> +}
> +EXPORT_SYMBOL_GPL(intel_pt_resume);
>
> int is_intel_pt_event(struct perf_event *event)
> {
On 8/25/2022 11:34 PM, Sean Christopherson wrote:
> On Thu, Aug 25, 2022, Xiaoyao Li wrote:
>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>> index d7f8331d6f7e..3e9ce8f600d2 100644
>> --- a/arch/x86/kvm/vmx/vmx.c
>> +++ b/arch/x86/kvm/vmx/vmx.c
>> @@ -38,6 +38,7 @@
>> #include <asm/fpu/api.h>
>> #include <asm/fpu/xstate.h>
>> #include <asm/idtentry.h>
>> +#include <asm/intel_pt.h>
>> #include <asm/io.h>
>> #include <asm/irq_remapping.h>
>> #include <asm/kexec.h>
>> @@ -1128,13 +1129,19 @@ static void pt_guest_enter(struct vcpu_vmx *vmx)
>> if (vmx_pt_mode_is_system())
>> return;
>>
>> + /*
>> + * Stop Intel PT on host to avoid vm-entry failure since
>> + * VM_ENTRY_LOAD_IA32_RTIT_CTL is set
>> + */
>> + intel_pt_stop();
>> +
>> /*
>> * GUEST_IA32_RTIT_CTL is already set in the VMCS.
>> * Save host state before VM entry.
>> */
>> rdmsrl(MSR_IA32_RTIT_CTL, vmx->pt_desc.host.ctl);
>
> KVM's manual save/restore of MSR_IA32_RTIT_CTL should be dropped.
No. It cannot. Please see below.
> If PT/RTIT can
> trace post-VMXON, then intel_pt_stop() will disable tracing and intel_pt_resume()
> will restore the host's desired value.
intel_pt_stop() and intel_pt_resume() touches host's RTIT_CTL only when
host enables/uses Intel PT. Otherwise, they're just noop. In this case,
we cannot assume host's RTIT_CTL is zero (only the RTIT_CTL.TraceEn is
0). After VM-exit, RTIT_CTL is cleared, we need to restore it.
>> if (vmx->pt_desc.guest.ctl & RTIT_CTL_TRACEEN) {
>> - wrmsrl(MSR_IA32_RTIT_CTL, 0);
>> + /* intel_pt_stop() ensures RTIT_CTL.TraceEn is zero */
>> pt_save_msr(&vmx->pt_desc.host, vmx->pt_desc.num_address_ranges);
>
> Isn't this at risk of the same corruption? What prevents a PT NMI that arrives
> after this point from changing other RTIT MSRs, thus causing KVM to restore the
> wrong values?
intel_pt_stop() -> pt_event_stop() will do
WRITE_ONCE(pt->handle_nmi, 0);
which ensure PT NMI handler as noop that at the beginning of
intel_pt_interrupt():
if (!READ_ONCE(pt->handle_nmi))
return;
>> pt_load_msr(&vmx->pt_desc.guest, vmx->pt_desc.num_address_ranges);
>> }
>> @@ -1156,6 +1163,8 @@ static void pt_guest_exit(struct vcpu_vmx *vmx)
>> */
>> if (vmx->pt_desc.host.ctl)
>> wrmsrl(MSR_IA32_RTIT_CTL, vmx->pt_desc.host.ctl);
>> +
>> + intel_pt_resume();
>> }
>>
>> void vmx_set_host_fs_gs(struct vmcs_host_state *host, u16 fs_sel, u16 gs_sel,
>> --
>> 2.27.0
>>
On Thu, Aug 25, 2022, Xiaoyao Li wrote:
> On 8/25/2022 11:34 PM, Sean Christopherson wrote:
> > On Thu, Aug 25, 2022, Xiaoyao Li wrote:
> > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > > index d7f8331d6f7e..3e9ce8f600d2 100644
> > > --- a/arch/x86/kvm/vmx/vmx.c
> > > +++ b/arch/x86/kvm/vmx/vmx.c
> > > @@ -38,6 +38,7 @@
> > > #include <asm/fpu/api.h>
> > > #include <asm/fpu/xstate.h>
> > > #include <asm/idtentry.h>
> > > +#include <asm/intel_pt.h>
> > > #include <asm/io.h>
> > > #include <asm/irq_remapping.h>
> > > #include <asm/kexec.h>
> > > @@ -1128,13 +1129,19 @@ static void pt_guest_enter(struct vcpu_vmx *vmx)
> > > if (vmx_pt_mode_is_system())
> > > return;
> > > + /*
> > > + * Stop Intel PT on host to avoid vm-entry failure since
> > > + * VM_ENTRY_LOAD_IA32_RTIT_CTL is set
> > > + */
> > > + intel_pt_stop();
> > > +
> > > /*
> > > * GUEST_IA32_RTIT_CTL is already set in the VMCS.
> > > * Save host state before VM entry.
> > > */
> > > rdmsrl(MSR_IA32_RTIT_CTL, vmx->pt_desc.host.ctl);
> >
> > KVM's manual save/restore of MSR_IA32_RTIT_CTL should be dropped.
>
> No. It cannot. Please see below.
>
> > If PT/RTIT can
> > trace post-VMXON, then intel_pt_stop() will disable tracing and intel_pt_resume()
> > will restore the host's desired value.
>
> intel_pt_stop() and intel_pt_resume() touches host's RTIT_CTL only when host
> enables/uses Intel PT. Otherwise, they're just noop. In this case, we cannot
> assume host's RTIT_CTL is zero (only the RTIT_CTL.TraceEn is 0). After
> VM-exit, RTIT_CTL is cleared, we need to restore it.
But ensuring the RTIT_CTL.TraceEn=0 is all that's needed to make VM-Entry happy,
and if the host isn't using Intel PT, what do we care if other bits that, for all
intents and purposes are ignored, are lost across VM-Entry/VM-Exit? I gotta
imaging the perf will fully initialize RTIT_CTL if it starts using PT.
Actually, if the host isn't actively using Intel PT, can KVM avoid saving the
other RTIT MSRs?
Even better, can we hand that off to perf? I really dislike KVM making assumptions
about perf's internal behavior. E.g. can this be made to look like
intel_pt_guest_enter(vmx->pt_desc.guest.ctl & RTIT_CTL_TRACEEN);
and
intel_pt_guest_exit(vmx->pt_desc.guest.ctl & RTIT_CTL_TRACEEN);
> > > if (vmx->pt_desc.guest.ctl & RTIT_CTL_TRACEEN) {
> > > - wrmsrl(MSR_IA32_RTIT_CTL, 0);
> > > + /* intel_pt_stop() ensures RTIT_CTL.TraceEn is zero */
> > > pt_save_msr(&vmx->pt_desc.host, vmx->pt_desc.num_address_ranges);
> >
> > Isn't this at risk of the same corruption? What prevents a PT NMI that arrives
> > after this point from changing other RTIT MSRs, thus causing KVM to restore the
> > wrong values?
>
> intel_pt_stop() -> pt_event_stop() will do
>
> WRITE_ONCE(pt->handle_nmi, 0);
>
> which ensure PT NMI handler as noop that at the beginning of
> intel_pt_interrupt():
>
> if (!READ_ONCE(pt->handle_nmi))
> return;
Ah, right.
>
> > > pt_load_msr(&vmx->pt_desc.guest, vmx->pt_desc.num_address_ranges);
> > > }
> > > @@ -1156,6 +1163,8 @@ static void pt_guest_exit(struct vcpu_vmx *vmx)
> > > */
> > > if (vmx->pt_desc.host.ctl)
> > > wrmsrl(MSR_IA32_RTIT_CTL, vmx->pt_desc.host.ctl);
> > > +
> > > + intel_pt_resume();
> > > }
> > > void vmx_set_host_fs_gs(struct vmcs_host_state *host, u16 fs_sel, u16 gs_sel,
> > > --
> > > 2.27.0
> > >
>
On Thu, Aug 25, 2022, Xiaoyao Li wrote:
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index d7f8331d6f7e..3e9ce8f600d2 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -38,6 +38,7 @@
> #include <asm/fpu/api.h>
> #include <asm/fpu/xstate.h>
> #include <asm/idtentry.h>
> +#include <asm/intel_pt.h>
> #include <asm/io.h>
> #include <asm/irq_remapping.h>
> #include <asm/kexec.h>
> @@ -1128,13 +1129,19 @@ static void pt_guest_enter(struct vcpu_vmx *vmx)
> if (vmx_pt_mode_is_system())
> return;
>
> + /*
> + * Stop Intel PT on host to avoid vm-entry failure since
> + * VM_ENTRY_LOAD_IA32_RTIT_CTL is set
> + */
> + intel_pt_stop();
> +
> /*
> * GUEST_IA32_RTIT_CTL is already set in the VMCS.
> * Save host state before VM entry.
> */
> rdmsrl(MSR_IA32_RTIT_CTL, vmx->pt_desc.host.ctl);
KVM's manual save/restore of MSR_IA32_RTIT_CTL should be dropped. If PT/RTIT can
trace post-VMXON, then intel_pt_stop() will disable tracing and intel_pt_resume()
will restore the host's desired value.
> if (vmx->pt_desc.guest.ctl & RTIT_CTL_TRACEEN) {
> - wrmsrl(MSR_IA32_RTIT_CTL, 0);
> + /* intel_pt_stop() ensures RTIT_CTL.TraceEn is zero */
> pt_save_msr(&vmx->pt_desc.host, vmx->pt_desc.num_address_ranges);
Isn't this at risk of the same corruption? What prevents a PT NMI that arrives
after this point from changing other RTIT MSRs, thus causing KVM to restore the
wrong values?
> pt_load_msr(&vmx->pt_desc.guest, vmx->pt_desc.num_address_ranges);
> }
> @@ -1156,6 +1163,8 @@ static void pt_guest_exit(struct vcpu_vmx *vmx)
> */
> if (vmx->pt_desc.host.ctl)
> wrmsrl(MSR_IA32_RTIT_CTL, vmx->pt_desc.host.ctl);
> +
> + intel_pt_resume();
> }
>
> void vmx_set_host_fs_gs(struct vmcs_host_state *host, u16 fs_sel, u16 gs_sel,
> --
> 2.27.0
>
On 8/25/2022 11:59 PM, Sean Christopherson wrote:
> On Thu, Aug 25, 2022, Xiaoyao Li wrote:
>> On 8/25/2022 11:34 PM, Sean Christopherson wrote:
>>> On Thu, Aug 25, 2022, Xiaoyao Li wrote:
>>>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>>>> index d7f8331d6f7e..3e9ce8f600d2 100644
>>>> --- a/arch/x86/kvm/vmx/vmx.c
>>>> +++ b/arch/x86/kvm/vmx/vmx.c
>>>> @@ -38,6 +38,7 @@
>>>> #include <asm/fpu/api.h>
>>>> #include <asm/fpu/xstate.h>
>>>> #include <asm/idtentry.h>
>>>> +#include <asm/intel_pt.h>
>>>> #include <asm/io.h>
>>>> #include <asm/irq_remapping.h>
>>>> #include <asm/kexec.h>
>>>> @@ -1128,13 +1129,19 @@ static void pt_guest_enter(struct vcpu_vmx *vmx)
>>>> if (vmx_pt_mode_is_system())
>>>> return;
>>>> + /*
>>>> + * Stop Intel PT on host to avoid vm-entry failure since
>>>> + * VM_ENTRY_LOAD_IA32_RTIT_CTL is set
>>>> + */
>>>> + intel_pt_stop();
>>>> +
>>>> /*
>>>> * GUEST_IA32_RTIT_CTL is already set in the VMCS.
>>>> * Save host state before VM entry.
>>>> */
>>>> rdmsrl(MSR_IA32_RTIT_CTL, vmx->pt_desc.host.ctl);
>>>
>>> KVM's manual save/restore of MSR_IA32_RTIT_CTL should be dropped.
>>
>> No. It cannot. Please see below.
>>
>>> If PT/RTIT can
>>> trace post-VMXON, then intel_pt_stop() will disable tracing and intel_pt_resume()
>>> will restore the host's desired value.
>>
>> intel_pt_stop() and intel_pt_resume() touches host's RTIT_CTL only when host
>> enables/uses Intel PT. Otherwise, they're just noop. In this case, we cannot
>> assume host's RTIT_CTL is zero (only the RTIT_CTL.TraceEn is 0). After
>> VM-exit, RTIT_CTL is cleared, we need to restore it.
>
> But ensuring the RTIT_CTL.TraceEn=0 is all that's needed to make VM-Entry happy,
> and if the host isn't using Intel PT, what do we care if other bits that, for all
> intents and purposes are ignored, are lost across VM-Entry/VM-Exit? I gotta
> imaging the perf will fully initialize RTIT_CTL if it starts using PT.
Personally, I agree with it.
But I'm not sure if there is a criteria that host context needs to be
unchanged after being virtualized.
> Actually, if the host isn't actively using Intel PT, can KVM avoid saving the
> other RTIT MSRs?
I don't think it's a good idea that it requires PT driver never and
won't rely on the previous value of PT MSRs. But it's OK if handing it
over to perf as the idea you gave below.
> Even better, can we hand that off to perf? I really dislike KVM making assumptions
> about perf's internal behavior. E.g. can this be made to look like
you mean let perf subsystem to do the context save/restore staff of host
and KVM focuses on save/restore of guest context, right?
I would like to see comment from perf folks on this and maybe need their
help on how to implement.
> intel_pt_guest_enter(vmx->pt_desc.guest.ctl & RTIT_CTL_TRACEEN);
>
> and
>
> intel_pt_guest_exit(vmx->pt_desc.guest.ctl & RTIT_CTL_TRACEEN);
>
On Fri, Aug 26, 2022, Xiaoyao Li wrote:
> On 8/25/2022 11:59 PM, Sean Christopherson wrote:
> > But ensuring the RTIT_CTL.TraceEn=0 is all that's needed to make VM-Entry happy,
> > and if the host isn't using Intel PT, what do we care if other bits that, for all
> > intents and purposes are ignored, are lost across VM-Entry/VM-Exit? I gotta
> > imaging the perf will fully initialize RTIT_CTL if it starts using PT.
>
> Personally, I agree with it.
>
> But I'm not sure if there is a criteria that host context needs to be
> unchanged after being virtualized.
>
> > Actually, if the host isn't actively using Intel PT, can KVM avoid saving the
> > other RTIT MSRs?
>
> I don't think it's a good idea that it requires PT driver never and won't
> rely on the previous value of PT MSRs. But it's OK if handing it over to
> perf as the idea you gave below.
Yep, my thought exactly.
> > Even better, can we hand that off to perf? I really dislike KVM making assumptions
> > about perf's internal behavior. E.g. can this be made to look like
>
> you mean let perf subsystem to do the context save/restore staff of host and
> KVM focuses on save/restore of guest context, right?
Yep! KVM already more or less does this for "regular" PMU MSRs, though in that
case perf hands back a list of MSRs+data. But for Intel PT I don't see any point
in having KVM do the actual MSR accesses. Tracing has to be turned off _before_
VM-Enter, so using the MSR load/save lists doesn't buy us anything.
> I would like to see comment from perf folks on this and maybe need their
> help on how to implement.
On Thursday, August 25, 2022 4:56 PM, Xiaoyao Li wrote:
> There is one bug in KVM that can hit vm-entry failure 100% on platform
> supporting PT_MODE_HOST_GUEST mode following below steps:
>
> 1. #modprobe -r kvm_intel
> 2. #modprobe kvm_intel pt_mode=1
> 3. start a VM with QEMU
> 4. on host: #perf record -e intel_pt//
>
> The vm-entry failure happens because it violates the requirement stated in
> Intel SDM 26.2.1.1 VM-Execution Control Fields
>
> If the logical processor is operating with Intel PT enabled (if
> IA32_RTIT_CTL.TraceEn = 1) at the time of VM entry, the "load
> IA32_RTIT_CTL" VM-entry control must be 0.
>
> On PT_MODE_HOST_GUEST node, PT_MODE_HOST_GUEST is always set. Thus
> KVM needs to ensure IA32_RTIT_CTL.TraceEn is 0 before VM-entry. Currently
> KVM manually WRMSR(IA32_RTIT_CTL) to clear TraceEn bit. However, it
> doesn't work everytime since there is a posibility that IA32_RTIT_CTL.TraceEn
> is re-enabled in PT PMI handler before vm-entry. This series tries to fix the
> issue by exposing two interfaces from Intel PT driver for the purose to stop and
> resume Intel PT on host. It prevents PT PMI handler from re-enabling PT. By the
> way, it also fixes another issue that PT PMI touches PT MSRs whihc leads to
> what KVM stores for host bemomes stale.
I'm thinking about another approach to fixing it. I think we need to have the
running host pt event disabled when we switch to guest and don't expect to
receive the host pt interrupt at this point. Also, the host pt context can be
save/restored by host perf core (instead of KVM) when we disable/enable
the event.
diff --git a/arch/x86/events/intel/pt.c b/arch/x86/events/intel/pt.c
index 82ef87e9a897..1d3e03ecaf6a 100644
--- a/arch/x86/events/intel/pt.c
+++ b/arch/x86/events/intel/pt.c
@@ -1575,6 +1575,7 @@ static void pt_event_start(struct perf_event *event, int mode)
pt_config_buffer(buf);
pt_config(event);
+ pt->event = event;
return;
@@ -1600,6 +1601,7 @@ static void pt_event_stop(struct perf_event *event, int mode)
return;
event->hw.state = PERF_HES_STOPPED;
+ pt->event = NULL;
if (mode & PERF_EF_UPDATE) {
struct pt_buffer *buf = perf_get_aux(&pt->handle);
@@ -1624,6 +1626,15 @@ static void pt_event_stop(struct perf_event *event, int mode)
}
}
+
+struct perf_event *pt_get_curr_event(void)
+{
+ struct pt *pt = this_cpu_ptr(&pt_ctx);
+
+ return pt->event;
+}
+EXPORT_SYMBOL_GPL(pt_get_curr_event);
+
static long pt_event_snapshot_aux(struct perf_event *event,
struct perf_output_handle *handle,
unsigned long size)
diff --git a/arch/x86/events/intel/pt.h b/arch/x86/events/intel/pt.h
index 96906a62aacd..d46a85bb06bb 100644
--- a/arch/x86/events/intel/pt.h
+++ b/arch/x86/events/intel/pt.h
@@ -121,6 +121,7 @@ struct pt_filters {
* @output_mask: cached RTIT_OUTPUT_MASK MSR value
*/
struct pt {
+ struct perf_event *event;
struct perf_output_handle handle;
struct pt_filters filters;
int handle_nmi;
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index f6fc8dd51ef4..be8dd24922a7 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -553,11 +553,14 @@ static inline int x86_perf_get_lbr(struct x86_pmu_lbr *lbr)
#ifdef CONFIG_CPU_SUP_INTEL
extern void intel_pt_handle_vmx(int on);
+ extern struct perf_event *pt_get_curr_event(void);
#else
static inline void intel_pt_handle_vmx(int on)
{
+
}
+struct perf_event *pt_get_curr_event(void) { }
#endif
#if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_CPU_SUP_AMD)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index d7f8331d6f7e..195debc1bff1 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1125,37 +1125,29 @@ static inline void pt_save_msr(struct pt_ctx *ctx, u32 addr_range)
static void pt_guest_enter(struct vcpu_vmx *vmx)
{
- if (vmx_pt_mode_is_system())
+ struct perf_event *event;
+
+ if (vmx_pt_mode_is_system() ||
+ !(vmx->pt_desc.guest.ctl & RTIT_CTL_TRACEEN))
return;
- /*
- * GUEST_IA32_RTIT_CTL is already set in the VMCS.
- * Save host state before VM entry.
- */
- rdmsrl(MSR_IA32_RTIT_CTL, vmx->pt_desc.host.ctl);
- if (vmx->pt_desc.guest.ctl & RTIT_CTL_TRACEEN) {
- wrmsrl(MSR_IA32_RTIT_CTL, 0);
- pt_save_msr(&vmx->pt_desc.host, vmx->pt_desc.num_address_ranges);
- pt_load_msr(&vmx->pt_desc.guest, vmx->pt_desc.num_address_ranges);
- }
+ event = pt_get_curr_event();
+ perf_event_disable(event);
+ vmx->pt_desc.host_event = event;
+ pt_load_msr(&vmx->pt_desc.guest, vmx->pt_desc.num_address_ranges);
}
static void pt_guest_exit(struct vcpu_vmx *vmx)
{
- if (vmx_pt_mode_is_system())
- return;
+ struct perf_event *event = vmx->pt_desc.host_event;
- if (vmx->pt_desc.guest.ctl & RTIT_CTL_TRACEEN) {
- pt_save_msr(&vmx->pt_desc.guest, vmx->pt_desc.num_address_ranges);
- pt_load_msr(&vmx->pt_desc.host, vmx->pt_desc.num_address_ranges);
- }
+ if (vmx_pt_mode_is_system() ||
+ !(vmx->pt_desc.guest.ctl & RTIT_CTL_TRACEEN))
+ return;
- /*
- * KVM requires VM_EXIT_CLEAR_IA32_RTIT_CTL to expose PT to the guest,
- * i.e. RTIT_CTL is always cleared on VM-Exit. Restore it if necessary.
- */
- if (vmx->pt_desc.host.ctl)
- wrmsrl(MSR_IA32_RTIT_CTL, vmx->pt_desc.host.ctl);
+ pt_save_msr(&vmx->pt_desc.guest, vmx->pt_desc.num_address_ranges);
+ if (event)
+ perf_event_enable(event);
}
void vmx_set_host_fs_gs(struct vmcs_host_state *host, u16 fs_sel, u16 gs_sel,
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 24d58c2ffaa3..4c20bdabc85b 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -66,7 +66,7 @@ struct pt_desc {
u64 ctl_bitmask;
u32 num_address_ranges;
u32 caps[PT_CPUID_REGS_NUM * PT_CPUID_LEAVES];
- struct pt_ctx host;
+ struct perf_event *host_event;
struct pt_ctx guest;
};
On Mon, Aug 29, 2022, Wang, Wei W wrote:
> On Thursday, August 25, 2022 4:56 PM, Xiaoyao Li wrote:
> #if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_CPU_SUP_AMD)
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index d7f8331d6f7e..195debc1bff1 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -1125,37 +1125,29 @@ static inline void pt_save_msr(struct pt_ctx *ctx, u32 addr_range)
>
> static void pt_guest_enter(struct vcpu_vmx *vmx)
> {
> - if (vmx_pt_mode_is_system())
> + struct perf_event *event;
> +
> + if (vmx_pt_mode_is_system() ||
> + !(vmx->pt_desc.guest.ctl & RTIT_CTL_TRACEEN))
I don't think the host should trace the guest in the host/guest mode just because
the guest isn't tracing itself. I.e. the host still needs to turn off it's own
tracing.
> return;
>
> - /*
> - * GUEST_IA32_RTIT_CTL is already set in the VMCS.
> - * Save host state before VM entry.
> - */
> - rdmsrl(MSR_IA32_RTIT_CTL, vmx->pt_desc.host.ctl);
> - if (vmx->pt_desc.guest.ctl & RTIT_CTL_TRACEEN) {
> - wrmsrl(MSR_IA32_RTIT_CTL, 0);
> - pt_save_msr(&vmx->pt_desc.host, vmx->pt_desc.num_address_ranges);
> - pt_load_msr(&vmx->pt_desc.guest, vmx->pt_desc.num_address_ranges);
> - }
> + event = pt_get_curr_event();
> + perf_event_disable(event);
> + vmx->pt_desc.host_event = event;
This is effectively what I suggested[*], the main difference being that my version
adds dedicated enter/exit helpers so that perf can skip save/restore of the other
MSRs. It's easy to extend if perf needs to hand back an event to complete the "exit.
bool guest_trace_enabled = vmx->pt_desc.guest.ctl & RTIT_CTL_TRACEEN;
vmx->pt_desc.host_event = intel_pt_guest_enter(guest_trace_enabled);
and then on exit
bool guest_trace_enabled = vmx->pt_desc.guest.ctl & RTIT_CTL_TRACEEN;
intel_pt_guest_exit(vmx->pt_desc.host_event, guest_trace_enabled);
[*] https://lore.kernel.org/all/YwecducnM%[email protected]
> + pt_load_msr(&vmx->pt_desc.guest, vmx->pt_desc.num_address_ranges);
> }
On Tuesday, August 30, 2022 1:34 AM, Sean Christopherson wrote:
> On Mon, Aug 29, 2022, Wang, Wei W wrote:
> > On Thursday, August 25, 2022 4:56 PM, Xiaoyao Li wrote:
> > #if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_CPU_SUP_AMD)
> diff
> > --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index
> > d7f8331d6f7e..195debc1bff1 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -1125,37 +1125,29 @@ static inline void pt_save_msr(struct pt_ctx
> > *ctx, u32 addr_range)
> >
> > static void pt_guest_enter(struct vcpu_vmx *vmx) {
> > - if (vmx_pt_mode_is_system())
> > + struct perf_event *event;
> > +
> > + if (vmx_pt_mode_is_system() ||
> > + !(vmx->pt_desc.guest.ctl & RTIT_CTL_TRACEEN))
>
> I don't think the host should trace the guest in the host/guest mode just
> because the guest isn't tracing itself. I.e. the host still needs to turn off it's
> own tracing.
Right, need to fix this one.
> This is effectively what I suggested[*], the main difference being that my
> version adds dedicated enter/exit helpers so that perf can skip save/restore of
> the other MSRs.
What "other MSRs" were you referring to?
(I suppose you meant perf_event_disable needs to save more MSRs)
> It's easy to extend if perf needs to hand back an event to
> complete the "exit.
>
> bool guest_trace_enabled = vmx->pt_desc.guest.ctl &
> RTIT_CTL_TRACEEN;
>
> vmx->pt_desc.host_event = intel_pt_guest_enter(guest_trace_enabled);
>
>
> and then on exit
>
> bool guest_trace_enabled = vmx->pt_desc.guest.ctl &
> RTIT_CTL_TRACEEN;
>
> intel_pt_guest_exit(vmx->pt_desc.host_event, guest_trace_enabled);
>
> [*] https://lore.kernel.org/all/YwecducnM%[email protected]
Yes, this can function. But I feel it a bit violates the general rule
that I got from previous experiences:
KVM should be a user of the perf subsystem, instead of implementing a secondary
driver beyond perf's management.
Being a user of perf means everything possible should go through "perf event",
which is the interface that perf exposes to users.
On 8/29/2022 3:49 PM, Wang, Wei W wrote:
> On Thursday, August 25, 2022 4:56 PM, Xiaoyao Li wrote:
>> There is one bug in KVM that can hit vm-entry failure 100% on platform
>> supporting PT_MODE_HOST_GUEST mode following below steps:
>>
>> 1. #modprobe -r kvm_intel
>> 2. #modprobe kvm_intel pt_mode=1
>> 3. start a VM with QEMU
>> 4. on host: #perf record -e intel_pt//
>>
>> The vm-entry failure happens because it violates the requirement stated in
>> Intel SDM 26.2.1.1 VM-Execution Control Fields
>>
>> If the logical processor is operating with Intel PT enabled (if
>> IA32_RTIT_CTL.TraceEn = 1) at the time of VM entry, the "load
>> IA32_RTIT_CTL" VM-entry control must be 0.
>>
>> On PT_MODE_HOST_GUEST node, PT_MODE_HOST_GUEST is always set. Thus
>> KVM needs to ensure IA32_RTIT_CTL.TraceEn is 0 before VM-entry. Currently
>> KVM manually WRMSR(IA32_RTIT_CTL) to clear TraceEn bit. However, it
>> doesn't work everytime since there is a posibility that IA32_RTIT_CTL.TraceEn
>> is re-enabled in PT PMI handler before vm-entry. This series tries to fix the
>> issue by exposing two interfaces from Intel PT driver for the purose to stop and
>> resume Intel PT on host. It prevents PT PMI handler from re-enabling PT. By the
>> way, it also fixes another issue that PT PMI touches PT MSRs whihc leads to
>> what KVM stores for host bemomes stale.
>
> I'm thinking about another approach to fixing it. I think we need to have the
> running host pt event disabled when we switch to guest and don't expect to
> receive the host pt interrupt at this point. Also, the host pt context can be
> save/restored by host perf core (instead of KVM) when we disable/enable
> the event.
>
> diff --git a/arch/x86/events/intel/pt.c b/arch/x86/events/intel/pt.c
> index 82ef87e9a897..1d3e03ecaf6a 100644
> --- a/arch/x86/events/intel/pt.c
> +++ b/arch/x86/events/intel/pt.c
> @@ -1575,6 +1575,7 @@ static void pt_event_start(struct perf_event *event, int mode)
>
> pt_config_buffer(buf);
> pt_config(event);
> + pt->event = event;
>
> return;
>
> @@ -1600,6 +1601,7 @@ static void pt_event_stop(struct perf_event *event, int mode)
> return;
>
> event->hw.state = PERF_HES_STOPPED;
> + pt->event = NULL;
>
> if (mode & PERF_EF_UPDATE) {
> struct pt_buffer *buf = perf_get_aux(&pt->handle);
> @@ -1624,6 +1626,15 @@ static void pt_event_stop(struct perf_event *event, int mode)
> }
> }
>
> +
> +struct perf_event *pt_get_curr_event(void)
> +{
> + struct pt *pt = this_cpu_ptr(&pt_ctx);
Wei,
I'm not sure if we can use pt->handle.event instead or not.
> + return pt->event;
> +}
> +EXPORT_SYMBOL_GPL(pt_get_curr_event);
> +
> static long pt_event_snapshot_aux(struct perf_event *event,
> struct perf_output_handle *handle,
> unsigned long size)
> diff --git a/arch/x86/events/intel/pt.h b/arch/x86/events/intel/pt.h
> index 96906a62aacd..d46a85bb06bb 100644
> --- a/arch/x86/events/intel/pt.h
> +++ b/arch/x86/events/intel/pt.h
> @@ -121,6 +121,7 @@ struct pt_filters {
> * @output_mask: cached RTIT_OUTPUT_MASK MSR value
> */
> struct pt {
> + struct perf_event *event;
> struct perf_output_handle handle;
> struct pt_filters filters;
> int handle_nmi;
> diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
> index f6fc8dd51ef4..be8dd24922a7 100644
> --- a/arch/x86/include/asm/perf_event.h
> +++ b/arch/x86/include/asm/perf_event.h
> @@ -553,11 +553,14 @@ static inline int x86_perf_get_lbr(struct x86_pmu_lbr *lbr)
>
> #ifdef CONFIG_CPU_SUP_INTEL
> extern void intel_pt_handle_vmx(int on);
> + extern struct perf_event *pt_get_curr_event(void);
> #else
> static inline void intel_pt_handle_vmx(int on)
> {
>
> +
> }
> +struct perf_event *pt_get_curr_event(void) { }
> #endif
>
On Thursday, September 8, 2022 3:26 PM, Li, Xiaoyao wrote:
> > +
> > +struct perf_event *pt_get_curr_event(void) {
> > + struct pt *pt = this_cpu_ptr(&pt_ctx);
>
> Wei,
>
> I'm not sure if we can use pt->handle.event instead or not.
>
> > + return pt->event;
Yes, I think we could reuse that.
On 8/29/2022 3:49 PM, Wang, Wei W wrote:
> On Thursday, August 25, 2022 4:56 PM, Xiaoyao Li wrote:
>> There is one bug in KVM that can hit vm-entry failure 100% on platform
>> supporting PT_MODE_HOST_GUEST mode following below steps:
>>
>> 1. #modprobe -r kvm_intel
>> 2. #modprobe kvm_intel pt_mode=1
>> 3. start a VM with QEMU
>> 4. on host: #perf record -e intel_pt//
>>
>> The vm-entry failure happens because it violates the requirement stated in
>> Intel SDM 26.2.1.1 VM-Execution Control Fields
>>
>> If the logical processor is operating with Intel PT enabled (if
>> IA32_RTIT_CTL.TraceEn = 1) at the time of VM entry, the "load
>> IA32_RTIT_CTL" VM-entry control must be 0.
>>
>> On PT_MODE_HOST_GUEST node, PT_MODE_HOST_GUEST is always set. Thus
>> KVM needs to ensure IA32_RTIT_CTL.TraceEn is 0 before VM-entry. Currently
>> KVM manually WRMSR(IA32_RTIT_CTL) to clear TraceEn bit. However, it
>> doesn't work everytime since there is a posibility that IA32_RTIT_CTL.TraceEn
>> is re-enabled in PT PMI handler before vm-entry. This series tries to fix the
>> issue by exposing two interfaces from Intel PT driver for the purose to stop and
>> resume Intel PT on host. It prevents PT PMI handler from re-enabling PT. By the
>> way, it also fixes another issue that PT PMI touches PT MSRs whihc leads to
>> what KVM stores for host bemomes stale.
>
> I'm thinking about another approach to fixing it. I think we need to have the
> running host pt event disabled when we switch to guest and don't expect to
> receive the host pt interrupt at this point. Also, the host pt context can be
> save/restored by host perf core (instead of KVM) when we disable/enable
> the event.
>
> diff --git a/arch/x86/events/intel/pt.c b/arch/x86/events/intel/pt.c
> index 82ef87e9a897..1d3e03ecaf6a 100644
> --- a/arch/x86/events/intel/pt.c
> +++ b/arch/x86/events/intel/pt.c
> @@ -1575,6 +1575,7 @@ static void pt_event_start(struct perf_event *event, int mode)
>
> pt_config_buffer(buf);
> pt_config(event);
> + pt->event = event;
>
> return;
>
> @@ -1600,6 +1601,7 @@ static void pt_event_stop(struct perf_event *event, int mode)
> return;
>
> event->hw.state = PERF_HES_STOPPED;
> + pt->event = NULL;
>
> if (mode & PERF_EF_UPDATE) {
> struct pt_buffer *buf = perf_get_aux(&pt->handle);
> @@ -1624,6 +1626,15 @@ static void pt_event_stop(struct perf_event *event, int mode)
> }
> }
>
> +
> +struct perf_event *pt_get_curr_event(void)
> +{
> + struct pt *pt = this_cpu_ptr(&pt_ctx);
> +
> + return pt->event;
> +}
> +EXPORT_SYMBOL_GPL(pt_get_curr_event);
> +
> static long pt_event_snapshot_aux(struct perf_event *event,
> struct perf_output_handle *handle,
> unsigned long size)
> diff --git a/arch/x86/events/intel/pt.h b/arch/x86/events/intel/pt.h
> index 96906a62aacd..d46a85bb06bb 100644
> --- a/arch/x86/events/intel/pt.h
> +++ b/arch/x86/events/intel/pt.h
> @@ -121,6 +121,7 @@ struct pt_filters {
> * @output_mask: cached RTIT_OUTPUT_MASK MSR value
> */
> struct pt {
> + struct perf_event *event;
> struct perf_output_handle handle;
> struct pt_filters filters;
> int handle_nmi;
> diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
> index f6fc8dd51ef4..be8dd24922a7 100644
> --- a/arch/x86/include/asm/perf_event.h
> +++ b/arch/x86/include/asm/perf_event.h
> @@ -553,11 +553,14 @@ static inline int x86_perf_get_lbr(struct x86_pmu_lbr *lbr)
>
> #ifdef CONFIG_CPU_SUP_INTEL
> extern void intel_pt_handle_vmx(int on);
> + extern struct perf_event *pt_get_curr_event(void);
> #else
> static inline void intel_pt_handle_vmx(int on)
> {
>
> +
> }
> +struct perf_event *pt_get_curr_event(void) { }
> #endif
>
> #if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_CPU_SUP_AMD)
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index d7f8331d6f7e..195debc1bff1 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -1125,37 +1125,29 @@ static inline void pt_save_msr(struct pt_ctx *ctx, u32 addr_range)
>
> static void pt_guest_enter(struct vcpu_vmx *vmx)
> {
> - if (vmx_pt_mode_is_system())
> + struct perf_event *event;
> +
> + if (vmx_pt_mode_is_system() ||
> + !(vmx->pt_desc.guest.ctl & RTIT_CTL_TRACEEN))
> return;
>
> - /*
> - * GUEST_IA32_RTIT_CTL is already set in the VMCS.
> - * Save host state before VM entry.
> - */
> - rdmsrl(MSR_IA32_RTIT_CTL, vmx->pt_desc.host.ctl);
> - if (vmx->pt_desc.guest.ctl & RTIT_CTL_TRACEEN) {
> - wrmsrl(MSR_IA32_RTIT_CTL, 0);
> - pt_save_msr(&vmx->pt_desc.host, vmx->pt_desc.num_address_ranges);
> - pt_load_msr(&vmx->pt_desc.guest, vmx->pt_desc.num_address_ranges);
> - }
> + event = pt_get_curr_event();
> + perf_event_disable(event);
perf_event_disable() is not allowed in preemption disabled context, since
perf_event_disable()
-> perf_event_ctx_lock()
-> perf_event_ctx_lock_nested()
-> mutex_lock_nested()
and it causes
[ 3542.164553] BUG: sleeping function called from invalid context at
kernel/locking/mutex.c:580
[ 3542.165140] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid:
1518, name: CPU 0/KVM
[ 3542.165703] preempt_count: 1, expected: 0
[ 3542.166006] RCU nest depth: 0, expected: 0
[ 3542.166315] INFO: lockdep is turned off.
[ 3542.166614] irq event stamp: 0
[ 3542.166857] hardirqs last enabled at (0): [<0000000000000000>] 0x0
[ 3542.167304] hardirqs last disabled at (0): [<ffffffff94699ac8>]
copy_process+0x8e8/0x1bd0
[ 3542.167874] softirqs last enabled at (0): [<ffffffff94699ac8>]
copy_process+0x8e8/0x1bd0
[ 3542.168443] softirqs last disabled at (0): [<0000000000000000>] 0x0
[ 3542.168891] Preemption disabled at:
[ 3542.168893] [<ffffffffc0c28f29>] vcpu_enter_guest+0x139/0x1350 [kvm]
[ 3542.169674] CPU: 20 PID: 1518 Comm: CPU 0/KVM Not tainted
6.0.0-rc5-fix-pt-vm-entry+ #3 f2d44ed9be3fc4a510291e2989c9432fce3cb5de
[ 3542.170457] Hardware name: Intel Corporation JACOBSVILLE/JACOBSVILLE,
BIOS JBVLCRB1.86B.0012.D75.1912120439 12/12/2019
[ 3542.171188] Call Trace:
[ 3542.171392] <TASK>
[ 3542.171572] show_stack+0x52/0x5c
[ 3542.171831] dump_stack_lvl+0x5b/0x86
[ 3542.172112] dump_stack+0x10/0x16
[ 3542.172371] __might_resched.cold+0x135/0x15b
[ 3542.172698] __might_sleep+0x52/0xa0
[ 3542.172975] __mutex_lock+0x4e/0x4d0
[ 3542.173251] ? kvm_sched_in+0x4f/0x60 [kvm
1f4ec973e90cea60cd68c22c1431d8e7f534c92b]
[ 3542.173839] ? perf_event_ctx_lock_nested+0xc8/0x230
[ 3542.174202] ? rcu_read_lock_sched_held+0x16/0x90
[ 3542.174551] ? lock_acquire+0xfc/0x150
[ 3542.174840] ? perf_event_ctx_lock_nested+0x24/0x230
[ 3542.175205] mutex_lock_nested+0x1c/0x30
[ 3542.175505] perf_event_ctx_lock_nested+0xc8/0x230
[ 3542.181147] ? perf_event_ctx_lock_nested+0x24/0x230
[ 3542.186839] perf_event_disable+0x19/0x80
[ 3542.192502] vmx_vcpu_run+0x3e5/0xfe0 [kvm_intel
7936a7891efe9306918aa504b0eb8bc1e7ba3aa6]
[ 3542.203771] ? rcu_read_lock_sched_held+0x16/0x90
[ 3542.209378] vcpu_enter_guest+0xa96/0x1350 [kvm
1f4ec973e90cea60cd68c22c1431d8e7f534c92b]
[ 3542.215218] ? vcpu_enter_guest+0xbe1/0x1350 [kvm
1f4ec973e90cea60cd68c22c1431d8e7f534c92b]
[ 3542.226292] ? rcu_read_lock_sched_held+0x16/0x90
[ 3542.231956] ? rcu_read_lock_sched_held+0x16/0x90
[ 3542.237542] ? lock_acquire+0xfc/0x150
[ 3542.243093] ? __rseq_handle_notify_resume+0x3a/0x60
[ 3542.248689] vcpu_run+0x53/0x490 [kvm
1f4ec973e90cea60cd68c22c1431d8e7f534c92b]
[ 3542.254533] ? vcpu_run+0x35a/0x490 [kvm
1f4ec973e90cea60cd68c22c1431d8e7f534c92b]
[ 3542.260567] kvm_arch_vcpu_ioctl_run+0x162/0x680 [kvm
1f4ec973e90cea60cd68c22c1431d8e7f534c92b]
[ 3542.272395] ? kvm_arch_vcpu_ioctl_run+0x6d/0x680 [kvm
1f4ec973e90cea60cd68c22c1431d8e7f534c92b]
[ 3542.284586] kvm_vcpu_ioctl+0x2ad/0x7a0 [kvm
1f4ec973e90cea60cd68c22c1431d8e7f534c92b]
[ 3542.290973] ? lock_acquire+0xfc/0x150
[ 3542.296990] ? rcu_read_lock_sched_held+0x16/0x90
[ 3542.302912] ? lock_release+0x118/0x190
[ 3542.308800] ? __fget_files+0xe8/0x1c0
[ 3542.314710] ? __fget_files+0x5/0x1c0
[ 3542.320591] __x64_sys_ioctl+0x96/0xd0
[ 3542.326500] do_syscall_64+0x3a/0x90
[ 3542.332426] entry_SYSCALL_64_after_hwframe+0x5e/0xc8
I know little about perf. It seems perf_event_disable() is not used
widely by other kernel component. Is there a alternative? If no, I think
expose disable/enable helper from pt driver like this series seems OK.
> + vmx->pt_desc.host_event = event;
> + pt_load_msr(&vmx->pt_desc.guest, vmx->pt_desc.num_address_ranges);
> }
>
> static void pt_guest_exit(struct vcpu_vmx *vmx)
> {
> - if (vmx_pt_mode_is_system())
> - return;
> + struct perf_event *event = vmx->pt_desc.host_event;
>
> - if (vmx->pt_desc.guest.ctl & RTIT_CTL_TRACEEN) {
> - pt_save_msr(&vmx->pt_desc.guest, vmx->pt_desc.num_address_ranges);
> - pt_load_msr(&vmx->pt_desc.host, vmx->pt_desc.num_address_ranges);
> - }
> + if (vmx_pt_mode_is_system() ||
> + !(vmx->pt_desc.guest.ctl & RTIT_CTL_TRACEEN))
> + return;
>
> - /*
> - * KVM requires VM_EXIT_CLEAR_IA32_RTIT_CTL to expose PT to the guest,
> - * i.e. RTIT_CTL is always cleared on VM-Exit. Restore it if necessary.
> - */
> - if (vmx->pt_desc.host.ctl)
> - wrmsrl(MSR_IA32_RTIT_CTL, vmx->pt_desc.host.ctl);
> + pt_save_msr(&vmx->pt_desc.guest, vmx->pt_desc.num_address_ranges);
> + if (event)
> + perf_event_enable(event);
> }
>
> void vmx_set_host_fs_gs(struct vmcs_host_state *host, u16 fs_sel, u16 gs_sel,
> diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
> index 24d58c2ffaa3..4c20bdabc85b 100644
> --- a/arch/x86/kvm/vmx/vmx.h
> +++ b/arch/x86/kvm/vmx/vmx.h
> @@ -66,7 +66,7 @@ struct pt_desc {
> u64 ctl_bitmask;
> u32 num_address_ranges;
> u32 caps[PT_CPUID_REGS_NUM * PT_CPUID_LEAVES];
> - struct pt_ctx host;
> + struct perf_event *host_event;
> struct pt_ctx guest;
> };
>
>
On Wednesday, September 14, 2022 12:16 PM, Li, Xiaoyao wrote:
> On 8/29/2022 3:49 PM, Wang, Wei W wrote:
> > On Thursday, August 25, 2022 4:56 PM, Xiaoyao Li wrote:
> >> There is one bug in KVM that can hit vm-entry failure 100% on
> >> platform supporting PT_MODE_HOST_GUEST mode following below steps:
> >>
> >> 1. #modprobe -r kvm_intel
> >> 2. #modprobe kvm_intel pt_mode=1
> >> 3. start a VM with QEMU
> >> 4. on host: #perf record -e intel_pt//
> >>
> >> The vm-entry failure happens because it violates the requirement
> >> stated in Intel SDM 26.2.1.1 VM-Execution Control Fields
> >>
> >> If the logical processor is operating with Intel PT enabled (if
> >> IA32_RTIT_CTL.TraceEn = 1) at the time of VM entry, the "load
> >> IA32_RTIT_CTL" VM-entry control must be 0.
> >>
> >> On PT_MODE_HOST_GUEST node, PT_MODE_HOST_GUEST is always set.
> Thus
> >> KVM needs to ensure IA32_RTIT_CTL.TraceEn is 0 before VM-entry.
> >> Currently KVM manually WRMSR(IA32_RTIT_CTL) to clear TraceEn bit.
> >> However, it doesn't work everytime since there is a posibility that
> >> IA32_RTIT_CTL.TraceEn is re-enabled in PT PMI handler before
> >> vm-entry. This series tries to fix the issue by exposing two
> >> interfaces from Intel PT driver for the purose to stop and resume
> >> Intel PT on host. It prevents PT PMI handler from re-enabling PT. By
> >> the way, it also fixes another issue that PT PMI touches PT MSRs whihc
> leads to what KVM stores for host bemomes stale.
> >
> > I'm thinking about another approach to fixing it. I think we need to
> > have the running host pt event disabled when we switch to guest and
> > don't expect to receive the host pt interrupt at this point. Also, the
> > host pt context can be save/restored by host perf core (instead of
> > KVM) when we disable/enable the event.
> >
> > diff --git a/arch/x86/events/intel/pt.c b/arch/x86/events/intel/pt.c
> > index 82ef87e9a897..1d3e03ecaf6a 100644
> > --- a/arch/x86/events/intel/pt.c
> > +++ b/arch/x86/events/intel/pt.c
> > @@ -1575,6 +1575,7 @@ static void pt_event_start(struct perf_event
> > *event, int mode)
> >
> > pt_config_buffer(buf);
> > pt_config(event);
> > + pt->event = event;
> >
> > return;
> >
> > @@ -1600,6 +1601,7 @@ static void pt_event_stop(struct perf_event
> *event, int mode)
> > return;
> >
> > event->hw.state = PERF_HES_STOPPED;
> > + pt->event = NULL;
> >
> > if (mode & PERF_EF_UPDATE) {
> > struct pt_buffer *buf = perf_get_aux(&pt->handle);
> @@
> > -1624,6 +1626,15 @@ static void pt_event_stop(struct perf_event *event,
> int mode)
> > }
> > }
> >
> > +
> > +struct perf_event *pt_get_curr_event(void) {
> > + struct pt *pt = this_cpu_ptr(&pt_ctx);
> > +
> > + return pt->event;
> > +}
> > +EXPORT_SYMBOL_GPL(pt_get_curr_event);
> > +
> > static long pt_event_snapshot_aux(struct perf_event *event,
> > struct perf_output_handle
> *handle,
> > unsigned long size) diff --git
> > a/arch/x86/events/intel/pt.h b/arch/x86/events/intel/pt.h index
> > 96906a62aacd..d46a85bb06bb 100644
> > --- a/arch/x86/events/intel/pt.h
> > +++ b/arch/x86/events/intel/pt.h
> > @@ -121,6 +121,7 @@ struct pt_filters {
> > * @output_mask: cached RTIT_OUTPUT_MASK MSR value
> > */
> > struct pt {
> > + struct perf_event *event;
> > struct perf_output_handle handle;
> > struct pt_filters filters;
> > int handle_nmi;
> > diff --git a/arch/x86/include/asm/perf_event.h
> > b/arch/x86/include/asm/perf_event.h
> > index f6fc8dd51ef4..be8dd24922a7 100644
> > --- a/arch/x86/include/asm/perf_event.h
> > +++ b/arch/x86/include/asm/perf_event.h
> > @@ -553,11 +553,14 @@ static inline int x86_perf_get_lbr(struct
> > x86_pmu_lbr *lbr)
> >
> > #ifdef CONFIG_CPU_SUP_INTEL
> > extern void intel_pt_handle_vmx(int on);
> > + extern struct perf_event *pt_get_curr_event(void);
> > #else
> > static inline void intel_pt_handle_vmx(int on)
> > {
> >
> > +
> > }
> > +struct perf_event *pt_get_curr_event(void) { }
> > #endif
> >
> > #if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_CPU_SUP_AMD)
> diff
> > --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index
> > d7f8331d6f7e..195debc1bff1 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -1125,37 +1125,29 @@ static inline void pt_save_msr(struct pt_ctx
> > *ctx, u32 addr_range)
> >
> > static void pt_guest_enter(struct vcpu_vmx *vmx)
> > {
> > - if (vmx_pt_mode_is_system())
> > + struct perf_event *event;
> > +
> > + if (vmx_pt_mode_is_system() ||
> > + !(vmx->pt_desc.guest.ctl & RTIT_CTL_TRACEEN))
> > return;
> >
> > - /*
> > - * GUEST_IA32_RTIT_CTL is already set in the VMCS.
> > - * Save host state before VM entry.
> > - */
> > - rdmsrl(MSR_IA32_RTIT_CTL, vmx->pt_desc.host.ctl);
> > - if (vmx->pt_desc.guest.ctl & RTIT_CTL_TRACEEN) {
> > - wrmsrl(MSR_IA32_RTIT_CTL, 0);
> > - pt_save_msr(&vmx->pt_desc.host,
> vmx->pt_desc.num_address_ranges);
> > - pt_load_msr(&vmx->pt_desc.guest,
> vmx->pt_desc.num_address_ranges);
> > - }
> > + event = pt_get_curr_event();
> > + perf_event_disable(event);
>
> perf_event_disable() is not allowed in preemption disabled context, since
>
> perf_event_disable()
> -> perf_event_ctx_lock()
> -> perf_event_ctx_lock_nested()
> -> mutex_lock_nested()
>
> and it causes
>
> [ 3542.164553] BUG: sleeping function called from invalid context at
> kernel/locking/mutex.c:580
> [ 3542.165140] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid:
> 1518, name: CPU 0/KVM
> [ 3542.165703] preempt_count: 1, expected: 0 [ 3542.166006] RCU nest depth:
> 0, expected: 0 [ 3542.166315] INFO: lockdep is turned off.
> [ 3542.166614] irq event stamp: 0
> [ 3542.166857] hardirqs last enabled at (0): [<0000000000000000>] 0x0
> [ 3542.167304] hardirqs last disabled at (0): [<ffffffff94699ac8>]
> copy_process+0x8e8/0x1bd0
> [ 3542.167874] softirqs last enabled at (0): [<ffffffff94699ac8>]
> copy_process+0x8e8/0x1bd0
> [ 3542.168443] softirqs last disabled at (0): [<0000000000000000>] 0x0
> [ 3542.168891] Preemption disabled at:
> [ 3542.168893] [<ffffffffc0c28f29>] vcpu_enter_guest+0x139/0x1350 [kvm]
> [ 3542.169674] CPU: 20 PID: 1518 Comm: CPU 0/KVM Not tainted
> 6.0.0-rc5-fix-pt-vm-entry+ #3 f2d44ed9be3fc4a510291e2989c9432fce3cb5de
> [ 3542.170457] Hardware name: Intel Corporation JACOBSVILLE/JACOBSVILLE,
> BIOS JBVLCRB1.86B.0012.D75.1912120439 12/12/2019 [ 3542.171188] Call
> Trace:
> [ 3542.171392] <TASK>
> [ 3542.171572] show_stack+0x52/0x5c
> [ 3542.171831] dump_stack_lvl+0x5b/0x86 [ 3542.172112]
> dump_stack+0x10/0x16 [ 3542.172371] __might_resched.cold+0x135/0x15b
> [ 3542.172698] __might_sleep+0x52/0xa0 [ 3542.172975]
> __mutex_lock+0x4e/0x4d0 [ 3542.173251] ? kvm_sched_in+0x4f/0x60 [kvm
> 1f4ec973e90cea60cd68c22c1431d8e7f534c92b]
> [ 3542.173839] ? perf_event_ctx_lock_nested+0xc8/0x230
> [ 3542.174202] ? rcu_read_lock_sched_held+0x16/0x90
> [ 3542.174551] ? lock_acquire+0xfc/0x150 [ 3542.174840] ?
> perf_event_ctx_lock_nested+0x24/0x230
> [ 3542.175205] mutex_lock_nested+0x1c/0x30 [ 3542.175505]
> perf_event_ctx_lock_nested+0xc8/0x230
> [ 3542.181147] ? perf_event_ctx_lock_nested+0x24/0x230
> [ 3542.186839] perf_event_disable+0x19/0x80 [ 3542.192502]
> vmx_vcpu_run+0x3e5/0xfe0 [kvm_intel
> 7936a7891efe9306918aa504b0eb8bc1e7ba3aa6]
> [ 3542.203771] ? rcu_read_lock_sched_held+0x16/0x90
> [ 3542.209378] vcpu_enter_guest+0xa96/0x1350 [kvm
> 1f4ec973e90cea60cd68c22c1431d8e7f534c92b]
> [ 3542.215218] ? vcpu_enter_guest+0xbe1/0x1350 [kvm
> 1f4ec973e90cea60cd68c22c1431d8e7f534c92b]
> [ 3542.226292] ? rcu_read_lock_sched_held+0x16/0x90
> [ 3542.231956] ? rcu_read_lock_sched_held+0x16/0x90
> [ 3542.237542] ? lock_acquire+0xfc/0x150 [ 3542.243093] ?
> __rseq_handle_notify_resume+0x3a/0x60
> [ 3542.248689] vcpu_run+0x53/0x490 [kvm
> 1f4ec973e90cea60cd68c22c1431d8e7f534c92b]
> [ 3542.254533] ? vcpu_run+0x35a/0x490 [kvm
> 1f4ec973e90cea60cd68c22c1431d8e7f534c92b]
> [ 3542.260567] kvm_arch_vcpu_ioctl_run+0x162/0x680 [kvm
> 1f4ec973e90cea60cd68c22c1431d8e7f534c92b]
> [ 3542.272395] ? kvm_arch_vcpu_ioctl_run+0x6d/0x680 [kvm
> 1f4ec973e90cea60cd68c22c1431d8e7f534c92b]
> [ 3542.284586] kvm_vcpu_ioctl+0x2ad/0x7a0 [kvm
> 1f4ec973e90cea60cd68c22c1431d8e7f534c92b]
> [ 3542.290973] ? lock_acquire+0xfc/0x150 [ 3542.296990] ?
> rcu_read_lock_sched_held+0x16/0x90
> [ 3542.302912] ? lock_release+0x118/0x190 [ 3542.308800] ?
> __fget_files+0xe8/0x1c0 [ 3542.314710] ? __fget_files+0x5/0x1c0
> [ 3542.320591] __x64_sys_ioctl+0x96/0xd0 [ 3542.326500]
> do_syscall_64+0x3a/0x90 [ 3542.332426]
> entry_SYSCALL_64_after_hwframe+0x5e/0xc8
>
>
> I know little about perf.
+Kan to double confirm if needed.
> It seems perf_event_disable() is not used widely by
> other kernel component. Is there a alternative? If no, I think expose
> disable/enable helper from pt driver like this series seems OK.
Oops, it was actually a mistake to disable the event on other CPUs.
Can you expose and use perf_event_disable_local?
For the enabling side, we need to add and expose perf_event_enable_local:
event_function_local(event, __perf_event_enable, NULL);
On 2022-09-14 2:16 a.m., Wang, Wei W wrote:
> On Wednesday, September 14, 2022 12:16 PM, Li, Xiaoyao wrote:
>> On 8/29/2022 3:49 PM, Wang, Wei W wrote:
>>> On Thursday, August 25, 2022 4:56 PM, Xiaoyao Li wrote:
>>>> There is one bug in KVM that can hit vm-entry failure 100% on
>>>> platform supporting PT_MODE_HOST_GUEST mode following below steps:
>>>>
>>>> 1. #modprobe -r kvm_intel
>>>> 2. #modprobe kvm_intel pt_mode=1
>>>> 3. start a VM with QEMU
>>>> 4. on host: #perf record -e intel_pt//
>>>>
>>>> The vm-entry failure happens because it violates the requirement
>>>> stated in Intel SDM 26.2.1.1 VM-Execution Control Fields
>>>>
>>>> If the logical processor is operating with Intel PT enabled (if
>>>> IA32_RTIT_CTL.TraceEn = 1) at the time of VM entry, the "load
>>>> IA32_RTIT_CTL" VM-entry control must be 0.
>>>>
>>>> On PT_MODE_HOST_GUEST node, PT_MODE_HOST_GUEST is always set.
>> Thus
>>>> KVM needs to ensure IA32_RTIT_CTL.TraceEn is 0 before VM-entry.
>>>> Currently KVM manually WRMSR(IA32_RTIT_CTL) to clear TraceEn bit.
>>>> However, it doesn't work everytime since there is a posibility that
>>>> IA32_RTIT_CTL.TraceEn is re-enabled in PT PMI handler before
>>>> vm-entry. This series tries to fix the issue by exposing two
>>>> interfaces from Intel PT driver for the purose to stop and resume
>>>> Intel PT on host. It prevents PT PMI handler from re-enabling PT. By
>>>> the way, it also fixes another issue that PT PMI touches PT MSRs whihc
>> leads to what KVM stores for host bemomes stale.
>>>
>>> I'm thinking about another approach to fixing it. I think we need to
>>> have the running host pt event disabled when we switch to guest and
>>> don't expect to receive the host pt interrupt at this point. Also, the
>>> host pt context can be save/restored by host perf core (instead of
>>> KVM) when we disable/enable the event.
>>>
>>> diff --git a/arch/x86/events/intel/pt.c b/arch/x86/events/intel/pt.c
>>> index 82ef87e9a897..1d3e03ecaf6a 100644
>>> --- a/arch/x86/events/intel/pt.c
>>> +++ b/arch/x86/events/intel/pt.c
>>> @@ -1575,6 +1575,7 @@ static void pt_event_start(struct perf_event
>>> *event, int mode)
>>>
>>> pt_config_buffer(buf);
>>> pt_config(event);
>>> + pt->event = event;
>>>
>>> return;
>>>
>>> @@ -1600,6 +1601,7 @@ static void pt_event_stop(struct perf_event
>> *event, int mode)
>>> return;
>>>
>>> event->hw.state = PERF_HES_STOPPED;
>>> + pt->event = NULL;
>>>
>>> if (mode & PERF_EF_UPDATE) {
>>> struct pt_buffer *buf = perf_get_aux(&pt->handle);
>> @@
>>> -1624,6 +1626,15 @@ static void pt_event_stop(struct perf_event *event,
>> int mode)
>>> }
>>> }
>>>
>>> +
>>> +struct perf_event *pt_get_curr_event(void) {
>>> + struct pt *pt = this_cpu_ptr(&pt_ctx);
>>> +
>>> + return pt->event;
>>> +}
>>> +EXPORT_SYMBOL_GPL(pt_get_curr_event);
>>> +
>>> static long pt_event_snapshot_aux(struct perf_event *event,
>>> struct perf_output_handle
>> *handle,
>>> unsigned long size) diff --git
>>> a/arch/x86/events/intel/pt.h b/arch/x86/events/intel/pt.h index
>>> 96906a62aacd..d46a85bb06bb 100644
>>> --- a/arch/x86/events/intel/pt.h
>>> +++ b/arch/x86/events/intel/pt.h
>>> @@ -121,6 +121,7 @@ struct pt_filters {
>>> * @output_mask: cached RTIT_OUTPUT_MASK MSR value
>>> */
>>> struct pt {
>>> + struct perf_event *event;
>>> struct perf_output_handle handle;
>>> struct pt_filters filters;
>>> int handle_nmi;
>>> diff --git a/arch/x86/include/asm/perf_event.h
>>> b/arch/x86/include/asm/perf_event.h
>>> index f6fc8dd51ef4..be8dd24922a7 100644
>>> --- a/arch/x86/include/asm/perf_event.h
>>> +++ b/arch/x86/include/asm/perf_event.h
>>> @@ -553,11 +553,14 @@ static inline int x86_perf_get_lbr(struct
>>> x86_pmu_lbr *lbr)
>>>
>>> #ifdef CONFIG_CPU_SUP_INTEL
>>> extern void intel_pt_handle_vmx(int on);
>>> + extern struct perf_event *pt_get_curr_event(void);
>>> #else
>>> static inline void intel_pt_handle_vmx(int on)
>>> {
>>>
>>> +
>>> }
>>> +struct perf_event *pt_get_curr_event(void) { }
>>> #endif
>>>
>>> #if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_CPU_SUP_AMD)
>> diff
>>> --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index
>>> d7f8331d6f7e..195debc1bff1 100644
>>> --- a/arch/x86/kvm/vmx/vmx.c
>>> +++ b/arch/x86/kvm/vmx/vmx.c
>>> @@ -1125,37 +1125,29 @@ static inline void pt_save_msr(struct pt_ctx
>>> *ctx, u32 addr_range)
>>>
>>> static void pt_guest_enter(struct vcpu_vmx *vmx)
>>> {
>>> - if (vmx_pt_mode_is_system())
>>> + struct perf_event *event;
>>> +
>>> + if (vmx_pt_mode_is_system() ||
>>> + !(vmx->pt_desc.guest.ctl & RTIT_CTL_TRACEEN))
>>> return;
>>>
>>> - /*
>>> - * GUEST_IA32_RTIT_CTL is already set in the VMCS.
>>> - * Save host state before VM entry.
>>> - */
>>> - rdmsrl(MSR_IA32_RTIT_CTL, vmx->pt_desc.host.ctl);
>>> - if (vmx->pt_desc.guest.ctl & RTIT_CTL_TRACEEN) {
>>> - wrmsrl(MSR_IA32_RTIT_CTL, 0);
>>> - pt_save_msr(&vmx->pt_desc.host,
>> vmx->pt_desc.num_address_ranges);
>>> - pt_load_msr(&vmx->pt_desc.guest,
>> vmx->pt_desc.num_address_ranges);
>>> - }
>>> + event = pt_get_curr_event();
>>> + perf_event_disable(event);
>>
>> perf_event_disable() is not allowed in preemption disabled context, since
>>
>> perf_event_disable()
>> -> perf_event_ctx_lock()
>> -> perf_event_ctx_lock_nested()
>> -> mutex_lock_nested()
>>
>> and it causes
>>
>> [ 3542.164553] BUG: sleeping function called from invalid context at
>> kernel/locking/mutex.c:580
>> [ 3542.165140] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid:
>> 1518, name: CPU 0/KVM
>> [ 3542.165703] preempt_count: 1, expected: 0 [ 3542.166006] RCU nest depth:
>> 0, expected: 0 [ 3542.166315] INFO: lockdep is turned off.
>> [ 3542.166614] irq event stamp: 0
>> [ 3542.166857] hardirqs last enabled at (0): [<0000000000000000>] 0x0
>> [ 3542.167304] hardirqs last disabled at (0): [<ffffffff94699ac8>]
>> copy_process+0x8e8/0x1bd0
>> [ 3542.167874] softirqs last enabled at (0): [<ffffffff94699ac8>]
>> copy_process+0x8e8/0x1bd0
>> [ 3542.168443] softirqs last disabled at (0): [<0000000000000000>] 0x0
>> [ 3542.168891] Preemption disabled at:
>> [ 3542.168893] [<ffffffffc0c28f29>] vcpu_enter_guest+0x139/0x1350 [kvm]
>> [ 3542.169674] CPU: 20 PID: 1518 Comm: CPU 0/KVM Not tainted
>> 6.0.0-rc5-fix-pt-vm-entry+ #3 f2d44ed9be3fc4a510291e2989c9432fce3cb5de
>> [ 3542.170457] Hardware name: Intel Corporation JACOBSVILLE/JACOBSVILLE,
>> BIOS JBVLCRB1.86B.0012.D75.1912120439 12/12/2019 [ 3542.171188] Call
>> Trace:
>> [ 3542.171392] <TASK>
>> [ 3542.171572] show_stack+0x52/0x5c
>> [ 3542.171831] dump_stack_lvl+0x5b/0x86 [ 3542.172112]
>> dump_stack+0x10/0x16 [ 3542.172371] __might_resched.cold+0x135/0x15b
>> [ 3542.172698] __might_sleep+0x52/0xa0 [ 3542.172975]
>> __mutex_lock+0x4e/0x4d0 [ 3542.173251] ? kvm_sched_in+0x4f/0x60 [kvm
>> 1f4ec973e90cea60cd68c22c1431d8e7f534c92b]
>> [ 3542.173839] ? perf_event_ctx_lock_nested+0xc8/0x230
>> [ 3542.174202] ? rcu_read_lock_sched_held+0x16/0x90
>> [ 3542.174551] ? lock_acquire+0xfc/0x150 [ 3542.174840] ?
>> perf_event_ctx_lock_nested+0x24/0x230
>> [ 3542.175205] mutex_lock_nested+0x1c/0x30 [ 3542.175505]
>> perf_event_ctx_lock_nested+0xc8/0x230
>> [ 3542.181147] ? perf_event_ctx_lock_nested+0x24/0x230
>> [ 3542.186839] perf_event_disable+0x19/0x80 [ 3542.192502]
>> vmx_vcpu_run+0x3e5/0xfe0 [kvm_intel
>> 7936a7891efe9306918aa504b0eb8bc1e7ba3aa6]
>> [ 3542.203771] ? rcu_read_lock_sched_held+0x16/0x90
>> [ 3542.209378] vcpu_enter_guest+0xa96/0x1350 [kvm
>> 1f4ec973e90cea60cd68c22c1431d8e7f534c92b]
>> [ 3542.215218] ? vcpu_enter_guest+0xbe1/0x1350 [kvm
>> 1f4ec973e90cea60cd68c22c1431d8e7f534c92b]
>> [ 3542.226292] ? rcu_read_lock_sched_held+0x16/0x90
>> [ 3542.231956] ? rcu_read_lock_sched_held+0x16/0x90
>> [ 3542.237542] ? lock_acquire+0xfc/0x150 [ 3542.243093] ?
>> __rseq_handle_notify_resume+0x3a/0x60
>> [ 3542.248689] vcpu_run+0x53/0x490 [kvm
>> 1f4ec973e90cea60cd68c22c1431d8e7f534c92b]
>> [ 3542.254533] ? vcpu_run+0x35a/0x490 [kvm
>> 1f4ec973e90cea60cd68c22c1431d8e7f534c92b]
>> [ 3542.260567] kvm_arch_vcpu_ioctl_run+0x162/0x680 [kvm
>> 1f4ec973e90cea60cd68c22c1431d8e7f534c92b]
>> [ 3542.272395] ? kvm_arch_vcpu_ioctl_run+0x6d/0x680 [kvm
>> 1f4ec973e90cea60cd68c22c1431d8e7f534c92b]
>> [ 3542.284586] kvm_vcpu_ioctl+0x2ad/0x7a0 [kvm
>> 1f4ec973e90cea60cd68c22c1431d8e7f534c92b]
>> [ 3542.290973] ? lock_acquire+0xfc/0x150 [ 3542.296990] ?
>> rcu_read_lock_sched_held+0x16/0x90
>> [ 3542.302912] ? lock_release+0x118/0x190 [ 3542.308800] ?
>> __fget_files+0xe8/0x1c0 [ 3542.314710] ? __fget_files+0x5/0x1c0
>> [ 3542.320591] __x64_sys_ioctl+0x96/0xd0 [ 3542.326500]
>> do_syscall_64+0x3a/0x90 [ 3542.332426]
>> entry_SYSCALL_64_after_hwframe+0x5e/0xc8
>>
>>
>> I know little about perf.
>
> +Kan to double confirm if needed.
The perf_event_disable() eventually invokes the intel_pt_stop().
We already expose the intel_pt_stop()/cpu_emergency_stop_pt() to other
modules. I don't think we have to use the perf_event_disable(). Also,
the perf_event_disable() requires extra codes.
I went through the discussions. I agree with Sean's suggestion.
We should only put the logic in the KVM but all the MSR access details
into the PT driver. But I prefer a more generic and straightforward
function name, e.g., intel_pt_stop_save()/intel_pt_start_load(), in case
other modules may want to save/restore the PT information in their
context switch later.
Thanks,
Kan
>
>> It seems perf_event_disable() is not used widely by
>> other kernel component. Is there a alternative? If no, I think expose
>> disable/enable helper from pt driver like this series seems OK.
>
> Oops, it was actually a mistake to disable the event on other CPUs.
> Can you expose and use perf_event_disable_local?
>
> For the enabling side, we need to add and expose perf_event_enable_local:
> event_function_local(event, __perf_event_enable, NULL);
On Thursday, September 15, 2022 4:26 AM, Liang, Kan wrote:
> The perf_event_disable() eventually invokes the intel_pt_stop().
> We already expose the intel_pt_stop()/cpu_emergency_stop_pt() to other
> modules. I don't think we have to use the perf_event_disable(). Also, the
> perf_event_disable() requires extra codes.
>
> I went through the discussions. I agree with Sean's suggestion.
> We should only put the logic in the KVM but all the MSR access details into the PT
> driver.
Even the driver itself doesn’t drive the save/restore of the MSRs, it is drived by perf.
1. If we make KVM a user of perf, we should do this via perf_event_disable/enable_*.
2. If we make KVM an alternative to perf (i.e. have direct control over PMU HW),
we can do this via driver interfaces like perf.
Per my experience, we should go for 1. Probably need Peter's opinions on this.
> But I prefer a more generic and straightforward function name, e.g.,
> intel_pt_stop_save()/intel_pt_start_load(), in case other modules may want to
> save/restore the PT information in their context switch later.
>
> Thanks,
> Kan
>
> >
> >> It seems perf_event_disable() is not used widely by other kernel
> >> component.
Because there are not lots of kernel users.
You can check another user, watchdog_hld.c, perf_event_enable/disable are used there.
On Thursday, September 15, 2022 9:55 PM Liang, Kan wrote:
> On 2022-09-14 10:46 p.m., Wang, Wei W wrote:
> > On Thursday, September 15, 2022 4:26 AM, Liang, Kan wrote:
> >> The perf_event_disable() eventually invokes the intel_pt_stop().
> >> We already expose the intel_pt_stop()/cpu_emergency_stop_pt() to
> >> other modules. I don't think we have to use the perf_event_disable().
> >> Also, the
> >> perf_event_disable() requires extra codes.
> >>
> >> I went through the discussions. I agree with Sean's suggestion.
> >> We should only put the logic in the KVM but all the MSR access
> >> details into the PT driver.
> >
> > Even the driver itself doesn’t drive the save/restore of the MSRs, it is drived
> by perf.
>
> It through perf_event, not driven by perf_event. The perf_event generic code
> never knows when should invokes each driver to save/restore information. It
> should be driven by the other subsystem e.g., scheduler.
Yes. The cpu scheduler does this via the perf subsystem, though.
>
> For this case, KVM should drive the save/restore, and the PT driver eventually
> does all the MSR access details.
>
> > 1. If we make KVM a user of perf, we should do this via
> perf_event_disable/enable_*.
> > 2. If we make KVM an alternative to perf (i.e. have direct control
> > over PMU HW), we can do this via driver interfaces like perf.
> > Per my experience, we should go for 1. Probably need Peter's opinions on
> this.
> >
>
> For 1, the perf_event_disable/enable_* are not enough. They don't
> save/restore MSRs.
perf_event_disable will go through perf to call pt_event_stop which saves the related MSRs, right?
(if so, what large changes did you mean?)
> If we go to this way, we have to introduce a new generic
> interface to ask each driver to save/restore their MSRs when the guest is
> entering/exiting. We'd better combine the new interface with the existing
> perf_guest_get_msrs() of the core driver.
> I think that's an ideal solution, but requires big changes in the code.
>
> 2 is the current KVM implementation. See pt_save_msr()/pt_load_msr(). I don't
> think it's a right way. We'd better fix it.
>
> The suggestion should be 3. The KVM notify the PT driver via the interface
> provided by PT. The PT driver save/restore all the registers.
> I think it's an acceptable solution with small code changes.
This looks like we just relocate the save/restore functions to the PT driver and KVM still directly call them - still not going through perf's management. Imagine every user operates on the pmu h/w directly like this, things would be a mess.
Thanks,
Wei
On 2022-09-14 10:46 p.m., Wang, Wei W wrote:
> On Thursday, September 15, 2022 4:26 AM, Liang, Kan wrote:
>> The perf_event_disable() eventually invokes the intel_pt_stop().
>> We already expose the intel_pt_stop()/cpu_emergency_stop_pt() to other
>> modules. I don't think we have to use the perf_event_disable(). Also, the
>> perf_event_disable() requires extra codes.
>>
>> I went through the discussions. I agree with Sean's suggestion.
>> We should only put the logic in the KVM but all the MSR access details into the PT
>> driver.
>
> Even the driver itself doesn’t drive the save/restore of the MSRs, it is drived by perf.
It through perf_event, not driven by perf_event. The perf_event generic
code never knows when should invokes each driver to save/restore
information. It should be driven by the other subsystem e.g., scheduler.
For this case, KVM should drive the save/restore, and the PT driver
eventually does all the MSR access details.
> 1. If we make KVM a user of perf, we should do this via perf_event_disable/enable_*.
> 2. If we make KVM an alternative to perf (i.e. have direct control over PMU HW),
> we can do this via driver interfaces like perf.
> Per my experience, we should go for 1. Probably need Peter's opinions on this.
>
For 1, the perf_event_disable/enable_* are not enough. They don't
save/restore MSRs. If we go to this way, we have to introduce a new
generic interface to ask each driver to save/restore their MSRs when the
guest is entering/exiting. We'd better combine the new interface with
the existing perf_guest_get_msrs() of the core driver.
I think that's an ideal solution, but requires big changes in the code.
2 is the current KVM implementation. See pt_save_msr()/pt_load_msr(). I
don't think it's a right way. We'd better fix it.
The suggestion should be 3. The KVM notify the PT driver via the
interface provided by PT. The PT driver save/restore all the registers.
I think it's an acceptable solution with small code changes.
So I prefer 3.
Thanks,
Kan
>> But I prefer a more generic and straightforward function name, e.g.,
>> intel_pt_stop_save()/intel_pt_start_load(), in case other modules may want to
>> save/restore the PT information in their context switch later.
>>
>> Thanks,
>> Kan
>>
>>>
>>>> It seems perf_event_disable() is not used widely by other kernel
>>>> component.
>
> Because there are not lots of kernel users.
> You can check another user, watchdog_hld.c, perf_event_enable/disable are used there.
On 2022-09-15 10:39 a.m., Wang, Wei W wrote:
> On Thursday, September 15, 2022 9:55 PM Liang, Kan wrote:
>> On 2022-09-14 10:46 p.m., Wang, Wei W wrote:
>>> On Thursday, September 15, 2022 4:26 AM, Liang, Kan wrote:
>>>> The perf_event_disable() eventually invokes the intel_pt_stop().
>>>> We already expose the intel_pt_stop()/cpu_emergency_stop_pt() to
>>>> other modules. I don't think we have to use the perf_event_disable().
>>>> Also, the
>>>> perf_event_disable() requires extra codes.
>>>>
>>>> I went through the discussions. I agree with Sean's suggestion.
>>>> We should only put the logic in the KVM but all the MSR access
>>>> details into the PT driver.
>>>
>>> Even the driver itself doesn’t drive the save/restore of the MSRs, it is drived
>> by perf.
>>
>> It through perf_event, not driven by perf_event. The perf_event generic code
>> never knows when should invokes each driver to save/restore information. It
>> should be driven by the other subsystem e.g., scheduler.
>
> Yes. The cpu scheduler does this via the perf subsystem, though.
>
>>
>> For this case, KVM should drive the save/restore, and the PT driver eventually
>> does all the MSR access details.
>>
>>> 1. If we make KVM a user of perf, we should do this via
>> perf_event_disable/enable_*.
>>> 2. If we make KVM an alternative to perf (i.e. have direct control
>>> over PMU HW), we can do this via driver interfaces like perf.
>>> Per my experience, we should go for 1. Probably need Peter's opinions on
>> this.
>>>
>>
>> For 1, the perf_event_disable/enable_* are not enough. They don't
>> save/restore MSRs.
>
> perf_event_disable will go through perf to call pt_event_stop which saves the related MSRs, right?
I don't think so. The pt_event_stop() doesn't save all the
MSR_IA32_RTIT_* MSRs.
> (if so, what large changes did you mean?)
>
>> If we go to this way, we have to introduce a new generic
>> interface to ask each driver to save/restore their MSRs when the guest is
>> entering/exiting. We'd better combine the new interface with the existing
>> perf_guest_get_msrs() of the core driver.
>> I think that's an ideal solution, but requires big changes in the code.
>>
>> 2 is the current KVM implementation. See pt_save_msr()/pt_load_msr(). I don't
>> think it's a right way. We'd better fix it.
>>
>> The suggestion should be 3. The KVM notify the PT driver via the interface
>> provided by PT. The PT driver save/restore all the registers.
>> I think it's an acceptable solution with small code changes.
>
> This looks like we just relocate the save/restore functions to the PT driver and KVM still directly call them - still not going through perf's management. Imagine every user operates on the pmu h/w directly like this, things would be a mess.
>
The pt_event_stop() and the proposed interface still manipulate the PT
event pt->handle.event. The event status is updated as well. It's still
under control of the perf_event.
While the current KVM implementation implicitly updates the MSRs without
updating the event status.
Also, KVM doesn't know the PT as well as the PT driver. It's better to
let the dedicated driver maintain the details. Otherwise, if we add more
MSRs later, we have to maintain both KVM and PT.
Thanks,
Kan
On Thursday, September 15, 2022 11:43 PM, Liang, Kan wrote:
> On 2022-09-15 10:39 a.m., Wang, Wei W wrote:
> > On Thursday, September 15, 2022 9:55 PM Liang, Kan wrote:
> >> On 2022-09-14 10:46 p.m., Wang, Wei W wrote:
> >>> On Thursday, September 15, 2022 4:26 AM, Liang, Kan wrote:
> >>>> The perf_event_disable() eventually invokes the intel_pt_stop().
> >>>> We already expose the intel_pt_stop()/cpu_emergency_stop_pt() to
> >>>> other modules. I don't think we have to use the perf_event_disable().
> >>>> Also, the
> >>>> perf_event_disable() requires extra codes.
> >>>>
> >>>> I went through the discussions. I agree with Sean's suggestion.
> >>>> We should only put the logic in the KVM but all the MSR access
> >>>> details into the PT driver.
> >>>
> >>> Even the driver itself doesn’t drive the save/restore of the MSRs,
> >>> it is drived
> >> by perf.
> >>
> >> It through perf_event, not driven by perf_event. The perf_event
> >> generic code never knows when should invokes each driver to
> >> save/restore information. It should be driven by the other subsystem e.g.,
> scheduler.
> >
> > Yes. The cpu scheduler does this via the perf subsystem, though.
> >
> >>
> >> For this case, KVM should drive the save/restore, and the PT driver
> >> eventually does all the MSR access details.
> >>
> >>> 1. If we make KVM a user of perf, we should do this via
> >> perf_event_disable/enable_*.
> >>> 2. If we make KVM an alternative to perf (i.e. have direct control
> >>> over PMU HW), we can do this via driver interfaces like perf.
> >>> Per my experience, we should go for 1. Probably need Peter's
> >>> opinions on
> >> this.
> >>>
> >>
> >> For 1, the perf_event_disable/enable_* are not enough. They don't
> >> save/restore MSRs.
> >
> > perf_event_disable will go through perf to call pt_event_stop which saves
> the related MSRs, right?
>
> I don't think so. The pt_event_stop() doesn't save all the
> MSR_IA32_RTIT_* MSRs.
Not all the MSRs are required to be saved. In general, pt_event_stop should have
saved all the MSRs required for an event switching. Otherwise the general usages
of PT have been broken. To be more precise, the following MSRs are not saved by
pt_event_stop, but I don’t see they are required to be saved:
- MSR_IA32_RTIT_CR3_MATCH: I don’t see it is used by perf.
Seems like KVM saved an MSR that's not even used by the host.
- Address range MSRs (MSR_IA32_RTIT_ADDR0_A etc.): Those are provided by s/w and not updated by h/w.
So they're just set to MSRs when event gets scheduled in. There is no need to save.
>
> > (if so, what large changes did you mean?)
> >
> >> If we go to this way, we have to introduce a new generic interface to
> >> ask each driver to save/restore their MSRs when the guest is
> >> entering/exiting. We'd better combine the new interface with the
> >> existing
> >> perf_guest_get_msrs() of the core driver.
> >> I think that's an ideal solution, but requires big changes in the code.
> >>
> >> 2 is the current KVM implementation. See pt_save_msr()/pt_load_msr().
> >> I don't think it's a right way. We'd better fix it.
> >>
> >> The suggestion should be 3. The KVM notify the PT driver via the
> >> interface provided by PT. The PT driver save/restore all the registers.
> >> I think it's an acceptable solution with small code changes.
> >
> > This looks like we just relocate the save/restore functions to the PT driver
> and KVM still directly call them - still not going through perf's management.
> Imagine every user operates on the pmu h/w directly like this, things would be
> a mess.
> >
>
>
> The pt_event_stop() and the proposed interface still manipulate the PT event
> pt->handle.event. The event status is updated as well. It's still under control of
> the perf_event.
Did you mean to handle the PT event in the proposed driver API? Event status is just
one of the things. There are other things if we want to make it complete for this,
e.g. event->oncpu = -1, and eventually seems we will re-implement perf_event_disable_*.
Btw, Xiaoyao has made it work with perf_event_disable_local, and don’t have that many changes.
If necessary, we can post the 2nd version out to double check.
Thanks,
Wei
On 2022-09-15 10:30 p.m., Wang, Wei W wrote:
> On Thursday, September 15, 2022 11:43 PM, Liang, Kan wrote:
>> On 2022-09-15 10:39 a.m., Wang, Wei W wrote:
>>> On Thursday, September 15, 2022 9:55 PM Liang, Kan wrote:
>>>> On 2022-09-14 10:46 p.m., Wang, Wei W wrote:
>>>>> On Thursday, September 15, 2022 4:26 AM, Liang, Kan wrote:
>>>>>> The perf_event_disable() eventually invokes the intel_pt_stop().
>>>>>> We already expose the intel_pt_stop()/cpu_emergency_stop_pt() to
>>>>>> other modules. I don't think we have to use the perf_event_disable().
>>>>>> Also, the
>>>>>> perf_event_disable() requires extra codes.
>>>>>>
>>>>>> I went through the discussions. I agree with Sean's suggestion.
>>>>>> We should only put the logic in the KVM but all the MSR access
>>>>>> details into the PT driver.
>>>>>
>>>>> Even the driver itself doesn’t drive the save/restore of the MSRs,
>>>>> it is drived
>>>> by perf.
>>>>
>>>> It through perf_event, not driven by perf_event. The perf_event
>>>> generic code never knows when should invokes each driver to
>>>> save/restore information. It should be driven by the other subsystem e.g.,
>> scheduler.
>>>
>>> Yes. The cpu scheduler does this via the perf subsystem, though.
>>>
>>>>
>>>> For this case, KVM should drive the save/restore, and the PT driver
>>>> eventually does all the MSR access details.
>>>>
>>>>> 1. If we make KVM a user of perf, we should do this via
>>>> perf_event_disable/enable_*.
>>>>> 2. If we make KVM an alternative to perf (i.e. have direct control
>>>>> over PMU HW), we can do this via driver interfaces like perf.
>>>>> Per my experience, we should go for 1. Probably need Peter's
>>>>> opinions on
>>>> this.
>>>>>
>>>>
>>>> For 1, the perf_event_disable/enable_* are not enough. They don't
>>>> save/restore MSRs.
>>>
>>> perf_event_disable will go through perf to call pt_event_stop which saves
>> the related MSRs, right?
>>
>> I don't think so. The pt_event_stop() doesn't save all the
>> MSR_IA32_RTIT_* MSRs.
>
> Not all the MSRs are required to be saved. In general, pt_event_stop should have
> saved all the MSRs required for an event switching. Otherwise the general usages
> of PT have been broken. To be more precise, the following MSRs are not saved by
> pt_event_stop, but I don’t see they are required to be saved:
>
> - MSR_IA32_RTIT_CR3_MATCH: I don’t see it is used by perf.
> Seems like KVM saved an MSR that's not even used by the host.
>
> - Address range MSRs (MSR_IA32_RTIT_ADDR0_A etc.): Those are provided by s/w and not updated by h/w.
> So they're just set to MSRs when event gets scheduled in. There is no need to save.
>
OK. I think you need a clean-up patch to fix them first.
>>
>>> (if so, what large changes did you mean?)
>>>
>>>> If we go to this way, we have to introduce a new generic interface to
>>>> ask each driver to save/restore their MSRs when the guest is
>>>> entering/exiting. We'd better combine the new interface with the
>>>> existing
>>>> perf_guest_get_msrs() of the core driver.
>>>> I think that's an ideal solution, but requires big changes in the code.
>>>>
>>>> 2 is the current KVM implementation. See pt_save_msr()/pt_load_msr().
>>>> I don't think it's a right way. We'd better fix it.
>>>>
>>>> The suggestion should be 3. The KVM notify the PT driver via the
>>>> interface provided by PT. The PT driver save/restore all the registers.
>>>> I think it's an acceptable solution with small code changes.
>>>
>>> This looks like we just relocate the save/restore functions to the PT driver
>> and KVM still directly call them - still not going through perf's management.
>> Imagine every user operates on the pmu h/w directly like this, things would be
>> a mess.
>>>
>>
>>
>> The pt_event_stop() and the proposed interface still manipulate the PT event
>> pt->handle.event. The event status is updated as well. It's still under control of
>> the perf_event.
>
> Did you mean to handle the PT event in the proposed driver API? Event status is just
> one of the things. There are other things if we want to make it complete for this,
> e.g. event->oncpu = -1, and eventually seems we will re-implement perf_event_disable_*.
>
As my understand, perf always check the status first. If it's a stopped
or inactivated event, I don't think event->oncpu will be touched. That's
why I think the proposed driver API should be acceptable.
> Btw, Xiaoyao has made it work with perf_event_disable_local, and don’t have that many changes.
> If necessary, we can post the 2nd version out to double check.
>
I'm not worry about which ways (either perf_event_disable_local() or the
proposed PT driver API) are chosen to stop the PT. If the existing
perf_event interfaces can meet your requirement, that's perfect.
My real concern is the pt_save_msr()/pt_load_msr(). I don't think it's a
job for KVM. See atomic_switch_perf_msrs(). It is the perf core driver
rather than KVM that tells which MSRs should be saved/restored in VMCS.
We should do the same thing for PT. (Actually, I think we already
encounter issues with the current KVM-dominated method. KVM
saves/restores unnecessary MSRs. Right?)
To do so, I think there may be two ways.
- Since MSRs have to be switched for both PT and core drivers, it sounds
reasonable to provide a new generic interface in the perf_event. The new
interface is to tell KVM which MSRs should be saved/restored. Then KVM
can decide to save/restore via VMCS or direct MSR access. I suspect this
way requires big change, but it will benefit all the drivers which have
similar requirements.
- The proposed driver API. The MSRs are saved/restored in the PT driver.
Thanks,
Kan
On Friday, September 16, 2022 9:27 PM, Liang, Kan wrote:
> > Did you mean to handle the PT event in the proposed driver API? Event
> > status is just one of the things. There are other things if we want to
> > make it complete for this, e.g. event->oncpu = -1, and eventually seems we will
> re-implement perf_event_disable_*.
> >
>
> As my understand, perf always check the status first. If it's a stopped or
> inactivated event, I don't think event->oncpu will be touched. That's why I think
> the proposed driver API should be acceptable.
That's the implementation thing. We need to make it architecturally clean though.
>
> > Btw, Xiaoyao has made it work with perf_event_disable_local, and don’t have
> that many changes.
> > If necessary, we can post the 2nd version out to double check.
> >
>
> I'm not worry about which ways (either perf_event_disable_local() or the
> proposed PT driver API) are chosen to stop the PT. If the existing perf_event
> interfaces can meet your requirement, that's perfect.
>
> My real concern is the pt_save_msr()/pt_load_msr(). I don't think it's a job for
> KVM. See atomic_switch_perf_msrs(). It is the perf core driver rather than KVM
> that tells which MSRs should be saved/restored in VMCS.
> We should do the same thing for PT. (Actually, I think we already encounter
> issues with the current KVM-dominated method. KVM saves/restores
> unnecessary MSRs. Right?)
>
Right. It's on my plan to improve the current PT virtualization, and
planed to be the next step after this fix. The general rule is the same: make KVM a user
of perf, that is, we leave those save/restore work to be completely done by the
perf (driver) side, so we will eventually remove the KVM side pt_save/load_msr.
To be more precise, it will work as below:
- we will create a guest event, like what we did for lbr virtualization
- on VMEnter:
-- perf_disable_event_local(host_event);
-- perf_enable_event_local(guest_event);
- on VMExit:
-- perf_disable_event_local(guest_event);
-- perf_enable_event_local(host_event);
> To do so, I think there may be two ways.
> - Since MSRs have to be switched for both PT and core drivers, it sounds
> reasonable to provide a new generic interface in the perf_event. The new
> interface is to tell KVM which MSRs should be saved/restored. Then KVM can
> decide to save/restore via VMCS or direct MSR access. I suspect this way
> requires big change, but it will benefit all the drivers which have similar
> requirements.
> - The proposed driver API. The MSRs are saved/restored in the PT driver.
As shown above, no need for those. We can completely reuse the
perf side save/restore.
Thanks,
Wei
On 2022-09-19 9:46 a.m., Wang, Wei W wrote:
> On Friday, September 16, 2022 9:27 PM, Liang, Kan wrote:
>>> Did you mean to handle the PT event in the proposed driver API? Event
>>> status is just one of the things. There are other things if we want to
>>> make it complete for this, e.g. event->oncpu = -1, and eventually seems we will
>> re-implement perf_event_disable_*.
>>>
>>
>> As my understand, perf always check the status first. If it's a stopped or
>> inactivated event, I don't think event->oncpu will be touched. That's why I think
>> the proposed driver API should be acceptable.
>
> That's the implementation thing. We need to make it architecturally clean though.
>
>>
>>> Btw, Xiaoyao has made it work with perf_event_disable_local, and don’t have
>> that many changes.
>>> If necessary, we can post the 2nd version out to double check.
>>>
>>
>> I'm not worry about which ways (either perf_event_disable_local() or the
>> proposed PT driver API) are chosen to stop the PT. If the existing perf_event
>> interfaces can meet your requirement, that's perfect.
>>
>> My real concern is the pt_save_msr()/pt_load_msr(). I don't think it's a job for
>> KVM. See atomic_switch_perf_msrs(). It is the perf core driver rather than KVM
>> that tells which MSRs should be saved/restored in VMCS.
>> We should do the same thing for PT. (Actually, I think we already encounter
>> issues with the current KVM-dominated method. KVM saves/restores
>> unnecessary MSRs. Right?)
>>
>
> Right. It's on my plan to improve the current PT virtualization, and
> planed to be the next step after this fix. The general rule is the same: make KVM a user
> of perf, that is, we leave those save/restore work to be completely done by the
> perf (driver) side, so we will eventually remove the KVM side pt_save/load_msr.
> To be more precise, it will work as below:
> - we will create a guest event, like what we did for lbr virtualization
Another fake event? We have to specially handle it in the perf code. I
don't think it's a clean way for perf.
> - on VMEnter:
> -- perf_disable_event_local(host_event);
> -- perf_enable_event_local(guest_event);
> - on VMExit:
> -- perf_disable_event_local(guest_event);
> -- perf_enable_event_local(host_event);
Why we cannot use the same way as the perf core driver to switch the
MSRs in the VMCS?
You just need one generic function, perf_guest_get_msrs(), for both PT
and core driver. If you have to disable PT explicitly before VMCS, I
think you can do it in the PT specific perf_guest_get_msrs().
Anyway, that's an improvement for the current code. I don't have a
problem, if you prefer to separate the fix patch and improvement patch.
Thanks,
Kan
>
>> To do so, I think there may be two ways.
>> - Since MSRs have to be switched for both PT and core drivers, it sounds
>> reasonable to provide a new generic interface in the perf_event. The new
>> interface is to tell KVM which MSRs should be saved/restored. Then KVM can
>> decide to save/restore via VMCS or direct MSR access. I suspect this way
>> requires big change, but it will benefit all the drivers which have similar
>> requirements.
>> - The proposed driver API. The MSRs are saved/restored in the PT driver.
>
> As shown above, no need for those. We can completely reuse the
> perf side save/restore.
>
> Thanks,
> Wei
On Monday, September 19, 2022 10:41 PM, Liang, Kan wrote:
> Another fake event? We have to specially handle it in the perf code. I don't
> think it's a clean way for perf.
We can check the patch later. I think it should be clean, like the LBR side.
>
> > - on VMEnter:
> > -- perf_disable_event_local(host_event);
> > -- perf_enable_event_local(guest_event);
> > - on VMExit:
> > -- perf_disable_event_local(guest_event);
> > -- perf_enable_event_local(host_event);
>
> Why we cannot use the same way as the perf core driver to switch the MSRs in
> the VMCS?
The current MSR switching list from VMCS isn’t fast,
should be the last resort when really necessary.
>
> You just need one generic function, perf_guest_get_msrs(), for both PT and
> core driver. If you have to disable PT explicitly before VMCS, I think you can do
> it in the PT specific perf_guest_get_msrs().
The disable is done via " Clear IA32_RTIT_CTL" VMExit control.
It can ensure PT disabled in time on VMExit, so no need to go through perf_guest_get_msrs.
Thanks,
Wei
On 2022-09-19 11:22 a.m., Wang, Wei W wrote:
> On Monday, September 19, 2022 10:41 PM, Liang, Kan wrote:
>> Another fake event? We have to specially handle it in the perf code. I don't
>> think it's a clean way for perf.
>
> We can check the patch later. I think it should be clean, like the LBR side.
I doubt. Perf already specially handles the fake LBR event in several
places from the core code to the LBR code. It also occupy a reserved
bit. If there is another choice, I don't think we want to go that way.
>
>>
>>> - on VMEnter:
>>> -- perf_disable_event_local(host_event);
>>> -- perf_enable_event_local(guest_event);
>>> - on VMExit:
>>> -- perf_disable_event_local(guest_event);
>>> -- perf_enable_event_local(host_event);
>>
>> Why we cannot use the same way as the perf core driver to switch the MSRs in
>> the VMCS?
>
> The current MSR switching list from VMCS isn’t fast,
> should be the last resort when really necessary.
>
It's a documented way in the SDM. I believe there must be some reason
Intel introduces it. Since it's an documented (or recommended) way, I
think we'd better use it if possible.
Since both the PT and the core driver needs to switch MSRs during VMCS,
it's better to use the same way (function) to handle them.
Thanks,
Kan
>>
>> You just need one generic function, perf_guest_get_msrs(), for both PT and
>> core driver. If you have to disable PT explicitly before VMCS, I think you can do
>> it in the PT specific perf_guest_get_msrs().
>
> The disable is done via " Clear IA32_RTIT_CTL" VMExit control.
> It can ensure PT disabled in time on VMExit, so no need to go through perf_guest_get_msrs.
>