On 6/3/2024 11:09 pm, Jim Mattson wrote:
> On Wed, Mar 6, 2024 at 1:11 AM Like Xu <[email protected]> wrote:
>>
>> On 6/3/2024 7:22 am, Sean Christopherson wrote:
>>> +Mingwei
>>>
>>> On Thu, Aug 24, 2023, Dapeng Mi wrote:
>>> diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
>>>> index 7d9ba301c090..ffda2ecc3a22 100644
>>>> --- a/arch/x86/kvm/pmu.h
>>>> +++ b/arch/x86/kvm/pmu.h
>>>> @@ -12,7 +12,8 @@
>>>> MSR_IA32_MISC_ENABLE_BTS_UNAVAIL)
>>>>
>>>> /* retrieve the 4 bits for EN and PMI out of IA32_FIXED_CTR_CTRL */
>>>> -#define fixed_ctrl_field(ctrl_reg, idx) (((ctrl_reg) >> ((idx)*4)) & 0xf)
>>>> +#define fixed_ctrl_field(ctrl_reg, idx) \
>>>> + (((ctrl_reg) >> ((idx) * INTEL_FIXED_BITS_STRIDE)) & INTEL_FIXED_BITS_MASK)
>>>>
>>>> #define VMWARE_BACKDOOR_PMC_HOST_TSC 0x10000
>>>> #define VMWARE_BACKDOOR_PMC_REAL_TIME 0x10001
>>>> @@ -165,7 +166,8 @@ static inline bool pmc_speculative_in_use(struct kvm_pmc *pmc)
>>>>
>>>> if (pmc_is_fixed(pmc))
>>>> return fixed_ctrl_field(pmu->fixed_ctr_ctrl,
>>>> - pmc->idx - INTEL_PMC_IDX_FIXED) & 0x3;
>>>> + pmc->idx - INTEL_PMC_IDX_FIXED) &
>>>> + (INTEL_FIXED_0_KERNEL | INTEL_FIXED_0_USER);
>>>>
>>>> return pmc->eventsel & ARCH_PERFMON_EVENTSEL_ENABLE;
>>>> }
>>>> diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
>>>> index f2efa0bf7ae8..b0ac55891cb7 100644
>>>> --- a/arch/x86/kvm/vmx/pmu_intel.c
>>>> +++ b/arch/x86/kvm/vmx/pmu_intel.c
>>>> @@ -548,8 +548,13 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
>>>> setup_fixed_pmc_eventsel(pmu);
>>>> }
>>>>
>>>> - for (i = 0; i < pmu->nr_arch_fixed_counters; i++)
>>>> - pmu->fixed_ctr_ctrl_mask &= ~(0xbull << (i * 4));
>>>> + for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
>>>> + pmu->fixed_ctr_ctrl_mask &=
>>>> + ~intel_fixed_bits_by_idx(i,
>>>> + INTEL_FIXED_0_KERNEL |
>>>> + INTEL_FIXED_0_USER |
>>>> + INTEL_FIXED_0_ENABLE_PMI);
>>>> + }
>>>> counter_mask = ~(((1ull << pmu->nr_arch_gp_counters) - 1) |
>>>> (((1ull << pmu->nr_arch_fixed_counters) - 1) << INTEL_PMC_IDX_FIXED));
>>>> pmu->global_ctrl_mask = counter_mask;
>>>> @@ -595,7 +600,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
>>>> pmu->reserved_bits &= ~ICL_EVENTSEL_ADAPTIVE;
>>>> for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
>>>> pmu->fixed_ctr_ctrl_mask &=
>>>> - ~(1ULL << (INTEL_PMC_IDX_FIXED + i * 4));
>>>
>>> OMG, this might just win the award for most obfuscated PMU code in KVM, which is
>>> saying something. The fact that INTEL_PMC_IDX_FIXED happens to be 32, the same
>>> bit number as ICL_FIXED_0_ADAPTIVE, is 100% coincidence. Good riddance.
>>>
>>> Argh, and this goofy code helped introduce a real bug. reprogram_fixed_counters()
>>> doesn't account for the upper 32 bits of IA32_FIXED_CTR_CTRL.
>>>
>>> Wait, WTF? Nothing in KVM accounts for the upper bits. This can't possibly work.
>>>
>>> IIUC, because KVM _always_ sets precise_ip to a non-zero bit for PEBS events,
>>> perf will _always_ generate an adaptive record, even if the guest requested a
>>> basic record. Ugh, and KVM will always generate adaptive records even if the
>>> guest doesn't support them. This is all completely broken. It probably kinda
>>> sorta works because the Basic info is always stored in the record, and generating
>>> more info requires a non-zero MSR_PEBS_DATA_CFG, but ugh.
>>
>> Yep, it works at least on machines with both adaptive and pebs_full features.
>>
>> I remember one generation of Atom core (? GOLDMONT) that didn't have both
>> above PEBS sub-features, so we didn't set x86_pmu.pebs_ept on that platform.
>>
>> Mingwei or others are encouraged to construct use cases in KUT::pmu_pebs.flat
>> that violate guest-pebs rules (e.g., leak host state), as we all recognize that
>> testing
>> is the right way to condemn legacy code, not just lengthy emails.
>>
>>>
>>> Oh great, and it gets worse. intel_pmu_disable_fixed() doesn't clear the upper
>>> bits either, i.e. leaves ICL_FIXED_0_ADAPTIVE set. Unless I'm misreading the code,
>>> intel_pmu_enable_fixed() effectively doesn't clear ICL_FIXED_0_ADAPTIVE either,
>>> as it only modifies the bit when it wants to set ICL_FIXED_0_ADAPTIVE.
>>>
>>> *sigh*
>>>
>>> I'm _very_ tempted to disable KVM PEBS support for the current PMU, and make it
>>> available only when the so-called passthrough PMU is available[*]. Because I
>>> don't see how this is can possibly be functionally correct, nor do I see a way
>>> to make it functionally correct without a rather large and invasive series.
>>
>> Considering that I've tried the idea myself, I have no inclination towards
>> "passthrough PMU", and I'd like to be able to take the time to review that
>> patchset while we all wait for a clear statement from that perf-core man,
>> who don't really care about virtualization and don't want to lose control
>> of global hardware resources.
>>
>> Before we actually get to that ideal state you want, we have to deal with
>> some intermediate state and face to any users that rely on the current code,
>> you had urged to merge in a KVM document for vPMU, not sure how far
>> along that part of the work is.
>>
>>>
>>> Ouch. And after chatting with Mingwei, who asked the very good question of
>>> "can this leak host state?", I am pretty sure that yes, this can leak host state.
>>
>> The Basic Info has a tsc field, I suspect it's the host-state-tsc.
>>
>>>
>>> When PERF_CAP_PEBS_BASELINE is enabled for the guest, i.e. when the guest has
>>> access to adaptive records, KVM gives the guest full access to MSR_PEBS_DATA_CFG
>>>
>>> pmu->pebs_data_cfg_mask = ~0xff00000full;
>>>
>>> which makes sense in a vacuum, because AFAICT the architecture doesn't allow
>>> exposing a subset of the four adaptive controls.
>>>
>>> GPRs and XMMs are always context switched and thus benign, but IIUC, Memory Info
>>> provides data that might now otherwise be available to the guest, e.g. if host
>>> userspace has disallowed equivalent events via KVM_SET_PMU_EVENT_FILTER.
>>
>> Indeed, KVM_SET_PMU_EVENT_FILTER doesn't work in harmony with
>> guest-pebs, and I believe there is a big problem here, especially with the
>> lack of targeted testing.
>>
>> One reason for this is that we don't use this cockamamie API in our
>> large-scale production environments, and users of vPMU want to get real
>> runtime information about physical cpus, not just virtualised hardware
>> architecture interfaces.
>>
>>>
>>> And unless I'm missing something, LBRs are a full leak of host state. Nothing
>>> in the SDM suggests that PEBS records honor MSR intercepts, so unless KVM is
>>> also passing through LBRs, i.e. is context switching all LBR MSRs, the guest can
>>> use PEBS to read host LBRs at will.
>>
>> KVM is also passing through LBRs when guest uses LBR but not at the
>> granularity of vm-exit/entry. I'm not sure if the LBR_EN bit is required
>> to get LBR information via PEBS, also not confirmed whether PEBS-lbr
>> can be enabled at the same time as independent LBR;
>>
>> I recall that PEBS-assist, per cpu-arch, would clean up this part of the
>> record when crossing root/non-root boundaries, or not generate record.
>>
>> We're looking forward to the tests that will undermine this perception.
>>
>> There are some devilish details during the processing of vm-exit and
>> the generation of host/guest pebs, and those interested can delve into
>> the short description in this SDM section "20.9.5 EPT-Friendly PEBS".
>>
>>>
>>> Unless someone chimes in to point out how PEBS virtualization isn't a broken mess,
>>> I will post a patch to effectively disable PEBS virtualization.
>>
>> There are two factors that affect the availability of guest-pebs:
>>
>> 1. the technical need to use core-PMU in both host/guest worlds;
>> (I don't think Googlers are paying attention to this part of users' needs)
>
> Let me clear up any misperceptions you might have that Google alone is
> foisting the pass-through PMU on the world. The work so far has been a
> collaboration between Google and Intel. Now, AMD has joined the
> collaboration as well. Mingwei is taking the lead on the project, but
> Googlers are outnumbered by the x86 CPU vendors ten to one.
This is such great news.
>
> The pass-through PMU allows both the host and guest worlds to use the
> core PMU, more so than the existing vPMU implementation. I assume your
Can I further confirm that in any case, host/guest can use PMU resources,
such as some special more accurate counters ? Is there an end of story
for that static partitioning scheme ?
> complaint is about the desire for host software to monitor guest
> behavior with core PMU events while the guest is running. Today,
> Google Cloud does this for fleet management, and losing this
> capability is not something we are looking forward to. However, the
> writing is on the wall: Coco is going to take this capability away
> from us anyway.
Coco pays a corresponding performance cost, and it's a paradox to hide
any performance trace of coco-guests from host's point of view.
Thanks for the input, Jim. Let me try to help.
>
>> 2. guest-pebs is temporarily disabled in the case of cross-mapping counter,
>> which reduces the number of performance samples collected by guest;
>>
>>>
>>> diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
>>> index 41a4533f9989..a2f827fa0ca1 100644
>>> --- a/arch/x86/kvm/vmx/capabilities.h
>>> +++ b/arch/x86/kvm/vmx/capabilities.h
>>> @@ -392,7 +392,7 @@ static inline bool vmx_pt_mode_is_host_guest(void)
>>>
>>> static inline bool vmx_pebs_supported(void)
>>> {
>>> - return boot_cpu_has(X86_FEATURE_PEBS) && kvm_pmu_cap.pebs_ept;
>>> + return false;
>>
>> As you know, user-space VMM may disable guest-pebs by filtering out the
>> MSR_IA32_PERF_CAPABILITIE.PERF_CAP_PEBS_FORMAT or CPUID.PDCM.
>>
>> In the end, if our great KVM maintainers insist on doing this,
>> there is obviously nothing I can do about it.
>>
>> Hope you have a good day.
>>
>>> }
>>>
>>> static inline bool cpu_has_notify_vmexit(void)
>>>
>>