by Like Xu

[permalink] [raw]

Subject: Re: [PATCH] KVM: x86/pmu: Prevent any host user from enabling PEBS for profiling guest

On 5/12/2023 11:19 pm, Liang, Kan wrote:
>
>
> On 2023-12-05 2:24 a.m., Like Xu wrote:
>>
>>
>> On 4/12/2023 11:19 pm, Liang, Kan wrote:
>>>
>>>
>>> On 2023-12-04 3:32 a.m., Like Xu wrote:
>>>>
>>>>
>>>> On 1/12/2023 10:38 pm, Liang, Kan wrote:
>>>>>
>>>>>
>>>>> On 2023-11-30 10:59 p.m., Like Xu wrote:
>>>>>> On 30/11/2023 11:49 pm, Liang, Kan wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 2023-11-30 2:29 a.m., Like Xu wrote:
>>>>>>>> On 29/11/2023 10:38 pm, Liang, Kan wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 2023-11-29 4:50 a.m., Like Xu wrote:
>>>>>>>>>> From: Like Xu <[email protected]>
>>>>>>>>>>
>>>>>>>>>> Stop using PEBS counters on host to profiling guest. Limit the
>>>>>>>>>> range of
>>>>>>>>>> enabled PEBS counters to only those counters enabled from the
>>>>>>>>>> guest
>>>>>>>>>> PEBS
>>>>>>>>>> emulation perspective.
>>>>>>>>>>
>>>>>>>>>> If there is a perf-record agent on host that uses perf-tools
>>>>>>>>>> events
>>>>>>>>>> like
>>>>>>>>>> "cpu-cycles:GP" (G for attr.exclude_host, P for max precise event
>>>>>>>>>> counter)
>>>>>>>>>> to capture guest performance events, then the guest will be
>>>>>>>>>> hanged.
>>>>>>>>>> This is
>>>>>>>>>> because Intel DS-based PEBS buffer is addressed using the 64-bit
>>>>>>>>>> linear
>>>>>>>>>> address of the current {p/v}CPU context based on MSR_IA32_DS_AREA.
>>>>>>>>>>
>>>>>>>>>> Any perf user using PEBS counters to profile guest on host is, in
>>>>>>>>>> perf/core
>>>>>>>>>> implementation details, trying to set bits on
>>>>>>>>>> cpuc->intel_ctrl_guest_mask
>>>>>>>>>> and arr[pebs_enable].guest, much like the guest PEBS emulation
>>>>>>>>>> behaviour.
>>>>>>>>>> But the subsequent PEBS memory write, regardless of whether guest
>>>>>>>>>> PEBS is
>>>>>>>>>> enabled, can overshoot guest entry and corrupt guest memory.
>>>>>>>>>>
>>>>>>>>>> Profiling guest via PEBS-DS buffer on host is not supported at
>>>>>>>>>> this
>>>>>>>>>> time.
>>>>>>>>>> Fix this by filtering the real configured value of
>>>>>>>>>> arr[pebs_enable].guest
>>>>>>>>>> with the emulated state of guest enabled PEBS counters, under the
>>>>>>>>>> condition
>>>>>>>>>> of none cross-mapped PEBS counters.
>>>>>>>>>
>>>>>>>>> So the counter will be silently disabled. The user never knows why
>>>>>>>>> nothing is sampled.
>>>>>>>>> Since we don't support the case, profiling guest via PEBS-DS
>>>>>>>>> buffer on
>>>>>>>>> host. Maybe we should error out when creating the event. For
>>>>>>>>> example
>>>>>>>>> (not tested),
>>>>>>>>
>>>>>>>> Test failed.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> diff --git a/arch/x86/events/intel/core.c
>>>>>>>>> b/arch/x86/events/intel/core.c
>>>>>>>>> index 3871267d3237..24b90c70737f 100644
>>>>>>>>> --- a/arch/x86/events/intel/core.c
>>>>>>>>> +++ b/arch/x86/events/intel/core.c
>>>>>>>>> @@ -3958,6 +3958,10 @@ static int intel_pmu_hw_config(struct
>>>>>>>>> perf_event
>>>>>>>>> *event)
>>>>>>>>>              if ((event->attr.config & INTEL_ARCH_EVENT_MASK) ==
>>>>>>>>> INTEL_FIXED_VLBR_EVENT)
>>>>>>>>>                  return -EINVAL;
>>>>>>>>>
>>>>>>>>> +        /* Profiling guest via PEBS-DS buffer on host is not
>>>>>>>>> supported. */
>>>>>>>>> +        if (event->attr.exclude_host)
>>>>>>>>> +            return -EINVAL;
>>>>>>>>> +
>>>>>>>>
>>>>>>>> Guest PEBS emulation also sets this bit, a typical call stack looks
>>>>>>>> like:
>>>>>>>>
>>>>>>>>        intel_pmu_hw_config+0x441/0x4d0
>>>>>>>>        hsw_hw_config+0x12/0xa0
>>>>>>>>        x86_pmu_event_init+0x98/0x370
>>>>>>>>        perf_try_init_event+0x47/0x130
>>>>>>>>        perf_event_alloc+0x446/0xeb0
>>>>>>>>        perf_event_create_kernel_counter+0x38/0x190
>>>>>>>>        pmc_reprogram_counter.constprop.17+0xd9/0x230 [kvm]
>>>>>>>>        kvm_pmu_handle_event+0x1a6/0x310 [kvm]
>>>>>>>>        vcpu_enter_guest+0x1388/0x19b0 [kvm]
>>>>>>>>        vcpu_run+0x117/0x6c0 [kvm]
>>>>>>>>        kvm_arch_vcpu_ioctl_run+0x13d/0x4d0 [kvm]
>>>>>>>>        kvm_vcpu_ioctl+0x301/0x6e0 [kvm]
>>>>>>>>
>>>>>>>
>>>>>>> Oh right, the event from the KVM guest is also exclude_host.
>>>>>>> So we should only error out with the non-KVM exclude_host PEBS event.
>>>>>>>
>>>>>>>> Alternatively, this path is taken when using PEBS-via-PT to profile
>>>>>>>> guests on host.
>>>>>>>
>>>>>>> There is a is_pebs_pt(event), so we can skip the PEBS-via-PT.
>>>>>>>
>>>>>>> Seems we just need to distinguish a KVM event and a normal host
>>>>>>> event.
>>>>>>> I don't have a better way to do it except using
>>>>>>> event->overflow_handler_context, which is NULL for a normal host
>>>>>>> event.
>>>>>>
>>>>>> The assumption that event->overflow_handler_context == NULL is unsafe,
>>>>>> considering .overflow_handler_context hook has its acclaimed
>>>>>> generality.
>>>>>>
>>>>>
>>>>> Yes, I agree it's very hacky.
>>>>>
>>>>> What we need here is a way to distinguish the KVM guest request from
>>>>> the
>>>>> others. How about the PF_VCPU flag?
>>>>>
>>>>>        if (event->attr.exclude_host &&
>>>>>            !is_pebs_pt(event) &&
>>>>>            (!(event->attach_state & PERF_ATTACH_TASK) ||
>>>>> !(current->flags &
>>>>> PF_VCPU))
>>>>>                return -EINVAL;
>>>>
>>>> Unfortunately, the tests are not passed w/ the above diff.
>>>
>>> What's the exact failed perf command?
>>
>> [0] one vcpu thread running //usr/bin/yes
>> [1] tools/perf/perf kvm --host --guest \
>> --guestkallsyms="..." \ --guestvmlinux="..." \
>> record --raw-samples --kcore \
>> -e "cpu-cycles:GP" \
>> -p `pidof cloud-hypervisor`
>>
>>>
>>> Is it because that the KVM guest is mistakenly killed or the host
>>> command is not detected?
>>
>> Theoretically it should error_out, but instead of just printing
>> unsupported info,
>> this command even collects samples pointing to vmx_vmexit, which is also
>> not
>> expected, there should be no samples taken for "cpu-cycles:GP" event.
>
> If the event creation is rejected by the kernel, the perf tool will
> automatically remove the P and give another try.
> What's the perf tool output with -vvv?
> Do you see "decreasing precise_ip by one"?
>
> If the precise_ip is decreased to 0, the event is a non-pebs which can
> be created successfully. So you will still see the samples.

I'm guessing this happened and the end-users need '-vvv' to notice the details.

>
>>
>> The point at which PF_VCPU is set/cleared is not synchronised or even
>> controllable with the time of perf_event being added and enabled.
>>
>> Going the way of PF_VCPU will bring more complexity.
>
> Seems we need a new flag to distinguish the KVM created event.

I'd love to see you make this happen, and those perf_events callers
from ebpf would appreciate the flag.

>
>>
>>>
>>>> But it's good to know that there is PF_VCPU and more things to play
>>>> around with.
>>>>
>>>> The AMD IBS also takes the precise path, but it puts the recorded values
>>>> on group of MSRs instead of the linear address-based memory.
>>>
>>> The check is in the intel_pmu_hw_config(). So AMD isn't impacted.
>>>
>>>>
>>>> The root cause of this issue is the hardware limitation, just like what
>>>> we did
>>>> in the commit 26a4f3c08de4 ("perf/x86: disable PEBS on a guest entry"),
>>>> a similar fix should also belong in the context of directly configuring
>>>> VMX-switch related hardware configuration values.
>>>>
>>>
>>> The above fix is different than this case. IIRC, It's caused by a ucode
>>> issue. So once there is a guest, SW has to explicitly disable the PEBS.
>>> Otherwise, the guest crashes even the host doesn't intend to profile the
>>> guest.
>>>
>>> While the case in the patch is apparently a violation of the current
>>> rules. I think it's better to detect it and error out early.
>>>
>>> Thanks,
>>> Kan
>>
>> "The host does not intend to profile the guest" implies that at the time
>> 26a4f3c08de4 was introduced, we didn't have the capability to use PEBS to
>> profiling guest, and the error_out message was not prompted to the end-user
>> when vPEBS is not supported, and the fix to prevent guest crashes appears
>> in intel_guest_get_msrs();
>>
>> Now vPEBS is supported and not enabled, but at this time, we're moving
>> the same logic that fixes guest crashes from intel_guest_get_msrs() to
>> generic intel_pmu_hw_config(). This might not be a good move.
>> For me, it fixes the same issue in the same way (I should have fixed
>> it in c59a1f106f5c).
>
> The problem is that the current KVM addicts to silently manipulating the
> MSRs. It doesn't care if the end user is informed properly about the
> changes. I don't think it's a proper way.

I have to agree, KVM also has quirks that users are not aware of.

>
> Let's say if a user runs a system-wide PEBS event which tries to collect
> both host and guest, they should only get samples from the host, right?
> How do they know the reason why there is 0 samples from the guest?

The same thing happened at pmu-unfunctional ctx like enclave and SMM.

perf-report even has a parameter option that can change the denominator
of all collected samples (it thinks that's the whole story for executed
payload), while making the data appear distorted and conclusions absurd.

>
>>
>> If we do more implementation to make PEBS on the host profiling guest,
>> with shared memory and read/write ordering, we don't need to change
>> intel_pmu_hw_config() any more, any of the required change is not out
>> of the scope of intel_guest_get_msrs().
>
> The intel_pmu_hw_config() is to check whether the feature is supported.
> If we support the host profiling guest later, we should remove the check
> in the intel_pmu_hw_config().
>
>>
>> To move forward, this fix won't prevent you from exploring more further
>> fixes in the intel_pmu_hw_config() or elsewhere, could we agree on that ?
>
> I think we need a real fix (from error handling to MSR manipulation),
> not just a workaround.
>
> I know the current error handling of perf is not perfect. Some KVM
> failures may not be able to be explicitly passed to the end user, e.g.,
> if I recall correctly PEBS crossmapping. But it doesn't mean it's not
> important. We should try our best to notify the end user when it's possible.

It sounds like we need to add a little more detail to the header of the samples.

I've also noticed that some users don't use perf/tool but develop directly
based on syscall, but it's not that simple to abstract some generic
prompt messages since '-Einval' is not enough for wider cases.

>
> I think we can patch both intel_pmu_hw_config() and
> intel_guest_get_msrs() if necessary.

Thank you and thanks for your comments. At the very least, KVM need this
minor fix on intel_guest_get_msrs() to alleviate my guilt.

And let's see what we will get on som thread around intel_pmu_hw_config().

>
> Thanks,
> Kan
>>
>>>
>>>> I haven't find a better location than intel_guest_get_msrs().
>>>>
>>>>>
>>>>>
>>>>>> I understand your motivation very well, but this is not the right move
>>>>>> (based on my previous history of being sprayed by Peter). For
>>>>>> perf/core,
>>>>>> in-kernel perf_events should be treated equally, and the semantics of
>>>>>> kvm_pmu should only be accounted when a perf/core API is only used for
>>>>>> guest-only path. In this case for KVM perf_events,
>>>>>> intel_guest_get_msrs()
>>>>>> and x86_pmu_handle_guest_pebs() have this context.
>>>>>>
>>>>>>>
>>>>>>> diff --git a/arch/x86/events/intel/core.c
>>>>>>> b/arch/x86/events/intel/core.c
>>>>>>> index a968708ed1fb..c93a2aaff7c3 100644
>>>>>>> --- a/arch/x86/events/intel/core.c
>>>>>>> +++ b/arch/x86/events/intel/core.c
>>>>>>> @@ -3958,6 +3958,16 @@ static int intel_pmu_hw_config(struct
>>>>>>> perf_event
>>>>>>> *event)
>>>>>>>             if ((event->attr.config & INTEL_ARCH_EVENT_MASK) ==
>>>>>>> INTEL_FIXED_VLBR_EVENT)
>>>>>>>                 return -EINVAL;
>>>>>>>
>>>>>>> +        /*
>>>>>>> +         * Profiling guest via PEBS-DS buffer on host is not
>>>>>>> supported.
>>>>>>> +         * The event->overflow_handler_context is to distinguish
>>>>>>> a KVM
>>>>>>> +         * event and a normal host event.
>>>>>>> +         */
>>>>>>> +        if (event->attr.exclude_host &&
>>>>>>> +            !is_pebs_pt(event) &&
>>>>>>> +            !event->overflow_handler_context)
>>>>>>> +            return -EINVAL;
>>>>>>> +
>>>>>>>             if (!(event->attr.freq || (event->attr.wakeup_events &&
>>>>>>> !event->attr.watermark))) {
>>>>>>>                 event->hw.flags |= PERF_X86_EVENT_AUTO_RELOAD;
>>>>>>>                 if (!(event->attr.sample_type &
>>>>>>>
>>>>>>>>
>>>>>>>> The status of the guest can only be queried in the NMI handler and
>>>>>>>> the func
>>>>>>>> intel_guest_get_msrs() in the perf/core context, where it's easier
>>>>>>>> and more
>>>>>>>> centrally to review this part of changes that affects vPMU for
>>>>>>>> corner
>>>>>>>> cases.
>>>>>>>>
>>>>>>>> Maybe adding print info on the perf-tool side would help.
>>>>>>>>
>>>>>>>> For perf-tool users, it will get 0 number of sample for
>>>>>>>> "cpu-cycles:GP"
>>>>>>>> events,
>>>>>>>> just like other uncounted perf-tool events.
>>>>>>>
>>>>>>> perf-tool would never know such details, e.g., whether the platform
>>>>>>> supports PEBS-DS or other PEBS method. It's hard to tell if the 0 is
>>>>>>> because of an unsupported hardware or nothing sampled in the guest.
>>>>>>
>>>>>> It kind of looks like this use case to me:
>>>>>>
>>>>>>       perf record -e cycles -e cpu/event=0xf4,umask=0x10/ ./workload #
>>>>>> ICX
>>>>>>
>>>>>> # Total Lost Samples: 0
>>>>>> #
>>>>>> # Samples: 0 of event 'cpu/event=0xf4,umask=0x10/'
>>>>>> # Event count (approx.): 0
>>>>>>
>>>>>> A end-user has to check if the event-umask combination is supported
>>>>>> or not,
>>>>>> or nothing sampled for the workload. Is there any room for
>>>>>> improvement in
>>>>>> perf-tool to reduce the pain of this part ? If any, the same thing
>>>>>> could
>>>>>> be applied
>>>>>> to cpu-cycles:GP, isn't it ?
>>>>>
>>>>> I don't think we can expect the end user knows such details. Most of
>>>>> them may even don't know what's PEBS-via-DS.
>>>>
>>>> Maybe the following generic reminder helps:
>>>>
>>>> # Total Lost Samples: 0
>>>> # Note: the event is not counted or unsupported.
>>>> #
>>>> # Samples: 0 of event 'cpu/event=0xf4,umask=0x10/'
>>>> # Event count (approx.): 0
>>>>
>>>>>
>>>>> Thanks,
>>>>> Kan
>>>>>
>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Kan
>>>>>>>>
>>>>>>>>>              if (!(event->attr.freq ||
>>>>>>>>> (event->attr.wakeup_events &&
>>>>>>>>> !event->attr.watermark))) {
>>>>>>>>>                  event->hw.flags |= PERF_X86_EVENT_AUTO_RELOAD;
>>>>>>>>>                  if (!(event->attr.sample_type &
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Kan
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Cc: Peter Zijlstra (Intel) <[email protected]>
>>>>>>>>>> Fixes: c59a1f106f5c ("KVM: x86/pmu: Add IA32_PEBS_ENABLE MSR
>>>>>>>>>> emulation for extended PEBS")
>>>>>>>>>> Signed-off-by: Like Xu <[email protected]>
>>>>>>>>>> ---
>>>>>>>>>>      arch/x86/events/intel/core.c | 8 +++++++-
>>>>>>>>>>      1 file changed, 7 insertions(+), 1 deletion(-)
>>>>>>>>>>
>>>>>>>>>> diff --git a/arch/x86/events/intel/core.c
>>>>>>>>>> b/arch/x86/events/intel/core.c
>>>>>>>>>> index a08f794a0e79..17afd504c35b 100644
>>>>>>>>>> --- a/arch/x86/events/intel/core.c
>>>>>>>>>> +++ b/arch/x86/events/intel/core.c
>>>>>>>>>> @@ -4103,13 +4103,19 @@ static struct perf_guest_switch_msr
>>>>>>>>>> *intel_guest_get_msrs(int *nr, void *data)
>>>>>>>>>>              .guest = pebs_mask & ~cpuc->intel_ctrl_host_mask,
>>>>>>>>>>          };
>>>>>>>>>>      +    /* In any case, clear guest PEBS bits first. */
>>>>>>>>>> +    arr[global_ctrl].guest &= ~arr[pebs_enable].guest;
>>>>>>>>>> +
>>>>>>>>>>          if (arr[pebs_enable].host) {
>>>>>>>>>>              /* Disable guest PEBS if host PEBS is enabled. */
>>>>>>>>>>              arr[pebs_enable].guest = 0;
>>>>>>>>>>          } else {
>>>>>>>>>>              /* Disable guest PEBS thoroughly for cross-mapped
>>>>>>>>>> PEBS
>>>>>>>>>> counters. */
>>>>>>>>>>              arr[pebs_enable].guest &=
>>>>>>>>>> ~kvm_pmu->host_cross_mapped_mask;
>>>>>>>>>> -        arr[global_ctrl].guest &=
>>>>>>>>>> ~kvm_pmu->host_cross_mapped_mask;
>>>>>>>>>> +
>>>>>>>>>> +        /* Prevent any host user from enabling PEBS for profiling
>>>>>>>>>> guest. */
>>>>>>>>>> +        arr[pebs_enable].guest &= (kvm_pmu->pebs_enable &
>>>>>>>>>> kvm_pmu->global_ctrl);
>>>>>>>>>> +
>>>>>>>>>>              /* Set hw GLOBAL_CTRL bits for PEBS counter when
>>>>>>>>>> it runs
>>>>>>>>>> for guest */
>>>>>>>>>>              arr[global_ctrl].guest |= arr[pebs_enable].guest;
>>>>>>>>>>          }
>>>>>>>>>>
>>>>>>>>>> base-commit: 6803fb00772cc50cd59a66bd8caaee5c84b13fcf
>>>>