2023-09-27 14:45:37

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [Patch v4 07/13] perf/x86: Add constraint for guest perf metrics event

On Wed, Sep 27, 2023 at 11:31:18AM +0800, Dapeng Mi wrote:
> When guest wants to use PERF_METRICS MSR, a virtual metrics event needs
> to be created in the perf subsystem so that the guest can have exclusive
> ownership of the PERF_METRICS MSR.

Urgh, can someone please remind me how all that is supposed to work
again? The guest is just a task that wants the event. If the
host creates a CPU event, then that gets scheduled with higher priority
and the task looses out, no joy.

So you cannot guarantee the guest gets anything.

That is, I remember we've had this exact problem before, but I keep
forgetting how this all is supposed to work. I don't use this virt stuff
(and every time I try qemu arguments defeat me and I give up in
disgust).



2023-09-28 05:40:12

by Sean Christopherson

[permalink] [raw]
Subject: Re: [Patch v4 07/13] perf/x86: Add constraint for guest perf metrics event

+Jim, David, and Mingwei

On Wed, Sep 27, 2023, Peter Zijlstra wrote:
> On Wed, Sep 27, 2023 at 11:31:18AM +0800, Dapeng Mi wrote:
> > When guest wants to use PERF_METRICS MSR, a virtual metrics event needs
> > to be created in the perf subsystem so that the guest can have exclusive
> > ownership of the PERF_METRICS MSR.
>
> Urgh, can someone please remind me how all that is supposed to work
> again? The guest is just a task that wants the event. If the
> host creates a CPU event, then that gets scheduled with higher priority
> and the task looses out, no joy.
>
> So you cannot guarantee the guest gets anything.
>
> That is, I remember we've had this exact problem before, but I keep
> forgetting how this all is supposed to work. I don't use this virt stuff
> (and every time I try qemu arguments defeat me and I give up in
> disgust).

I don't think it does work, at least not without a very, very carefully crafted
setup and a host userspace that knows it must not use certain aspects of perf.
E.g. for PEBS, if the guest virtual counters don't map 1:1 to the "real" counters
in hardware, KVM+perf simply disables the counter.

And for top-down slots, getting anything remotely accurate requires pinning vCPUs
1:1 with pCPUs and enumerating an accurate toplogy to the guest:

The count is distributed among unhalted logical processors (hyper-threads) who
share the same physical core, in processors that support Intel Hyper-Threading
Technology.

Jumping the gun a bit (we're in the *super* early stages of scraping together a
rough PoC), but I think we should effectively put KVM's current vPMU support into
maintenance-only mode, i.e. stop adding new features unless they are *very* simple
to enable, and instead pursue an implementation that (a) lets userspace (and/or
the kernel builder) completely disable host perf (or possibly just host perf usage
of the hardware PMU) and (b) let KVM passthrough the entire hardware PMU when it
has been turned off in the host.

I.e. keep KVM's existing best-offset vPMU support, e.g. for setups where the
platform owner is also the VM ueer (running a Windows VM on a Linux box, hosting
a Linux VM in ChromeOS, etc...). But for anything advanced and for hard guarantees,
e.g. cloud providers that want to expose fully featured vPMU to customers, force
the platform owner to choose between using perf (or again, perf with hardware PMU)
in the host, and exposing the hardware PMU to the guest.

Hardware vendors are pushing us in the direction whether we like it or not, e.g.
SNP and TDX want to disallow profiling the guest from the host, ARM has an
upcoming PMU model where (IIUC) it can't be virtualized without a passthrough
approach, Intel's hybrid CPUs are a complete trainwreck unless vCPUs are pinned,
and virtualizing things like top-down slots, PEBS, and LBRs in the shared model
requires an absurd amount of complexity throughout the kernel and userspace.

Note, a similar idea was floated and rejected in the past[*], but that failed
proposal tried to retain host perf+PMU functionality by making the behavior dynamic,
which I agree would create an awful ABI for the host. If we make the "knob" a
Kconfig or kernel param, i.e. require the platform owner to opt-out of using perf
no later than at boot time, then I think we can provide a sane ABI, keep the
implementation simple, all without breaking existing users that utilize perf in
the host to profile guests.

[*] https://lore.kernel.org/all/CALMp9eRBOmwz=mspp0m5Q093K3rMUeAsF3vEL39MGV5Br9wEQQ@mail.gmail.com

2023-09-28 17:49:55

by Mi, Dapeng

[permalink] [raw]
Subject: Re: [Patch v4 07/13] perf/x86: Add constraint for guest perf metrics event


On 9/28/2023 1:27 AM, Sean Christopherson wrote:
> +Jim, David, and Mingwei
>
> On Wed, Sep 27, 2023, Peter Zijlstra wrote:
>> On Wed, Sep 27, 2023 at 11:31:18AM +0800, Dapeng Mi wrote:
>>> When guest wants to use PERF_METRICS MSR, a virtual metrics event needs
>>> to be created in the perf subsystem so that the guest can have exclusive
>>> ownership of the PERF_METRICS MSR.
>> Urgh, can someone please remind me how all that is supposed to work
>> again? The guest is just a task that wants the event. If the
>> host creates a CPU event, then that gets scheduled with higher priority
>> and the task looses out, no joy.


It looks I used the inaccurate words in the comments. Yes, it's not
*exclusive* from host's point view.  Currently the perf events created
by KVM are task-pinned events, they are indeed possible to be preempted
by CPU-pinned host events which have higher priority. This is a long
term issue which vPMU encountered. We ever have some internal discussion
about this issue, but it seems we don't have a good way to solve this
issue thoroughly in current vPMU framework.

But if there is no such CPU-pinned events which have the highest
priority on host, KVM perf events can share the HW resource with other
host events with the way of time-multiplexing.

>> So you cannot guarantee the guest gets anything.
>>
>> That is, I remember we've had this exact problem before, but I keep
>> forgetting how this all is supposed to work. I don't use this virt stuff
>> (and every time I try qemu arguments defeat me and I give up in
>> disgust).
> I don't think it does work, at least not without a very, very carefully crafted
> setup and a host userspace that knows it must not use certain aspects of perf.
> E.g. for PEBS, if the guest virtual counters don't map 1:1 to the "real" counters
> in hardware, KVM+perf simply disables the counter.
>
> And for top-down slots, getting anything remotely accurate requires pinning vCPUs
> 1:1 with pCPUs and enumerating an accurate toplogy to the guest:
>
> The count is distributed among unhalted logical processors (hyper-threads) who
> share the same physical core, in processors that support Intel Hyper-Threading
> Technology.
>
> Jumping the gun a bit (we're in the *super* early stages of scraping together a
> rough PoC), but I think we should effectively put KVM's current vPMU support into
> maintenance-only mode, i.e. stop adding new features unless they are *very* simple
> to enable, and instead pursue an implementation that (a) lets userspace (and/or
> the kernel builder) completely disable host perf (or possibly just host perf usage
> of the hardware PMU) and (b) let KVM passthrough the entire hardware PMU when it
> has been turned off in the host.
>
> I.e. keep KVM's existing best-offset vPMU support, e.g. for setups where the
> platform owner is also the VM ueer (running a Windows VM on a Linux box, hosting
> a Linux VM in ChromeOS, etc...). But for anything advanced and for hard guarantees,
> e.g. cloud providers that want to expose fully featured vPMU to customers, force
> the platform owner to choose between using perf (or again, perf with hardware PMU)
> in the host, and exposing the hardware PMU to the guest.
>
> Hardware vendors are pushing us in the direction whether we like it or not, e.g.
> SNP and TDX want to disallow profiling the guest from the host, ARM has an
> upcoming PMU model where (IIUC) it can't be virtualized without a passthrough
> approach, Intel's hybrid CPUs are a complete trainwreck unless vCPUs are pinned,
> and virtualizing things like top-down slots, PEBS, and LBRs in the shared model
> requires an absurd amount of complexity throughout the kernel and userspace.
>
> Note, a similar idea was floated and rejected in the past[*], but that failed
> proposal tried to retain host perf+PMU functionality by making the behavior dynamic,
> which I agree would create an awful ABI for the host. If we make the "knob" a
> Kconfig or kernel param, i.e. require the platform owner to opt-out of using perf
> no later than at boot time, then I think we can provide a sane ABI, keep the
> implementation simple, all without breaking existing users that utilize perf in
> the host to profile guests.
>
> [*] https://lore.kernel.org/all/CALMp9eRBOmwz=mspp0m5Q093K3rMUeAsF3vEL39MGV5Br9wEQQ@mail.gmail.com

2023-09-29 12:06:52

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [Patch v4 07/13] perf/x86: Add constraint for guest perf metrics event

On Wed, Sep 27, 2023 at 10:27:07AM -0700, Sean Christopherson wrote:

> I don't think it does work, at least not without a very, very carefully crafted
> setup and a host userspace that knows it must not use certain aspects of perf.
> E.g. for PEBS, if the guest virtual counters don't map 1:1 to the "real" counters
> in hardware, KVM+perf simply disables the counter.

I have distinct memories of there being patches to rewrite the PEBS
buffer, but I really can't remember what we ended up doing. Like I said,
I can't operate KVM in any meaningful way -- it's a monster :-(

> And for top-down slots, getting anything remotely accurate requires pinning vCPUs
> 1:1 with pCPUs and enumerating an accurate toplogy to the guest:
>
> The count is distributed among unhalted logical processors (hyper-threads) who
> share the same physical core, in processors that support Intel Hyper-Threading
> Technology.

So IIRC slots is per logical CPU, it counts the actual pipeline stages
going towards that logical CPU, this is required to make it work on SMT
at all -- even for native.

But it's been a long while since that was explained -- and because it
was a call, I can't very well read it back, god how I hate calls :-(

> Jumping the gun a bit (we're in the *super* early stages of scraping together a
> rough PoC), but I think we should effectively put KVM's current vPMU support into
> maintenance-only mode, i.e. stop adding new features unless they are *very* simple
> to enable, and instead pursue an implementation that (a) lets userspace (and/or
> the kernel builder) completely disable host perf (or possibly just host perf usage
> of the hardware PMU) and (b) let KVM passthrough the entire hardware PMU when it
> has been turned off in the host.

I don't think you need to go that far, host can use PMU just fine as
long as it doesn't overlap with a vCPU. Basically, if you force
perf_attr::exclude_guest on everything your vCPU can haz the full thing.

> Hardware vendors are pushing us in the direction whether we like it or not, e.g.
> SNP and TDX want to disallow profiling the guest from the host,

Yeah, sekjoerity model etc.. bah.

> ARM has an upcoming PMU model where (IIUC) it can't be virtualized
> without a passthrough approach,

:-(

> Intel's hybrid CPUs are a complete trainwreck unless vCPUs are pinned,

Anybodies hybrid things are a clusterfuck, hybrid vs virt doesn't work
sanely on ARM either AFAIU.

I intensely dislike hybrid (and virt ofc), but alas we get to live with
that mess :/ And it's only going to get worse I fear..

At least (for now) AMD hybrid is committed to identical ISA, including
PMUs with their Zen4+Zen4c things. We'll have to wait and see how
that'll end up.

> and virtualizing things like top-down slots, PEBS, and LBRs in the shared model
> requires an absurd amount of complexity throughout the kernel and userspace.

I'm not sure about top-down, the other two, for sure.

My main beef with top-down is the ludicrously bad hardware interface we
have on big cores, I like the atom interface a *ton* better.

> Note, a similar idea was floated and rejected in the past[*], but that failed
> proposal tried to retain host perf+PMU functionality by making the behavior dynamic,
> which I agree would create an awful ABI for the host. If we make the "knob" a
> Kconfig

Must not be Kconfig, distros would have no sane choice.

> or kernel param, i.e. require the platform owner to opt-out of using perf
> no later than at boot time, then I think we can provide a sane ABI, keep the
> implementation simple, all without breaking existing users that utilize perf in
> the host to profile guests.

It's a shit choice to have to make. At the same time I'm not sure I have
a better proposal.

It does mean a host cannot profile one guest and have pass-through on the
other. Eg. have a development and production guest on the same box. This
is pretty crap.

Making it a guest-boot-option would allow that, but then the host gets
complicated again. I think I can make it trivially work for per-task
events, simply error the creation of events without exclude_guest for
affected vCPU tasks. But the CPU events are tricky.


I will firmly reject anything that takes the PMU away from the host
entirely through.

Also, NMI watchdog needs a solution.. Ideally hardware grows a second
per-CPU timer we can program to NMI.


2023-09-30 03:52:10

by Ravi Bangoria

[permalink] [raw]
Subject: Re: [Patch v4 07/13] perf/x86: Add constraint for guest perf metrics event

On 29-Sep-23 5:23 PM, Peter Zijlstra wrote:
> On Wed, Sep 27, 2023 at 10:27:07AM -0700, Sean Christopherson wrote:
>
>> I don't think it does work, at least not without a very, very carefully crafted
>> setup and a host userspace that knows it must not use certain aspects of perf.
>> E.g. for PEBS, if the guest virtual counters don't map 1:1 to the "real" counters
>> in hardware, KVM+perf simply disables the counter.
>
> I have distinct memories of there being patches to rewrite the PEBS
> buffer, but I really can't remember what we ended up doing. Like I said,
> I can't operate KVM in any meaningful way -- it's a monster :-(
>
>> And for top-down slots, getting anything remotely accurate requires pinning vCPUs
>> 1:1 with pCPUs and enumerating an accurate toplogy to the guest:
>>
>> The count is distributed among unhalted logical processors (hyper-threads) who
>> share the same physical core, in processors that support Intel Hyper-Threading
>> Technology.
>
> So IIRC slots is per logical CPU, it counts the actual pipeline stages
> going towards that logical CPU, this is required to make it work on SMT
> at all -- even for native.
>
> But it's been a long while since that was explained -- and because it
> was a call, I can't very well read it back, god how I hate calls :-(
>
>> Jumping the gun a bit (we're in the *super* early stages of scraping together a
>> rough PoC), but I think we should effectively put KVM's current vPMU support into
>> maintenance-only mode, i.e. stop adding new features unless they are *very* simple
>> to enable, and instead pursue an implementation that (a) lets userspace (and/or
>> the kernel builder) completely disable host perf (or possibly just host perf usage
>> of the hardware PMU) and (b) let KVM passthrough the entire hardware PMU when it
>> has been turned off in the host.
>
> I don't think you need to go that far, host can use PMU just fine as
> long as it doesn't overlap with a vCPU. Basically, if you force
> perf_attr::exclude_guest on everything your vCPU can haz the full thing.
>
>> Hardware vendors are pushing us in the direction whether we like it or not, e.g.
>> SNP and TDX want to disallow profiling the guest from the host,
>
> Yeah, sekjoerity model etc.. bah.
>
>> ARM has an upcoming PMU model where (IIUC) it can't be virtualized
>> without a passthrough approach,
>
> :-(
>
>> Intel's hybrid CPUs are a complete trainwreck unless vCPUs are pinned,
>
> Anybodies hybrid things are a clusterfuck, hybrid vs virt doesn't work
> sanely on ARM either AFAIU.
>
> I intensely dislike hybrid (and virt ofc), but alas we get to live with
> that mess :/ And it's only going to get worse I fear..
>
> At least (for now) AMD hybrid is committed to identical ISA, including
> PMUs with their Zen4+Zen4c things. We'll have to wait and see how
> that'll end up.
>
>> and virtualizing things like top-down slots, PEBS, and LBRs in the shared model
>> requires an absurd amount of complexity throughout the kernel and userspace.
>
> I'm not sure about top-down, the other two, for sure.
>
> My main beef with top-down is the ludicrously bad hardware interface we
> have on big cores, I like the atom interface a *ton* better.
>
>> Note, a similar idea was floated and rejected in the past[*], but that failed
>> proposal tried to retain host perf+PMU functionality by making the behavior dynamic,
>> which I agree would create an awful ABI for the host. If we make the "knob" a
>> Kconfig
>
> Must not be Kconfig, distros would have no sane choice.
>
>> or kernel param, i.e. require the platform owner to opt-out of using perf
>> no later than at boot time, then I think we can provide a sane ABI, keep the
>> implementation simple, all without breaking existing users that utilize perf in
>> the host to profile guests.
>
> It's a shit choice to have to make. At the same time I'm not sure I have
> a better proposal.

How about keying off based on PMU specific KVM module parameter? Something
like what Manali has proposed for AMD VIBS? Please see solution 1.1:

https://lore.kernel.org/r/[email protected]

> It does mean a host cannot profile one guest and have pass-through on the
> other. Eg. have a development and production guest on the same box. This
> is pretty crap.
>
> Making it a guest-boot-option would allow that, but then the host gets
> complicated again. I think I can make it trivially work for per-task
> events, simply error the creation of events without exclude_guest for
> affected vCPU tasks. But the CPU events are tricky.
>
>
> I will firmly reject anything that takes the PMU away from the host
> entirely through.
>
> Also, NMI watchdog needs a solution.. Ideally hardware grows a second
> per-CPU timer we can program to NMI.

Thanks,
Ravi

2023-09-30 20:41:02

by Sean Christopherson

[permalink] [raw]
Subject: Re: [Patch v4 07/13] perf/x86: Add constraint for guest perf metrics event

On Fri, Sep 29, 2023, Peter Zijlstra wrote:
> On Wed, Sep 27, 2023 at 10:27:07AM -0700, Sean Christopherson wrote:
> > Jumping the gun a bit (we're in the *super* early stages of scraping together a
> > rough PoC), but I think we should effectively put KVM's current vPMU support into
> > maintenance-only mode, i.e. stop adding new features unless they are *very* simple
> > to enable, and instead pursue an implementation that (a) lets userspace (and/or
> > the kernel builder) completely disable host perf (or possibly just host perf usage
> > of the hardware PMU) and (b) let KVM passthrough the entire hardware PMU when it
> > has been turned off in the host.
>
> I don't think you need to go that far, host can use PMU just fine as
> long as it doesn't overlap with a vCPU. Basically, if you force
> perf_attr::exclude_guest on everything your vCPU can haz the full thing.

Complexity aside, my understanding is that the overhead of trapping and emulating
all of the guest counter and MSR accesses results in unacceptably degraded functionality
for the guest. And we haven't even gotten to things like arch LBRs where context
switching MSRs between the guest and host is going to be quite costly.

> > Note, a similar idea was floated and rejected in the past[*], but that failed
> > proposal tried to retain host perf+PMU functionality by making the behavior dynamic,
> > which I agree would create an awful ABI for the host. If we make the "knob" a
> > Kconfig
>
> Must not be Kconfig, distros would have no sane choice.

Or not only a Kconfig? E.g. similar to how the kernel has
CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS and nopku.

> > or kernel param, i.e. require the platform owner to opt-out of using perf
> > no later than at boot time, then I think we can provide a sane ABI, keep the
> > implementation simple, all without breaking existing users that utilize perf in
> > the host to profile guests.
>
> It's a shit choice to have to make. At the same time I'm not sure I have
> a better proposal.
>
> It does mean a host cannot profile one guest and have pass-through on the
> other. Eg. have a development and production guest on the same box. This
> is pretty crap.
>
> Making it a guest-boot-option would allow that, but then the host gets
> complicated again. I think I can make it trivially work for per-task
> events, simply error the creation of events without exclude_guest for
> affected vCPU tasks. But the CPU events are tricky.
>
>
> I will firmly reject anything that takes the PMU away from the host
> entirely through.

Why? What is so wrong with supporting use cases where the platform owner *wants*
to give up host PMU and NMI watchdog functionality? If disabling host PMU usage
were complex, highly invasive, and/or difficult to maintain, then I can understand
the pushback.

But if we simply allow hiding hardware PMU support, then isn't the cost to perf
just a few lines in init_hw_perf_events()? And if we put a stake in the ground
and say that exposing "advanced" PMU features to KVM guests requires a passthrough
PMU, i.e. the PMU to be hidden from the host, that will significantly reduce our
maintenance and complexity.

The kernel allows disabling almost literally every other feature that is even
remotely optional, I don't understand why the hardware PMU is special.

2023-10-02 14:49:00

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [Patch v4 07/13] perf/x86: Add constraint for guest perf metrics event

On Fri, Sep 29, 2023 at 08:50:07PM +0530, Ravi Bangoria wrote:
> On 29-Sep-23 5:23 PM, Peter Zijlstra wrote:

> > I don't think you need to go that far, host can use PMU just fine as
> > long as it doesn't overlap with a vCPU. Basically, if you force
> > perf_attr::exclude_guest on everything your vCPU can haz the full thing.

^ this..

> How about keying off based on PMU specific KVM module parameter? Something
> like what Manali has proposed for AMD VIBS? Please see solution 1.1:
>
> https://lore.kernel.org/r/[email protected]

So I hadn't read that, but isn't that more or less what I proposed
above?

2023-10-02 20:08:00

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [Patch v4 07/13] perf/x86: Add constraint for guest perf metrics event

On Fri, Sep 29, 2023 at 03:46:55PM +0000, Sean Christopherson wrote:

> > I will firmly reject anything that takes the PMU away from the host
> > entirely through.
>
> Why? What is so wrong with supporting use cases where the platform owner *wants*
> to give up host PMU and NMI watchdog functionality? If disabling host PMU usage
> were complex, highly invasive, and/or difficult to maintain, then I can understand
> the pushback.

Because it sucks.

You're forcing people to choose between no host PMU or a slow guest PMU.
And that's simply not a sane choice for most people -- worse it's not a
choice based in technical reality.

It's a choice out of lazyness, disabling host PMU is not a requirement
for pass-through.

Like I wrote, all we need to do is ensure vCPU tasks will never have a
perf-event scheduled that covers guest mode. Currently this would be
achievable by having event creation for both:

- CPU events without attr::exclude_guest=1, and
- task events for vCPU task of interest without attr::exclude_guest=1

error with -EBUSY or something.

This ensures there are no events active for those vCPU tasks at VMENTER
time and you can haz pass-through.

2023-10-03 06:37:01

by Ravi Bangoria

[permalink] [raw]
Subject: Re: [Patch v4 07/13] perf/x86: Add constraint for guest perf metrics event

On 02-Oct-23 5:59 PM, Peter Zijlstra wrote:
> On Fri, Sep 29, 2023 at 08:50:07PM +0530, Ravi Bangoria wrote:
>> On 29-Sep-23 5:23 PM, Peter Zijlstra wrote:
>
>>> I don't think you need to go that far, host can use PMU just fine as
>>> long as it doesn't overlap with a vCPU. Basically, if you force
>>> perf_attr::exclude_guest on everything your vCPU can haz the full thing.
>
> ^ this..
>
>> How about keying off based on PMU specific KVM module parameter? Something
>> like what Manali has proposed for AMD VIBS? Please see solution 1.1:
>>
>> https://lore.kernel.org/r/[email protected]
>
> So I hadn't read that, but isn't that more or less what I proposed
> above?

Yes, it's pretty much same as:

"Like I wrote, all we need to do is ensure vCPU tasks will never have a
perf-event scheduled that covers guest mode. Currently this would be
achievable by having event creation for both:

- CPU events without attr::exclude_guest=1, and
- task events for vCPU task of interest without attr::exclude_guest=1

error with -EBUSY or something."

Thanks,
Ravi