Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) client-ip=23.128.96.37;
Date:   Mon, 2 Oct 2023 08:56:50 -0700
In-Reply-To: <ZRrF38RGllA04R8o@gmail.com>
Mime-Version: 1.0
References: <20230927033124.1226509-1-dapeng1.mi@linux.intel.com>
 <20230927033124.1226509-8-dapeng1.mi@linux.intel.com> <20230927113312.GD21810@noisy.programming.kicks-ass.net>
 <ZRRl6y1GL-7RM63x@google.com> <20230929115344.GE6282@noisy.programming.kicks-ass.net>
 <ZRbxb15Opa2_AusF@google.com> <20231002115718.GB13957@noisy.programming.kicks-ass.net>
 <ZRrF38RGllA04R8o@gmail.com>
Message-ID: <ZRroQg6flyGBtZTG@google.com>
Subject: Re: [Patch v4 07/13] perf/x86: Add constraint for guest perf metrics event
From:   Sean Christopherson <seanjc@google.com>
To:     Ingo Molnar <mingo@kernel.org>
Cc:     Peter Zijlstra <peterz@infradead.org>,
        Dapeng Mi <dapeng1.mi@linux.intel.com>,
        Paolo Bonzini <pbonzini@redhat.com>,
        Arnaldo Carvalho de Melo <acme@kernel.org>,
        Kan Liang <kan.liang@linux.intel.com>,
        Like Xu <likexu@tencent.com>,
        Mark Rutland <mark.rutland@arm.com>,
        Alexander Shishkin <alexander.shishkin@linux.intel.com>,
        Jiri Olsa <jolsa@kernel.org>,
        Namhyung Kim <namhyung@kernel.org>,
        Ian Rogers <irogers@google.com>,
        Adrian Hunter <adrian.hunter@intel.com>, kvm@vger.kernel.org,
        linux-perf-users@vger.kernel.org, linux-kernel@vger.kernel.org,
        Zhenyu Wang <zhenyuw@linux.intel.com>,
        Zhang Xiong <xiong.y.zhang@intel.com>,
        Lv Zhiyuan <zhiyuan.lv@intel.com>,
        Yang Weijiang <weijiang.yang@intel.com>,
        Dapeng Mi <dapeng1.mi@intel.com>,
        Jim Mattson <jmattson@google.com>,
        David Dunn <daviddunn@google.com>,
        Mingwei Zhang <mizhang@google.com>,
        Thomas Gleixner <tglx@linutronix.de>
Content-Type: text/plain; charset="us-ascii"
Precedence: bulk

On Mon, Oct 02, 2023, Ingo Molnar wrote:
> 
> * Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Fri, Sep 29, 2023 at 03:46:55PM +0000, Sean Christopherson wrote:
> > 
> > > > I will firmly reject anything that takes the PMU away from the host
> > > > entirely through.
> > > 
> > > Why?  What is so wrong with supporting use cases where the platform owner *wants*
> > > to give up host PMU and NMI watchdog functionality?  If disabling host PMU usage
> > > were complex, highly invasive, and/or difficult to maintain, then I can understand
> > > the pushback.  
> > 
> > Because it sucks.
>
> > You're forcing people to choose between no host PMU or a slow guest PMU.

Nowhere did I say that we wouldn't take patches to improve the existing vPMU
support.  But that's largely a moot point because I don't think it's possible to
improve the current approach to the point where it would provide a performant,
functional guest PMU.

> > And that's simply not a sane choice for most people --

It's better than the status quo, which is that no one gets to choose, everyone
gets a slow guest PMU.

> > worse it's not a choice based in technical reality.

The technical reality is that context switching the PMU between host and guest
requires reading and writing far too many MSRs for KVM to be able to context
switch at every VM-Enter and every VM-Exit.  And PMIs skidding past VM-Exit adds
another layer of complexity to deal with.

> > It's a choice out of lazyness, disabling host PMU is not a requirement
> > for pass-through.

The requirement isn't passthrough access, the requirements are that the guest's
PMU has accuracy that is on par with bare metal, and that exposing a PMU to the
guest doesn't have a meaningful impact on guest performance.

> Not just a choice of laziness, but it will clearly be forced upon users
> by external entities:
> 
>    "Pass ownership of the PMU to the guest and have no host PMU, or you
>     won't have sane guest PMU support at all. If you disagree, please open
>     a support ticket, which we'll ignore."

We don't have sane guest PMU support today.  In the 12+ years since commit
f5132b01386b ("KVM: Expose a version 2 architectural PMU to a guests"), KVM has
never provided anything remotely close to a sane vPMU.  It *mostly* works if host
perf is quiesced, but that "good enough" approach doesn't suffice for any form of
PMU usage that requires a high level of accuracy and precision.

> The host OS shouldn't offer facilities that severely limit its own capabilities,
> when there's a better solution. We don't give the FPU to apps exclusively either,
> it would be insanely stupid for a platform to do that.

The FPU can be effeciently context switched, guest state remains resident in
hardware so long as the vCPU task is scheduled in (ignoring infrequrent FPU usage
from IRQ context), and guest usage of the FPU doesn't require trap-and-emulate
behavior in KVM.

As David said, ceding the hardware PMU for all of kvm_arch_vcpu_ioctl_run()
(module the vCPU task being scheduled out) is likely a viable alternative.

 : But it does mean that when entering the KVM run loop, the host perf system 
 : needs to context switch away the host PMU state and allow KVM to load the guest
 : PMU state.  And much like the FPU situation, the portion of the host kernel
 : that runs between the context switch to the KVM thread and VMENTER to the guest
 : cannot use the PMU.

If y'all are willing to let KVM redefined exclude_guest to be KVM's outer run
loop, then I'm all for exploring that option.  But that idea got shot down over
a year ago[*].  Or at least, that was my reading of things.  Maybe it was just a
misunderstanding because we didn't do a good job of defining the behavior.

I am completely ok with either approach, but I am not ok with being nak'd on both.
Because unless there's a magical third option lurking, those two options are the
only ways for KVM to provide a vPMU that meets the requirements for slice-of-hardware
use cases.

[*] https://lore.kernel.org/all/YgPCm1WIt9dHuoEo@hirez.programming.kicks-ass.net