Received: by 2002:a05:7412:3784:b0:e2:908c:2ebd with SMTP id jk4csp2241052rdb; Tue, 3 Oct 2023 15:03:40 -0700 (PDT) X-Google-Smtp-Source: AGHT+IE7OKOykp0qgDA5l0LhQ1PHBMc99dP+Tx4V8s3mPtckfTMxf5ygy7tfXBvXxm0sY2cA/b9+ X-Received: by 2002:a05:6a20:1596:b0:137:74f8:62ee with SMTP id h22-20020a056a20159600b0013774f862eemr773919pzj.18.1696370619846; Tue, 03 Oct 2023 15:03:39 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1696370619; cv=none; d=google.com; s=arc-20160816; b=qOFIXQ/vs6k9i6/1vkLOHkxLMCog1QEUfn1+T8F2ArClVahgM1Lj7urxOjU4C/seEj uOgngLiHzgIG2hR8DHil/JE/fWqz1ZQri/SnrCcuamiOsuTU98CJusmiKUYI+Kkkth3Y cvOUxODoDzIrHU37K5qwOHjxOAJqxvbMHC6uvT7/R7Qqc3J7kQLte4oyxX8Jd1U6++jF EHaYDfdoUxxUdFPEYgbIlRCZ91Zorski/r4dg7EdclJANMK9nCJ95gLYnzbdOpvsTUii dwZr+ObGbKZgJEm06mgs32ojTNuBVjGjoes+HRr2QOeyclHHqf0CWtQEoXg4uK6RiujW vLaw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=nrSkqGEV7zYALyig+dy2qkh8ACgD/TSMuqcTHGTL0uA=; fh=wXnFaU2sdX9wdbSXgcJi77vsqd8xI2iEAP+o3TAUcdw=; b=pbRjwkptWQpPgo3K7Ng4npfZY4xIi1DvZ4AEYv9rVIkfrXslxcZGR7Dw/cT4SEjRK4 hviWY3/UTTe8vJjYkinAv0w4HQLVCIqFySzy8iMvqXw4aQcI7EuuB4bEh3NEpbaM4ou9 kTU+kznBxViG3NV/uNhpTonGbTBEOcbADjEqzerMrcbRHRrKyL5w22wnDgnjF5ArGn+1 eVu44FJT8ShLz5zEm2TakLLy7G364SuyUP4g1CDNir8IlfZHgpcDWOLYMQdz1jOnvHZp j/zYbXCOQ9MfKrNdL0vEEwEmWS5L6EsC4VIwrxupSlyEGPvf2TYYHx70XdpqNHwFrtfW Ns3g== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=MPc4zMyh; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from snail.vger.email (snail.vger.email. [23.128.96.37]) by mx.google.com with ESMTPS id j14-20020a056a00174e00b00690fe11c9dbsi2527769pfc.97.2023.10.03.15.03.39 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 03 Oct 2023 15:03:39 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) client-ip=23.128.96.37; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=MPc4zMyh; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by snail.vger.email (Postfix) with ESMTP id E180A819DFD9; Tue, 3 Oct 2023 15:03:38 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at snail.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232529AbjJCWDe (ORCPT + 99 others); Tue, 3 Oct 2023 18:03:34 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52128 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232161AbjJCWDd (ORCPT ); Tue, 3 Oct 2023 18:03:33 -0400 Received: from mail-qv1-xf2a.google.com (mail-qv1-xf2a.google.com [IPv6:2607:f8b0:4864:20::f2a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4AC8BA7 for ; Tue, 3 Oct 2023 15:03:30 -0700 (PDT) Received: by mail-qv1-xf2a.google.com with SMTP id 6a1803df08f44-65b0c9fb673so8064346d6.1 for ; Tue, 03 Oct 2023 15:03:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1696370609; x=1696975409; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=nrSkqGEV7zYALyig+dy2qkh8ACgD/TSMuqcTHGTL0uA=; b=MPc4zMyhRZlUz1/PYKadBgW83YJ/C0nxGYu/8NJKqRWBGRYzuVi39FgoYKQKD/JtS7 lRr+b38xJo8h+c2Ssuq8PWUS8thgCCD83sAJqfNaMmwAxI/Sr8+KZ4RuFxE5VHRgUXq+ PE7tEavvqZN8UEuiqps4YCFaXvN36r8tF4gUeLvnqj4RAVdOBPiSHnAZTuo6OLaUYURW 4m7B2TdqHi9h6oAx9Ddy/7XDi1xUYTUFGXBND+HrhgeSRc3keiDrxnPK1etmem2ocLcx v4NSboBcYeazkRO/mN464F1Mw0dl6pYIY4tBpx+90KRw72gXNrBWQ69AxJ4THr1BZHKp 9lCg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1696370609; x=1696975409; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=nrSkqGEV7zYALyig+dy2qkh8ACgD/TSMuqcTHGTL0uA=; b=SS0pN/pDjLjHukSmK7RTwBmdFv9YjmzbqnMOZYPglEQdQLWpm76M6AV3iqFQjVgN4C NhNCh4kPxmFizs2DOMYtl42mWLbDqih9kdpwW8wMD68aHwmX5z1YZhA5pYnPFtPCYiNT Oqo6ZNtIAclowzBLCQ4+m3/DIiED+RyjKZni+5KrqwniMTkpG8XJeMRvaEMY+lm8aNvL 5OWOhj3lmnga/SS4olOeihmMaQUjbUKdmiZyLXUVf9Yt/2eBEduW+UjhnoNMNMOmFA1c Bdl6sYhtZKlDiA+2ztD4acBVUFjimXIClfMJET+sr5AY+pWeR9qJus6erFbnddmdLEaU 1h/A== X-Gm-Message-State: AOJu0YxcYaBWjt+tFbdYw2xLjDmfKK1Wmw7DLLNLY4uiRiQXA4aQXxEX 58jrJUTadVHLbNhrN8ycQVTCwQH0K9HwSKDzS2X+lQ== X-Received: by 2002:a05:6214:449d:b0:65b:771:f2d5 with SMTP id on29-20020a056214449d00b0065b0771f2d5mr575468qvb.61.1696370609164; Tue, 03 Oct 2023 15:03:29 -0700 (PDT) MIME-Version: 1.0 References: <20230927033124.1226509-1-dapeng1.mi@linux.intel.com> <20230927033124.1226509-8-dapeng1.mi@linux.intel.com> <20230927113312.GD21810@noisy.programming.kicks-ass.net> <20230929115344.GE6282@noisy.programming.kicks-ass.net> <20231002115718.GB13957@noisy.programming.kicks-ass.net> <20231002204017.GB27267@noisy.programming.kicks-ass.net> In-Reply-To: From: Mingwei Zhang Date: Tue, 3 Oct 2023 15:02:52 -0700 Message-ID: Subject: Re: [Patch v4 07/13] perf/x86: Add constraint for guest perf metrics event To: Sean Christopherson Cc: Peter Zijlstra , Ingo Molnar , Dapeng Mi , Paolo Bonzini , Arnaldo Carvalho de Melo , Kan Liang , Like Xu , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Ian Rogers , Adrian Hunter , kvm@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-kernel@vger.kernel.org, Zhenyu Wang , Zhang Xiong , Lv Zhiyuan , Yang Weijiang , Dapeng Mi , Jim Mattson , David Dunn , Thomas Gleixner Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-17.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_BLOCKED,SPF_HELO_NONE,SPF_PASS, USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (snail.vger.email [0.0.0.0]); Tue, 03 Oct 2023 15:03:39 -0700 (PDT) On Mon, Oct 2, 2023 at 5:56=E2=80=AFPM Sean Christopherson wrote: > > On Mon, Oct 02, 2023, Peter Zijlstra wrote: > > On Mon, Oct 02, 2023 at 08:56:50AM -0700, Sean Christopherson wrote: > > > > > worse it's not a choice based in technical reality. > > > > > > The technical reality is that context switching the PMU between host = and guest > > > requires reading and writing far too many MSRs for KVM to be able to = context > > > switch at every VM-Enter and every VM-Exit. And PMIs skidding past V= M-Exit adds > > > another layer of complexity to deal with. > > > > I'm not sure what you're suggesting here. It will have to save/restore > > all those MSRs anyway. Suppose it switches between vCPUs. > > The "when" is what's important. If KVM took a literal interpretation of > "exclude guest" for pass-through MSRs, then KVM would context switch all = those > MSRs twice for every VM-Exit=3D>VM-Enter roundtrip, even when the VM-Exit= isn't a > reschedule IRQ to schedule in a different task (or vCPU). The overhead t= o save > all the host/guest MSRs and load all of the guest/host MSRs *twice* for e= very > VM-Exit would be a non-starter. E.g. simple VM-Exits are completely hand= led in > <1500 cycles, and "fastpath" exits are something like half that. Switchi= ng all > the MSRs is likely 1000+ cycles, if not double that. Hi Sean, Sorry, I have no intention to interrupt the conversation, but this is slightly confusing to me. I remember when doing AMX, we added gigantic 8KB memory in the FPU context switch. That works well in Linux today. Why can't we do the same for PMU? Assuming we context switch all counters, selectors and global stuff there? On the VM boundary, all we need is for global ctrl, right? We stop all counters when we exit from the guest and restore the guest value of global control when entering it. But the actual PMU context switch should be deferred roughly to the same time we switch FPU (xsave state). This means we do that when switching task_struct and/or returning to userspace. Please kindly correct me if this is flawed. ah, I think I understand what you are saying... So, "If KVM took a literal interpretation of "exclude guest" for pass-through MSRs..." perf_event.attr.exclude_guest might need a different meaning, if we have a pass-through PMU for KVM. exclude_guest=3D1 does not mean the counters are restored at the VMEXIT boundary, which is a disaster if we do that. Thanks. -Mingwei -Mingwei > > FWIW, the primary use case we care about is for slice-of-hardware VMs, wh= ere each > vCPU is pinned 1:1 with a host pCPU. I suspect it's a similar story for = the other > CSPs that are trying to provide accurate PMUs to guests. If a vCPU is sc= heduled > out, then yes, a bunch of context switching will need to happen. But for= the > types of VMs that are the target audience, their vCPUs will rarely be sch= eduled > out. > > > > > > It's a choice out of lazyness, disabling host PMU is not a requir= ement > > > > > for pass-through. > > > > > > The requirement isn't passthrough access, the requirements are that t= he guest's > > > PMU has accuracy that is on par with bare metal, and that exposing a = PMU to the > > > guest doesn't have a meaningful impact on guest performance. > > > > Given you don't think that trapping MSR accesses is viable, what else > > besides pass-through did you have in mind? > > Sorry, I didn't mean to imply that we don't want pass-through of MSRs. W= hat I was > trying to say is that *just* passthrough MSRs doesn't solve the problem, = because > again I thought the whole "context switch PMU state less often" approach = had been > firmly nak'd. > > > > > Not just a choice of laziness, but it will clearly be forced upon u= sers > > > > by external entities: > > > > > > > > "Pass ownership of the PMU to the guest and have no host PMU, or= you > > > > won't have sane guest PMU support at all. If you disagree, plea= se open > > > > a support ticket, which we'll ignore." > > > > > > We don't have sane guest PMU support today. > > > > Because KVM is too damn hard to use, rebooting a machine is *sooo* much > > easier -- and I'm really not kidding here. > > > > Anyway, you want pass-through, but that doesn't mean host cannot use > > PMU when vCPU thread is not running. > > > > > If y'all are willing to let KVM redefined exclude_guest to be KVM's o= uter run > > > loop, then I'm all for exploring that option. But that idea got shot= down over > > > a year ago[*]. > > > > I never saw that idea in that thread. You virt people keep talking like > > I know how KVM works -- I'm not joking when I say I have no clue about > > virt. > > > > Sometimes I get a little clue after y'all keep bashing me over the head= , > > but it quickly erases itself. > > > > > Or at least, that was my reading of things. Maybe it was just a > > > misunderstanding because we didn't do a good job of defining the beha= vior. > > > > This might be the case. I don't particularly care where the guest > > boundary lies -- somewhere in the vCPU thread. Once the thread is gone, > > PMU is usable again etc.. > > Well drat, that there would have saved a wee bit of frustration. Better = late > than never though, that's for sure. > > Just to double confirm: keeping guest PMU state loaded until the vCPU is = scheduled > out or KVM exits to userspace, would mean that host perf events won't be = active > for potentially large swaths of non-KVM code. Any function calls or even= t/exception > handlers that occur within the context of ioctl(KVM_RUN) would run with h= ost > perf events disabled. > > Are you ok with that approach? Assuming we don't completely botch things= , the > interfaces are sane, we can come up with a clean solution for handling NM= Is, etc. > > > Re-reading parts of that linked thread, I see mention of > > PT_MODE_HOST_GUEST -- see I knew we had something there, but I can neve= r > > remember all that nonsense. Worst part is that I can't find the relevan= t > > perf code when I grep for that string :/ > > The PT stuff is actually an example of what we don't want, at least not e= xactly. > The concept of a hard switch between guest and host is ok, but as-is, KVM= 's PT > code does a big pile of MSR reads and writes on every VM-Enter and VM-Exi= t. > > > Anyway, what I don't like is KVM silently changing all events to > > ::exclude_guest=3D1. I would like all (pre-existing) ::exclude_guest=3D= 0 > > events to hard error when they run into a vCPU with pass-through on > > (PERF_EVENT_STATE_ERROR). I would like event-creation to error out on > > ::exclude_guest=3D0 events when a vCPU with pass-through exists -- with > > minimal scope (this probably means all CPU events, but only relevant > > vCPU events). > > Agreed, I am definitely against KVM silently doing anything. And the mor= e that > is surfaced to the user, the better. > > > It also means ::exclude_guest should actually work -- it often does not > > today -- the IBS thing for example totally ignores it. > > Is that already an in-tree, or are you talking about Manali's proposed se= ries to > support virtualizing IBS? > > > Anyway, none of this means host cannot use PMU because virt muck wants > > it.