Received: by 2002:a05:7412:3784:b0:e2:908c:2ebd with SMTP id jk4csp236399rdb; Sat, 30 Sep 2023 02:26:02 -0700 (PDT) X-Google-Smtp-Source: AGHT+IHs71CzRzBcvgp+nhjvIuPsv0E94rqyZDc8ouN0EHJOe+rbqQEUxx721lu1E8AVMrRTE/fH X-Received: by 2002:a05:6358:9042:b0:139:4783:5140 with SMTP id f2-20020a056358904200b0013947835140mr7737095rwf.16.1696065961793; Sat, 30 Sep 2023 02:26:01 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1696065961; cv=none; d=google.com; s=arc-20160816; b=XBGhUM4WVj35Vsl9oqh+C+tiLJ8fof5sLr+0sxgdSFG2oYVlW4Bl8hEdE4zm5PIMo8 gpRMPiwBlaQtpnb5nbxyixp+1JYmQtcTXYkX96oqB3BAS69L6Y1onMxysA6Q5ZIdbjPj Cxz3FhVYmDUS9ijqn2do38+eF7b7k7Xi/u7HCkVjyIYiHPyz0Lpnlpuj5AfUzzlcBRFJ buBRyJvKWVmGP/2AVG1WtXLcxO+UPAJCtGr7TQjMK/40fOiX0SVLeF9syAJ6jXubOCGH dLeznrULsik1PE2Xsoqjt8qR6PztJNvkY5jmKxty1uinIvJSgBE7aTD/5Al5o34Bspw/ hzgw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=j4M3VEzSVTHjyM3bYp0Tou9eIxp0wzmUOjm6ql/7oYQ=; fh=HzOhDoTTRQGxHbR65NHzqw09MoJZcd+tW4f28nfWHfo=; b=fAnkPylz6klRvdDuT5CmNl5gNGyodHkmqQwpPdNrH1gAh3i0vEZmJoBqoKVTng5+E+ 2UksvqIyIwGugwfhmtno1CeqFaHpgGJFbHQkZFK2SiC0YG070shiQxAGVfgyfslT+a0s 77me0//Vi5l+5lywmw/JT9AF+x5HcuLHAhxRsIwi5Gr3D5igx2wNEICt2Ol/zeOK3VjU uavsMkyuHQw7wyHfYNPYYmXLsjRpjdkdYDJ7YUhKzmRr634UZ1uZOciQA8cIz4sGaeyC eWuEwXb2GPfBFgT44VlRwxA0s3b/+038selfnNzyb1dAIngkHFWV+hkw62j+/Xb1Wf2p 0mdg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=By0EB+p3; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:2 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from agentk.vger.email (agentk.vger.email. [2620:137:e000::3:2]) by mx.google.com with ESMTPS id l11-20020a65560b000000b00565d88203c8si23214519pgs.535.2023.09.30.02.26.01 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 30 Sep 2023 02:26:01 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:2 as permitted sender) client-ip=2620:137:e000::3:2; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=By0EB+p3; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:2 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by agentk.vger.email (Postfix) with ESMTP id 3A23281AFDA4; Fri, 29 Sep 2023 20:30:02 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at agentk.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233967AbjI3D3u (ORCPT + 99 others); Fri, 29 Sep 2023 23:29:50 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33716 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229526AbjI3D3s (ORCPT ); Fri, 29 Sep 2023 23:29:48 -0400 Received: from mail-wm1-x336.google.com (mail-wm1-x336.google.com [IPv6:2a00:1450:4864:20::336]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id CBEE4BA for ; Fri, 29 Sep 2023 20:29:45 -0700 (PDT) Received: by mail-wm1-x336.google.com with SMTP id 5b1f17b1804b1-405459d9a96so36955e9.0 for ; Fri, 29 Sep 2023 20:29:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1696044584; x=1696649384; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=j4M3VEzSVTHjyM3bYp0Tou9eIxp0wzmUOjm6ql/7oYQ=; b=By0EB+p3BGv87hJrhsjP4gLJjjzRtMASI5WNRn2z/NQbJ0HTLfiHxbMVPJN+U/Qkzq ofg4AerATQjfSSn82R67n7/VDqsK5aT+E3yDFPdCukCxkES5lWZbDKrUd1LDPlueNttn 4zEEb9AJM1DiYARiYdg2X57DdBewjaV4Lgcn12Wy5hL4D8ki5kYukycy0XA4bGVZSbph w58LwYS2pzNin9quh3Esu0pK+VPC39W9aJE9Wp2l6IVRvRWUV9Odu5/Ow5pMwxzh/Vgw ljMbwOUUY3wcz1i9gbI2REU6ep3qEg3zhMWmvsdAwuxKSjFUVZXw25Ik6TRB78d1B0Yp jxrQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1696044584; x=1696649384; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=j4M3VEzSVTHjyM3bYp0Tou9eIxp0wzmUOjm6ql/7oYQ=; b=kdF5yanfw7MDNLRkgGTcvijrWXO9H0qyIaySioyYHaGYiCnStDeNBV4R9Wb4DG4IX8 UwozmuaA8GmQc8sfS/8YL02enOh0iIZPN54kbU/EUYcacUK6II4ICNaf9RpyuAEo63xM 0BnQKvne2Casv5dXk5BidvZZ6qgQW8Hem1ZuWi80iJAbJCBaKBkFFRu/cPqXIHj+ZbEY CnNq6iuwZo5Cn9MZPf9DHGDhY0LbFPlUngVseSTqyZH5264lFs8PADw/xTwhRiGkBg8X aVIx+PZlV0bOUrZUaPrJRV0vAj5OcO6G6rY+FY2/E8/U63grr/oCuSnyFqGL6+MOUktn GDhg== X-Gm-Message-State: AOJu0YycaxXDYsFxpCQ+EczDU4KN0H8NQtA8SBoU1yw+iAqkIsVOzhf4 MUgdiPtYH2bhX9kuWIITjSI5ImapAcajjycLwdtpIA== X-Received: by 2002:a05:600c:35d3:b0:3f6:f4b:d4a6 with SMTP id r19-20020a05600c35d300b003f60f4bd4a6mr5790wmq.7.1696044583977; Fri, 29 Sep 2023 20:29:43 -0700 (PDT) MIME-Version: 1.0 References: <20230927033124.1226509-1-dapeng1.mi@linux.intel.com> <20230927033124.1226509-8-dapeng1.mi@linux.intel.com> <20230927113312.GD21810@noisy.programming.kicks-ass.net> <20230929115344.GE6282@noisy.programming.kicks-ass.net> In-Reply-To: From: Jim Mattson Date: Fri, 29 Sep 2023 20:29:31 -0700 Message-ID: Subject: Re: [Patch v4 07/13] perf/x86: Add constraint for guest perf metrics event To: Sean Christopherson Cc: Peter Zijlstra , Dapeng Mi , Paolo Bonzini , Arnaldo Carvalho de Melo , Kan Liang , Like Xu , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Ian Rogers , Adrian Hunter , kvm@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-kernel@vger.kernel.org, Zhenyu Wang , Zhang Xiong , Lv Zhiyuan , Yang Weijiang , Dapeng Mi , David Dunn , Mingwei Zhang , Thomas Gleixner , Ingo Molnar Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-8.4 required=5.0 tests=DKIMWL_WL_MED,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on agentk.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (agentk.vger.email [0.0.0.0]); Fri, 29 Sep 2023 20:30:02 -0700 (PDT) On Fri, Sep 29, 2023 at 8:46=E2=80=AFAM Sean Christopherson wrote: > > On Fri, Sep 29, 2023, Peter Zijlstra wrote: > > On Wed, Sep 27, 2023 at 10:27:07AM -0700, Sean Christopherson wrote: > > > Jumping the gun a bit (we're in the *super* early stages of scraping = together a > > > rough PoC), but I think we should effectively put KVM's current vPMU = support into > > > maintenance-only mode, i.e. stop adding new features unless they are = *very* simple > > > to enable, and instead pursue an implementation that (a) lets userspa= ce (and/or > > > the kernel builder) completely disable host perf (or possibly just ho= st perf usage > > > of the hardware PMU) and (b) let KVM passthrough the entire hardware = PMU when it > > > has been turned off in the host. > > > > I don't think you need to go that far, host can use PMU just fine as > > long as it doesn't overlap with a vCPU. Basically, if you force > > perf_attr::exclude_guest on everything your vCPU can haz the full thing= . > > Complexity aside, my understanding is that the overhead of trapping and e= mulating > all of the guest counter and MSR accesses results in unacceptably degrade= d functionality > for the guest. And we haven't even gotten to things like arch LBRs where= context > switching MSRs between the guest and host is going to be quite costly. Trapping and emulating all of the PMU MSR accesses is ludicrously slow, especially when the guest is multiplexing events. Also, the current scheme of implicitly tying together usage mode and priority means that KVM's "task pinned" perf_events always lose to someone else's "CPU pinned" perf_events. Even if those "CPU pinned" perf events are tagged "exclude_guest," the counters they occupy are not available for KVM's "exclude_host" events, because host perf won't multiplex a counter between an "exclude_host" event and an "exclude_guest" event, even though the two events don't overlap. Frankly, we wouldn't want it to, because that would introduce egregious overheads at VM-entry and VM-exit. What we would need would be a mechanism for allocating KVM's "task pinned" perf_events at the highest priority, so they always win. For things to work well in the "vPMU as a client of host perf" world, we need to have the following at a minimum: 1) Guaranteed identity mapping of guest PMCs to host PMCs, so that we don't have to intercept accesses to IA32_PERF_GLOBAL_CTRL. 2) Exclusive ownership of the PMU MSRs while in the KVM_RUN loop, so that we don't have to switch any PMU MSRs on VM-entry/VM-exit (with the exception of IA32_PERF_GLOBAL_CTRL, which has guest and host fields in the VMCS). There are other issues with the current implementation, like the ridiculous overhead of bumping a counter in software to account for an emulated instruction. That should just be a RDMSR, an increment, a WRMSR, and the conditional synthesis of a guest PMI on overflow. Instead, we have to pause a perf_event and reprogram it before continuing. Putting a high-level abstraction between the guest PMU and the host PMU does not yield the most efficient implementation. > > > Note, a similar idea was floated and rejected in the past[*], but tha= t failed > > > proposal tried to retain host perf+PMU functionality by making the be= havior dynamic, > > > which I agree would create an awful ABI for the host. If we make the= "knob" a > > > Kconfig > > > > Must not be Kconfig, distros would have no sane choice. > > Or not only a Kconfig? E.g. similar to how the kernel has > CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS and nopku. > > > > or kernel param, i.e. require the platform owner to opt-out of using = perf > > > no later than at boot time, then I think we can provide a sane ABI, k= eep the > > > implementation simple, all without breaking existing users that utili= ze perf in > > > the host to profile guests. > > > > It's a shit choice to have to make. At the same time I'm not sure I hav= e > > a better proposal. > > > > It does mean a host cannot profile one guest and have pass-through on t= he > > other. Eg. have a development and production guest on the same box. Thi= s > > is pretty crap. > > > > Making it a guest-boot-option would allow that, but then the host gets > > complicated again. I think I can make it trivially work for per-task > > events, simply error the creation of events without exclude_guest for > > affected vCPU tasks. But the CPU events are tricky. > > > > > > I will firmly reject anything that takes the PMU away from the host > > entirely through. > > Why? What is so wrong with supporting use cases where the platform owner= *wants* > to give up host PMU and NMI watchdog functionality? If disabling host PM= U usage > were complex, highly invasive, and/or difficult to maintain, then I can u= nderstand > the pushback. > > But if we simply allow hiding hardware PMU support, then isn't the cost t= o perf > just a few lines in init_hw_perf_events()? And if we put a stake in the = ground > and say that exposing "advanced" PMU features to KVM guests requires a pa= ssthrough > PMU, i.e. the PMU to be hidden from the host, that will significantly red= uce our > maintenance and complexity. > > The kernel allows disabling almost literally every other feature that is = even > remotely optional, I don't understand why the hardware PMU is special.