Received: by 2002:a05:6a10:1a4d:0:0:0:0 with SMTP id nk13csp3686120pxb; Fri, 11 Feb 2022 05:42:17 -0800 (PST) X-Google-Smtp-Source: ABdhPJzvHsVVLT8VBJoydch1KS8kVNmH3S/0zzlps7aegBOhjixnBe0P4dvH0/6ewjbzl9gNBauf X-Received: by 2002:a17:907:1624:: with SMTP id hb36mr1489691ejc.42.1644586937025; Fri, 11 Feb 2022 05:42:17 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1644586937; cv=none; d=google.com; s=arc-20160816; b=eKARnmKHTzjL0MI0eWeuTMNZ7bLibLJjb9EfhO/leQfdrApbsjmy0g07PhqFNYplq+ xvZ1U/cXTGbLR+MiVnEhuGsoHGEk099vyB2AOwOuzc6SljO8WQPwh9dHPiFc4KqyM9Zd vs0Oq7EVsdlIS4EpaDRmvYPqmtWcNg0i7g3vGpYYd6ry/UEkS1gV6rRM7oaR+ci021m5 S2AK1h2XAr1y4ZUvk0yTtSdRAwCfotL/oK6UbR/22wIcus/pvM+xVWCeOfdyG9mLb2EH ZmVunqt2WzPUzIGUEDa7B0PEUjLUns2sUFGsdDYog3DGvfV8ai5QhDWeEXEu1RYjFANi 13cw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=X3s20vgvk/guHKy7OMkfSdtJHxYHhLQSz8YFs1yoSNU=; b=PKSiDif257BWe1FpBCeh4t/ZQfJXy1AlUYBZ83gP6d2LQEYKsDWbBwxRdFY+a7p7oe 2Y3x2Ykf5M6n45MUzAhjDernFDzz01ZNOYG1i2TxkVfCF8/7z5pQTbTDyTI23vlLTZ1Q ejVLKrHXx1C6d5o1FYJtO07yrBhlBqrQc2fl+k3Q+GUE6jFnVIR+wuNGmj4+HiWCReQw Algs6e73dYR8daYfZplg8X+9OPmLh5vauz1XbwqnJd6yGPXeugk/0Y/6rKVrZ1OaihPO AhaRYCLHp9C42kDZuBUFSDndnHV3KdVZ/Oyn6L515tgPKPUEvBezS6GfabAldkZoBWsc FMNQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=E57GJzKh; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id v21si18952606edc.273.2022.02.11.05.41.51; Fri, 11 Feb 2022 05:42:17 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=E57GJzKh; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1343832AbiBJTRH (ORCPT + 99 others); Thu, 10 Feb 2022 14:17:07 -0500 Received: from mxb-00190b01.gslb.pphosted.com ([23.128.96.19]:51536 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1343829AbiBJTRE (ORCPT ); Thu, 10 Feb 2022 14:17:04 -0500 Received: from mail-oi1-x232.google.com (mail-oi1-x232.google.com [IPv6:2607:f8b0:4864:20::232]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D53AE110C for ; Thu, 10 Feb 2022 11:17:04 -0800 (PST) Received: by mail-oi1-x232.google.com with SMTP id i5so7078247oih.1 for ; Thu, 10 Feb 2022 11:17:04 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=X3s20vgvk/guHKy7OMkfSdtJHxYHhLQSz8YFs1yoSNU=; b=E57GJzKhbzAlBrtpClBXPC+EYefkQ3zBFy6bahHm7Y61PEgtKH/yogBLaB4mPKBl4c a4Rcfxo7VmsJIxS6DT+TVlMIt5pRl+mOGHNyogYnIQdXdGtL+cEAjALV1/qp4AOwkbVr PvBRYXX/wXsyEJdB1pggwY6/3lg8k1rhGwZ0xnG89zUCPGjgpYLUvbj5JEPj1JEcghMw TE3Nr3hv9RikczXdBtitflOO8oW+2//WWwZolsJ2Zmx6DO/feR/OwpdoPVd8bsNvlkxx YMwOPsagr+LP8lNdrRqJG7O0rcG6hSNPdpqd9dIhTM/EGaI8WbV4NbodOOejfRdMYXW6 WrAg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=X3s20vgvk/guHKy7OMkfSdtJHxYHhLQSz8YFs1yoSNU=; b=0ERikTFmD7FBVrIH8DmXhKQ9NquCVJ7p87pwW5tJZVnC5n4Dd8C1v6TN+zH8YUbtGJ /zRzubAiHgVYieJu3tDYzTftH71/aGh8lmVSfBO/hZkhQUZ7TzKqkxG4uzibcQRAyUWq P8Nf34RNmEEbdPsO6XSvB305vwDBgQtBdXhOpVAdOLBk84+cBdTrPR6UvJg/y0labLu8 hiEjebpBlQtKqgyERWoFIs6M9xkPodVsE4uNHy+PukU7VgaJEbzSYZmH0b+UPvG/6eaK rskLwieUqk8kQJiluf4LiYHG5M49HV3GlDd3noG66aSfifAYAWdqvW1SDveBg+Ov9/sp 1eMw== X-Gm-Message-State: AOAM5337jdOiI1B3Gwlx1vsSxGjJXus+EIPPKB8ava1b52Itccjox+Ld fs3uU05MYyID1BNKfSTCz8BLrQn775ffT2Hk590DyQ== X-Received: by 2002:a05:6808:1292:: with SMTP id a18mr1708933oiw.314.1644520623949; Thu, 10 Feb 2022 11:17:03 -0800 (PST) MIME-Version: 1.0 References: <20220117085307.93030-1-likexu@tencent.com> <20220117085307.93030-3-likexu@tencent.com> <20220202144308.GB20638@worktop.programming.kicks-ass.net> <69c0fc41-a5bd-fea9-43f6-4724368baf66@intel.com> <67a731dd-53ba-0eb8-377f-9707e5c9be1b@intel.com> <7b5012d8-6ae1-7cde-a381-e82685dfed4f@linux.intel.com> In-Reply-To: From: Jim Mattson Date: Thu, 10 Feb 2022 11:16:52 -0800 Message-ID: Subject: Re: [PATCH kvm/queue v2 2/3] perf: x86/core: Add interface to query perfmon_event_map[] directly To: "Liang, Kan" Cc: David Dunn , Dave Hansen , Peter Zijlstra , Like Xu , Paolo Bonzini , Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Joerg Roedel , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Like Xu , Stephane Eranian Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-17.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE,USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Feb 10, 2022 at 10:30 AM Liang, Kan wrote: > > > > On 2/10/2022 11:34 AM, Jim Mattson wrote: > > On Thu, Feb 10, 2022 at 7:34 AM Liang, Kan wrote: > >> > >> > >> > >> On 2/9/2022 2:24 PM, David Dunn wrote: > >>> Dave, > >>> > >>> In my opinion, the right policy depends on what the host owner and > >>> guest owner are trying to achieve. > >>> > >>> If the PMU is being used to locate places where performance could be > >>> improved in the system, there are two sub scenarios: > >>> - The host and guest are owned by same entity that is optimizing > >>> overall system. In this case, the guest doesn't need PMU access and > >>> better information is provided by profiling the entire system from the > >>> host. > >>> - The host and guest are owned by different entities. In this > >>> case, profiling from the host can identify perf issues in the guest. > >>> But what action can be taken? The host entity must communicate issues > >>> back to the guest owner through some sort of out-of-band information > >>> channel. On the other hand, preempting the host PMU to give the guest > >>> a fully functional PMU serves this use case well. > >>> > >>> TDX and SGX (outside of debug mode) strongly assume different > >>> entities. And Intel is doing this to reduce insight of the host into > >>> guest operations. So in my opinion, preemption makes sense. > >>> > >>> There are also scenarios where the host owner is trying to identify > >>> systemwide impacts of guest actions. For example, detecting memory > >>> bandwidth consumption or split locks. In this case, host control > >>> without preemption is necessary. > >>> > >>> To address these various scenarios, it seems like the host needs to be > >>> able to have policy control on whether it is willing to have the PMU > >>> preempted by the guest. > >>> > >>> But I don't see what scenario is well served by the current situation > >>> in KVM. Currently the guest will either be told it has no PMU (which > >>> is fine) or that it has full control of a PMU. If the guest is told > >>> it has full control of the PMU, it actually doesn't. But instead of > >>> losing counters on well defined events (from the guest perspective), > >>> they simply stop counting depending on what the host is doing with the > >>> PMU. > >> > >> For the current perf subsystem, a PMU should be shared among different > >> users via the multiplexing mechanism if the resource is limited. No one > >> has full control of a PMU for lifetime. A user can only have the PMU in > >> its given period. I think the user can understand how long it runs via > >> total_time_enabled and total_time_running. > > > > For most clients, yes. For kvm, no. KVM currently tosses > > total_time_enabled and total_time_running in the bitbucket. It could > > extrapolate, but that would result in loss of precision. Some guest > > uses of the PMU would not be able to cope (e.g. > > https://github.com/rr-debugger/rr). > > > >> For a guest, it should rely on the host to tell whether the PMU resource > >> is available. But unfortunately, I don't think we have such a > >> notification mechanism in KVM. The guest has the wrong impression that > >> the guest can have full control of the PMU. > > > > That is the only impression that the architectural specification > > allows the guest to have. On Intel, we can mask off individual fixed > > counters, and we can reduce the number of GP counters, but AMD offers > > us no such freedom. Whatever resources we advertise to the guest must > > be available for its use whenever it wants. Otherwise, PMU > > virtualization is simply broken. > > > >> In my opinion, we should add the notification mechanism in KVM. When the > >> PMU resource is limited, the guest can know whether it's multiplexing or > >> can choose to reschedule the event. > > > > That sounds like a paravirtual perf mechanism, rather than PMU > > virtualization. Are you suggesting that we not try to virtualize the > > PMU? Unfortunately, PMU virtualization is what we have customers > > clamoring for. No one is interested in a paravirtual perf mechanism. > > For example, when will VTune in the guest know how to use your > > proposed paravirtual interface? > > OK. If KVM cannot notify the guest, maybe guest can query the usage of > counters before using a counter. There is a IA32_PERF_GLOBAL_INUSE MSR > introduced with Arch perfmon v4. The MSR provides an "InUse" bit for > each counters. But it cannot guarantee that the counter can always be > owned by the guest unless the host treats the guest as a super-user and > agrees to not touch its counter. This should only works for the Intel > platforms. Simple question: Do all existing guests (Windows and Linux are my primary interest) query that MSR today? If not, then this proposal is DOA. > > > >> But seems the notification mechanism may not work for TDX case? > >>> > >>> On the other hand, if we flip it around the semantics are more clear. > >>> A guest will be told it has no PMU (which is fine) or that it has full > >>> control of the PMU. If the guest is told that it has full control of > >>> the PMU, it does. And the host (which is the thing that granted the > >>> full PMU to the guest) knows that events inside the guest are not > >>> being measured. This results in all entities seeing something that > >>> can be reasoned about from their perspective. > >>> > >> > >> I assume that this is for the TDX case (where the notification mechanism > >> doesn't work). The host still control all the PMU resources. The TDX > >> guest is treated as a super-user who can 'own' a PMU. The admin in the > >> host can configure/change the owned PMUs of the TDX. Personally, I think > >> it makes sense. But please keep in mind that the counters are not > >> identical. There are some special events that can only run on a specific > >> counter. If the special counter is assigned to TDX, other entities can > >> never run some events. We should let other entities know if it happens. > >> Or we should never let non-host entities own the special counter. > > > > Right; the counters are not fungible. Ideally, when the guest requests > > a particular counter, that is the counter it gets. If it is given a > > different counter, the counter it is given must provide the same > > behavior as the requested counter for the event in question. > > Ideally, Yes, but sometimes KVM/host may not know whether they can use > another counter to replace the requested counter, because KVM/host > cannot retrieve the event constraint information from guest. In that case, don't do it. When the guest asks for a specific counter, give the guest that counter. This isn't rocket science. > For example, we have Precise Distribution (PDist) feature enabled only > for the GP counter 0 on SPR. Perf uses the precise_level 3 (a SW > variable) to indicate the feature. For the KVM/host, they never know > whether the guest apply the PDist feature. > > I have a patch that forces the perf scheduler starts from the regular > counters, which may mitigates the issue, but cannot fix it. (I will post > the patch separately.) > > Or we should never let the guest own the special counters. Although the > guest has to lose some special events, I guess the host may more likely > be willing to let the guest own a regular counter. > > > Thanks, > Kan > > > > >> > >> Thanks, > >> Kan > >> > >>> Thanks, > >>> > >>> Dave Dunn > >>> > >>> On Wed, Feb 9, 2022 at 10:57 AM Dave Hansen wrote: > >>> > >>>>> I was referring to gaps in the collection of data that the host perf > >>>>> subsystem doesn't know about if ATTRIBUTES.PERFMON is set for a TDX > >>>>> guest. This can potentially be a problem if someone is trying to > >>>>> measure events per unit of time. > >>>> > >>>> Ahh, that makes sense. > >>>> > >>>> Does SGX cause problem for these people? It can create some of the same > >>>> collection gaps: > >>>> > >>>> performance monitoring activities are suppressed when entering > >>>> an opt-out (of performance monitoring) enclave.