Received: by 2002:a89:48b:0:b0:1f5:f2ab:c469 with SMTP id a11csp1365776lqd; Thu, 25 Apr 2024 13:18:16 -0700 (PDT) X-Forwarded-Encrypted: i=3; AJvYcCUNHcTsgkh0seIUbAlVJZfJpgHtpwf3kIAvwU3tR7NarzC2oLL3rPutgyD5SfcZUAD/1Ew1CCoYfOfvZ4WXD7m87DF83YxDMh4YpUXmRg== X-Google-Smtp-Source: AGHT+IFme7batP+Rm/wIjl75tLQe6kViy68yeCR/qiJheqcNrtrE6wzq0+/DCAQ1oqr3x7InrbHB X-Received: by 2002:a05:6a20:914f:b0:1ad:6c51:a87 with SMTP id x15-20020a056a20914f00b001ad6c510a87mr6747265pzc.3.1714076296125; Thu, 25 Apr 2024 13:18:16 -0700 (PDT) ARC-Seal: i=2; a=rsa-sha256; t=1714076296; cv=pass; d=google.com; s=arc-20160816; b=gKAAlqO6U4z34uqzAp2Fv3eVgtjqAPwYNjTmSyd/VKnik2mx1F4MmZymJ5GI+tGEAt jIvEx76k9GypAfWOEaYkj6NVIlteqjayNRxS9HjDm4OuEUEEmaGIqWCPFM91uV2cm9jF sRT187RTZ7r68D0NkDGZEJmm3nFLFsK9wdIfbhkgxOfv+rkHHO/vhV0sQo1m9U3KOANc v5yiW3S3ep5X0qPYJg/RgUPBv5qttJrILc6UDMJo3GQQTIy9/uiD59tWIvtCJ5tcJdGn U2HOY9bScZy+05pxp7lzAr+nyJc5kixzJAH1wOq0AETG9Cds9l5W7Nj2LWjycOeFQ4cI OtDA== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:list-unsubscribe:list-subscribe :list-id:precedence:dkim-signature; bh=dMog2JZUZ8XbtVqoBAzMg/dsNmKbqiBjLxtU6d10Qw4=; fh=ImtmXX0hR+X5AuN1XAyRIIOBKIzVuZ6f6luasjGZ6wI=; b=Bm9BqG9wWJVzwfVgCEBCX+6Ga3b1bvdGE5ZUuC1iIdzgrEPKhXOXWFtPigobDyu5lI Jzc7ulSY1wycirVNktmw8sw3epU0c3ZyHZhP8HqGKKRDbF8ZAIhpIaZ+uSpSjT3oTFCL 55L2MOC+1f8IsaUQeDz3iSi5uiRWy+r7dFZzQruwQtJexua3/F6p5wJrdCsh0HrQ++9Q nLQsxcFwHOiseVUU08T4tL0CgG+2HAPAYMYr1pwsUFGdncRaySdP1Vtbraq+csUbOgwh 91ubsdbCpL7a4KxRLLQrdubvQGiNfETdh6brg94r02GVtF4pyBuBt4CATy2CZFQtPZum HZHg==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b="Z/NUikNX"; arc=pass (i=1 spf=pass spfdomain=google.com dkim=pass dkdomain=google.com dmarc=pass fromdomain=google.com); spf=pass (google.com: domain of linux-kernel+bounces-159108-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.48.161 as permitted sender) smtp.mailfrom="linux-kernel+bounces-159108-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from sy.mirrors.kernel.org (sy.mirrors.kernel.org. [147.75.48.161]) by mx.google.com with ESMTPS id dr5-20020a056a020fc500b005f7cf826da6si13035254pgb.896.2024.04.25.13.18.15 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 25 Apr 2024 13:18:16 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel+bounces-159108-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.48.161 as permitted sender) client-ip=147.75.48.161; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b="Z/NUikNX"; arc=pass (i=1 spf=pass spfdomain=google.com dkim=pass dkdomain=google.com dmarc=pass fromdomain=google.com); spf=pass (google.com: domain of linux-kernel+bounces-159108-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.48.161 as permitted sender) smtp.mailfrom="linux-kernel+bounces-159108-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sy.mirrors.kernel.org (Postfix) with ESMTPS id 32E84B233E9 for ; Thu, 25 Apr 2024 20:17:33 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 052D815380E; Thu, 25 Apr 2024 20:17:22 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="Z/NUikNX" Received: from mail-ej1-f49.google.com (mail-ej1-f49.google.com [209.85.218.49]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1914B1534FD for ; Thu, 25 Apr 2024 20:17:18 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.218.49 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1714076240; cv=none; b=PkESxRuCRy/QMzuTX+SeyeMcVR61jFm9N9dKO6jFAgCV3e9cNWhcHyNTUqv+gpaolOOy7Zd5O2uy7P7qHsvKiIeSlod3lXg/GfS9vifJ3Avbw4Kipbb4XMbu/A2uHoonu1FIUS5X76IPm1JusmOYbWFk1GtnZocEcfFet8E+pxY= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1714076240; c=relaxed/simple; bh=4BBDPsjN6dzEJ4giKnaR11KMRJazqH6XiUS8veZbFSM=; h=MIME-Version:References:In-Reply-To:From:Date:Message-ID:Subject: To:Cc:Content-Type; b=Xhnh/m7OVjXiCrt6883RP/m3gaIqnZmTP9S3e0JoTlR9w3RYJcFdeF7M5N0p4/uBfjyHNstHjoAwyEKhQnT+Vo7406d0MybhF0//wd1fErJtODJjzAFCMxtBQYHdzzzmERwZv9E051dy1dSwZSyAXecikFKzkVcDkY0w8KVve7I= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=Z/NUikNX; arc=none smtp.client-ip=209.85.218.49 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=google.com Received: by mail-ej1-f49.google.com with SMTP id a640c23a62f3a-a5872b74c44so161430466b.3 for ; Thu, 25 Apr 2024 13:17:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1714076237; x=1714681037; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=dMog2JZUZ8XbtVqoBAzMg/dsNmKbqiBjLxtU6d10Qw4=; b=Z/NUikNXLNoK3TQtZK1qFP8vOt08KF6jSkM5UBR0TE0iIq6NRXahUIGoSOfOO5HuaY pDAABTNq8Fk9l6RVZg/AWdREzGMCtvwjwa8z0FzsEkwu+LNDQAe+ubKzOS3uFYaMoWDc kGauC6Hqalz6I1eVLjTq9T9TXBMiiorNCzGjTR5U6v5kkmBxFwpyQiCvZmCSFJXov3Q/ zxJOjEPRDhBSANYzEiBmrekrpjlWBn3EMJqlVbJuEMMo7j9HDviB6/ffnyIdYkqR2S0I SkAXuD4SR2Z38lcelrF6uCGC/ecw9jVTKdW1zRTQAvqqZt0ICF92Y59G3FW/Z1U0IVEU xpVQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1714076237; x=1714681037; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=dMog2JZUZ8XbtVqoBAzMg/dsNmKbqiBjLxtU6d10Qw4=; b=q0mz7o8gEm2aOKFIFarYDru3N8puRYedatP1+/3Iby1vjfV49+FUBIo+YR6QfTw73/ s2PjdfUweYjNWFvHyBfo+iv/CigSNnyAWMrG6kmL0NJ3SL69MDPx3jpe0hWzLtSBjkIp 5WKy7QSgVNqGfAps3OoeEbK/cT7y8x/53WLpDuBQorMLyU2WJkFWNliYCjMoGlEm7449 Auv38lTncEc4tzyJseKsQ7LyrmrEo6x8iPjvm8Zp5igtbuf3OxmtvPcpmBZTjLw/c7kj c99zObiSvjB/Lt2mfp/yWtp0HItLZ1NUzpWfqbcVYV8OG8yMqp7IfjbCoH9hcEsx8EUk Ei9Q== X-Forwarded-Encrypted: i=1; AJvYcCW8lwhE1aR6Uf9wtvVlHZxtJ4EKnayNB77Sux1p6JAbtqD0Kit1lY5UIpIby+YAnaXZj4nFkgPFqAIiVM/2Z1Ls/JpfIwFqlxq3+HXR X-Gm-Message-State: AOJu0YxSzxVywJrZb2cEh+vYOg36fRRKh6FaDIA1fN+e5cfyfQZdIoSI FXvowmI3VlcB4PbeJHww511kTnhGRBiBVPLBGE88rD4eJRD1Yz/OEQDoTvjvWflNehO0/WTcgRp EKxmVMN93Qq3a8w2M0BR/Qo16JozXajhzqwFM X-Received: by 2002:a17:906:6ce:b0:a58:7505:16ff with SMTP id v14-20020a17090606ce00b00a58750516ffmr456515ejb.64.1714076237097; Thu, 25 Apr 2024 13:17:17 -0700 (PDT) Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 References: <1ec7a21c-71d0-4f3e-9fa3-3de8ca0f7315@linux.intel.com> <5279eabc-ca46-ee1b-b80d-9a511ba90a36@loongson.cn> <7834a811-4764-42aa-8198-55c4556d947b@linux.intel.com> <6af2da05-cb47-46f7-b129-08463bc9469b@linux.intel.com> In-Reply-To: <6af2da05-cb47-46f7-b129-08463bc9469b@linux.intel.com> From: Mingwei Zhang Date: Thu, 25 Apr 2024 13:16:40 -0700 Message-ID: Subject: Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU To: "Liang, Kan" Cc: "Mi, Dapeng" , Sean Christopherson , maobibo , Xiong Zhang , pbonzini@redhat.com, peterz@infradead.org, kan.liang@intel.com, zhenyuw@linux.intel.com, jmattson@google.com, kvm@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-kernel@vger.kernel.org, zhiyuan.lv@intel.com, eranian@google.com, irogers@google.com, samantha.alt@intel.com, like.xu.linux@gmail.com, chao.gao@intel.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Thu, Apr 25, 2024 at 9:13=E2=80=AFAM Liang, Kan wrote: > > > > On 2024-04-25 12:24 a.m., Mingwei Zhang wrote: > > On Wed, Apr 24, 2024 at 8:56=E2=80=AFPM Mi, Dapeng wrote: > >> > >> > >> On 4/24/2024 11:00 PM, Sean Christopherson wrote: > >>> On Wed, Apr 24, 2024, Dapeng Mi wrote: > >>>> On 4/24/2024 1:02 AM, Mingwei Zhang wrote: > >>>>>>> Maybe, (just maybe), it is possible to do PMU context switch at v= cpu > >>>>>>> boundary normally, but doing it at VM Enter/Exit boundary when ho= st is > >>>>>>> profiling KVM kernel module. So, dynamically adjusting PMU contex= t > >>>>>>> switch location could be an option. > >>>>>> If there are two VMs with pmu enabled both, however host PMU is no= t > >>>>>> enabled. PMU context switch should be done in vcpu thread sched-ou= t path. > >>>>>> > >>>>>> If host pmu is used also, we can choose whether PMU switch should = be > >>>>>> done in vm exit path or vcpu thread sched-out path. > >>>>>> > >>>>> host PMU is always enabled, ie., Linux currently does not support K= VM > >>>>> PMU running standalone. I guess what you mean is there are no activ= e > >>>>> perf_events on the host side. Allowing a PMU context switch driftin= g > >>>>> from vm-enter/exit boundary to vcpu loop boundary by checking host > >>>>> side events might be a good option. We can keep the discussion, but= I > >>>>> won't propose that in v2. > >>>> I suspect if it's really doable to do this deferring. This still mak= es host > >>>> lose the most of capability to profile KVM. Per my understanding, mo= st of > >>>> KVM overhead happens in the vcpu loop, exactly speaking in VM-exit h= andling. > >>>> We have no idea when host want to create perf event to profile KVM, = it could > >>>> be at any time. > >>> No, the idea is that KVM will load host PMU state asap, but only when= host PMU > >>> state actually needs to be loaded, i.e. only when there are relevant = host events. > >>> > >>> If there are no host perf events, KVM keeps guest PMU state loaded fo= r the entire > >>> KVM_RUN loop, i.e. provides optimal behavior for the guest. But if a= host perf > >>> events exists (or comes along), the KVM context switches PMU at VM-En= ter/VM-Exit, > >>> i.e. lets the host profile almost all of KVM, at the cost of a degrad= ed experience > >>> for the guest while host perf events are active. > >> > >> I see. So KVM needs to provide a callback which needs to be called in > >> the IPI handler. The KVM callback needs to be called to switch PMU sta= te > >> before perf really enabling host event and touching PMU MSRs. And only > >> the perf event with exclude_guest attribute is allowed to create on > >> host. Thanks. > > > > Do we really need a KVM callback? I think that is one option. > > > > Immediately after VMEXIT, KVM will check whether there are "host perf > > events". If so, do the PMU context switch immediately. Otherwise, keep > > deferring the context switch to the end of vPMU loop. > > > > Detecting if there are "host perf events" would be interesting. The > > "host perf events" refer to the perf_events on the host that are > > active and assigned with HW counters and that are saved when context > > switching to the guest PMU. I think getting those events could be done > > by fetching the bitmaps in cpuc. > > The cpuc is ARCH specific structure. I don't think it can be get in the > generic code. You probably have to implement ARCH specific functions to > fetch the bitmaps. It probably won't worth it. > > You may check the pinned_groups and flexible_groups to understand if > there are host perf events which may be scheduled when VM-exit. But it > will not tell the idx of the counters which can only be got when the > host event is really scheduled. > > > I have to look into the details. But > > at the time of VMEXIT, kvm should already have that information, so it > > can immediately decide whether to do the PMU context switch or not. > > > > oh, but when the control is executing within the run loop, a > > host-level profiling starts, say 'perf record -a ...', it will > > generate an IPI to all CPUs. Maybe that's when we need a callback so > > the KVM guest PMU context gets preempted for the host-level profiling. > > Gah.. > > > > hmm, not a fan of that. That means the host can poke the guest PMU > > context at any time and cause higher overhead. But I admit it is much > > better than the current approach. > > > > The only thing is that: any command like 'perf record/stat -a' shot in > > dark corners of the host can preempt guest PMUs of _all_ running VMs. > > So, to alleviate that, maybe a module parameter that disables this > > "preemption" is possible? This should fit scenarios where we don't > > want guest PMU to be preempted outside of the vCPU loop? > > > > It should not happen. For the current implementation, perf rejects all > the !exclude_guest system-wide event creation if a guest with the vPMU > is running. > However, it's possible to create an exclude_guest system-wide event at > any time. KVM cannot use the information from the VM-entry to decide if > there will be active perf events in the VM-exit. Hmm, why not? If there is any exclude_guest system-wide event, perf_guest_enter() can return something to tell KVM "hey, some active host events are swapped out. they are originally in counter #2 and #3". If so, at the time when perf_guest_enter() returns, KVM will ack that and keep it in its pmu data structure. Now, when doing context switching back to host at just VMEXIT, KVM will check this data and see if host perf context has something active (of course, they are all exclude_guest events). If not, deferring the context switch to vcpu boundary. Otherwise, do the proper PMU context switching by respecting the occupied counter positions on the host side, i.e., avoid doubling the work on the KVM side. Kan, any suggestion on the above approach? Totally understand that there might be some difficulty, since perf subsystem works in several layers and obviously fetching low-level mapping is arch specific work. If that is difficult, we can split the work in two phases: 1) phase #1, just ask perf to tell kvm if there are active exclude_guest events swapped out; 2) phase #2, ask perf to tell their (low-level) counter indices. Thanks. -Mingwei > > The perf_guest_exit() will reload the host state. It's impossible to > save the guest state after that. We may need a KVM callback. So perf can > tell KVM whether to save the guest state before perf reloads the host sta= te. > > Thanks, > Kan > >> > >> > >>> > >>> My original sketch: https://lore.kernel.org/all/ZR3eNtP5IVAHeFNC@goog= lecom > >