Received: by 2002:ab2:3c46:0:b0:1f5:f2ab:c469 with SMTP id x6csp22529lqf; Thu, 25 Apr 2024 20:12:53 -0700 (PDT) X-Forwarded-Encrypted: i=3; AJvYcCUlOvP+PChRm9s+sDel4kPQpcrnNJfXIsFOLCPGQB4PY+KskVTNvGVQDCEfWOsQCtPYMsUpf6ZpD+k5iAD0yWLyg79Z5lLuYfpCTMd5Rw== X-Google-Smtp-Source: AGHT+IE92KNcMAU5IX62oFPkTzdL8oWn+84vhWxh9AMqsH7S7/zyNmihKeXJZ7qF04nqrDVuEnde X-Received: by 2002:a17:906:ca54:b0:a58:c54f:83fd with SMTP id jx20-20020a170906ca5400b00a58c54f83fdmr864654ejb.5.1714101173237; Thu, 25 Apr 2024 20:12:53 -0700 (PDT) ARC-Seal: i=2; a=rsa-sha256; t=1714101173; cv=pass; d=google.com; s=arc-20160816; b=vCSHqslDjSADkTUeRmel3L6UFzUh4S7C+jGnFNhozwTV2+Vvl4KY+QjxGbgTk9miiJ Hq2+rIOY7JnKZxkHxgrFtw6etCF/zhWZ6gRDTJIaad9N+wOmQg6OhOlQXMvEHNtILtiR Gthsbi3fEuK0+2F6XuuU3HQiaklVG9KXV/yu4j49xANN5qS+fuOk9gHN3urufaFuswTB t17npTXRD/Vo0OxanFkSDJvTI1nm9w31S45cqMNHA0Z14Tz0hIHtSXgQJ6cO7R2YXog9 nAKDjMY5M6CVBC6enjCVCk/L+cbOZToX0qzjSIhZYzykLGIA+x6/dLylk66VY0QMohIi JTRA== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:list-unsubscribe:list-subscribe :list-id:precedence:dkim-signature; bh=0gollOxoKKyvw6fTGUWSlbVML5x5LO6G0o13LnZVH94=; fh=t116e1sbdxtTak8Yjay5en/yXvBESP4EcWW42u6m/JM=; b=NX1Ez+WQyAlaz4vYROfge8EDAfPUZMgXV65QEOmWNXcCfAxA7k78vZ1GUsiXkggJud FfBjLAGhpm6+n+l/B7i+s6Edm9gEjSNbzSU9oMg573qyQO5Xi1nVVoW/+GAa9zbkkkNl pTiw5nEfSdhh5EmC2J3kD/5n6cRtt9ykJpVUILDxsoQzJVmOt1Osl/fKRNglzOBtolLC 3s0p6NZRAOd8Fswq5rc1Hq9DxvN8yMSkpH2lXfmMonW5oowkZPGvwrw1xKczFLRKrCO0 PUK5D/JdVDJ55nEFDeKXY/LtfIaoiKvv22s8MhZ3nAvmi0/oHu+RPhBgafXDHdR7g0/A 8J3Q==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b="W/Dtl/6c"; arc=pass (i=1 spf=pass spfdomain=google.com dkim=pass dkdomain=google.com dmarc=pass fromdomain=google.com); spf=pass (google.com: domain of linux-kernel+bounces-159458-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.80.249 as permitted sender) smtp.mailfrom="linux-kernel+bounces-159458-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from am.mirrors.kernel.org (am.mirrors.kernel.org. [147.75.80.249]) by mx.google.com with ESMTPS id f3-20020a17090631c300b00a55a8b91c5dsi6706337ejf.68.2024.04.25.20.12.53 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 25 Apr 2024 20:12:53 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel+bounces-159458-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.80.249 as permitted sender) client-ip=147.75.80.249; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b="W/Dtl/6c"; arc=pass (i=1 spf=pass spfdomain=google.com dkim=pass dkdomain=google.com dmarc=pass fromdomain=google.com); spf=pass (google.com: domain of linux-kernel+bounces-159458-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.80.249 as permitted sender) smtp.mailfrom="linux-kernel+bounces-159458-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by am.mirrors.kernel.org (Postfix) with ESMTPS id C79701F2299F for ; Fri, 26 Apr 2024 03:12:52 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 6DC6A76046; Fri, 26 Apr 2024 03:12:42 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="W/Dtl/6c" Received: from mail-ej1-f47.google.com (mail-ej1-f47.google.com [209.85.218.47]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id AE79D757F3 for ; Fri, 26 Apr 2024 03:12:39 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.218.47 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1714101161; cv=none; b=XeZ+JVvfKBwEC2MEzDcj0FCsYoYCqJtUOk8n4g9f3/6QeP9ndMI4iFBonPD0cHPnz1FF3oLo23vdh0fdF46tRLAkMm8/mIHbjOe1awbThNJDMVFKKZUBngHdWT8EECXhmsXB+iM+aEQW1odqV9+WR3c/5WkGfTY2XOLObpdY96Q= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1714101161; c=relaxed/simple; bh=N6nqUZWK/ft8y3k7S0i7d7O9JqPf9wUGqpFToio4O1A=; h=MIME-Version:References:In-Reply-To:From:Date:Message-ID:Subject: To:Cc:Content-Type; b=GzRhTKPW67yPn0fV1qNVY6fyBEGvvn0z73166tErGr/E+mUOtvdmO5txaK/xNpSthhQd1wVsj7O0JJQ34zZyq9+wpzmvpYnpbNvvfxKOR+x/kBf+k56IA0UKAlKSnLKSCLbhLMdEaI2l3qu+mMvuONGoISJKYuzFvRArg653mG0= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=W/Dtl/6c; arc=none smtp.client-ip=209.85.218.47 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=google.com Received: by mail-ej1-f47.google.com with SMTP id a640c23a62f3a-a58c09e2187so113593666b.3 for ; Thu, 25 Apr 2024 20:12:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1714101158; x=1714705958; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=0gollOxoKKyvw6fTGUWSlbVML5x5LO6G0o13LnZVH94=; b=W/Dtl/6c9K5+nqMJf1VOVmfaMOw1mHQ2yi976imO1DscrJB9QSSQeRuj2BHaIkFgTO epyn4IHKflJN1PNYbYnu6q0rLmIz5bw3hcDp1P9vTz7h8qOZlULfLVrM9+j0hX9e80zM bLd0uHkLoU+x8w7zIyJCVmvZOng4Ngi12OXNG/LpsPQUzCAC/oPBUqldiTJkOCYqQt4P flzVXaTvaGgqKHdvCTFH4DWLDQE3IqkTlka6SwRAT9k3sPYKmcDopqMUnmGSiDKG/UUm +iEK4gaVkV78mRgLRPn5dJGBQ41CrxXTCXX/Qn3+QnTJmI+uGfAcaGfqCi9gwZFMjYdO +a5A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1714101158; x=1714705958; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=0gollOxoKKyvw6fTGUWSlbVML5x5LO6G0o13LnZVH94=; b=P1pB+cBOTBSQqqCegppkptYWjvT5QEMD5F4pFl8Kn6Jc3Y2R9wd34Czjsw3McKyJTD /4xsJiOkU6+lHzdssP3AvdynnvgrvqQjTZ77eGi0mKWGF/9e7bWsgekD/VncMZzL9Rds szVGIa4AcR9K1rNY731Qhz1Ja6yqberQzCxfhOTYzaczQS2ErUfaPY2EViXsZQYWosBk 2ovGdIG8VSs0BLdlqHQQbQvpv4QWT+wbn6g0eUlDZmpqKHuz3GRQ5xwiyuVVae/a18oQ V146FYQm0tUKBD5CbtXbAZuQb2vCFqkWXlHsPYRE4BqrdSc2GQtASLui/nArvmT0hAd+ yN2Q== X-Forwarded-Encrypted: i=1; AJvYcCUmyOUhcFYadffQuk16eo36voI1Q02HWZJm4E4ZQiPpEiDZV4dn+E8SvZixFfFOBihzqKK4VQNTQniNHWccUeu5YAEo5bc69O7BNc/7 X-Gm-Message-State: AOJu0YxavPUxqG+dkFoU35sA/bDnTlobfdPI3NdzRYrU+B4acHvXx4ui 9d5G+4yndA+cWWzanac7IJjMrCTczTvShepwS/HCnq4twIj52dVzuk3R1hPiUTpADPtJZrMLGic DUxe3v3NbUKEnx2yR7uCroNNkRqMGz2Yy+Q3H X-Received: by 2002:a17:906:368e:b0:a58:7fa5:811f with SMTP id a14-20020a170906368e00b00a587fa5811fmr782902ejc.69.1714101157761; Thu, 25 Apr 2024 20:12:37 -0700 (PDT) Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 References: <7834a811-4764-42aa-8198-55c4556d947b@linux.intel.com> <6af2da05-cb47-46f7-b129-08463bc9469b@linux.intel.com> <42acf1fc-1603-4ac5-8a09-edae2d85963d@linux.intel.com> <77913327-2115-42b5-850a-04ef0581faa7@linux.intel.com> In-Reply-To: <77913327-2115-42b5-850a-04ef0581faa7@linux.intel.com> From: Mingwei Zhang Date: Thu, 25 Apr 2024 20:12:01 -0700 Message-ID: Subject: Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU To: "Mi, Dapeng" Cc: Sean Christopherson , Kan Liang , maobibo , Xiong Zhang , pbonzini@redhat.com, peterz@infradead.org, kan.liang@intel.com, zhenyuw@linux.intel.com, jmattson@google.com, kvm@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-kernel@vger.kernel.org, zhiyuan.lv@intel.com, eranian@google.com, irogers@google.com, samantha.alt@intel.com, like.xu.linux@gmail.com, chao.gao@intel.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Thu, Apr 25, 2024 at 6:46=E2=80=AFPM Mi, Dapeng wrote: > > > On 4/26/2024 5:46 AM, Sean Christopherson wrote: > > On Thu, Apr 25, 2024, Kan Liang wrote: > >> On 2024-04-25 4:16 p.m., Mingwei Zhang wrote: > >>> On Thu, Apr 25, 2024 at 9:13=E2=80=AFAM Liang, Kan wrote: > >>>> It should not happen. For the current implementation, perf rejects a= ll > >>>> the !exclude_guest system-wide event creation if a guest with the vP= MU > >>>> is running. > >>>> However, it's possible to create an exclude_guest system-wide event = at > >>>> any time. KVM cannot use the information from the VM-entry to decide= if > >>>> there will be active perf events in the VM-exit. > >>> Hmm, why not? If there is any exclude_guest system-wide event, > >>> perf_guest_enter() can return something to tell KVM "hey, some active > >>> host events are swapped out. they are originally in counter #2 and > >>> #3". If so, at the time when perf_guest_enter() returns, KVM will ack > >>> that and keep it in its pmu data structure. > >> I think it's possible that someone creates !exclude_guest event after > > I assume you mean an exclude_guest=3D1 event? Because perf should be i= n a state > > where it rejects exclude_guest=3D0 events. > > Suppose should be exclude_guest=3D1 event, the perf event without > exclude_guest attribute would be blocked to create in the v2 patches > which we are working on. > > > > > >> the perf_guest_enter(). The stale information is saved in the KVM. Per= f > >> will schedule the event in the next perf_guest_exit(). KVM will not kn= ow it. > > Ya, the creation of an event on a CPU that currently has guest PMU stat= e loaded > > is what I had in mind when I suggested a callback in my sketch: > > > > : D. Add a perf callback that is invoked from IRQ context when perf = wants to > > : configure a new PMU-based events, *before* actually programming= the MSRs, > > : and have KVM's callback put the guest PMU state > > > when host creates a perf event with exclude_guest attribute which is > used to profile KVM/VMM user space, the vCPU process could work at three > places. > > 1. in guest state (non-root mode) > > 2. inside vcpu-loop > > 3. outside vcpu-loop > > Since the PMU state has already been switched to host state, we don't > need to consider the case 3 and only care about cases 1 and 2. > > when host creates a perf event with exclude_guest attribute to profile > KVM/VMM user space, an IPI is triggered to enable the perf event > eventually like the following code shows. > > event_function_call(event, __perf_event_enable, NULL); > > For case 1, a vm-exit is triggered and KVM starts to process the > vm-exit and then run IPI irq handler, exactly speaking > __perf_event_enable() to enable the perf event. > > For case 2, the IPI irq handler would preempt the vcpu-loop and call > __perf_event_enable() to enable the perf event. > > So IMO KVM just needs to provide a callback to switch guest/host PMU > state, and __perf_event_enable() calls this callback before really > touching PMU MSRs. ok, in this case, do we still need KVM to query perf if there are active exclude_guest events? yes? Because there is an ordering issue. The above suggests that the host-level perf profiling comes when a VM is already running, there is an IPI that can invoke the callback and trigger preemption. In this case, KVM should switch the context from guest to host. What if it is the other way around, ie., host-level profiling runs first and then VM runs? In this case, just before entering the vcpu loop, kvm should check whether there is an active host event and save that into a pmu data structure. If none, do the context switch early (so that KVM saves a huge amount of unnecessary PMU context switches in the future). Otherwise, keep the host PMU context until vm-enter. At the time of vm-exit, do the check again using the data stored in pmu structure. If there is an active event do the context switch to the host PMU, otherwise defer that until exiting the vcpu loop. Of course, in the meantime, if there is any perf profiling started causing the IPI, the irq handler calls the callback, preempting the guest PMU context. If that happens, at the time of exiting the vcpu boundary, PMU context switch is skipped since it is already done. Of course, note that the irq could come at any time, so the PMU context switch in all 4 locations need to check the state flag (and skip the context switch if needed). So this requires vcpu->pmu has two pieces of state information: 1) the flag similar to TIF_NEED_FPU_LOAD; 2) host perf context info (phase #1 just a boolean; phase #2, bitmap of occupied counters). This is a non-trivial optimization on the PMU context switch. I am thinking about splitting them into the following phases: 1) lazy PMU context switch, i.e., wait until the guest touches PMU MSR for the 1st time. 2) fast PMU context switch on KVM side, i.e., KVM checking event selector value (enable/disable) and selectively switch PMU state (reducing rd/wr msrs) 3) dynamic PMU context boundary, ie., KVM can dynamically choose PMU context switch boundary depending on existing active host-level events. 3.1) more accurate dynamic PMU context switch, ie., KVM checking host-level counter position and further reduces the number of msr accesses. 4) guest PMU context preemption, i.e., any new host-level perf profiling can immediately preempt the guest PMU in the vcpu loop (instead of waiting for the next PMU context switch in KVM). Thanks. -Mingwei > > > > > It's a similar idea to TIF_NEED_FPU_LOAD, just that instead of a common= chunk of > > kernel code swapping out the guest state (kernel_fpu_begin()), it's a c= allback > > into KVM.