Received: by 2002:a05:7412:3784:b0:e2:908c:2ebd with SMTP id jk4csp2885149rdb; Wed, 4 Oct 2023 14:55:34 -0700 (PDT) X-Google-Smtp-Source: AGHT+IG40SKJb4Fk3IYzkZJ1E7LIp12l4eu9EYgJRRFZSC+PGkgAbFRCij9A4dz/k5DkI72xeqLS X-Received: by 2002:a05:6808:2918:b0:3a7:6d64:aa68 with SMTP id ev24-20020a056808291800b003a76d64aa68mr3250935oib.18.1696456534689; Wed, 04 Oct 2023 14:55:34 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1696456534; cv=none; d=google.com; s=arc-20160816; b=f91aBll9nfhs1V/TgqGQv9VLx03Y4GWz3UnRUmvuN0sMazXSFw8XWu66Y897lxHj5O BAfWkrh5IWedv9/GyhbvP8Q3GjG8Em8kogWhBwkREoZ6JwLqxWVZJR6hHJuiwEKbNnv3 r0/ANAd9gPxM06QlKbA5W/Yr4225sjaHE8JTWpERqKOJ+y5yZ8LjZGe4LdrTEJCElABR ZvVq7dkUPfAsxoJu4+o9n+nULZKVz5Ea2OYLAoVdXTk1pIa71DewhXWSrqxtfaYnG9l5 rzpS6tYNq9ayJrLsjzrKMuL8PHU021gBTIULjlNyyfv2SHzxBOhwxjIi8NH7G8DeqcNk Ll4A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:from:subject :message-id:references:mime-version:in-reply-to:date:dkim-signature; bh=kiSgfZM7ECGdOQlszNQDUnAmB/N1bBj0T6OcTzs1AbE=; fh=z73Nw6ZnROKujOlha8ONdOw7siSFOjDnwUvdk4jMB50=; b=KEqn8AxeWZSoY1HwOApc5QTPZfoj7c+qjfTQNQPEZN8KaJhXGG8MoPWIVOUFpaJskQ h9wW81NWFmBUrquJxvgfpjji9E9ddZMF9jtyxJVMn4fZuCUgPjmRpmhxYtPfcQ7L2Z7T 5DxY1iwruzysSxSk51A0goxNuRCKdIY2lJeYFs4LwhSjQ6tyn6S7jBZ0fKF7V1TUcUvO Nt3YqCDCDkXj4ZAmE81kKfRad2PAuOaLJzcQSCI2ReewluNRNTHGcnhGZgxnjGFGnyq9 uBMlSGyio5a4qhpXAkn9LaWIXrEv8OvZGSm4O8DIieTTWiGpq9eUIOvmR0g6DLr3ise+ q4Dw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=rCkLhUPD; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.32 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from agentk.vger.email (agentk.vger.email. [23.128.96.32]) by mx.google.com with ESMTPS id bv3-20020a632e03000000b0057745d87b50si58354pgb.139.2023.10.04.14.55.11 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 04 Oct 2023 14:55:34 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.32 as permitted sender) client-ip=23.128.96.32; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=rCkLhUPD; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.32 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by agentk.vger.email (Postfix) with ESMTP id 480C381D0B44; Wed, 4 Oct 2023 14:55:08 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at agentk.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S244045AbjJDVux (ORCPT + 99 others); Wed, 4 Oct 2023 17:50:53 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57814 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233584AbjJDVuw (ORCPT ); Wed, 4 Oct 2023 17:50:52 -0400 Received: from mail-pj1-x1049.google.com (mail-pj1-x1049.google.com [IPv6:2607:f8b0:4864:20::1049]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 42265C1 for ; Wed, 4 Oct 2023 14:50:48 -0700 (PDT) Received: by mail-pj1-x1049.google.com with SMTP id 98e67ed59e1d1-277277fd56aso207754a91.2 for ; Wed, 04 Oct 2023 14:50:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1696456247; x=1697061047; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=kiSgfZM7ECGdOQlszNQDUnAmB/N1bBj0T6OcTzs1AbE=; b=rCkLhUPDkLNBdANwdb7ts4sZCV+YYhPq4xEDFcDYK9DiAWLG8XZI2fv9av19JKXoUQ 2TcUTqsqZ2reaYtXaf7h6hX2iGQY1/xrsjMof0cElbubv+T9wgiESlvlV82IG1VXyeFN vUVoGupTdOiHQhTL9CD/mcdWiuqFw8yphrkx34E92qi6l3pf4qgZMeDzbMqkTIwlp2DR VXYuC1Ypp08F/eLDrpPbFBwpH0v42OP05m8Dh5c/79950fpx6DhQ48y6iOVElyOPdB0U VY2n5mA5nUOHGVzyp9OzAOu8ym3/2pv3Amnivtvx5M+kAr+HH5DE/AT/xLMiWLXF6Qjn 6M+Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1696456247; x=1697061047; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=kiSgfZM7ECGdOQlszNQDUnAmB/N1bBj0T6OcTzs1AbE=; b=WF8RfwHERmsBEv2Em7d1o3d9hOKZ05Mt3tem/idkfy89fs2d2C87MV+wKvsGPDt/ou TsLemd97h8JiWWyOTtxtjhlePar5qboVVLfLjhZK32bF4Gcn3z5cBJx9S7fvRbFv9qE1 HUc1IIB/0XcB3ziU5JAb2a3P1bSHFyHkfKsoU8AavxhwH3oILF1cc0ZGU6wbJdOcEDhy SB+8NL2nLRO+IGUlar8Fbs+tXxO5R5QKU+L1eWnAyPmDdg2jz3PaZkOXskmc19NvMcu1 60maJzzYEEW+uwxHxW/v81aWLotLzz33mhpVQBXUgLbJSnYtatM9kHAACFDPyb/Dc9pS oc6g== X-Gm-Message-State: AOJu0Yxjy/TmWlhdgXtv2BzM8HWfE4PWWzBPXqjLMMJ2ymJvJjpfUcqq elkZ5hqRGUOFSjgqYDYO8Z6Py/tmlDU= X-Received: from zagreus.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:5c37]) (user=seanjc job=sendgmr) by 2002:a17:90b:894:b0:277:78c:ae8c with SMTP id bj20-20020a17090b089400b00277078cae8cmr57505pjb.6.1696456247624; Wed, 04 Oct 2023 14:50:47 -0700 (PDT) Date: Wed, 4 Oct 2023 14:50:46 -0700 In-Reply-To: Mime-Version: 1.0 References: <20231002115718.GB13957@noisy.programming.kicks-ass.net> <20231002204017.GB27267@noisy.programming.kicks-ass.net> <20231003081616.GE27267@noisy.programming.kicks-ass.net> <20231004112152.GA5947@noisy.programming.kicks-ass.net> Message-ID: Subject: Re: [Patch v4 07/13] perf/x86: Add constraint for guest perf metrics event From: Sean Christopherson To: Mingwei Zhang Cc: Peter Zijlstra , Ingo Molnar , Dapeng Mi , Paolo Bonzini , Arnaldo Carvalho de Melo , Kan Liang , Like Xu , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Ian Rogers , Adrian Hunter , kvm@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-kernel@vger.kernel.org, Zhenyu Wang , Zhang Xiong , Lv Zhiyuan , Yang Weijiang , Dapeng Mi , Jim Mattson , David Dunn , Thomas Gleixner Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-4.8 required=5.0 tests=DKIMWL_WL_MED,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,RCVD_IN_SBL_CSS,SPF_HELO_NONE,SPF_PASS, USER_IN_DEF_DKIM_WL autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on agentk.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (agentk.vger.email [0.0.0.0]); Wed, 04 Oct 2023 14:55:08 -0700 (PDT) On Wed, Oct 04, 2023, Mingwei Zhang wrote: > On Wed, Oct 4, 2023 at 4:22=E2=80=AFAM Peter Zijlstra wrote: > > > > Or are you talking about the whole of vcpu_run() ? That seems like = a > > > > massive amount of code, and doesn't look like anything I'd call a > > > > fast-path. Also, much of that loop has preemption enabled... > > > > > > The whole of vcpu_run(). And yes, much of it runs with preemption en= abled. KVM > > > uses preempt notifiers to context switch state if the vCPU task is sc= heduled > > > out/in, we'd use those hooks to swap PMU state. > > > > > > Jumping back to the exception analogy, not all exits are equal. For = "simple" exits > > > that KVM can handle internally, the roundtrip is <1500. The exit_fa= stpath loop is > > > roughly half that. > > > > > > But for exits that are more complex, e.g. if the guest hits the equiv= alent of a > > > page fault, the cost of handling the page fault can vary significantl= y. It might > > > be <1500, but it might also be 10x that if handling the page fault re= quires faulting > > > in a new page in the host. > > > > > > We don't want to get too aggressive with moving stuff into the exit_f= astpath loop, > > > because doing too much work with IRQs disabled can cause latency prob= lems for the > > > host. This isn't much of a concern for slice-of-hardware setups, but= would be > > > quite problematic for other use cases. > > > > > > And except for obviously slow paths (from the guest's perspective), e= xtra latency > > > on any exit can be problematic. E.g. even if we got to the point whe= re KVM handles > > > 99% of exits the fastpath (may or may not be feasible), a not-fastpat= h exit at an > > > inopportune time could throw off the guest's profiling results, intro= duce unacceptable > > > jitter, etc. > > > > I'm confused... the PMU must not be running after vm-exit. It must not > > be able to profile the host. So what jitter are you talking about? > > > > Even if we persist the MSR contents, the PMU itself must be disabled on > > vm-exit and enabled on vm-enter. If not by hardware then by software > > poking at the global ctrl msr. > > > > I also don't buy the latency argument, we already do full and complete > > PMU rewrites with IRQs disabled in the context switch path. And as > > mentioned elsewhere, the whole AMX thing has an 8k copy stuck in the FP= U > > save/restore. > > > > I would much prefer we keep the PMU swizzle inside the IRQ disabled > > region of vcpu_enter_guest(). That's already a ton better than you have > > today. ... > Peter, that latency argument in pass-through implementation is > something that we hope you could buy. This should be relatively easy > to prove. I can provide some data if you need. You and Peter are talking about is two different latencies. Or rather, how= the latency impacts two different things. Peter is talking about is the latency impact on the host, specifically how = much work is done with IRQs disabled. You are talking about is the latency impact on the guest, i.e. how much gue= st performance is affected if KVM swaps MSRs on every exit.=20 Peter is contending that swapping PMU MSRs with IRQs disabled isn't a big d= eal, because the kernel already does as much during a context switch. I agree, = *if* we're talking about only adding the PMU MSRs. You (and I) are contending that the latency impact on the guest will be too= high if KVM swaps in the inner VM-Exit loop. This is not analogous to host cont= ext switches, as VM-Exits can occur at a much higher frequency than context swi= tches, and can be triggered by events that have nothing to do with the guest. There's some confusion here though because of what I said earlier: We don't want to get too aggressive with moving stuff into the exit_fastp= ath loop, because doing too much work with IRQs disabled can cause latency pr= oblems for the host.=20 By "stuff" I wasn't talking about PMU MSRs, I was referring to all exit han= dling that KVM *could* move into the IRQs disabled section in order to mitigate t= he concerns that we have about the latency impacts on the guest. E.g. if most= exits are handled in the IRQs disabled section, then KVM could handle most exits = without swapping PMU state and thus limit the impact on guest performance, and not = cause to much host perf "downtime" that I mentioned in the other thread[*]. However, my concern is that handling most exits with IRQs disabled would re= sult in KVM doing too much work with IRQs disabled, i.e. would impact the host l= atency that Peter is talking about. And I'm more than a bit terrified of calling = into code that doesn't expect to be called with IRQs disabled. Thinking about this more, what if we do a blend of KVM's FPU swapping and d= ebug register swapping? A. Load guest PMU state in vcpu_enter_guest() after IRQs are disabled B. Put guest PMU state (and load host state) in vcpu_enter_guest() before= IRQs are enabled, *if and only if* the current CPU has one or perf events t= hat wants to use the hardware PMU C. Put guest PMU state at vcpu_put() D. Add a perf callback that is invoked from IRQ context when perf wants t= o configure a new PMU-based events, *before* actually programming the MS= Rs, and have KVM's callback put the guest PMU state If there are host perf events that want to use the PMU, then KVM will swap = fairly aggressively and the "downtime" of the host perf events will be limited to = the small window around VM-Enter/VM-Exit. If there are no such host events, KVM will swap on the first entry to the g= uest, and keep the guest PMU loaded until the vCPU is put. The perf callback in (D) would allow perf to program system-wide events on = all CPUs without clobbering guest PMU state. I think that would make everyone happy. As long as our hosts don't create = perf events, then we get the "swap as little as possible" behavior without signi= ficantly impacting the host's ability to utilize perf. If our host screws up and cr= eates perf events on CPUs that are running vCPUs, then the degraded vCPU performa= nce is on us. Rough sketch below, minus the perf callback or any of actual swapping logic= . [*] https://lore.kernel.org/all/ZR3Ohk50rSofAnSL@google.com diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h index 7d9ba301c090..86699d310224 100644 --- a/arch/x86/kvm/pmu.h +++ b/arch/x86/kvm/pmu.h @@ -41,6 +41,30 @@ struct kvm_pmu_ops { =20 void kvm_pmu_ops_update(const struct kvm_pmu_ops *pmu_ops); =20 +static inline void kvm_load_guest_pmu(struct kvm_vcpu *vcpu) +{ + struct kvm_pmu *pmu =3D vcpu_to_pmu(vcpu); + + lockdep_assert_irqs_disabled(); + + if (vcpu->pmu->guest_state_loaded) + return; + + + vcpu->pmu->guest_state_loaded =3D true; +} + +static inline void kvm_put_guest_pmu(struct kvm_vcpu *vcpu) +{ + lockdep_assert_irqs_disabled(); + + if (!vcpu->pmu->guest_state_loaded) + return; + + + vcpu->pmu->guest_state_loaded =3D false; +} + static inline bool kvm_pmu_has_perf_global_ctrl(struct kvm_pmu *pmu) { /* diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 1e645f5b1e2c..93a8f268c37b 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -4903,8 +4903,20 @@ static void kvm_steal_time_set_preempted(struct kvm_= vcpu *vcpu) =20 void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) { + unsigned long flags; int idx; =20 + /* + * This can get false positives, but not false negatives, i.e. KVM = will + * never fail to put the PMU, but may unnecessarily disable IRQs to + * safely check if the PMU is still loaded. + */ + if (kvm_is_guest_pmu_loaded(vcpu)) { + local_irq_save(flags); + kvm_put_guest_pmu(vcpu); + local_irq_restore(flags); + } + if (vcpu->preempted) { if (!vcpu->arch.guest_state_protected) vcpu->arch.preempted_in_kernel =3D !static_call(kvm= _x86_get_cpl)(vcpu); @@ -10759,6 +10771,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) set_debugreg(0, 7); } =20 + kvm_load_guest_pmu(vcpu); + guest_timing_enter_irqoff(); =20 for (;;) { @@ -10810,6 +10824,9 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) if (hw_breakpoint_active()) hw_breakpoint_restore(); =20 + if (perf_has_hw_events()) + kvm_put_guest_pmu(vcpu); + vcpu->arch.last_vmentry_cpu =3D vcpu->cpu; vcpu->arch.last_guest_tsc =3D kvm_read_l1_tsc(vcpu, rdtsc()); =20