Received: by 2002:a05:7412:3784:b0:e2:908c:2ebd with SMTP id jk4csp2035159rdb; Tue, 3 Oct 2023 08:23:38 -0700 (PDT) X-Google-Smtp-Source: AGHT+IHQ5GRbge7yzbV01M6TUzhaFmZE54zvOYP17sP+6bwxDQPr73WN3WyrCxyLPGvzhGCB2ZXX X-Received: by 2002:a05:6a21:7746:b0:157:be16:b6bc with SMTP id bc6-20020a056a21774600b00157be16b6bcmr5377845pzc.59.1696346617797; Tue, 03 Oct 2023 08:23:37 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1696346617; cv=none; d=google.com; s=arc-20160816; b=D4IAY6zIvqO8ybN7EG9bWyWqzoA730NCIse+lhqpena68DaK9bOQkLjeDjZSQd9llD OgUVt8OJEku3l6YNnns6IsUGAhI/BI7DP4r9pIW4JnaNecxnhgyFiNYR/j4SIhdajwWU ZJN2qJ3feUX3RjADgitZwxtuA8fYZ/wWIor/UrJddBmb1DMh2REMaPezBR7g0kF1P71i JD/2meHm0BeCPAivIoBtsTo6LjC/fH7NVC9k7sbE0cE+wgkemUkAFXvfCykBMcmqoPxa rTGCZaNIdxTLamOj+l9mUBXwi5V1pBZ5b5oFHj1C3g2oVkiezau0qWgq9JT+ROqdMple LIFA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:dkim-signature; bh=Ku2k6bqYSfGKotdN8kSOIqtcNeqWaahv/pJN4kCskfw=; fh=UlZxmafoD3LRcKohjx+c/eWfGp2B3jX1gb/DSiojP7A=; b=A/fYPiQqYYVPhNESNiLzX2iM8phfU547YH51CsiCak8Hi/FZORaAXZwQ3hZsqAVE/s mKrvt115pvHvqkyrWJvPe9Vk7VZdeb6x0R2D6rndxq8D26QGXSW9craebqxeLVZTAIYY R0k2evJrFaynFD9BkvSNbQr0Zn+T6qVhkC88UZYw47MK1iwh37LGqWEbpEWw7HY8qpQ3 WDcuXCvC6tjwrrqD0WB/3BIB+QZtKVjKuvZIymXsJL/gEdL/kPEJ6THn3uXQG3yrMkwf 7S0nL1wZGlS8MXuBdW9mnOK1giFY1+PAExPPKpfQmPEwPeUlE/OAQ/Nt/1AEPLQKoZBE nojw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=IUKDQJmR; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.34 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from howler.vger.email (howler.vger.email. [23.128.96.34]) by mx.google.com with ESMTPS id j190-20020a6380c7000000b00573fc71f6d9si1762998pgd.64.2023.10.03.08.23.36 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 03 Oct 2023 08:23:37 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.34 as permitted sender) client-ip=23.128.96.34; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=IUKDQJmR; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.34 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by howler.vger.email (Postfix) with ESMTP id E2BDE801E5E8; Tue, 3 Oct 2023 08:23:35 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at howler.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231639AbjJCPXd (ORCPT + 99 others); Tue, 3 Oct 2023 11:23:33 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34108 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230238AbjJCPXc (ORCPT ); Tue, 3 Oct 2023 11:23:32 -0400 Received: from mail-yw1-x114a.google.com (mail-yw1-x114a.google.com [IPv6:2607:f8b0:4864:20::114a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D2ED6AD for ; Tue, 3 Oct 2023 08:23:28 -0700 (PDT) Received: by mail-yw1-x114a.google.com with SMTP id 00721157ae682-59f2c7a4f24so16003707b3.0 for ; Tue, 03 Oct 2023 08:23:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1696346608; x=1696951408; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=Ku2k6bqYSfGKotdN8kSOIqtcNeqWaahv/pJN4kCskfw=; b=IUKDQJmRYENZ/cCXAB8BiDftKArU2nPWsYzL0CZRMsY/+hlkITOygQ2FL6atvuhkEO fD+A3bHMrZhC2QAkNs0SBIKeD8bY4/WpsISZUSsd8hYhbCLXeV7Uwq6jj0VXDjmL2QeO bCg+LuPBynBZZdauIzRhWE342AnDY8Z6nZMF0jnjPzNqMV/Z1nm94Cm0bfMm0DFPHndB wpWLSoN3aTvRfxA2QGRseYdbz8ybazA+D1ucZMiLQ7Fs7bcV+RgPq8lUzxkMy0RStood BIQclJJLpmaYuApJ8sMfFaIoDIywhA1CSLQsiORrvGTTjifVO04h66/dpAdushkG7LNf pHJA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1696346608; x=1696951408; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=Ku2k6bqYSfGKotdN8kSOIqtcNeqWaahv/pJN4kCskfw=; b=KoRD1AaQxapAmbC/EFL+ywDC1Bem9724ptaxrtoufhsWUjiGfcqDU6SSoK2qUGlFDn OTMA1JX9SFJ4zabK1xwKwrTMC+ABO3EZvA9rNRZqArYGzs0FFJDkRwhXsRTFfzKset8t rCvjn422Oe7e+fMTnPPZjvjh3Jsx42hABA93YEgUlZcyZy+SDpBpLeRHPtlQ9LlM2zDB abyrz4Kswie2AyCOMtZr6Uf7DhJmSNDXrSsc+uRC/3GZD6xeCUfCuLXKMO3ZLIjBhm3v Xd20JE/6nMhjDDjOVFzKuqNWWadkNmVSquPP3270gxhh1bSBrtaP97yB2WLdXsoYFU2W 8UOQ== X-Gm-Message-State: AOJu0YyU1uUeMGMapjrURh5R6Mj4Llv128+oVWHWf/FXPrfK8+cPYW1a gXQHTkVP5G9seBQ0SIMAWbYKU68NPhs= X-Received: from zagreus.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:5c37]) (user=seanjc job=sendgmr) by 2002:a25:6901:0:b0:d7e:dff4:b0fe with SMTP id e1-20020a256901000000b00d7edff4b0femr221605ybc.7.1696346607959; Tue, 03 Oct 2023 08:23:27 -0700 (PDT) Date: Tue, 3 Oct 2023 08:23:26 -0700 In-Reply-To: <20231003081616.GE27267@noisy.programming.kicks-ass.net> Mime-Version: 1.0 References: <20230927113312.GD21810@noisy.programming.kicks-ass.net> <20230929115344.GE6282@noisy.programming.kicks-ass.net> <20231002115718.GB13957@noisy.programming.kicks-ass.net> <20231002204017.GB27267@noisy.programming.kicks-ass.net> <20231003081616.GE27267@noisy.programming.kicks-ass.net> Message-ID: Subject: Re: [Patch v4 07/13] perf/x86: Add constraint for guest perf metrics event From: Sean Christopherson To: Peter Zijlstra Cc: Ingo Molnar , Dapeng Mi , Paolo Bonzini , Arnaldo Carvalho de Melo , Kan Liang , Like Xu , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Ian Rogers , Adrian Hunter , kvm@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-kernel@vger.kernel.org, Zhenyu Wang , Zhang Xiong , Lv Zhiyuan , Yang Weijiang , Dapeng Mi , Jim Mattson , David Dunn , Mingwei Zhang , Thomas Gleixner Content-Type: text/plain; charset="us-ascii" X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, RCVD_IN_DNSWL_BLOCKED,SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (howler.vger.email [0.0.0.0]); Tue, 03 Oct 2023 08:23:36 -0700 (PDT) On Tue, Oct 03, 2023, Peter Zijlstra wrote: > On Mon, Oct 02, 2023 at 05:56:28PM -0700, Sean Christopherson wrote: > > On Mon, Oct 02, 2023, Peter Zijlstra wrote: > > > > I'm not sure what you're suggesting here. It will have to save/restore > > > all those MSRs anyway. Suppose it switches between vCPUs. > > > > The "when" is what's important. If KVM took a literal interpretation of > > "exclude guest" for pass-through MSRs, then KVM would context switch all those > > MSRs twice for every VM-Exit=>VM-Enter roundtrip, even when the VM-Exit isn't a > > reschedule IRQ to schedule in a different task (or vCPU). The overhead to save > > all the host/guest MSRs and load all of the guest/host MSRs *twice* for every > > VM-Exit would be a non-starter. E.g. simple VM-Exits are completely handled in > > <1500 cycles, and "fastpath" exits are something like half that. Switching all > > the MSRs is likely 1000+ cycles, if not double that. > > See, you're the virt-nerd and I'm sure you know what you're talking > about, but I have no clue :-) I didn't know there were different levels > of vm-exit. An exit is essentially a fancy exception/event. The hardware transition from guest=>host is the exception itself (VM-Exit), and the transition back to guest is analagous to the IRET (VM-Enter). In between, software will do some amount of work, and the amount of work that is done can vary quite significantly depending on what caused the exit. > > FWIW, the primary use case we care about is for slice-of-hardware VMs, where each > > vCPU is pinned 1:1 with a host pCPU. > > I've been given to understand that vm-exit is a bad word in this > scenario, any exit is a fail. They get MWAIT and all the other crap and > more or less pretend to be real hardware. > > So why do you care about those MSRs so much? That should 'never' happen > in this scenario. It's not feasible to completely avoid exits, as current/upcoming hardware doesn't (yet) virtualize a few important things. Off the top of my head, the two most relevant flows are: - APIC_LVTPC entry and PMU counters. If a PMU counter overflows, the NMI that is generated will trigger a hardware level NMI and cause an exit. And sadly, the guest's NMI handler (assuming the guest is also using NMIs for PMIs) will trigger another exit when it clears the mask bit in its LVTPC entry. - Timer related IRQs, both in the guest and host. These are the biggest source of exits on modern hardware. Neither AMD nor Intel provide a virtual APIC timer, and so KVM must trap and emulate writes to TSC_DEADLINE (or to APIC_TMICT), and the subsequent IRQ will also cause an exit. The cumulative cost of all exits is important, but the latency of each individual exit is even more critical, especially for PMU related stuff. E.g. if the guest is trying to use perf/PMU to profile a workload, adding a few thousand cycles to each exit will introduce too much noise into the results. > > > > Or at least, that was my reading of things. Maybe it was just a > > > > misunderstanding because we didn't do a good job of defining the behavior. > > > > > > This might be the case. I don't particularly care where the guest > > > boundary lies -- somewhere in the vCPU thread. Once the thread is gone, > > > PMU is usable again etc.. > > > > Well drat, that there would have saved a wee bit of frustration. Better late > > than never though, that's for sure. > > > > Just to double confirm: keeping guest PMU state loaded until the vCPU is scheduled > > out or KVM exits to userspace, would mean that host perf events won't be active > > for potentially large swaths of non-KVM code. Any function calls or event/exception > > handlers that occur within the context of ioctl(KVM_RUN) would run with host > > perf events disabled. > > Hurmph, that sounds sub-optimal, earlier you said <1500 cycles, this all > sounds like a ton more. > > /me frobs around the kvm code some... > > Are we talking about exit_fastpath loop in vcpu_enter_guest() ? That > seems to run with IRQs disabled, so at most you can trigger a #PF or > something, which will then trip an exception fixup because you can't run > #PF with IRQs disabled etc.. > > That seems fine. That is, a theoretical kvm_x86_handle_enter_irqoff() > coupled with the existing kvm_x86_handle_exit_irqoff() seems like > reasonable solution from where I'm sitting. That also more or less > matches the FPU state save/restore AFAICT. > > Or are you talking about the whole of vcpu_run() ? That seems like a > massive amount of code, and doesn't look like anything I'd call a > fast-path. Also, much of that loop has preemption enabled... The whole of vcpu_run(). And yes, much of it runs with preemption enabled. KVM uses preempt notifiers to context switch state if the vCPU task is scheduled out/in, we'd use those hooks to swap PMU state. Jumping back to the exception analogy, not all exits are equal. For "simple" exits that KVM can handle internally, the roundtrip is <1500. The exit_fastpath loop is roughly half that. But for exits that are more complex, e.g. if the guest hits the equivalent of a page fault, the cost of handling the page fault can vary significantly. It might be <1500, but it might also be 10x that if handling the page fault requires faulting in a new page in the host. We don't want to get too aggressive with moving stuff into the exit_fastpath loop, because doing too much work with IRQs disabled can cause latency problems for the host. This isn't much of a concern for slice-of-hardware setups, but would be quite problematic for other use cases. And except for obviously slow paths (from the guest's perspective), extra latency on any exit can be problematic. E.g. even if we got to the point where KVM handles 99% of exits the fastpath (may or may not be feasible), a not-fastpath exit at an inopportune time could throw off the guest's profiling results, introduce unacceptable jitter, etc. > > Are you ok with that approach? Assuming we don't completely botch things, the > > interfaces are sane, we can come up with a clean solution for handling NMIs, etc. > > Since you steal the whole PMU, can't you re-route the PMI to something > that's virt friendly too? Hmm, actually, we probably could. It would require modifying the host's APIC_LVTPC entry when context switching the PMU, e.g. to replace the NMI with a dedicated IRQ vector. As gross as that sounds, it might actually be cleaner overall than deciphering whether an NMI belongs to the host or guest, and it would almost certainly yield lower latency for guest PMIs.