Received: by 2002:a05:7412:3784:b0:e2:908c:2ebd with SMTP id jk4csp2139467rdb; Tue, 3 Oct 2023 11:22:28 -0700 (PDT) X-Google-Smtp-Source: AGHT+IFCJZSDc1ns4uAj/Dtf+k7hMV0ikEv8O83NlK73jZWuWk3crprHGfADdnagD8Y3O9KU8Hgk X-Received: by 2002:a17:90b:4b4a:b0:274:6ab0:67ba with SMTP id mi10-20020a17090b4b4a00b002746ab067bamr65108pjb.48.1696357348389; Tue, 03 Oct 2023 11:22:28 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1696357348; cv=none; d=google.com; s=arc-20160816; b=VPxS7M4/SG0SJrTAb9w0K1uN924cJs9DvEB3NtXoOmGLW39VnDnNy960L5DlDgbTez d57lH0EClI5KSd4B4DqaoWcv2MyRRO9Yruw9HpzM9u1XZVgAFRCWWjWa32Q8TvZx8H2x /digFGWzz9OOCsytVlvI0vxeDjgiZsqpNlNfdxWpDyUK0cjE5e5gM28FOOz4xItSFQBA kBE3zjNFESrxmZYATqb4ilDra/3yQFzDkP7LD20MCqAcOQUl/r586nFi4jooXB2rBazt FLumyxlLWOL7sZ5SK2L3zZw9/VKuMI85HeQSehhkd+GZf0fajwlUGa+fb/nhXGcWqIHX hh/A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=ZzDVAIPhXLJ6H1shwPpB3BGM89tyFHmwziem7st5rSo=; fh=/z8Hxlyo037ZWv7QkPfKsKsix3iQfmhuQ2ZwH1hqvlc=; b=oCEibex5e7xbreoCJwF0GwC101dPoeJsB8o87JBBygoHiznJVH4BLlfy0dfvZ5OE7C 7J126uIva3iHR3tTTUKhaggiFfYjNALRGozKP6b/IVm0DZ8CPWiO1g4WPI/WuEGiYgbe NptGXY1kbgwe9z2KStIu+Ac6PqNgR7qHtFcVfWvtrbRVUHXugis4d1XE9E6RieSgWPfK bdVKSwbUqkEWu3mrRyJ1s1vW49XZcvqBCeP4dhlR+cZurnK73gVL6mDv4lrDbBHTSlJC RwqxMHtpP3iOxoXoVK1pfPSySq2UEqY65AZdcnq0NTIuWoUo1z/asnvarWczKvaxyB9X QHXg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=TgVhFHcF; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.36 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from pete.vger.email (pete.vger.email. [23.128.96.36]) by mx.google.com with ESMTPS id l14-20020a17090a49ce00b002741e78e66asi10261216pjm.153.2023.10.03.11.22.28 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 03 Oct 2023 11:22:28 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.36 as permitted sender) client-ip=23.128.96.36; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=TgVhFHcF; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.36 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by pete.vger.email (Postfix) with ESMTP id 3958B80DA31B; Tue, 3 Oct 2023 11:22:25 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at pete.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231778AbjJCSWJ (ORCPT + 99 others); Tue, 3 Oct 2023 14:22:09 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46324 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S240754AbjJCSWH (ORCPT ); Tue, 3 Oct 2023 14:22:07 -0400 Received: from mail-ed1-x531.google.com (mail-ed1-x531.google.com [IPv6:2a00:1450:4864:20::531]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2D2B995 for ; Tue, 3 Oct 2023 11:22:04 -0700 (PDT) Received: by mail-ed1-x531.google.com with SMTP id 4fb4d7f45d1cf-537f07dfe8eso1634a12.1 for ; Tue, 03 Oct 2023 11:22:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1696357322; x=1696962122; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=ZzDVAIPhXLJ6H1shwPpB3BGM89tyFHmwziem7st5rSo=; b=TgVhFHcFAduEbhIum9Mp4yAbeKnTaqN4to1/9vWoiU85/AG++ekR/SqCACPdYBZZBe 304e98qF99dBS6pdFjycQVlnqHUP5i5cQgxQ5h0UfzrM233mUwSC8NCvdgCTEB/OsQE4 bn1xgaG+ocl7/0nb9Y8+oWfVd4oeEcTzoMMClpI9SRGp/myD5SWvPMpJzpBJM7vVSBid mXiQWAlBpPxBvhfbcHIv1X7XiukZI/9zgz4CmhA6wm0CY3yWh0fFUHG0DvK28ly8GbXl CHIpmfb98TEboBicU1UYQBNMFxgfCSeKZD+NCSFYujjqea1Bo3rvikmodXSe81w9yPd9 LU/w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1696357322; x=1696962122; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=ZzDVAIPhXLJ6H1shwPpB3BGM89tyFHmwziem7st5rSo=; b=vzSN8KifpxLDfhuMYUaQ43NIvdXT0b3ZFjY+rLcur8MByh++5Jy5uuzH/QbBrMaq9w 0SXTZxM4OK/fHSiTDoW1b8rF+FnwMi5G0GaiKyOLmA9bm1cSzeWV3kdFXIJLEPDQIPSd cs8qMuq5DkHd3AFuXzDvJwgQGRPBTBYdUzTFUyBuwQ7/1bRlc6aGDQRbtJm73u233JXn 2yAYz2dSsnswkuj9yon2NGBI7ycnJf5dIz2xXm97l40yE9N2h0bMdoY+uTAFy44KonB9 F0lnAIilZHXeibPix94HKytlTWB11AxrkDG2LmaIEObsKrtwVnsgK5mUOf1Ve9ETImcj 1Q2w== X-Gm-Message-State: AOJu0YzJmmC8qeJ1y5TIK59DsMoXz9OkjXYAQIej40nD7xDAqZWh/PYI HSIpFQSuhwlcGpAK3FjqjCf5E93TweB6bbVngpJmzg== X-Received: by 2002:a50:d61e:0:b0:519:7d2:e256 with SMTP id x30-20020a50d61e000000b0051907d2e256mr10452edi.0.1696357322423; Tue, 03 Oct 2023 11:22:02 -0700 (PDT) MIME-Version: 1.0 References: <20230927113312.GD21810@noisy.programming.kicks-ass.net> <20230929115344.GE6282@noisy.programming.kicks-ass.net> <20231002115718.GB13957@noisy.programming.kicks-ass.net> <20231002204017.GB27267@noisy.programming.kicks-ass.net> <20231003081616.GE27267@noisy.programming.kicks-ass.net> In-Reply-To: From: Jim Mattson Date: Tue, 3 Oct 2023 11:21:46 -0700 Message-ID: Subject: Re: [Patch v4 07/13] perf/x86: Add constraint for guest perf metrics event To: Sean Christopherson Cc: Peter Zijlstra , Ingo Molnar , Dapeng Mi , Paolo Bonzini , Arnaldo Carvalho de Melo , Kan Liang , Like Xu , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Ian Rogers , Adrian Hunter , kvm@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-kernel@vger.kernel.org, Zhenyu Wang , Zhang Xiong , Lv Zhiyuan , Yang Weijiang , Dapeng Mi , David Dunn , Mingwei Zhang , Thomas Gleixner Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-8.4 required=5.0 tests=DKIMWL_WL_MED,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on pete.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (pete.vger.email [0.0.0.0]); Tue, 03 Oct 2023 11:22:25 -0700 (PDT) On Tue, Oct 3, 2023 at 8:23=E2=80=AFAM Sean Christopherson wrote: > > On Tue, Oct 03, 2023, Peter Zijlstra wrote: > > On Mon, Oct 02, 2023 at 05:56:28PM -0700, Sean Christopherson wrote: > > > On Mon, Oct 02, 2023, Peter Zijlstra wrote: > > > > > > I'm not sure what you're suggesting here. It will have to save/rest= ore > > > > all those MSRs anyway. Suppose it switches between vCPUs. > > > > > > The "when" is what's important. If KVM took a literal interpretatio= n of > > > "exclude guest" for pass-through MSRs, then KVM would context switch = all those > > > MSRs twice for every VM-Exit=3D>VM-Enter roundtrip, even when the VM-= Exit isn't a > > > reschedule IRQ to schedule in a different task (or vCPU). The overhe= ad to save > > > all the host/guest MSRs and load all of the guest/host MSRs *twice* f= or every > > > VM-Exit would be a non-starter. E.g. simple VM-Exits are completely = handled in > > > <1500 cycles, and "fastpath" exits are something like half that. Swi= tching all > > > the MSRs is likely 1000+ cycles, if not double that. > > > > See, you're the virt-nerd and I'm sure you know what you're talking > > about, but I have no clue :-) I didn't know there were different levels > > of vm-exit. > > An exit is essentially a fancy exception/event. The hardware transition = from > guest=3D>host is the exception itself (VM-Exit), and the transition back = to guest > is analagous to the IRET (VM-Enter). > > In between, software will do some amount of work, and the amount of work = that is > done can vary quite significantly depending on what caused the exit. > > > > FWIW, the primary use case we care about is for slice-of-hardware VMs= , where each > > > vCPU is pinned 1:1 with a host pCPU. > > > > I've been given to understand that vm-exit is a bad word in this > > scenario, any exit is a fail. They get MWAIT and all the other crap and > > more or less pretend to be real hardware. > > > > So why do you care about those MSRs so much? That should 'never' happen > > in this scenario. > > It's not feasible to completely avoid exits, as current/upcoming hardware= doesn't > (yet) virtualize a few important things. Off the top of my head, the two= most > relevant flows are: > > - APIC_LVTPC entry and PMU counters. If a PMU counter overflows, the N= MI that > is generated will trigger a hardware level NMI and cause an exit. An= d sadly, > the guest's NMI handler (assuming the guest is also using NMIs for PM= Is) will > trigger another exit when it clears the mask bit in its LVTPC entry. In addition, when the guest PMI handler writes to IA32_PERF_GLOBAL_CTRL to disable all counters (and again later to re-enable the counters), KVM has to intercept that as well, with today's implementation. Similarly, on each guest timer tick, when guest perf is multiplexing PMCs, KVM has to intercept writes to IA32_PERF_GLOBAL _CTRL. Furthermore, in some cases, Linux perf seems to double-disable counters, using both the individual enable bits in each PerfEvtSel, as well as the bits in PERF_GLOBAL_CTRL. KVM has to intercept writes to the PerfEvtSels as well. Off-topic, but I'd like to request that Linux perf *only* use the enable bits in IA32_PERF_GLOBAL_CTRL on architectures where that is supported. Just leave the enable bits set in the PrfEvtSels, to avoid unnecessary VM-exits. :) > - Timer related IRQs, both in the guest and host. These are the bigges= t source > of exits on modern hardware. Neither AMD nor Intel provide a virtual= APIC > timer, and so KVM must trap and emulate writes to TSC_DEADLINE (or to= APIC_TMICT), > and the subsequent IRQ will also cause an exit. > > The cumulative cost of all exits is important, but the latency of each in= dividual > exit is even more critical, especially for PMU related stuff. E.g. if th= e guest > is trying to use perf/PMU to profile a workload, adding a few thousand cy= cles to > each exit will introduce too much noise into the results. > > > > > > Or at least, that was my reading of things. Maybe it was just a > > > > > misunderstanding because we didn't do a good job of defining the = behavior. > > > > > > > > This might be the case. I don't particularly care where the guest > > > > boundary lies -- somewhere in the vCPU thread. Once the thread is g= one, > > > > PMU is usable again etc.. > > > > > > Well drat, that there would have saved a wee bit of frustration. Bet= ter late > > > than never though, that's for sure. > > > > > > Just to double confirm: keeping guest PMU state loaded until the vCPU= is scheduled > > > out or KVM exits to userspace, would mean that host perf events won't= be active > > > for potentially large swaths of non-KVM code. Any function calls or = event/exception > > > handlers that occur within the context of ioctl(KVM_RUN) would run wi= th host > > > perf events disabled. > > > > Hurmph, that sounds sub-optimal, earlier you said <1500 cycles, this al= l > > sounds like a ton more. > > > > /me frobs around the kvm code some... > > > > Are we talking about exit_fastpath loop in vcpu_enter_guest() ? That > > seems to run with IRQs disabled, so at most you can trigger a #PF or > > something, which will then trip an exception fixup because you can't ru= n > > #PF with IRQs disabled etc.. > > > > That seems fine. That is, a theoretical kvm_x86_handle_enter_irqoff() > > coupled with the existing kvm_x86_handle_exit_irqoff() seems like > > reasonable solution from where I'm sitting. That also more or less > > matches the FPU state save/restore AFAICT. > > > > Or are you talking about the whole of vcpu_run() ? That seems like a > > massive amount of code, and doesn't look like anything I'd call a > > fast-path. Also, much of that loop has preemption enabled... > > The whole of vcpu_run(). And yes, much of it runs with preemption enable= d. KVM > uses preempt notifiers to context switch state if the vCPU task is schedu= led > out/in, we'd use those hooks to swap PMU state. > > Jumping back to the exception analogy, not all exits are equal. For "sim= ple" exits > that KVM can handle internally, the roundtrip is <1500. The exit_fastpa= th loop is > roughly half that. > > But for exits that are more complex, e.g. if the guest hits the equivalen= t of a > page fault, the cost of handling the page fault can vary significantly. = It might > be <1500, but it might also be 10x that if handling the page fault requir= es faulting > in a new page in the host. > > We don't want to get too aggressive with moving stuff into the exit_fastp= ath loop, > because doing too much work with IRQs disabled can cause latency problems= for the > host. This isn't much of a concern for slice-of-hardware setups, but wou= ld be > quite problematic for other use cases. > > And except for obviously slow paths (from the guest's perspective), extra= latency > on any exit can be problematic. E.g. even if we got to the point where K= VM handles > 99% of exits the fastpath (may or may not be feasible), a not-fastpath ex= it at an > inopportune time could throw off the guest's profiling results, introduce= unacceptable > jitter, etc. > > > > Are you ok with that approach? Assuming we don't completely botch th= ings, the > > > interfaces are sane, we can come up with a clean solution for handlin= g NMIs, etc. > > > > Since you steal the whole PMU, can't you re-route the PMI to something > > that's virt friendly too? > > Hmm, actually, we probably could. It would require modifying the host's = APIC_LVTPC > entry when context switching the PMU, e.g. to replace the NMI with a dedi= cated IRQ > vector. As gross as that sounds, it might actually be cleaner overall th= an > deciphering whether an NMI belongs to the host or guest, and it would alm= ost > certainly yield lower latency for guest PMIs. Ugh. Can't KVM just install its own NMI handler? Either way, it's possible for late PMIs to arrive in the wrong context.