Received: by 2002:a05:7412:8d10:b0:f3:1519:9f41 with SMTP id bj16csp6985517rdb; Fri, 15 Dec 2023 14:08:57 -0800 (PST) X-Google-Smtp-Source: AGHT+IEUWvJqy96NCN0iDXat2T9PfTLMw1abx292iz9XAEaWiUmBUMf+qUTNwrTpf06vDsmS/zPj X-Received: by 2002:ac8:5f8f:0:b0:423:78f8:5cf2 with SMTP id j15-20020ac85f8f000000b0042378f85cf2mr15493073qta.23.1702678137781; Fri, 15 Dec 2023 14:08:57 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1702678137; cv=none; d=google.com; s=arc-20160816; b=MLGGWRhRv/VARnKAkvcFu+3YR8XJh6PNymt71KbIN9MCPk+4OV1qX9qUlWbJ18LiWA +vAtn3GupAk5YxZCpxgBvEfNu5nKDME4iMYS8aAPGb/AvXSYX2UJpMVoOabBHo71bUX7 6Sg7ta8jyqESv0qtwpKOIXj0HDvh1HJNT9fB9+OfrKsgSrBl5qHlg99+hb59/YDyFoCB C+f98mr6YEK58EUZhWNbCjMlJJHNfUZc5/24sY3Gzp12mKqxsFjDP3l+s01/TB+iP69M xxesBtGtD15ak+1vkUnkhopN5Hi36XdeC1OuTNGE5btUkVJAnqkY56M+Q+/l2rLvVpZ8 Rwcg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:list-unsubscribe:list-subscribe:list-id:precedence :in-reply-to:date:dkim-signature; bh=q/XgF9D+ghWHUEkU0/9EFhe8biuUdcjJQbYRys6RmOs=; fh=kod3K34uaxE6rYpipiF4tiAGkyAMvDFpID+m5MJtqn4=; b=FntqF9z8rSA/t0FkrPAuzvzIU73/bCBGSg5cJHt9hWfk8D4koeAaNj6K6+75l+L0/B R/SYtcJg1PJz1xMy0T+zZI3eyhNRre+YrZQe/g05XyZnOLMNsYL6CXmGi/Wst1ixKWNX 5+sa1KHtlBsVPO3w9uw1SvHDf24WxzReqK6nto71hWErHNEf7RIApgMCk3ksO3gJyNE4 AVDZAxAROmctMus+rnJmaysabjz/VwIiGGpYZIk6fTknpCwUvJ5uDMnemeRkM2xXXOzV hOJzD8+FZOdrLey4a/MflXIaDkbGboiJnHgysM22EYNnhxpvgZjC5GpANF5SABznjdHg tEfw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=coibmiha; spf=pass (google.com: domain of linux-kernel+bounces-1727-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-1727-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from ny.mirrors.kernel.org (ny.mirrors.kernel.org. [2604:1380:45d1:ec00::1]) by mx.google.com with ESMTPS id fa12-20020a05622a4ccc00b00423baa3852esi19318930qtb.342.2023.12.15.14.08.57 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 15 Dec 2023 14:08:57 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-1727-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) client-ip=2604:1380:45d1:ec00::1; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=coibmiha; spf=pass (google.com: domain of linux-kernel+bounces-1727-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-1727-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ny.mirrors.kernel.org (Postfix) with ESMTPS id 7A3651C211E8 for ; Fri, 15 Dec 2023 22:08:57 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id C169860EDB; Fri, 15 Dec 2023 22:01:28 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="coibmiha" X-Original-To: linux-kernel@vger.kernel.org Received: from mail-yw1-f202.google.com (mail-yw1-f202.google.com [209.85.128.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1C10D67E72 for ; Fri, 15 Dec 2023 22:01:16 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com Received: by mail-yw1-f202.google.com with SMTP id 00721157ae682-5e4c255846aso8243067b3.1 for ; Fri, 15 Dec 2023 14:01:16 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1702677676; x=1703282476; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=q/XgF9D+ghWHUEkU0/9EFhe8biuUdcjJQbYRys6RmOs=; b=coibmiha3XGJaH/fO8bYZr4hVK5r9xdV3z08WKvum2hAT5+M4TSGw+Zy//kFyPMeCu gll1VtNF6lhQ4jwhKrp5rCma0WPdXbQNm1+5iVFi1WJLVuVqtbq/8wdoJQj1vbNppaXJ pK9cbt+TUmLRAITxC+z/IJYmztuRBccbpLu51QvgiGKbEEbxc25VeQpOFa+XMyOhROrw AGgwjAFmU6QGtsOeNKUZJHOKcXkWruDVLyYxUXhASnR6jyQaf+ADPHy9ZtMYtS1NLHu2 +yp6n6Fq3wFnYMRSCxuA0SfKtA0duI8Xgj+vzJ+uSgA7RyeSZFKXdQwtlBTNnJLyOCMe GDfQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1702677676; x=1703282476; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=q/XgF9D+ghWHUEkU0/9EFhe8biuUdcjJQbYRys6RmOs=; b=tAt7pDygfBsltIX93vho+oZeuvXhba8HuLH1DnHzvQVmQtbMjU3e6KNMDHBSIEgRLW 6GkavChbk4M40YFGI+W59NrNcNv1c+mbjMMB+O7BfR98hsUeZtJr737bOHW5QCJLcAiL Y4KnMI+doUk85mye9FM8We+yR8h1cBiMOPzAlPLhCAa+Flx17YMthR5hGfWmOExvpAaG MJvDMvfCXkkQYlBlfn74LQU7K5yMjQzO7adsDBqwBCXF4mEifvXiQ6KkZiSrXaQJ9LG5 guiwEAtUBoCfIK3kr268yBBHT4oSNCVE5oYuA+IquXzyH3MR+NbKj6TkDxj67biMUFBe gJ1g== X-Gm-Message-State: AOJu0YzBo/JZcQZkVRe8WQvCAZq1deXQ8j132Z7I6+Vl7FDt9J5sWTU8 ZLuEMFrwZuxaMKyDr6B3KMep6jlyJZU= X-Received: from zagreus.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:5c37]) (user=seanjc job=sendgmr) by 2002:a05:6902:134a:b0:dbd:2f0:c763 with SMTP id g10-20020a056902134a00b00dbd02f0c763mr9605ybu.1.1702677675842; Fri, 15 Dec 2023 14:01:15 -0800 (PST) Date: Fri, 15 Dec 2023 14:01:14 -0800 In-Reply-To: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20231214024727.3503870-1-vineeth@bitbyteword.org> Message-ID: Subject: Re: [RFC PATCH 0/8] Dynamic vcpu priority management in kvm From: Sean Christopherson To: Joel Fernandes Cc: Vineeth Remanan Pillai , Ben Segall , Borislav Petkov , Daniel Bristot de Oliveira , Dave Hansen , Dietmar Eggemann , "H . Peter Anvin" , Ingo Molnar , Juri Lelli , Mel Gorman , Paolo Bonzini , Andy Lutomirski , Peter Zijlstra , Steven Rostedt , Thomas Gleixner , Valentin Schneider , Vincent Guittot , Vitaly Kuznetsov , Wanpeng Li , Suleiman Souhlal , Masami Hiramatsu , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, x86@kernel.org, Tejun Heo , Josh Don , Barret Rhoden , David Vernet Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable On Fri, Dec 15, 2023, Joel Fernandes wrote: > On Fri, Dec 15, 2023 at 11:38=E2=80=AFAM Sean Christopherson wrote: > > > > On Fri, Dec 15, 2023, Joel Fernandes wrote: > > > Hi Sean, > > > Nice to see your quick response to the RFC, thanks. I wanted to > > > clarify some points below: > > > > > > On Thu, Dec 14, 2023 at 3:13=E2=80=AFPM Sean Christopherson wrote: > > > > > > > > On Thu, Dec 14, 2023, Vineeth Remanan Pillai wrote: > > > > > On Thu, Dec 14, 2023 at 11:38=E2=80=AFAM Sean Christopherson wrote: > > > > > Now when I think about it, the implementation seems to > > > > > suggest that we are putting policies in kvm. Ideally, the goal is= : > > > > > - guest scheduler communicates the priority requirements of the w= orkload > > > > > - kvm applies the priority to the vcpu task. > > > > > > > > Why? Tasks are tasks, why does KVM need to get involved? E.g. if = the problem > > > > is that userspace doesn't have the right knobs to adjust the priori= ty of a task > > > > quickly and efficiently, then wouldn't it be better to solve that p= roblem in a > > > > generic way? > > > > > > No, it is not only about tasks. We are boosting anything RT or above > > > such as softirq, irq etc as well. > > > > I was talking about the host side of things. A vCPU is a task, full st= op. KVM > > *may* have some information that is useful to the scheduler, but KVM do= es not > > *need* to initiate adjustments to a vCPU's priority. >=20 > Sorry I thought you were referring to guest tasks. You are right, KVM > does not *need* to change priority. But a vCPU is a container of tasks > who's priority dynamically changes. Still, I see your point of view > and that's also why we offer the capability to be selectively enabled > or disabled per-guest by the VMM (Vineeth will make it default off and > opt-in in the next series). >=20 > > > Could you please see the other patches? > > > > I already have, see my comments about boosting vCPUs that have received > > NMIs and IRQs not necessarily being desirable. >=20 > Ah, I was not on CC for that email. Seeing it now. I think I don't > fully buy that argument, hard IRQs are always high priority IMHO. They most definitely are not, and there are undoubtedly tiers of priority, = e.g. tiers are part and parcel of the APIC architecture. I agree that *most* IR= Qs are high-ish priority, but that is not the same that *all* IRQs are high priori= ty. It only takes one example to disprove the latter, and I can think of severa= l off the top of my head. Nested virtualization is the easy one to demonstrate. On AMD, which doesn't have an equivalent to the VMX preemption timer, KVM u= ses a self-IPI to wrest control back from the guest immediately after VMRUN in so= me situations (mostly to inject events into L2 while honoring the architectura= l priority of events). If the guest is running a nested workload, the result= ing IRQ in L1 is not at all interesting or high priority, as the L2 workload ha= sn't suddenly become high priority just because KVM wants to inject an event. Anyways, I didn't mean to start a debate over the priority of handling IRQs= and NMIs, quite the opposite actually. The point I'm trying to make is that un= der no circumstance do I want KVM to be making decisions about whether or not s= uch things are high priority. I have no objection to KVM making information av= ailable to whatever entity is making the actual decisions, it's having policy in KV= M that I am staunchly opposed to. > If an hrtimer expires on a CPU running a low priority workload, that > hrtimer might itself wake up a high priority thread. If we don't boost > the hrtimer interrupt handler, then that will delay the wakeup as > well. It is always a chain of events and it has to be boosted from the > first event. If a system does not wish to give an interrupt a high > priority, then the typical way is to use threaded IRQs and lower the > priority of the thread. That will give the interrupt handler lower > priority and the guest is free to do that. We had many POCs before > where we don't boost at all for interrupts and they all fall apart. > This is the only POC that works without any issues as far as we know > (we've been trying to do this for a long time :P). In *your* environment. The fact that it took multiple months to get a stab= le, functional set of patches for one use case is *exactly* why I am pushing ba= ck on this. Change any number of things about the setup and odds are good that t= he result would need different tuning. E.g. the ratio of vCPUs to pCPUs, the = number of VMs, the number of vCPUs per VM, whether or not the host kernel is preem= ptible, whether or not the guest kernel is preemptible, the tick rate of the host a= nd guest kernels, the workload of the VM, the affinity of tasks within the VM,= and and so on and so forth. It's a catch-22 of sorts. Anything that is generic enough to land upstream= is likely going to be too coarse grained to be universally applicable. > Regarding perf, I similarly disagree. I think a PMU event is super > important (example, some versions of the kernel watchdog that depend > on PMU fail without it). But if some VM does not want this to be > prioritized, they could just not opt-in for the feature IMO. I can see > your point of view that not all VMs may want this behavior though. Or a VM may want it conditionally, e.g. only for select tasks. > > > At the moment, the only ABI is a shared memory structure and a custom > > > MSR. This is no different from the existing steal time accounting > > > where a shared structure is similarly shared between host and guest, > > > we could perhaps augment that structure with other fields instead of > > > adding a new one? > > > > I'm not concerned about the number of structures/fields, it's the amoun= t/type of > > information and the behavior of KVM that is problematic. E.g. boosting= the priority > > of a vCPU that has a pending NMI is dubious. >=20 > I think NMIs have to be treated as high priority, the perf profiling > interrupt for instance works well on x86 (unlike ARM) because it can > interrupt any context (other than NMI and possibly the machine check > ones). On ARM on the other hand, because the perf interrupt is a > regular non-NMI interrupt, you cannot profile hardirq and IRQ-disable > regions (this could have changed since pseudo-NMI features). So making > the NMI a higher priority than IRQ is not dubious AFAICS, it is a > requirement in many cases IMHO. Again, many, but not all. A large part of KVM's success is that KVM has ve= ry few "opinions" of its own. Outside of the MMU and a few paravirt paths, KVM mo= stly just emulates/virtualizes hardware according to the desires of userspace. = This has allowed a fairly large variety of use cases to spring up with relativel= y few changes to KVM. What I want to avoid is (a) adding something that only works for one use ca= se and (b) turning KVM into a scheduler of any kind. > > Which illustrates one of the points I'm trying to make is kind of my po= int. > > Upstream will never accept anything that's wildly complex or specific b= ecause such > > a thing is unlikely to be maintainable. >=20 > TBH, it is not that complex though.=20 Yet. Your use case is happy with relatively simple, coarse-grained hooks. = Use cases that want to squeeze out maximum performance, e.g. because shaving N%= off the runtime saves $$$, are likely willing to take on far more complexity, o= r may just want to make decisions at a slightly different granularity. > But let us know which parts, if any, can be further simplified (I saw you= r > suggestions for next steps in the reply to Vineeth, those look good to me= and > we'll plan accordingly). It's not a matter of simplifying things, it's a matter of KVM (a) not defin= ing policy of any kind and (b) KVM not defining a guest/host ABI. > > > We have to intervene *before* the scheduler takes the vCPU thread off= the > > > CPU. > > > > If the host scheduler is directly involved in the paravirt shenanigans,= then > > there is no need to hook KVM's VM-Exit path because the scheduler alrea= dy has the > > information needed to make an informed decision. >=20 > Just to clarify, we're not making any "decisions" in the VM exit path, Yes, you are. > we're just giving the scheduler enough information (via the > setscheduler call). The scheduler may just as well "decide" it wants > to still preempt the vCPU thread -- that's Ok in fact required some > times. We're just arming it with more information, specifically that > this is an important thread. We can find another way to pass this > information along (BPF etc) but I just wanted to mention that KVM is > not really replacing the functionality or decision-making of the > scheduler even with this POC. Yes, it is. kvm_vcpu_kick_boost() *directly* adjusts the priority of the t= ask. KVM is not just passing a message, KVM is defining a scheduling policy of "= boost vCPUs with pending IRQs, NMIs, SMIs, and PV unhalt events". The VM-Exit path also makes those same decisions by boosting a vCPU if the = guest has requested boost *or* the vCPU has a pending event (but oddly, not pendi= ng NMIs, SMIs, or PV unhalt events): bool pending_event =3D kvm_cpu_has_pending_timer(vcpu) || kvm_cpu_has_inte= rrupt(vcpu); /* * vcpu needs a boost if * - A lazy boost request active or a pending latency sensitive event, and * - Preemption disabled duration on this vcpu has not crossed the thresho= ld. */ return ((schedinfo.boost_req =3D=3D VCPU_REQ_BOOST || pending_event) && !kvm_vcpu_exceeds_preempt_disabled_duration(&vcpu->arch)); Which, by the by is suboptimal. Detecting for pending events isn't free, s= o you ideally want to check for pending events if and only if the guest hasn't re= quested a boost. > > > Similarly, in the case of an interrupt injected into the guest, we ha= ve > > > to boost the vCPU before the "vCPU run" stage -- anything later might= be too > > > late. > > > > Except that this RFC doesn't actually do this. KVM's relevant function= names suck > > and aren't helping you, but these patches boost vCPUs when events are *= pended*, > > not when they are actually injected. >=20 > We are doing the injection bit in: > https://lore.kernel.org/all/20231214024727.3503870-5-vineeth@bitbyteword.= org/ >=20 > For instance, in: >=20 > kvm_set_msi -> > kvm_irq_delivery_to_apic -> > kvm_apic_set_irq -> > __apic_accept_irq -> > kvm_vcpu_kick_boost(); >=20 > The patch is a bit out of order because patch 4 depends on 3. Patch 3 > does what you're referring to, which is checking for pending events. >=20 > Did we miss something? If there is some path that we are missing, your > help is much appreciated as you're likely much more versed with this > code than me. But doing the boosting at injection time is what has > made all the difference (for instance with cyclictest latencies). That accepts in IRQ into the vCPU's local APIC, it does not *inject* the IR= Q into the vCPU proper. The actual injection is done by kvm_check_and_inject_even= ts(). A pending IRQ is _usually_ delivered/injected fairly quickly, but not alway= s. E.g. if the guest has IRQs disabled (RFLAGS.IF=3D0), KVM can't immediately = inject the IRQ (without violating x86 architecture). In that case, KVM will twidd= le VMCS/VMCB knobs to detect an IRQ window, i.e. to cause a VM-Exit when IRQs = are no longer blocked in the guest. Your PoC actually (mostly) handles this (see above) by keeping the vCPU boo= sted after EXIT_REASON_INTERRUPT_WINDOW (because the IRQ will obviously still be= pending). > > Boosting the priority of vCPUs at semi-arbitrary points is going to be = much more > > difficult for KVM to "get right". E.g. why boost the priority of a vCP= U that has > > a pending IRQ, but not a vCPU that is running with IRQs disabled? >=20 > I was actually wanting to boost preempted vCPU threads that were > preempted in IRQ disabled regions as well. In fact, that is on our > TODO.. we just haven't done it yet as we notice that usually IRQ is > disabled while preemption was already disabled and just boosting > preempt-disabled gets us most of the way there. >=20 > > The potential for endless twiddling to try and tune KVM's de facto > > scheduling logic so that it's universally "correct" is what terrifies m= e. >=20 > Yes, we can certainly look into BPF to make it a bit more generic for > our usecase (while getting enough information from the kernel). >=20 > By the way, one other usecase for this patch series is RCU. I am one > of the RCU maintainers and I am looking into improving RCU in VMs. For > instance, boosting preempted RCU readers to RT (CONFIG_RCU_BOOST) does > not usually work because the vCPU thread on the host is not RT. > Similarly, other RCU threads which have RT priority don't get to run > causing issues. >=20 > Thanks! >=20 > - Joel