Received: by 2002:a05:7412:8d10:b0:f3:1519:9f41 with SMTP id bj16csp6218418rdb; Thu, 14 Dec 2023 11:25:32 -0800 (PST) X-Google-Smtp-Source: AGHT+IFXDs6BriOG1pnyz9btYjCAUWnrrWaBV2kMykh3P4c+Fd5/kTTZMEUaZVT/7/SYYDHo9U5e X-Received: by 2002:a05:6a00:14d5:b0:6cb:d24c:4a9f with SMTP id w21-20020a056a0014d500b006cbd24c4a9fmr11803201pfu.29.1702581931883; Thu, 14 Dec 2023 11:25:31 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1702581931; cv=none; d=google.com; s=arc-20160816; b=YoVk4bpAoDCKEjRzuKlI0NpVZQutiOB5ypv596FvoNjzOSG1vXkO0+7b8EnT7xv5n4 AuSvFyAQFcHxTn/YWzMQIOxYhFa91ue4jGQj1PODc98D2VaoTOYzT/JKXrFFZ/nsZ+Eo 17X4zZgvEvXNpIDAdMZA/80uEERZ1rLkeSc+8RxouwWoIH3r5jF5r0YgXsmSR161nXC+ t4igopu35JtrFfSgcATYy9OftJKHK7ywcXgaGyuHn53H4I16zueXY9KGFk7PXRte0lbZ 7VLviR+747MbNl0ZwuyDJbm/+WiiQMyWTO0E8sCDlg1TKAXEL7iJ6+G7QB6JZk+7VemX vrvQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:list-unsubscribe:list-subscribe :list-id:precedence:dkim-signature; bh=JwXVK8YsyVLmc2yBDAeX97qZitBh++UFD7dC4Hziebg=; fh=of7o24Li/hTzKjeBHbBM93VBOxwbV8MS8AxLlHMsnGU=; b=Obuts4qJh34zDPj8d3MbWdONxDnMujdVwTHmxzKKye3oY/5VL5/2bx5WM55hmApoY4 LJCMBz1dk5sl/up9CDR16W655Edv0MKOw2nBDu/q8JVyfOsCImA7Tn2SXUGD4GAPmVbd Kj//qTm6V5PX5l4pZm2++T/gnhu70WsUdmKYdirv2z84tbn806xb5mArFE7LQc/IJn3T NX/QcER/LHKlJxoy09VkxG8L77ydJY5no13sjsjOGILR3VQhID4IK5RBDW+TC3CVY5yi IaCtOnY4YHhb6wuVcY1TTRWN63OUmRwccbFl4c1DFTfTsr0tfNqFlJNmnTfWVFNXfDBz 3IUQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@bitbyteword.org header.s=google header.b=VdVhvKO9; spf=pass (google.com: domain of linux-kernel+bounces-11-linux.lists.archive=gmail.com@vger.kernel.org designates 139.178.88.99 as permitted sender) smtp.mailfrom="linux-kernel+bounces-11-linux.lists.archive=gmail.com@vger.kernel.org" Return-Path: Received: from sv.mirrors.kernel.org (sv.mirrors.kernel.org. [139.178.88.99]) by mx.google.com with ESMTPS id x20-20020a631714000000b0057e21f51ab3si11613224pgl.665.2023.12.14.11.25.31 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 14 Dec 2023 11:25:31 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-11-linux.lists.archive=gmail.com@vger.kernel.org designates 139.178.88.99 as permitted sender) client-ip=139.178.88.99; Authentication-Results: mx.google.com; dkim=pass header.i=@bitbyteword.org header.s=google header.b=VdVhvKO9; spf=pass (google.com: domain of linux-kernel+bounces-11-linux.lists.archive=gmail.com@vger.kernel.org designates 139.178.88.99 as permitted sender) smtp.mailfrom="linux-kernel+bounces-11-linux.lists.archive=gmail.com@vger.kernel.org" Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sv.mirrors.kernel.org (Postfix) with ESMTPS id 83D4D281C99 for ; Thu, 14 Dec 2023 19:25:31 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 81919697A2; Thu, 14 Dec 2023 19:25:16 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=bitbyteword.org header.i=@bitbyteword.org header.b="VdVhvKO9" X-Original-To: linux-kernel@vger.kernel.org Received: from mail-oa1-f48.google.com (mail-oa1-f48.google.com [209.85.160.48]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 014756978F for ; Thu, 14 Dec 2023 19:25:11 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=bitbyteword.org Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=bitbyteword.org Received: by mail-oa1-f48.google.com with SMTP id 586e51a60fabf-1eb39505ba4so5086401fac.0 for ; Thu, 14 Dec 2023 11:25:11 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bitbyteword.org; s=google; t=1702581911; x=1703186711; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=JwXVK8YsyVLmc2yBDAeX97qZitBh++UFD7dC4Hziebg=; b=VdVhvKO9sctnKqXHSz4/ySoIAzpe89B8c9/M5uFEyb8+5Zy08RtvxGWrGbywfYd4QJ UaNLm/ROlt6A7hVWfNobsPDT6uGp/wrnonCwBN+pedVbnAyzFcwcZ17cUM/4fSKBGPOt xE8HC9uDjNG2+Wm3e9BbmeFXmmD17i3ByvShds1GlAl4QIhmRKb6NsNBVmRCdrAva5We a09Xlz47RGKxvDkaIHC8rOEY/jmixYYKY0VWFhcUA5afKR5HK7uY4rgsrfD2cZNgY+G8 mL1ejFUh8S3mZyTleMHM7U0Wo4l0HcPVOAzX0DR99pLtYNH8t/cRP1ELKKn0sKR1sJBB E8nA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1702581911; x=1703186711; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=JwXVK8YsyVLmc2yBDAeX97qZitBh++UFD7dC4Hziebg=; b=gYZy2P4no8Pl4WZyGVPw8gprYHPXCEQdQ1mAv2e2NbYHaFri9UeERVZ4QIneVjVLcL dv1FTdUoGRv6dSSF+BRzD7kQ107TzS/XVPAI9zUJ8SSS18I8hQFbE7q4HMM5vjhjbWEv x5tCQSriZyBEOOJWKO3HdB602a+gZKiiLWu5hIZ+halqXgVaxR3rq14JzHwiBPFDd6nK Xekpc0EsE4y5XvyLecnvszDIcab85iOI013kO23EgBx6VdoqTed57P+8UY2QvOFSeWNA ivKkuBI02J6D2AwV8xXLcmUpSrXf6nhHleKXxra7giOm5H8pWwe8X+jF8YI10fhJdXJy /hpg== X-Gm-Message-State: AOJu0YxB7JCKhu8CdP2OiaNVP5Lsu6FDqFJdf1ritIevjh4lfsOdW6n4 BvWxCxJFU1m9ko9MzakAtnXN6G+PmY4Lgz6Zto64Gg== X-Received: by 2002:a05:6870:7013:b0:203:40a0:b786 with SMTP id u19-20020a056870701300b0020340a0b786mr2122394oae.63.1702581910907; Thu, 14 Dec 2023 11:25:10 -0800 (PST) Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 References: <20231214024727.3503870-1-vineeth@bitbyteword.org> In-Reply-To: From: Vineeth Remanan Pillai Date: Thu, 14 Dec 2023 14:25:00 -0500 Message-ID: Subject: Re: [RFC PATCH 0/8] Dynamic vcpu priority management in kvm To: Sean Christopherson Cc: Ben Segall , Borislav Petkov , Daniel Bristot de Oliveira , Dave Hansen , Dietmar Eggemann , "H . Peter Anvin" , Ingo Molnar , Juri Lelli , Mel Gorman , Paolo Bonzini , Andy Lutomirski , Peter Zijlstra , Steven Rostedt , Thomas Gleixner , Valentin Schneider , Vincent Guittot , Vitaly Kuznetsov , Wanpeng Li , Suleiman Souhlal , Masami Hiramatsu , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, x86@kernel.org, Tejun Heo , Josh Don , Barret Rhoden , David Vernet , Joel Fernandes Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Thu, Dec 14, 2023 at 11:38=E2=80=AFAM Sean Christopherson wrote: > > +sched_ext folks > > On Wed, Dec 13, 2023, Vineeth Pillai (Google) wrote: > > Double scheduling is a concern with virtualization hosts where the host > > schedules vcpus without knowing whats run by the vcpu and guest schedul= es > > tasks without knowing where the vcpu is physically running. This causes > > issues related to latencies, power consumption, resource utilization > > etc. An ideal solution would be to have a cooperative scheduling > > framework where the guest and host shares scheduling related informatio= n > > and makes an educated scheduling decision to optimally handle the > > workloads. As a first step, we are taking a stab at reducing latencies > > for latency sensitive workloads in the guest. > > > > This series of patches aims to implement a framework for dynamically > > managing the priority of vcpu threads based on the needs of the workloa= d > > running on the vcpu. Latency sensitive workloads (nmi, irq, softirq, > > critcal sections, RT tasks etc) will get a boost from the host so as to > > minimize the latency. > > > > The host can proactively boost the vcpu threads when it has enough > > information about what is going to run on the vcpu - fo eg: injecting > > interrupts. For rest of the case, guest can request boost if the vcpu i= s > > not already boosted. The guest can subsequently request unboost after > > the latency sensitive workloads completes. Guest can also request a > > boost if needed. > > > > A shared memory region is used to communicate the scheduling informatio= n. > > Guest shares its needs for priority boosting and host shares the boosti= ng > > status of the vcpu. Guest sets a flag when it needs a boost and continu= es > > running. Host reads this on next VMEXIT and boosts the vcpu thread. For > > unboosting, it is done synchronously so that host workloads can fairly > > compete with guests when guest is not running any latency sensitive > > workload. > > Big thumbs down on my end. Nothing in this RFC explains why this should = be done > in KVM. In general, I am very opposed to putting policy of any kind into= KVM, > and this puts a _lot_ of unmaintainable policy into KVM by deciding when = to > start/stop boosting a vCPU. > I am sorry for not clearly explaining the goal. The intent was not to have scheduling policies implemented in kvm, but to have a mechanism for guest and host schedulers to communicate so that guest workloads get a fair treatment from host scheduler while competing with host workloads. Now when I think about it, the implementation seems to suggest that we are putting policies in kvm. Ideally, the goal is: - guest scheduler communicates the priority requirements of the workload - kvm applies the priority to the vcpu task. - Now that vcpu is appropriately prioritized, host scheduler can make the right choice of picking the next best task. We have an exception of proactive boosting for interrupts/nmis. I don't expect these proactive boosting cases to grow. And I think this also to be controlled by the guest where the guest can say what scenarios would it like to be proactive boosted. That would make kvm just a medium to communicate the scheduler requirements from guest to host and not house any policies. What do you think? > Concretely, boosting vCPUs for most events is far too coarse grained. E.= g. boosting > a vCPU that is running a low priority workload just because the vCPU trig= gered > an NMI due to PMU counter overflow doesn't make sense. Ditto for if a gu= est's > hrtimer expires on a vCPU running a low priority workload. > > And as evidenced by patch 8/8, boosting vCPUs based on when an event is _= pending_ > is not maintainable. As hardware virtualizes more and more functionality= , KVM's > visibility into the guest effectively decreases, e.g. Intel and AMD both = support > with IPI virtualization. > > Boosting the target of a PV spinlock kick is similarly flawed. In that c= ase, KVM > only gets involved _after_ there is a problem, i.e. after a lock is conte= nded so > heavily that a vCPU stops spinning and instead decided to HLT. It's not = hard to > imagine scenarios where a guest would want to communicate to the host tha= t it's > acquiring a spinlock for a latency sensitive path and so shouldn't be sch= eduled > out. And of course that's predicated on the assumption that all vCPUs ar= e subject > to CPU overcommit. > > Initiating a boost from the host is also flawed in the sense that it reli= es on > the guest to be on the same page as to when it should stop boosting. E.g= . if > KVM boosts a vCPU because an IRQ is pending, but the guest doesn't want t= o boost > IRQs on that vCPU and thus doesn't stop boosting at the end of the IRQ ha= ndler, > then the vCPU could end up being boosted long after its done with the IRQ= . > > Throw nested virtualization into the mix and then all of this becomes nig= h > impossible to sort out in KVM. E.g. if an L1 vCPU is a running an L2 vCP= U, i.e. > a nested guest, and L2 is spamming interrupts for whatever reason, KVM wi= ll end > repeatedly boosting the L1 vCPU regardless of the priority of the L2 work= load. > > For things that aren't clearly in KVM's domain, I don't think we should i= mplement > KVM-specific functionality until every other option has been tried (and f= ailed). > I don't see any reason why KVM needs to get involved in scheduling, beyon= d maybe > providing *input* regarding event injection, emphasis on *input* because = KVM > providing information to userspace or some other entity is wildly differe= nt than > KVM making scheduling decisions based on that information. > Agreed with all the points above and it doesn't make sense to have policies in kvm. But if kvm can act as a medium to communicate scheduling requirements between guest and host and not make any decisions, would that be more reasonable? > Pushing the scheduling policies to host userspace would allow for far mor= e control > and flexibility. E.g. a heavily paravirtualized environment where host u= serspace > knows *exactly* what workloads are being run could have wildly different = policies > than an environment where the guest is a fairly vanilla Linux VM that has= received > a small amount of enlightment. > > Lastly, if the concern/argument is that userspace doesn't have the right = knobs > to (quickly) boost vCPU tasks, then the proposed sched_ext functionality = seems > tailor made for the problems you are trying to solve. > > https://lkml.kernel.org/r/20231111024835.2164816-1-tj%40kernel.org > You are right, sched_ext is a good choice to have policies implemented. In our case, we would need a communication mechanism as well and hence we thought kvm would work best to be a medium between the guest and the host. The policies could be in the guest and the guest shall communicate its priority requirements(based on policy) to the host via kvm and then the host scheduler takes action based on that. Please let me know. Thanks, Vineeth