Received: by 2002:a05:7412:8d10:b0:f3:1519:9f41 with SMTP id bj16csp6746455rdb; Fri, 15 Dec 2023 07:20:42 -0800 (PST) X-Google-Smtp-Source: AGHT+IH8UXxdGu/ChaDhlC3Vm13fGzWnnyxBJYf17PE4TaZeskbASDPMFMnlDq9ntJv40VYjFOp5 X-Received: by 2002:a7b:c848:0:b0:40c:1e00:cfbc with SMTP id c8-20020a7bc848000000b0040c1e00cfbcmr4267605wml.205.1702653642720; Fri, 15 Dec 2023 07:20:42 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1702653642; cv=none; d=google.com; s=arc-20160816; b=TWAJ7PbdU33wEZCMkrbFCPgOjnm1zptuR7U1DEkbK0mj6DRZwS6punUgm8J5EBlQkC r7SscEskraikY/Vb5aBvHs0meeOAy4diC2P2hNhQlt58J4VRGbttrJAMmJBJeebn6Jdr rtO/lMHSnD84x9o1HvWzYONhtBg8fm3RayuE15/jBsWg2sCuZkHBD14pOvy+Ol3cN7pX QS007Ryu5aZn2wzeZh23X/Gtk1iD8ICSABRExm0YcOuQkZsqvYWTF9Utbu6AnADFQB9v xyhgeEL5vBeLO467l/Im6klGax5WDsgkYBd8TgJp5XaAgc9IXsfjhZISgAkuYFSK6F2l AGRQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:list-unsubscribe:list-subscribe :list-id:precedence:dkim-signature; bh=hmyST/XrGHFNC+XNE6mnVIuH8tC17821dmJgq7RAt7M=; fh=eSUyUcegzUACPnnPofCOWl/OrhnaPdspeHgUq99KPtI=; b=Rx3Qo5Yu8ifUy/urgN2LwYQyaoQ40xg1MPRHnjBG+C0uvY4yx/9CUuVLWkY12MhdMQ 1pty4Y5Fql2Fh2yiYQVOWxKYpAqQgC3s0Vk45EUF7OouDKklQYu1DIjBKws5iipbdffe 2HA8UhtOq6YL0jO3+YbieXT83jUDYb2jLm0Ybo1dVcifZorgTyeRK1QSkznICp0iS1Tl 7DXTVDUvnqnwI+MQ+yILQbYSnHsdXktGNw7+4A+3nFgGPdXkc3gJq4lKJveGAqLO+VEY T852Sgya05bI8mGvSxOW149fFzZOfSXBJ4KT/X13NN4LUmiO08TtfMs/u5aPtvFNTug/ gaiw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@joelfernandes.org header.s=google header.b=pde2aIGI; spf=pass (google.com: domain of linux-kernel+bounces-1205-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:4601:e00::3 as permitted sender) smtp.mailfrom="linux-kernel+bounces-1205-linux.lists.archive=gmail.com@vger.kernel.org" Return-Path: Received: from am.mirrors.kernel.org (am.mirrors.kernel.org. [2604:1380:4601:e00::3]) by mx.google.com with ESMTPS id ef7-20020a17090697c700b00a1d1f6bb27csi7671606ejb.804.2023.12.15.07.20.42 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 15 Dec 2023 07:20:42 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-1205-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:4601:e00::3 as permitted sender) client-ip=2604:1380:4601:e00::3; Authentication-Results: mx.google.com; dkim=pass header.i=@joelfernandes.org header.s=google header.b=pde2aIGI; spf=pass (google.com: domain of linux-kernel+bounces-1205-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:4601:e00::3 as permitted sender) smtp.mailfrom="linux-kernel+bounces-1205-linux.lists.archive=gmail.com@vger.kernel.org" Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by am.mirrors.kernel.org (Postfix) with ESMTPS id 4F8421F24C68 for ; Fri, 15 Dec 2023 15:20:42 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 86B15381D5; Fri, 15 Dec 2023 15:20:19 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=joelfernandes.org header.i=@joelfernandes.org header.b="pde2aIGI" X-Original-To: linux-kernel@vger.kernel.org Received: from mail-lj1-f171.google.com (mail-lj1-f171.google.com [209.85.208.171]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id F0FDB374D2 for ; Fri, 15 Dec 2023 15:20:16 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=joelfernandes.org Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=joelfernandes.org Received: by mail-lj1-f171.google.com with SMTP id 38308e7fff4ca-2ca03103155so8450531fa.0 for ; Fri, 15 Dec 2023 07:20:16 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=joelfernandes.org; s=google; t=1702653615; x=1703258415; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=hmyST/XrGHFNC+XNE6mnVIuH8tC17821dmJgq7RAt7M=; b=pde2aIGImpukFp6qLv3idtH09uJrjySQmEC5EeeYrUqBs4NG3mn53HxltTmkp977bM S7JyyGP4AYAXVW6m7RodsOxUqxJ4eYtZU9EesN3oEJgcCvXfTuIUwwWp5kM841SW8LFX QG1JyFoDvyjWzu4DW1o/+WjZ+paMtH2PIl9GY= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1702653615; x=1703258415; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=hmyST/XrGHFNC+XNE6mnVIuH8tC17821dmJgq7RAt7M=; b=dulsqiStrPNRVKIy69qreWv87/qzEZxwedIKnrO5q1LZjEZFrNXwvHCM8/XnVAg56q FN5ZB350SWwCtJlRvOX/qw/e7mHxzJ7lwtEuxEsYVVu5dQi4aKP5Zjthq3oXAGVm92UQ kr9SOFXg1Tb+hvAMZdonUpHjRyiDuUqhOW+d7I37tm5hGBjEFrw2H3ro+MuSMN4OyHEk ZdhjOUiKZj1ncaTs3hqyUUyGYcNcWetND4DWuV/dKWVhDBNFumIn6B2G+vpdeWOxixBQ MtOLsy6I4yABMhPw+SsdO2TujSEW2xDgA6j2uMcZGk35OD7TrieDjdyUd0rCj1fUYlFH DUPQ== X-Gm-Message-State: AOJu0Yy0Urq+a+6KExnphaFCbdITRIneqEJtnTBeZ4qF6toDSmYoo3VR dQ0x9KAT7yQuxXVJiFRLONLea7jqrVX8JvKLhqN85Q== X-Received: by 2002:a2e:a588:0:b0:2cb:3169:b348 with SMTP id m8-20020a2ea588000000b002cb3169b348mr3145402ljp.96.1702653614855; Fri, 15 Dec 2023 07:20:14 -0800 (PST) Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 References: <20231214024727.3503870-1-vineeth@bitbyteword.org> In-Reply-To: From: Joel Fernandes Date: Fri, 15 Dec 2023 10:20:03 -0500 Message-ID: Subject: Re: [RFC PATCH 0/8] Dynamic vcpu priority management in kvm To: Sean Christopherson Cc: Vineeth Remanan Pillai , Ben Segall , Borislav Petkov , Daniel Bristot de Oliveira , Dave Hansen , Dietmar Eggemann , "H . Peter Anvin" , Ingo Molnar , Juri Lelli , Mel Gorman , Paolo Bonzini , Andy Lutomirski , Peter Zijlstra , Steven Rostedt , Thomas Gleixner , Valentin Schneider , Vincent Guittot , Vitaly Kuznetsov , Wanpeng Li , Suleiman Souhlal , Masami Hiramatsu , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, x86@kernel.org, Tejun Heo , Josh Don , Barret Rhoden , David Vernet Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi Sean, Nice to see your quick response to the RFC, thanks. I wanted to clarify some points below: On Thu, Dec 14, 2023 at 3:13=E2=80=AFPM Sean Christopherson wrote: > > On Thu, Dec 14, 2023, Vineeth Remanan Pillai wrote: > > On Thu, Dec 14, 2023 at 11:38=E2=80=AFAM Sean Christopherson wrote: > > Now when I think about it, the implementation seems to > > suggest that we are putting policies in kvm. Ideally, the goal is: > > - guest scheduler communicates the priority requirements of the workloa= d > > - kvm applies the priority to the vcpu task. > > Why? Tasks are tasks, why does KVM need to get involved? E.g. if the pr= oblem > is that userspace doesn't have the right knobs to adjust the priority of = a task > quickly and efficiently, then wouldn't it be better to solve that problem= in a > generic way? No, it is not only about tasks. We are boosting anything RT or above such as softirq, irq etc as well. Could you please see the other patches? Also, Vineeth please make this clear in the next revision. > > > Pushing the scheduling policies to host userspace would allow for far= more control > > > and flexibility. E.g. a heavily paravirtualized environment where ho= st userspace > > > knows *exactly* what workloads are being run could have wildly differ= ent policies > > > than an environment where the guest is a fairly vanilla Linux VM that= has received > > > a small amount of enlightment. > > > > > > Lastly, if the concern/argument is that userspace doesn't have the ri= ght knobs > > > to (quickly) boost vCPU tasks, then the proposed sched_ext functional= ity seems > > > tailor made for the problems you are trying to solve. > > > > > > https://lkml.kernel.org/r/20231111024835.2164816-1-tj%40kernel.org > > > > > You are right, sched_ext is a good choice to have policies > > implemented. In our case, we would need a communication mechanism as > > well and hence we thought kvm would work best to be a medium between > > the guest and the host. > > Making KVM be the medium may be convenient and the quickest way to get a = PoC > out the door, but effectively making KVM a middle-man is going to be a hu= ge net > negative in the long term. Userspace can communicate with the guest just= as > easily as KVM, and if you make KVM the middle-man, then you effectively *= must* > define a relatively rigid guest/host ABI. At the moment, the only ABI is a shared memory structure and a custom MSR. This is no different from the existing steal time accounting where a shared structure is similarly shared between host and guest, we could perhaps augment that structure with other fields instead of adding a new one? On the ABI point, we have deliberately tried to keep it simple (for example, a few months ago we had hypercalls and we went to great lengths to eliminate those). > If instead the contract is between host userspace and the guest, the ABI = can be > much more fluid, e.g. if you (or any setup) can control at least some amo= unt of > code that runs in the guest I see your point of view. One way to achieve this is to have a BPF program run to implement the boosting part, in the VMEXIT path. KVM then just calls a hook. Would that alleviate some of your concerns? > then the contract between the guest and host doesn't > even need to be formally defined, it could simply be a matter of bundling= host > and guest code appropriately. > > If you want to land support for a given contract in upstream repositories= , e.g. > to broadly enable paravirt scheduling support across a variety of usersep= ace VMMs > and/or guests, then yeah, you'll need a formal ABI. But that's still not= a good > reason to have KVM define the ABI. Doing it in KVM might be a wee bit ea= sier because > it's largely just a matter of writing code, and LKML provides a centraliz= ed channel > for getting buyin from all parties. But defining an ABI that's independe= nt of the > kernel is absolutely doable, e.g. see the many virtio specs. > > I'm not saying KVM can't help, e.g. if there is information that is known= only > to KVM, but the vast majority of the contract doesn't need to be defined = by KVM. The key to making this working of the patch is VMEXIT path, that is only available to KVM. If we do anything later, then it might be too late. We have to intervene *before* the scheduler takes the vCPU thread off the CPU. Similarly, in the case of an interrupt injected into the guest, we have to boost the vCPU before the "vCPU run" stage -- anything later might be too late. Also you mentioned something about the tick path in the other email, we have no control over the host tick preempting the vCPU thread. The guest *will VMEXIT* on the host tick. On ChromeOS, we run multiple VMs and overcommitting is very common especially on devices with smaller number of CPUs. Just to clarify, this isn't a "quick POC". We have been working on this for many months and it was hard to get working correctly and handle all corner cases. We are finally at a point where - it just works (TM) and is roughly half the code size of when we initially started. thanks, - Joel