Received: by 2002:a05:7412:8d10:b0:f3:1519:9f41 with SMTP id bj16csp6114383rdb; Thu, 14 Dec 2023 08:39:03 -0800 (PST) X-Google-Smtp-Source: AGHT+IH9uluf+R8fSG4I2/yje8UnZzU5x1sTgPGvJFJfM2nCrk8hQG0Cb39+7nk+vdHE+3JoGGm/ X-Received: by 2002:a17:90b:38c8:b0:286:da6d:c41 with SMTP id nn8-20020a17090b38c800b00286da6d0c41mr7649591pjb.70.1702571943549; Thu, 14 Dec 2023 08:39:03 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1702571943; cv=none; d=google.com; s=arc-20160816; b=aflH9LnFwccOg2Orad22JA+WaImS6MDcFfbDEr+pOufbewUdNlQnFlQ0/oyJzVfRvC cfurcoQlE3Sccz2DFjyLebk5F1A87Mar+xTieiERXD3GynH6iIkO512MZZ8Y6RoVASXY 3pXwByzjrNHE5j0xp75q0LuR5OJGU7Fpg3LbcIXa3zBIUuFDA5gpneM+ujbJZYeYEJ7b nn4FCIl0ue8DHqiUaW81AnslExR+p5a+DWCTwEEOyTu/AdmeMO+rpa3lvuYr0wN7fkQn vCgEYDtO+wpHsOE2x50n20OKNkOPoXQwcsCUkaOjk5hysXvbyyDSYacPT2+n5lUq1eA0 qBgw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:dkim-signature; bh=wgFC2UkHZlElWA30ukvsWw2kCiyu/c5uf7p4lb7iNbg=; fh=Vmt7SEWTWWj5+XDbnfMsfz6kADpG8AXe7xn9O6zeVho=; b=PD4bRUNdWdKPIkWnGU3FqJobv/F3MO7zlIoa3jFRhRxuoITJrTujnqDOvxRogXQ7tS F8HJxnjqR+otizm9E2jrfAy7DiN+ePc3xP0LMXZWCKqe5z6D0T91GJkhAcYmYtXoM+AB TY/w5RKiZMgg0XsZqF9VLovBTIvfQ6rRgzUgNHNiup3V+Zuk7rVRs9hOYVw5rACNSZjI LjIyE1wh/R10VKpJmyqJBuonptzd8u6VobFYMo4el6VY1vTZMbdoEA/tOdYrCmxre5ev 7AwOVOmq0W6iWRXiao0xD+1whkD6NES51X0dvIfE0w3gzMgDeQqx0prGr3jgdoYbE+32 97nQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=RIgX3CLh; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:6 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from pete.vger.email (pete.vger.email. [2620:137:e000::3:6]) by mx.google.com with ESMTPS id ob13-20020a17090b390d00b0028aabe55eefsi5786989pjb.79.2023.12.14.08.39.03 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 14 Dec 2023 08:39:03 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:6 as permitted sender) client-ip=2620:137:e000::3:6; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=RIgX3CLh; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:6 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by pete.vger.email (Postfix) with ESMTP id 8E7D782C9D8A; Thu, 14 Dec 2023 08:39:00 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at pete.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229933AbjLNQip (ORCPT + 99 others); Thu, 14 Dec 2023 11:38:45 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49342 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229873AbjLNQio (ORCPT ); Thu, 14 Dec 2023 11:38:44 -0500 Received: from mail-pj1-x104a.google.com (mail-pj1-x104a.google.com [IPv6:2607:f8b0:4864:20::104a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A4DA111A for ; Thu, 14 Dec 2023 08:38:49 -0800 (PST) Received: by mail-pj1-x104a.google.com with SMTP id 98e67ed59e1d1-28b172aa2e7so611221a91.0 for ; Thu, 14 Dec 2023 08:38:49 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1702571929; x=1703176729; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=wgFC2UkHZlElWA30ukvsWw2kCiyu/c5uf7p4lb7iNbg=; b=RIgX3CLhwvI9e4qLUdkZKjdVUhE/5wRLP9vaNpV3KTN/NrBCYTJYB9KNsEwKmiP9pT msw+hoILAn4hrM2oxLrbBGdbheeGzwRt98reGcTWp4f2Zk3KSlViesyQQTDdcpC6YBJN BynK9lDpfDd53sltuq7VLu2LiIAZ4UGdLMAKO1iaJX46W8gLC+qrnOul9ErNflSTSEEQ VSuZBHUTBSOwcIUOS4tKKU2Jhru1jAvu1X4mQZHEw0CpvSiuQa1CnV6rtPbptajkfSil q12HYSGnEkgo+DQJFLSsq07rThItJRgSlMUAXrh4hKjYN96W7O7kbSSoHWFpLYvhjLZC YnDQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1702571929; x=1703176729; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=wgFC2UkHZlElWA30ukvsWw2kCiyu/c5uf7p4lb7iNbg=; b=oWhQ3tjOlR5x6IMBh59j+nH8NzSCpFPOfwDp4u5kKLmwc2zoXaO3JkaDj+g6uROEWm N2ehJUdqrElVHqLBVHcedblwHq1CzVoEQbW+fwIKvw15wms66VmtCVPNriCElAj7sznL FeCpSfPKwLYIwY2WNxrWkAKok2+bgQQCFkBC/UyqSksrHGrNY5dhtvnE3NM4dhfggI2t MA7Ig0+b7HtozlY6wPhUwwa7gvJRI4WcgWkj6xQ1bA8KHb1kvzXy7TUEPFiuGpVq2oMr 4bdWmpzyC5WhozSDdco5Ko6RNGOSEJq68Fd7SMdrfgZupIiW7pFoRSeY9zJijtZrjWkc FNgA== X-Gm-Message-State: AOJu0YxOGoy/iSOu04cP3Oah8WYrm8tcfN/M4amNVe0pr3EuRFK2zmT1 Fp5WmNQVK4P5KwjgSPdT2v/adYNQ9bc= X-Received: from zagreus.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:5c37]) (user=seanjc job=sendgmr) by 2002:a17:90a:ebcb:b0:286:4090:7397 with SMTP id cf11-20020a17090aebcb00b0028640907397mr1058904pjb.5.1702571929027; Thu, 14 Dec 2023 08:38:49 -0800 (PST) Date: Thu, 14 Dec 2023 08:38:47 -0800 In-Reply-To: <20231214024727.3503870-1-vineeth@bitbyteword.org> Mime-Version: 1.0 References: <20231214024727.3503870-1-vineeth@bitbyteword.org> Message-ID: Subject: Re: [RFC PATCH 0/8] Dynamic vcpu priority management in kvm From: Sean Christopherson To: "Vineeth Pillai (Google)" Cc: Ben Segall , Borislav Petkov , Daniel Bristot de Oliveira , Dave Hansen , Dietmar Eggemann , "H . Peter Anvin" , Ingo Molnar , Juri Lelli , Mel Gorman , Paolo Bonzini , Andy Lutomirski , Peter Zijlstra , Steven Rostedt , Thomas Gleixner , Valentin Schneider , Vincent Guittot , Vitaly Kuznetsov , Wanpeng Li , Suleiman Souhlal , Masami Hiramatsu , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, x86@kernel.org, Tejun Heo , Josh Don , Barret Rhoden , David Vernet Content-Type: text/plain; charset="us-ascii" X-Spam-Status: No, score=-8.4 required=5.0 tests=DKIMWL_WL_MED,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE, USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on pete.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (pete.vger.email [0.0.0.0]); Thu, 14 Dec 2023 08:39:00 -0800 (PST) +sched_ext folks On Wed, Dec 13, 2023, Vineeth Pillai (Google) wrote: > Double scheduling is a concern with virtualization hosts where the host > schedules vcpus without knowing whats run by the vcpu and guest schedules > tasks without knowing where the vcpu is physically running. This causes > issues related to latencies, power consumption, resource utilization > etc. An ideal solution would be to have a cooperative scheduling > framework where the guest and host shares scheduling related information > and makes an educated scheduling decision to optimally handle the > workloads. As a first step, we are taking a stab at reducing latencies > for latency sensitive workloads in the guest. > > This series of patches aims to implement a framework for dynamically > managing the priority of vcpu threads based on the needs of the workload > running on the vcpu. Latency sensitive workloads (nmi, irq, softirq, > critcal sections, RT tasks etc) will get a boost from the host so as to > minimize the latency. > > The host can proactively boost the vcpu threads when it has enough > information about what is going to run on the vcpu - fo eg: injecting > interrupts. For rest of the case, guest can request boost if the vcpu is > not already boosted. The guest can subsequently request unboost after > the latency sensitive workloads completes. Guest can also request a > boost if needed. > > A shared memory region is used to communicate the scheduling information. > Guest shares its needs for priority boosting and host shares the boosting > status of the vcpu. Guest sets a flag when it needs a boost and continues > running. Host reads this on next VMEXIT and boosts the vcpu thread. For > unboosting, it is done synchronously so that host workloads can fairly > compete with guests when guest is not running any latency sensitive > workload. Big thumbs down on my end. Nothing in this RFC explains why this should be done in KVM. In general, I am very opposed to putting policy of any kind into KVM, and this puts a _lot_ of unmaintainable policy into KVM by deciding when to start/stop boosting a vCPU. Concretely, boosting vCPUs for most events is far too coarse grained. E.g. boosting a vCPU that is running a low priority workload just because the vCPU triggered an NMI due to PMU counter overflow doesn't make sense. Ditto for if a guest's hrtimer expires on a vCPU running a low priority workload. And as evidenced by patch 8/8, boosting vCPUs based on when an event is _pending_ is not maintainable. As hardware virtualizes more and more functionality, KVM's visilibity into the guest effectively decreases, e.g. Intel and AMD both support with IPI virtualization. Boosting the target of a PV spinlock kick is similarly flawed. In that case, KVM only gets involved _after_ there is a problem, i.e. after a lock is contended so heavily that a vCPU stops spinning and instead decided to HLT. It's not hard to imagine scenarios where a guest would want to communicate to the host that it's acquiring a spinlock for a latency sensitive path and so shouldn't be scheduled out. And of course that's predicated on the assumption that all vCPUs are subject to CPU overcommit. Initiating a boost from the host is also flawed in the sense that it relies on the guest to be on the same page as to when it should stop boosting. E.g. if KVM boosts a vCPU because an IRQ is pending, but the guest doesn't want to boost IRQs on that vCPU and thus doesn't stop boosting at the end of the IRQ handler, then the vCPU could end up being boosted long after its done with the IRQ. Throw nested virtualization into the mix and then all of this becomes nigh impossible to sort out in KVM. E.g. if an L1 vCPU is a running an L2 vCPU, i.e. a nested guest, and L2 is spamming interrupts for whatever reason, KVM will end repeatedly boosting the L1 vCPU regardless of the priority of the L2 workload. For things that aren't clearly in KVM's domain, I don't think we should implement KVM-specific functionality until every other option has been tried (and failed). I don't see any reason why KVM needs to get involved in scheduling, beyond maybe providing *input* regarding event injection, emphasis on *input* because KVM providing information to userspace or some other entity is wildly different than KVM making scheduling decisions based on that information. Pushing the scheduling policies to host userspace would allow for far more control and flexibility. E.g. a heavily paravirtualized environment where host userspace knows *exactly* what workloads are being run could have wildly different policies than an environment where the guest is a fairly vanilla Linux VM that has received a small amount of enlightment. Lastly, if the concern/argument is that userspace doesn't have the right knobs to (quickly) boost vCPU tasks, then the proposed sched_ext functionality seems tailor made for the problems you are trying to solve. https://lkml.kernel.org/r/20231111024835.2164816-1-tj%40kernel.org