Received: by 2002:a05:7412:b130:b0:e2:908c:2ebd with SMTP id az48csp480677rdb; Fri, 17 Nov 2023 04:38:46 -0800 (PST) X-Google-Smtp-Source: AGHT+IFWrtNQxaQaBbA9nVTlgQleXLjjgFUywQPmQTSVIG4PFVnHZHAORwxKi+F5O9N06C6ZkGS+ X-Received: by 2002:a17:902:c1c9:b0:1cc:dade:2789 with SMTP id c9-20020a170902c1c900b001ccdade2789mr11954722plc.20.1700224726606; Fri, 17 Nov 2023 04:38:46 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1700224726; cv=none; d=google.com; s=arc-20160816; b=1E5mDeAqiebBIsDcTStdi2BXwASdzqPu6FtvmP8+BO50ONOQE0mVk5PaCRWUYYLRl+ Ugm4gygIq5D46QS2VRXPacO8KaU5FO43U4MrDcJ3v9CqWNdQbznL9xak3j53sLsLXJqg FvysQciEdtrI5ovUGaiHXyWolNrSdynfXPrIHewKdgpkIHmM+qEYek/r0UrTgJIJNP4b MC6cU0epPbKqOINXCHD9yDUQl49kMbGq4IlO784nO/TrrQwqBypSIimyvLacURGCyncS SbHX4J3gFPzWNPtAEYK23JcRUhGQbi5TLfdCRI5FLhkwsM3Lqh2VldDv+CHZfQJbkpFa LI/w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=ihaKmDQEUKTpLi3X62NAOuhZsDyQ5ySoKRLINzqRVm0=; fh=0io6ed5E3wz9QIDg9X/gTcLqSC8EdW2Yqqmv07T676Y=; b=IbrIDUjk6jWoqFurYNOEhtcsYcByS+bMODAaIDHLPk9gLDqtzScKBefSrtCUYL581Q +tA+CuWiW9NSbZ6qcPxUFFlbLuthAoXEnebJgvAtO5bUGJ6N51wreI1gX5dQIBoaAKP6 DRm/UyY8xlbNWJQ6znic12cD7Iwedcb2mftnigagh8vo3xHMwOx52BMxRb5kDNyhPo3Y nB/8BrcvjAxFAObBWl4MBjDNw5MlfYaNxny6T9lOXC0pRC6NmW308opZ3PUBB086cv/S QGaWYdbfA5XNH2b4o1WKo46YbhUjWy5LgMuJSPISBEmkuuwhdf2Knw8DW8T5Cze3NnlR SDyg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@infradead.org header.s=desiato.20200630 header.b=Dtpn1gWv; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:5 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from groat.vger.email (groat.vger.email. [2620:137:e000::3:5]) by mx.google.com with ESMTPS id m3-20020a170902bb8300b001cc4aca5f5dsi1665626pls.636.2023.11.17.04.38.46 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 17 Nov 2023 04:38:46 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:5 as permitted sender) client-ip=2620:137:e000::3:5; Authentication-Results: mx.google.com; dkim=pass header.i=@infradead.org header.s=desiato.20200630 header.b=Dtpn1gWv; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:5 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by groat.vger.email (Postfix) with ESMTP id 370B58304EFE; Fri, 17 Nov 2023 04:38:43 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at groat.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230419AbjKQMi2 (ORCPT + 99 others); Fri, 17 Nov 2023 07:38:28 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51964 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229543AbjKQMiZ (ORCPT ); Fri, 17 Nov 2023 07:38:25 -0500 Received: from desiato.infradead.org (desiato.infradead.org [IPv6:2001:8b0:10b:1:d65d:64ff:fe57:4e05]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DA486B3; Fri, 17 Nov 2023 04:38:19 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=desiato.20200630; h=In-Reply-To:Content-Type:MIME-Version: References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=ihaKmDQEUKTpLi3X62NAOuhZsDyQ5ySoKRLINzqRVm0=; b=Dtpn1gWv0I+0QPid4NYzcsR/6j Fnqh+ZOcrbzX0ek/ekvlgdLxQeJBvQLc2Or69WIqPsQPUpCBKWGngILq8Yg51uDD0QKU4GNBD9bxZ t052LdAsWc7KsuoXoRRzkXBNrP+h5GZm0KyV1KsAOV4omImFqr+9dQCANq5qGRVt2zXc0ak1IRnMd PA3FAx1dyrt7LRiP/oabBenTnKJjMg/9OsJr1izBHUTooI5dPTYhfahZ6/nP4kdN39zBV8MxpOiOh TgxveYHS+0rv+vyKboYXWwJX0Y5LUqnPOH2GZSTwPWuOkRpIML5crC0+xKaV/LT3zKWOXphB8zPU/ jp6pFuNQ==; Received: from j130084.upc-j.chello.nl ([24.132.130.84] helo=noisy.programming.kicks-ass.net) by desiato.infradead.org with esmtpsa (Exim 4.96 #2 (Red Hat Linux)) id 1r3y6h-007HlG-1D; Fri, 17 Nov 2023 12:37:59 +0000 Received: by noisy.programming.kicks-ass.net (Postfix, from userid 1000) id 10E793002BE; Fri, 17 Nov 2023 13:37:59 +0100 (CET) Date: Fri, 17 Nov 2023 13:37:59 +0100 From: Peter Zijlstra To: Tobias Huschle Cc: Linux Kernel , kvm@vger.kernel.org, virtualization@lists.linux.dev, netdev@vger.kernel.org, mst@redhat.com, jasowang@redhat.com, wuyun.abel@bytedance.com Subject: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement) Message-ID: <20231117123759.GP8262@noisy.programming.kicks-ass.net> References: <20231117092318.GJ8262@noisy.programming.kicks-ass.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on groat.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (groat.vger.email [0.0.0.0]); Fri, 17 Nov 2023 04:38:43 -0800 (PST) On Fri, Nov 17, 2023 at 01:24:21PM +0100, Tobias Huschle wrote: > On Fri, Nov 17, 2023 at 10:23:18AM +0100, Peter Zijlstra wrote: > > kworkers are typically not in cgroups and are part of the root cgroup, > > but what's a vhost and where does it live? > > The qemu instances of the two KVM guests are placed into cgroups. > The vhosts run within the context of these qemu instances (4 threads per guest). > So they are also put into those cgroups. > > I'll answer the other questions you brought up as well, but I guess that one > is most critical: > > > > > After confirming both tasks are indeed in the same cgroup ofcourse, > > because if they're not, vruntime will be meaningless to compare and we > > should look elsewhere. > > In that case we probably have to go with elsewhere ... which is good to know. Ah, so if this is a cgroup issue, it might be worth trying this patch that we have in tip/sched/urgent. I'll try and read the rest of the email a little later, gotta run errands first. --- commit eab03c23c2a162085b13200d7942fc5a00b5ccc8 Author: Abel Wu Date: Tue Nov 7 17:05:07 2023 +0800 sched/eevdf: Fix vruntime adjustment on reweight vruntime of the (on_rq && !0-lag) entity needs to be adjusted when it gets re-weighted, and the calculations can be simplified based on the fact that re-weight won't change the w-average of all the entities. Please check the proofs in comments. But adjusting vruntime can also cause position change in RB-tree hence require re-queue to fix up which might be costly. This might be avoided by deferring adjustment to the time the entity actually leaves tree (dequeue/pick), but that will negatively affect task selection and probably not good enough either. Fixes: 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling policy") Signed-off-by: Abel Wu Signed-off-by: Peter Zijlstra (Intel) Link: https://lkml.kernel.org/r/20231107090510.71322-2-wuyun.abel@bytedance.com diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 2048138ce54b..025d90925bf6 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3666,41 +3666,140 @@ static inline void dequeue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) { } #endif +static void reweight_eevdf(struct cfs_rq *cfs_rq, struct sched_entity *se, + unsigned long weight) +{ + unsigned long old_weight = se->load.weight; + u64 avruntime = avg_vruntime(cfs_rq); + s64 vlag, vslice; + + /* + * VRUNTIME + * ======== + * + * COROLLARY #1: The virtual runtime of the entity needs to be + * adjusted if re-weight at !0-lag point. + * + * Proof: For contradiction assume this is not true, so we can + * re-weight without changing vruntime at !0-lag point. + * + * Weight VRuntime Avg-VRuntime + * before w v V + * after w' v' V' + * + * Since lag needs to be preserved through re-weight: + * + * lag = (V - v)*w = (V'- v')*w', where v = v' + * ==> V' = (V - v)*w/w' + v (1) + * + * Let W be the total weight of the entities before reweight, + * since V' is the new weighted average of entities: + * + * V' = (WV + w'v - wv) / (W + w' - w) (2) + * + * by using (1) & (2) we obtain: + * + * (WV + w'v - wv) / (W + w' - w) = (V - v)*w/w' + v + * ==> (WV-Wv+Wv+w'v-wv)/(W+w'-w) = (V - v)*w/w' + v + * ==> (WV - Wv)/(W + w' - w) + v = (V - v)*w/w' + v + * ==> (V - v)*W/(W + w' - w) = (V - v)*w/w' (3) + * + * Since we are doing at !0-lag point which means V != v, we + * can simplify (3): + * + * ==> W / (W + w' - w) = w / w' + * ==> Ww' = Ww + ww' - ww + * ==> W * (w' - w) = w * (w' - w) + * ==> W = w (re-weight indicates w' != w) + * + * So the cfs_rq contains only one entity, hence vruntime of + * the entity @v should always equal to the cfs_rq's weighted + * average vruntime @V, which means we will always re-weight + * at 0-lag point, thus breach assumption. Proof completed. + * + * + * COROLLARY #2: Re-weight does NOT affect weighted average + * vruntime of all the entities. + * + * Proof: According to corollary #1, Eq. (1) should be: + * + * (V - v)*w = (V' - v')*w' + * ==> v' = V' - (V - v)*w/w' (4) + * + * According to the weighted average formula, we have: + * + * V' = (WV - wv + w'v') / (W - w + w') + * = (WV - wv + w'(V' - (V - v)w/w')) / (W - w + w') + * = (WV - wv + w'V' - Vw + wv) / (W - w + w') + * = (WV + w'V' - Vw) / (W - w + w') + * + * ==> V'*(W - w + w') = WV + w'V' - Vw + * ==> V' * (W - w) = (W - w) * V (5) + * + * If the entity is the only one in the cfs_rq, then reweight + * always occurs at 0-lag point, so V won't change. Or else + * there are other entities, hence W != w, then Eq. (5) turns + * into V' = V. So V won't change in either case, proof done. + * + * + * So according to corollary #1 & #2, the effect of re-weight + * on vruntime should be: + * + * v' = V' - (V - v) * w / w' (4) + * = V - (V - v) * w / w' + * = V - vl * w / w' + * = V - vl' + */ + if (avruntime != se->vruntime) { + vlag = (s64)(avruntime - se->vruntime); + vlag = div_s64(vlag * old_weight, weight); + se->vruntime = avruntime - vlag; + } + + /* + * DEADLINE + * ======== + * + * When the weight changes, the virtual time slope changes and + * we should adjust the relative virtual deadline accordingly. + * + * d' = v' + (d - v)*w/w' + * = V' - (V - v)*w/w' + (d - v)*w/w' + * = V - (V - v)*w/w' + (d - v)*w/w' + * = V + (d - V)*w/w' + */ + vslice = (s64)(se->deadline - avruntime); + vslice = div_s64(vslice * old_weight, weight); + se->deadline = avruntime + vslice; +} + static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, unsigned long weight) { - unsigned long old_weight = se->load.weight; + bool curr = cfs_rq->curr == se; if (se->on_rq) { /* commit outstanding execution time */ - if (cfs_rq->curr == se) + if (curr) update_curr(cfs_rq); else - avg_vruntime_sub(cfs_rq, se); + __dequeue_entity(cfs_rq, se); update_load_sub(&cfs_rq->load, se->load.weight); } dequeue_load_avg(cfs_rq, se); - update_load_set(&se->load, weight); - if (!se->on_rq) { /* * Because we keep se->vlag = V - v_i, while: lag_i = w_i*(V - v_i), * we need to scale se->vlag when w_i changes. */ - se->vlag = div_s64(se->vlag * old_weight, weight); + se->vlag = div_s64(se->vlag * se->load.weight, weight); } else { - s64 deadline = se->deadline - se->vruntime; - /* - * When the weight changes, the virtual time slope changes and - * we should adjust the relative virtual deadline accordingly. - */ - deadline = div_s64(deadline * old_weight, weight); - se->deadline = se->vruntime + deadline; - if (se != cfs_rq->curr) - min_deadline_cb_propagate(&se->run_node, NULL); + reweight_eevdf(cfs_rq, se, weight); } + update_load_set(&se->load, weight); + #ifdef CONFIG_SMP do { u32 divider = get_pelt_divider(&se->avg); @@ -3712,8 +3811,17 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, enqueue_load_avg(cfs_rq, se); if (se->on_rq) { update_load_add(&cfs_rq->load, se->load.weight); - if (cfs_rq->curr != se) - avg_vruntime_add(cfs_rq, se); + if (!curr) { + /* + * The entity's vruntime has been adjusted, so let's check + * whether the rq-wide min_vruntime needs updated too. Since + * the calculations above require stable min_vruntime rather + * than up-to-date one, we do the update at the end of the + * reweight process. + */ + __enqueue_entity(cfs_rq, se); + update_min_vruntime(cfs_rq); + } } } @@ -3857,14 +3965,11 @@ static void update_cfs_group(struct sched_entity *se) #ifndef CONFIG_SMP shares = READ_ONCE(gcfs_rq->tg->shares); - - if (likely(se->load.weight == shares)) - return; #else - shares = calc_group_shares(gcfs_rq); + shares = calc_group_shares(gcfs_rq); #endif - - reweight_entity(cfs_rq_of(se), se, shares); + if (unlikely(se->load.weight != shares)) + reweight_entity(cfs_rq_of(se), se, shares); } #else /* CONFIG_FAIR_GROUP_SCHED */