Received: by 2002:ac2:464d:0:0:0:0:0 with SMTP id s13csp3598123lfo; Mon, 23 May 2022 08:54:14 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwsyJd1vcjTUteGXC0fFRCPQLJByZIVUael0YLylzLEsiq7/36DzP4UrouyTjpBLGEtU5J0 X-Received: by 2002:a17:903:20f:b0:158:d86a:f473 with SMTP id r15-20020a170903020f00b00158d86af473mr23636155plh.92.1653321253835; Mon, 23 May 2022 08:54:13 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1653321253; cv=none; d=google.com; s=arc-20160816; b=najKDsWVmz584QacugKizANbmIpfqYaRmx0Vp9lj/IArLMu/lGChzMBgoQ3gtcCMYT 07DCzT3nkC/CFR8Qy91SlM96lw1w146iPu1Bz3SnVhm9RTRaseLPjczLbv3QyGPmQlDQ zQTU8jzy81Xt9uSQVafG9sf2W64nwvB6bpccjlHe3eIpL8mUNarmnWjGK4gbk1i/Q5EX sHuyPBYyhg/bVMpmMHVEajq2LBJ7RSCZQp/yVXzAS075/zHXy/VhBFxzpneY0dmWOaFp fO7mpQrwYSOeTQPS6BMMx+6jmHKAu4lR8FG+zDX0e/owmsMXnyeOISdOop5yppwdbJR5 bB+g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:references:mime-version :message-id:in-reply-to:date:dkim-signature; bh=tRDi5jTRVbNIM2xVenGdkInBedwqyzv/oBdGL9Foj9E=; b=RkOGsqb20rzj2v1+eXtsoXEOoaztoFXBPiIgYCvixPLUkt5Bu5bBQ3nm7Unb3KUfMf EDlKaOPVHwaknTAuo/UHxEIeZwE9hIg7PitS9U162PddrwQB096aKr6gwn6bxQoGNnik sV+0csIKTOn34y7vSVzwSmH6vjB3cedRlG3jdRl1iccv7ST6Dzak0TVAE6HmrFYFZxYA SbGRMoKN6GZmWzT+m8GY7GyTPU7Il17HA71jSCNvIQlGltzNTqN5HlZGyyFSdo8DgV/1 ht59c8Q7yrKaAYSzVKwfhhRguFSB5oirr0EXKKpbckSPNYy2OEAKeU5pX8S7IPFgp3k3 +2AA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=DiyvxO6s; spf=softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net. [23.128.96.19]) by mx.google.com with ESMTPS id w22-20020a634916000000b003c6750425e7si10901071pga.596.2022.05.23.08.54.13 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 23 May 2022 08:54:13 -0700 (PDT) Received-SPF: softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) client-ip=23.128.96.19; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=DiyvxO6s; spf=softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 606A93FBC2; Mon, 23 May 2022 08:53:29 -0700 (PDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238243AbiEWPxG (ORCPT + 99 others); Mon, 23 May 2022 11:53:06 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54272 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238280AbiEWPwk (ORCPT ); Mon, 23 May 2022 11:52:40 -0400 Received: from mail-wm1-x349.google.com (mail-wm1-x349.google.com [IPv6:2a00:1450:4864:20::349]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 22C0132073 for ; Mon, 23 May 2022 08:52:37 -0700 (PDT) Received: by mail-wm1-x349.google.com with SMTP id bi5-20020a05600c3d8500b0039489e1d18dso10745970wmb.5 for ; Mon, 23 May 2022 08:52:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=tRDi5jTRVbNIM2xVenGdkInBedwqyzv/oBdGL9Foj9E=; b=DiyvxO6sXyHaWQ6S+EmySQgx0bNyXdIhCGWAie9Sl6/sgq2Fi+2C441A/a2ATDwYOp K3snUHUoYskQuyjkZVApMHDJd6N8X0doGokJXwk1aQvU9EMbr2H5aoM/0acDmkS9xlY6 v/m0wBlGbB/h84XBe/ID9RuFLUCELn0QfaT8zfZrmkzuUMmjtAYNaRfLMXjbirJ2fdqV nJARZaL7Is6VvV7jzS9rLJ59X1pJ2mwfrrHFx/XcNzd36OIvka78sF5dIuH9VXwQeUdR jhamYmc29A7Knn4JXRSKz4a+JpmqaQTscO5FJIw//0dcE4o6+G3/vN3N4uoK/ymCe+/h p2+A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=tRDi5jTRVbNIM2xVenGdkInBedwqyzv/oBdGL9Foj9E=; b=mhsicfpy83XFlpakGX4Ngbe/t1M7yubhhOZR5DrOF+r2lgZuW9eb/ERjif5GgqzLGg zit2kCxHUuSiOxwheujAsA8bmS2FhRDbx2lj2Ej3KZX9Yyi6co3PGIkAKH6ugGtxot6f gmdi73mTNnuHL9YecyWW4x7NHmG3uuxzR8q/RlnsBBTcMHRWGuKaZhiJVhEPBz6qxNnz 7Y0P6Vw9D4pddfPMvRXMFEo9PYZeTDj9cY2FKDeV3hRZ+38q5x7YdDf2LDyWBAM+bslH lEiqLBqqIGQ+y4Jn+bNVnxmJVteEaMZjqvY1pr4j9CaEj9RF8vhc86BRzXDeMKZWIbM0 Ef6A== X-Gm-Message-State: AOAM530po1Sv1kABMB7/rTrgYCjMLtAmVinCkUcs7RxIPRK7Jhnb5UT8 j9vl7f1TvKYcY+ecgogQ/Z3K4OxWRUSgfccR X-Received: from vdonnefort.c.googlers.com ([fda3:e722:ac3:cc00:28:9cb1:c0a8:2eea]) (user=vdonnefort job=sendgmr) by 2002:a1c:f415:0:b0:397:2db0:ed73 with SMTP id z21-20020a1cf415000000b003972db0ed73mr19835156wma.19.1653321156402; Mon, 23 May 2022 08:52:36 -0700 (PDT) Date: Mon, 23 May 2022 16:51:35 +0100 In-Reply-To: <20220523155140.2878563-1-vdonnefort@google.com> Message-Id: <20220523155140.2878563-3-vdonnefort@google.com> Mime-Version: 1.0 References: <20220523155140.2878563-1-vdonnefort@google.com> X-Mailer: git-send-email 2.36.1.124.g0e6072fb45-goog Subject: [PATCH v9 2/7] sched/fair: Decay task PELT values during wakeup migration From: Vincent Donnefort To: peterz@infradead.org, mingo@redhat.com, vincent.guittot@linaro.org Cc: linux-kernel@vger.kernel.org, dietmar.eggemann@arm.com, morten.rasmussen@arm.com, chris.redpath@arm.com, qperret@google.com, tao.zhou@linux.dev, kernel-team@android.com, vdonnefort@google.com, Vincent Donnefort Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-9.5 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,RDNS_NONE,SPF_HELO_NONE,T_SCC_BODY_TEXT_LINE, USER_IN_DEF_DKIM_WL autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Vincent Donnefort Before being migrated to a new CPU, a task sees its PELT values synchronized with rq last_update_time. Once done, that same task will also have its sched_avg last_update_time reset. This means the time between the migration and the last clock update will not be accounted for in util_avg and a discontinuity will appear. This issue is amplified by the PELT clock scaling. It takes currently one tick after the CPU being idle to let clock_pelt catching up clock_task. This is especially problematic for asymmetric CPU capacity systems which need stable util_avg signals for task placement and energy estimation. Ideally, this problem would be solved by updating the runqueue clocks before the migration. But that would require taking the runqueue lock which is quite expensive [1]. Instead estimate the missing time and update the task util_avg with that value. To that end, we need sched_clock_cpu() but it is a costly function. Limit the usage to the case where the source CPU is idle as we know this is when the clock is having the biggest risk of being outdated. In this such case, let's call it cfs_idle_lag the delta time between the rq_clock_pelt value at rq idle and cfs_rq idle. And rq_idle_lag the delta between "now" and the rq_clock_pelt at rq idle. The estimated PELT clock is then: last_update_time + (the cfs_rq's last_update_time) cfs_idle_lag + (delta between cfs_rq's update and rq's update) rq_idle_lag (delta between rq's update and now) last_update_time = cfs_rq_clock_pelt() = rq_clock_pelt() - cfs->throttled_clock_pelt_time cfs_idle_lag = rq_clock_pelt()@rq_idle - rq_clock_pelt()@cfs_rq_idle rq_idle_lag = sched_clock_cpu() - rq_clock()@rq_idle The rq_clock_pelt() from last_update_time being the same as rq_clock_pelt()@cfs_rq_idle, we can write: estimation = rq_clock_pelt()@rq_idle - cfs->throttled_clock_pelt_time + sched_clock_cpu() - rq_clock()@rq_idle The clocks being not accessible without the rq lock taken, some timestamps are created: rq_clock_pelt()@rq_idle is rq->clock_pelt_idle rq_clock()@rq_idle is rq->enter_idle cfs->throttled_clock_pelt_time is cfs_rq->throttled_pelt_idle The rq_idle_lag part of the missing time is however an estimation that doesn't take into account IRQ and Paravirt time. [1] https://lore.kernel.org/all/20190709115759.10451-1-chris.redpath@arm.com/ Signed-off-by: Vincent Donnefort Signed-off-by: Vincent Donnefort diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 05614d9b919c..df5e6e565b4d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3310,6 +3310,29 @@ static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq, int flags) } #ifdef CONFIG_SMP +static inline bool load_avg_is_decayed(struct sched_avg *sa) +{ + if (sa->load_sum) + return false; + + if (sa->util_sum) + return false; + + if (sa->runnable_sum) + return false; + + /* + * _avg must be null when _sum are null because _avg = _sum / divider + * Make sure that rounding and/or propagation of PELT values never + * break this. + */ + SCHED_WARN_ON(sa->load_avg || + sa->util_avg || + sa->runnable_avg); + + return true; +} + static inline u64 cfs_rq_last_update_time(struct cfs_rq *cfs_rq) { return u64_u32_load_copy(cfs_rq->avg.last_update_time, @@ -3347,27 +3370,12 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq) if (cfs_rq->load.weight) return false; - if (cfs_rq->avg.load_sum) - return false; - - if (cfs_rq->avg.util_sum) - return false; - - if (cfs_rq->avg.runnable_sum) + if (!load_avg_is_decayed(&cfs_rq->avg)) return false; if (child_cfs_rq_on_list(cfs_rq)) return false; - /* - * _avg must be null when _sum are null because _avg = _sum / divider - * Make sure that rounding and/or propagation of PELT values never - * break this. - */ - SCHED_WARN_ON(cfs_rq->avg.load_avg || - cfs_rq->avg.util_avg || - cfs_rq->avg.runnable_avg); - return true; } @@ -3706,6 +3714,88 @@ static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum #endif /* CONFIG_FAIR_GROUP_SCHED */ +#ifdef CONFIG_NO_HZ_COMMON +static inline void migrate_se_pelt_lag(struct sched_entity *se) +{ + u64 throttled = 0, now, lut; + struct cfs_rq *cfs_rq; + struct rq *rq; + bool is_idle; + + if (load_avg_is_decayed(&se->avg)) + return; + + cfs_rq = cfs_rq_of(se); + rq = rq_of(cfs_rq); + + rcu_read_lock(); + is_idle = is_idle_task(rcu_dereference(rq->curr)); + rcu_read_unlock(); + + /* + * The lag estimation comes with a cost we don't want to pay all the + * time. Hence, limiting to the case where the source CPU is idle and + * we know we are at the greatest risk to have an outdated clock. + */ + if (!is_idle) + return; + + /* + * Estimated "now" is: last_update_time + cfs_idle_lag + rq_idle_lag, where: + * + * last_update_time (the cfs_rq's last_update_time) + * = cfs_rq_clock_pelt() + * = rq_clock_pelt() - cfs->throttled_clock_pelt_time + * + * cfs_idle_lag (delta between cfs_rq's update and rq's update) + * = rq_clock_pelt()@rq_idle - rq_clock_pelt()@cfs_rq_idle + * + * rq_idle_lag (delta between rq's update and now) + * = sched_clock_cpu() - rq_clock()@rq_idle + * + * The rq_clock_pelt() from last_update_time being the same as + * rq_clock_pelt()@cfs_rq_idle, we can write: + * + * now = rq_clock_pelt()@rq_idle - cfs->throttled_clock_pelt_time + + * sched_clock_cpu() - rq_clock()@rq_idle + * Where: + * rq_clock_pelt()@rq_idle is rq->clock_pelt_idle + * rq_clock()@rq_idle is rq->enter_idle + * cfs->throttled_clock_pelt_time is cfs_rq->throttled_pelt_idle + */ + +#ifdef CONFIG_CFS_BANDWIDTH + throttled = u64_u32_load(cfs_rq->throttled_pelt_idle); + /* The clock has been stopped for throttling */ + if (throttled == U64_MAX) + return; +#endif + now = u64_u32_load(rq->clock_pelt_idle); + /* + * Paired with _update_idle_rq_clock_pelt. It ensures at the worst case + * is observed the old clock_pelt_idle value and the new enter_idle, + * which lead to an understimation. The opposite would lead to an + * overestimation. + */ + smp_rmb(); + lut = cfs_rq_last_update_time(cfs_rq); + + now -= throttled; + if (now < lut) + /* + * cfs_rq->avg.last_update_time is more recent than our + * estimation, let's use it. + */ + now = lut; + else + now += sched_clock_cpu(cpu_of(rq)) - u64_u32_load(rq->enter_idle); + + __update_load_avg_blocked_se(now, se); +} +#else +static void migrate_se_pelt_lag(struct sched_entity *se) {} +#endif + /** * update_cfs_rq_load_avg - update the cfs_rq's load/util averages * @now: current time, as per cfs_rq_clock_pelt() @@ -4437,6 +4527,9 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) */ if ((flags & (DEQUEUE_SAVE | DEQUEUE_MOVE)) != DEQUEUE_SAVE) update_min_vruntime(cfs_rq); + + if (cfs_rq->nr_running == 0) + update_idle_cfs_rq_clock_pelt(cfs_rq); } /* @@ -6911,6 +7004,8 @@ static void detach_entity_cfs_rq(struct sched_entity *se); */ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu) { + struct sched_entity *se = &p->se; + /* * As blocked tasks retain absolute vruntime the migration needs to * deal with this by subtracting the old and adding the new @@ -6918,7 +7013,6 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu) * the task on the new runqueue. */ if (READ_ONCE(p->__state) == TASK_WAKING) { - struct sched_entity *se = &p->se; struct cfs_rq *cfs_rq = cfs_rq_of(se); se->vruntime -= u64_u32_load(cfs_rq->min_vruntime); @@ -6930,25 +7024,29 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu) * rq->lock and can modify state directly. */ lockdep_assert_rq_held(task_rq(p)); - detach_entity_cfs_rq(&p->se); + detach_entity_cfs_rq(se); } else { + remove_entity_load_avg(se); + /* - * We are supposed to update the task to "current" time, then - * its up to date and ready to go to new CPU/cfs_rq. But we - * have difficulty in getting what current time is, so simply - * throw away the out-of-date time. This will result in the - * wakee task is less decayed, but giving the wakee more load - * sounds not bad. + * Here, the task's PELT values have been updated according to + * the current rq's clock. But if that clock hasn't been + * updated in a while, a substantial idle time will be missed, + * leading to an inflation after wake-up on the new rq. + * + * Estimate the missing time from the cfs_rq last_update_time + * and update sched_avg to improve the PELT continuity after + * migration. */ - remove_entity_load_avg(&p->se); + migrate_se_pelt_lag(se); } /* Tell new CPU we are migrated */ - p->se.avg.last_update_time = 0; + se->avg.last_update_time = 0; /* We have migrated, no longer consider this task hot */ - p->se.exec_start = 0; + se->exec_start = 0; update_scan_period(p, new_cpu); } @@ -8114,6 +8212,10 @@ static bool __update_blocked_fair(struct rq *rq, bool *done) if (update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq)) { update_tg_load_avg(cfs_rq); + /* sync clock_pelt_idle with last update */ + if (cfs_rq->nr_running == 0) + update_idle_cfs_rq_clock_pelt(cfs_rq); + if (cfs_rq == &rq->cfs) decayed = true; } diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h index 4ff2ed4f8fa1..647e5fcc041b 100644 --- a/kernel/sched/pelt.h +++ b/kernel/sched/pelt.h @@ -61,6 +61,25 @@ static inline void cfs_se_util_change(struct sched_avg *avg) WRITE_ONCE(avg->util_est.enqueued, enqueued); } +static inline u64 rq_clock_pelt(struct rq *rq) +{ + lockdep_assert_rq_held(rq); + assert_clock_updated(rq); + + return rq->clock_pelt - rq->lost_idle_time; +} + +/* The rq is idle, we can sync to clock_task */ +static inline void _update_idle_rq_clock_pelt(struct rq *rq) +{ + rq->clock_pelt = rq_clock_task(rq); + + u64_u32_store(rq->enter_idle, rq_clock(rq)); + /* Paired with smp_rmb in migrate_se_pelt_lag */ + smp_wmb(); + u64_u32_store(rq->clock_pelt_idle, rq_clock_pelt(rq)); +} + /* * The clock_pelt scales the time to reflect the effective amount of * computation done during the running delta time but then sync back to @@ -76,8 +95,7 @@ static inline void cfs_se_util_change(struct sched_avg *avg) static inline void update_rq_clock_pelt(struct rq *rq, s64 delta) { if (unlikely(is_idle_task(rq->curr))) { - /* The rq is idle, we can sync to clock_task */ - rq->clock_pelt = rq_clock_task(rq); + _update_idle_rq_clock_pelt(rq); return; } @@ -130,17 +148,23 @@ static inline void update_idle_rq_clock_pelt(struct rq *rq) */ if (util_sum >= divider) rq->lost_idle_time += rq_clock_task(rq) - rq->clock_pelt; + + _update_idle_rq_clock_pelt(rq); } -static inline u64 rq_clock_pelt(struct rq *rq) +#ifdef CONFIG_CFS_BANDWIDTH +static inline void update_idle_cfs_rq_clock_pelt(struct cfs_rq *cfs_rq) { - lockdep_assert_rq_held(rq); - assert_clock_updated(rq); + u64 throttled; - return rq->clock_pelt - rq->lost_idle_time; + if (unlikely(cfs_rq->throttle_count)) + throttled = U64_MAX; + else + throttled = cfs_rq->throttled_clock_pelt_time; + + u64_u32_store(cfs_rq->throttled_pelt_idle, throttled); } -#ifdef CONFIG_CFS_BANDWIDTH /* rq->task_clock normalized against any time this cfs_rq has spent throttled */ static inline u64 cfs_rq_clock_pelt(struct cfs_rq *cfs_rq) { @@ -150,6 +174,7 @@ static inline u64 cfs_rq_clock_pelt(struct cfs_rq *cfs_rq) return rq_clock_pelt(rq_of(cfs_rq)) - cfs_rq->throttled_clock_pelt_time; } #else +static inline void update_idle_cfs_rq_clock_pelt(struct cfs_rq *cfs_rq) { } static inline u64 cfs_rq_clock_pelt(struct cfs_rq *cfs_rq) { return rq_clock_pelt(rq_of(cfs_rq)); @@ -204,6 +229,7 @@ update_rq_clock_pelt(struct rq *rq, s64 delta) { } static inline void update_idle_rq_clock_pelt(struct rq *rq) { } +static inline void update_idle_cfs_rq_clock_pelt(struct cfs_rq *cfs_rq) { } #endif diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index bf4a0ec98678..97bc26e5c8af 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -648,6 +648,10 @@ struct cfs_rq { int runtime_enabled; s64 runtime_remaining; + u64 throttled_pelt_idle; +#ifndef CONFIG_64BIT + u64 throttled_pelt_idle_copy; +#endif u64 throttled_clock; u64 throttled_clock_pelt; u64 throttled_clock_pelt_time; @@ -1020,6 +1024,12 @@ struct rq { u64 clock_task ____cacheline_aligned; u64 clock_pelt; unsigned long lost_idle_time; + u64 clock_pelt_idle; + u64 enter_idle; +#ifndef CONFIG_64BIT + u64 clock_pelt_idle_copy; + u64 enter_idle_copy; +#endif atomic_t nr_iowait; -- 2.36.1.124.g0e6072fb45-goog