Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 63379C64EC4 for ; Thu, 9 Mar 2023 14:25:15 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231171AbjCIOZN (ORCPT ); Thu, 9 Mar 2023 09:25:13 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42038 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231676AbjCIOY3 (ORCPT ); Thu, 9 Mar 2023 09:24:29 -0500 Received: from szxga03-in.huawei.com (szxga03-in.huawei.com [45.249.212.189]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D38A51F5CE for ; Thu, 9 Mar 2023 06:24:01 -0800 (PST) Received: from dggpeml500018.china.huawei.com (unknown [172.30.72.54]) by szxga03-in.huawei.com (SkyGuard) with ESMTP id 4PXWdK6n1czKmY9; Thu, 9 Mar 2023 22:23:49 +0800 (CST) Received: from [10.67.111.186] (10.67.111.186) by dggpeml500018.china.huawei.com (7.185.36.186) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.21; Thu, 9 Mar 2023 22:23:58 +0800 Message-ID: <6b189026-f33b-6418-3144-5e82dfc712e6@huawei.com> Date: Thu, 9 Mar 2023 22:23:58 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Thunderbird/91.1.1 Subject: Re: [PATCH v2] sched/fair: sanitize vruntime of entity being migrated To: Vincent Guittot CC: , , , , , , , , , , References: <20230306132418.50389-1-zhangqiao22@huawei.com> <07c692fd-fe59-1bd4-a6d0-e84bee6dbb3b@huawei.com> From: Zhang Qiao In-Reply-To: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8bit X-Originating-IP: [10.67.111.186] X-ClientProxiedBy: dggems704-chm.china.huawei.com (10.3.19.181) To dggpeml500018.china.huawei.com (7.185.36.186) X-CFilter-Loop: Reflected Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org 在 2023/3/9 18:48, Vincent Guittot 写道: > On Thu, 9 Mar 2023 at 09:37, Zhang Qiao wrote: >> >> >> >> 在 2023/3/8 20:55, Vincent Guittot 写道: >>> Le mercredi 08 mars 2023 à 09:01:05 (+0100), Vincent Guittot a écrit : >>>> On Tue, 7 Mar 2023 at 14:41, Zhang Qiao wrote: >>>>> >>>>> >>>>> >>>>> 在 2023/3/7 18:26, Vincent Guittot 写道: >>>>>> On Mon, 6 Mar 2023 at 14:53, Vincent Guittot wrote: >>>>>>> >>>>>>> On Mon, 6 Mar 2023 at 13:57, Zhang Qiao wrote: >>>>>>>> >>>>>>>> Commit 829c1651e9c4 ("sched/fair: sanitize vruntime of >>>>>>>> entity being placed") fix an overflowing bug, but ignore >>>>>>>> a case that se->exec_start is reset after a migration. >>>>>>>> >>>>>>>> For fixing this case, we reset the vruntime of a long >>>>>>>> sleeping task in migrate_task_rq_fair(). >>>>>>>> >>>>>>>> Fixes: 829c1651e9c4 ("sched/fair: sanitize vruntime of entity being placed") >>>>>>>> Suggested-by: Vincent Guittot >>>>>>>> Signed-off-by: Zhang Qiao >>>>>>> >>>>>>> Reviewed-by: Vincent Guittot >>>>>>> >>>>>>>> --- >>>>>>>> >>>>>>>> v1 -> v2: >>>>>>>> - fix some typos and update comments >>>>>>>> - reformat the patch >>>>>>>> >>>>>>>> --- >>>>>>>> kernel/sched/fair.c | 76 ++++++++++++++++++++++++++++++++------------- >>>>>>>> 1 file changed, 55 insertions(+), 21 deletions(-) >>>>>>>> >>>>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >>>>>>>> index 7a1b1f855b96..74c9918ffe76 100644 >>>>>>>> --- a/kernel/sched/fair.c >>>>>>>> +++ b/kernel/sched/fair.c >>>>>>>> @@ -4648,11 +4648,45 @@ static void check_spread(struct cfs_rq *cfs_rq, struct sched_entity *se) >>>>>>>> #endif >>>>>>>> } >>>>>>>> >>>>>>>> +static inline bool entity_is_long_sleep(struct sched_entity *se) >>>>>>>> +{ >>>>>>>> + struct cfs_rq *cfs_rq; >>>>>>>> + u64 sleep_time; >>>>>>>> + >>>>>>>> + if (se->exec_start == 0) >>>>>>>> + return false; >>>>>>>> + >>>>>>>> + cfs_rq = cfs_rq_of(se); >>>>>>>> + sleep_time = rq_clock_task(rq_of(cfs_rq)) - se->exec_start; >>>>>>>> + if ((s64)sleep_time > 60LL * NSEC_PER_SEC) >>>>>>>> + return true; >>>>>>>> + >>>>>>>> + return false; >>>>>>>> +} >>>>>>>> + >>>>>>>> +static inline u64 sched_sleeper_credit(struct sched_entity *se) >>>>>>>> +{ >>>>>>>> + unsigned long thresh; >>>>>>>> + >>>>>>>> + if (se_is_idle(se)) >>>>>>>> + thresh = sysctl_sched_min_granularity; >>>>>>>> + else >>>>>>>> + thresh = sysctl_sched_latency; >>>>>>>> + >>>>>>>> + /* >>>>>>>> + * Halve their sleep time's effect, to allow >>>>>>>> + * for a gentler effect of sleepers: >>>>>>>> + */ >>>>>>>> + if (sched_feat(GENTLE_FAIR_SLEEPERS)) >>>>>>>> + thresh >>= 1; >>>>>>>> + >>>>>>>> + return thresh; >>>>>>>> +} >>>>>>>> + >>>>>>>> static void >>>>>>>> place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial) >>>>>>>> { >>>>>>>> u64 vruntime = cfs_rq->min_vruntime; >>>>>>>> - u64 sleep_time; >>>>>>>> >>>>>>>> /* >>>>>>>> * The 'current' period is already promised to the current tasks, >>>>>>>> @@ -4664,23 +4698,8 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial) >>>>>>>> vruntime += sched_vslice(cfs_rq, se); >>>>>>>> >>>>>>>> /* sleeps up to a single latency don't count. */ >>>>>>>> - if (!initial) { >>>>>>>> - unsigned long thresh; >>>>>>>> - >>>>>>>> - if (se_is_idle(se)) >>>>>>>> - thresh = sysctl_sched_min_granularity; >>>>>>>> - else >>>>>>>> - thresh = sysctl_sched_latency; >>>>>>>> - >>>>>>>> - /* >>>>>>>> - * Halve their sleep time's effect, to allow >>>>>>>> - * for a gentler effect of sleepers: >>>>>>>> - */ >>>>>>>> - if (sched_feat(GENTLE_FAIR_SLEEPERS)) >>>>>>>> - thresh >>= 1; >>>>>>>> - >>>>>>>> - vruntime -= thresh; >>>>>>>> - } >>>>>>>> + if (!initial) >>>>>>>> + vruntime -= sched_sleeper_credit(se); >>>>>>>> >>>>>>>> /* >>>>>>>> * Pull vruntime of the entity being placed to the base level of >>>>>>>> @@ -4689,8 +4708,7 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial) >>>>>>>> * the base as it may be too far off and the comparison may get >>>>>>>> * inversed due to s64 overflow. >>>>>>>> */ >>>>>>>> - sleep_time = rq_clock_task(rq_of(cfs_rq)) - se->exec_start; >>>>>>>> - if ((s64)sleep_time > 60LL * NSEC_PER_SEC) >>>>>>>> + if (entity_is_long_sleep(se)) >>>>>>>> se->vruntime = vruntime; >>>>>>>> else >>>>>>>> se->vruntime = max_vruntime(se->vruntime, vruntime); >>>>>>>> @@ -7635,7 +7653,23 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu) >>>>>>>> if (READ_ONCE(p->__state) == TASK_WAKING) { >>>>>>>> struct cfs_rq *cfs_rq = cfs_rq_of(se); >>>>>>>> >>>>>>>> - se->vruntime -= u64_u32_load(cfs_rq->min_vruntime); >>>>>>>> + /* >>>>>>>> + * We determine whether a task sleeps for long by checking >>>>>>>> + * se->exec_start, and if it is, we sanitize its vruntime at >>>>>>>> + * place_entity(). However, after a migration, this detection >>>>>>>> + * method fails due to se->exec_start being reset. >>>>>>>> + * >>>>>>>> + * For fixing this case, we add the same check here. For a task >>>>>>>> + * which has slept for a long time, its vruntime should be reset >>>>>>>> + * to cfs_rq->min_vruntime with a sleep credit. Because waking >>>>>>>> + * task's vruntime will be added to cfs_rq->min_vruntime when >>>>>>>> + * enqueue, we only need to reset the se->vruntime of waking task >>>>>>>> + * to a credit here. >>>>>>>> + */ >>>>>>>> + if (entity_is_long_sleep(se)) >>>>>> >>>>>> I completely overlooked that we can't use rq_clock_task here. Need to >>>>>> think a bit more on this >>>>> >>>>> Hi,Vincent, >>>>> >>>>> How about using exec_start of the parent sched_entity instant of rq_clock_task()? >>>> >>>> How do we handle sched_entity without a parent sched_entity ? >>> >>> The change below should do the stuff. Not that we can't use it in place entity because >>> pelt is not enabled in UP. >>> >>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >>> index 74c9918ffe76..b8b381b0ff20 100644 >>> --- a/kernel/sched/fair.c >>> +++ b/kernel/sched/fair.c >>> @@ -7635,6 +7635,32 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags) >>> return new_cpu; >>> } >>> >>> +static inline bool migrate_long_sleeper(struct sched_entity *se) >>> +{ >>> + struct cfs_rq *cfs_rq; >>> + u64 sleep_time; >>> + >>> + if (se->exec_start == 0) >> >> How about use `se->avg.last_update_time == 0` here? >> >>> + return false; >>> + >>> + cfs_rq = cfs_rq_of(se); >>> + /* >>> + * If the entity slept for a long time, don't even try to normalize its >>> + * vruntime with the base as it may be too far off and might generate >>> + * wrong decision because of s64 overflow. >>> + * We estimate its sleep duration with the last update of se's pelt. >>> + * The last update happened before sleeping. The cfs' pelt is not >>> + * always updated when cfs is idle but this is not a problem because >>> + * its min_vruntime is not updated too, so the situation can't get >>> + * worse. >>> + */ >>> + sleep_time = cfs_rq_last_update_time(cfs_rq) - se->avg.last_update_time; >>> + if ((s64)sleep_time > 60LL * NSEC_PER_SEC) >> >> I tested with this patch and found it make hackbench slower. > > Compared to which version ? > - v6.2 + revert commit 829c1651e9c4 ? > - v6.2 ? > > v6.2 has a bug because newly migrated task gets a runtime credit > whatever its previous vruntime and hackbench generates a lot of > migration > Sorry, When I was testing, I forgot to add the check 'se->exec_start == 0' in place_entity(). After i re-test, the hackbench result is ok. > > >> >> >>> + return true; >>> + >>> + return false; >>> +} >>> + >>> /* >>> * Called immediately before a task is migrated to a new CPU; task_cpu(p) and >>> * cfs_rq_of(p) references at time of call are still valid and identify the >>> @@ -7666,7 +7692,7 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu) >>> * enqueue, we only need to reset the se->vruntime of waking task >>> * to a credit here. >>> */ >>> - if (entity_is_long_sleep(se)) >>> + if (migrate_long_sleeper(se)) >>> se->vruntime = -sched_sleeper_credit(se); >>> else >>> se->vruntime -= u64_u32_load(cfs_rq->min_vruntime); >>> >>> >>> >>>> >>>> >>>> >>>>>> >>>>>>>> + se->vruntime = -sched_sleeper_credit(se); >>>>>>>> + else >>>>>>>> + se->vruntime -= u64_u32_load(cfs_rq->min_vruntime); >>>>>>>> } >>>>>>>> >>>>>>>> if (!task_on_rq_migrating(p)) { >>>>>>>> -- >>>>>>>> 2.17.1 >>>>>>>> >>>>>> . >>>>>> >>> . >>> > . >