Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752506AbdGCH0f (ORCPT ); Mon, 3 Jul 2017 03:26:35 -0400 Received: from mail-lf0-f49.google.com ([209.85.215.49]:35151 "EHLO mail-lf0-f49.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750793AbdGCH0c (ORCPT ); Mon, 3 Jul 2017 03:26:32 -0400 MIME-Version: 1.0 In-Reply-To: <1498787766-9593-1-git-send-email-jbacik@fb.com> References: <1498787766-9593-1-git-send-email-jbacik@fb.com> From: Vincent Guittot Date: Mon, 3 Jul 2017 09:26:10 +0200 Message-ID: Subject: Re: [RFC][PATCH] sched: attach extra runtime to the right avg To: josef@toxicpanda.com Cc: "mingo@redhat.com" , Peter Zijlstra , linux-kernel , kernel-team@fb.com, Josef Bacik Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5051 Lines: 128 Hi Josef, On 30 June 2017 at 03:56, wrote: > From: Josef Bacik > > We only track the load avg of a se in 1024 ns chunks, so in order to > make up for the loss of the < 1024 ns part of a run/sleep delta we only > add the time we processed to the se->avg.last_update_time. The problem > is there is no way to know if this extra time was while we were asleep > or while we were running. Instead keep track of the remainder and apply > it in the appropriate place. If the remainder was while we were > running, add it to the delta the next time we update the load avg while > running, and the same for sleeping. This (coupled with other fixes) > mostly fixes the regression to my workload introduced by Peter's > experimental runnable load propagation patches. IIUC, your workload is sensible to the fact that the min granularity of the load tracking is 1us ? The contribution seems to be quite small to have a real impact on the load_avg. May be rounding last_update_time to the closest value policy instead of the bottom value would be enough ? we would have 512ns precision Have you got details about your use case that needs this sub microsecond precision ? > > Signed-off-by: Josef Bacik > --- > include/linux/sched.h | 2 ++ > kernel/sched/fair.c | 21 ++++++++++++++++++--- > 2 files changed, 20 insertions(+), 3 deletions(-) > > diff --git a/include/linux/sched.h b/include/linux/sched.h > index 2b69fc6..ed147d4 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -316,6 +316,8 @@ struct sched_avg { > u64 load_sum; > u32 util_sum; > u32 period_contrib; > + u32 sleep_remainder; > + u32 run_remainder; > unsigned long load_avg; > unsigned long util_avg; > }; > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index c77e4b1..573bdcd 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -813,6 +813,8 @@ void post_init_entity_util_avg(struct sched_entity *se) > * expected state. > */ > se->avg.last_update_time = cfs_rq_clock_task(cfs_rq); > + se->avg.sleep_remainder = 0; > + se->avg.run_remainder = 0; > return; > } > } > @@ -2882,14 +2884,19 @@ ___update_load_avg(u64 now, int cpu, struct sched_avg *sa, > unsigned long weight, int running, struct cfs_rq *cfs_rq) > { > u64 delta; > + u32 remainder = running ? sa->run_remainder : sa->sleep_remainder; > > - delta = now - sa->last_update_time; > + delta = now - sa->last_update_time + remainder; So this remainder comes from the previous identical phase (previous running phase or previous not running phase). But this previous phase can be quite old and you may need to decay the remainder before applying it > /* > * This should only happen when time goes backwards, which it > * unfortunately does during sched clock init when we swap over to TSC. > */ > if ((s64)delta < 0) { > sa->last_update_time = now; > + if (running) > + sa->run_remainder = 0; > + else > + sa->sleep_remainder = 0; > return 0; > } > > @@ -2897,12 +2904,16 @@ ___update_load_avg(u64 now, int cpu, struct sched_avg *sa, > * Use 1024ns as the unit of measurement since it's a reasonable > * approximation of 1us and fast to compute. > */ > + remainder = delta & (1023UL); > + sa->last_update_time = now; this revert (bb0bd044e65c : sched/fair: Increase PELT accuracy for small tasks) > + if (running) > + sa->run_remainder = remainder; > + else > + sa->sleep_remainder = remainder; > delta >>= 10; > if (!delta) > return 0; > > - sa->last_update_time += delta << 10; > - > /* > * Now we know we crossed measurement unit boundaries. The *_avg > * accrues by two steps: > @@ -6102,6 +6113,8 @@ static void migrate_task_rq_fair(struct task_struct *p) > > /* Tell new CPU we are migrated */ > p->se.avg.last_update_time = 0; > + p->se.avg.sleep_remainder = 0; > + p->se.avg.run_remainder = 0; > > /* We have migrated, no longer consider this task hot */ > p->se.exec_start = 0; > @@ -9259,6 +9272,8 @@ static void task_move_group_fair(struct task_struct *p) > #ifdef CONFIG_SMP > /* Tell se's cfs_rq has been changed -- migrated */ > p->se.avg.last_update_time = 0; > + p->se.avg.sleep_remainder = 0; > + p->se.avg.run_remainder = 0; > #endif > attach_task_cfs_rq(p); > } > -- > 2.7.4 >